Making data a first-class Web citizen

Working on my startup, Faviki, I have realized how hard it is to get even basic data about a webpage. Faviki is a bookmarking app that lets users connect webpages with structured data from DBpedia.

I was trying hard to figure out how to take it to the next level — to get more data from web pages, connect it with the rest of the graph and help users organize bookmarks better using not just tags (“strings”), but “things” and their relations.

However, it didn’t take long before I realized that despite all the promising semantic standards that enable doing some cool and powerful things with data, in reality, getting even the most basic data from an average webpage can be incredibly hard.

Take the title of the page for instance. In the spirit of working on the “things” level, I tried to get the “real” title, or the name of the thing the webpage is about (i.e. the primary topic).

Getting the value of the <title> tag is easy, but the problem is that it typically contains additional text like the name of the website and SEO keywords noise.

To get the actual title, one can search for <h1> tag in the source, but, as with the <title> tag, it’s often abused or not used properly.

Or maybe we can compare <h1> with <title> hoping that the <h1> is the subset of it? I even thought about downloading a few other pages and trying to figure out the general pattern behind the <title> tag.

The “right” way

You may argue that these are just dirty hacks and that HTML is not suited for this. It’s a document format, and data should be described using RDF, Microdata, OData, CSV or other syntaxes, provided either in a separate document or embedded in HTML.

The trouble is that in a bookmarking application, you deal with random webpages, most of them not publishing any data whatsoever. But, let’s say we want to make use of the ones that do, in order to provide a richer data and better experience for end-users.

In order to get the page title, we need to search the webpage’s HTML code for embedded data or <meta> tags (rel=”alternate”, rel=”primarytopic”, etc.), pointing to external resources that might lead us to the data we need. There is a number of options, even if we limit ourselves to the RDF model: RDFa, Turtle, RDF/XML, JSON-LD…

Now this diversity may sound like a good thing, allowing you to use the syntax that best fits your needs and taste. In a perfect world, this may be the case. But in the reality, if my prefered syntax is not published, I must use whatever is available. So ultimately both publishers and consumers must cover as many alternatives as possible, which is a big burden.

What about SPARQL?

If one only needs a single atomic data such as the title, isn’t the most appropriate solution sending a simple query to the SPARQL endpoint?

The problem is that the number of websites providing SPARQL endpoint on one hand, and the developers familiar with SPARQL syntax on the other, is still small.

Another problem is that if there is the website providing a SPARQL endpoint, given its random webpage URL, how to find the SPARQL endpoint URI? How many websites that provide SPARQL endpoint publish a voiD file containing the description of dataset (where one should be able to find the SPARQL endpoint URL)?

Finally, are the relations between standard web pages and data stored in the dataset and accessible via SPARQL at all?

Take DBpedia for instance. http://dbpedia.org/page/London has this tag encoded in its HTML source.

<link rel="foaf:primarytopic" href="http://dbpedia.org/resource/London"/>

… suggesting the triple:

<http://dbpedia.org/page/London> foaf:primarytopic <http://dbpedia.org/resource/London> .

But if you try to get this triple using SPARQL:

SELECT ?o WHERE {
    <http://dbpedia.org/page/London> foaf:primarytopic ?o .
}

… you will get nothing.

Therefore, first I need to figure out where is the SPARQL endpoint by parsing the voiD file, then to download and parse the web page in order to get the “primarytopic” resource http://dbpedia.org/resource/London, and then to use that resource directly in the SPARQL query.

The frustration

The more I’ve been thinking about the problem of data access, the more I got frustrated. It seemed that most options were already considered and that there was hardly any room for innovation.

Historically, every new solution tried to solve some other solution’s problems, ending up as a balance between different constraints.

For instance, the Turtle syntax is much simpler than RDF/XML but requires a special parser and you can not use the XML stack. RDFa doesn’t require a separate document, but is mixed with HTML content, hard to read and makes the original file bigger. Schema.org’s Microdata is a simpler alternative to RDFa, but at the cost of being far less expressive.

It was hard to imagine some new, simpler and more elegant syntax than Turtle. But it’s not just about syntax. I was always annoyed by the fact that when I stumble upon the Turtle file on the Web, I either need to download it and read it outside of the browser, or, if it opens in the browser, I can’t click on the deamn links (URIs)!

And I don’t buy the story about its human-friendliness either. Look at this for instance, it’s just painful to read.

Finally, it wasn’t about the simplicity and elegance either. Take HTML for example. It’s definitely not the most elegant syntax in the world, but still enormously successful.

It was hard to imagine some original and radically different aproach. Still, I had a strong feeling that something was wrong and that there must be a better way.

The idea

One day, a strange idea struck me. I was looking at some news page on Guardian and thought: what if the “title” segment is just added to the URL? For instance take the URL:

http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone

If you need to get the title, you simply append the “title” segment to the URL. The result is:

http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone/title

When you look up this new URL, you get (HTTP 200 OK) response with the body:

Will the Samsung Galaxy S4 eclipse the iPhone?

That is, you get raw data, and the syntax is not just easier to parse — there is practically no syntax at all! Similarly, you can look up the description and author on the following URLs, respectivelly:

http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone/description

http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone/author

Relation to RDF

How does this fit with RDF and triples? Here is how the RDF might look like:

<http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone>
    dc:title "Will the Samsung Galaxy S4 eclipse the iPhone?" .

Now, if you append the predicate to the subject URI, you will get:

http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone/dc:title

By looking up this new URI, we get the literal value of the title (the object of the triple). But what does that new URI identify? It’s simply the URI of the page’s title. Therefore, the title is not just a “string” any more, it became a “thing” – a separate resource identified by URI.

<http://www.guardian.co.uk/.../samsung-galaxy-s4-iphone>
    dc:title <http://www.guardian.co.uk/.../samsung-galaxy-s4-iphone/dc:title> .

The result is that the title and its value are separated, which makes sense. The title != the value of the title.

<http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone/dc:title>
    rdf:value "Will the Samsung Galaxy S4 eclipse the iPhone?" .

By using the property names with prefixes (“dc:title”) as segments, we limited the properties to ones defined in vocabularies, making them unambigous and predicatable.

(Note that the : character is used for clarity, although it is a reserved character according to the URI spec. On the rest of the blog I have used underscores instead.)

Using HTML

In some cases, however, we must use a syntax. When dealing with RDF links, as in the case of stating the author, we need a way to say that the value is not a literal string, but a thing identified by URI.

Should we invent a new syntax? Of course not, there is already “the Web way” of writing URIs — the HTML <a> tag.

Therefore, a look up to

http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone/foaf:maker

…will return the hyperlink:

<a href="http://www.guardian.co.uk/profile/juliette-garside">Juliette Garside</a>

This corresponds to the following triple:

<http://www.guardian.co.uk/technology/2013/mar/10/samsung-galaxy-s4-iphone>
    foaf:maker <http://www.guardian.co.uk/profile/juliette-garside> .

(I am using here the web page URL as the identifier for the person for simplicity.)

If there are several values for the same property, e.g. descriptions in different languages, we can append segments that have the role of local “keys” in the rdfs:comment collection:

http://www.guardian.co.uk/.../samsung-galaxy-s4-iphone/rdfs:comment/en

http://www.guardian.co.uk/.../samsung-galaxy-s4-iphone/rdfs:comment/fr

If one looks up the following resource:

http://www.guardian.co.uk/.../samsung-galaxy-s4-iphone/rdfs:comment

… we need a way to write down this collection. Again, no need to reinvent the wheel — we can use the HTML list, either <ul>

<ul>
    <li><a href="en">english</a></li>
    <li><a href="fr">french</a></li>
</ul>

… or <dl>, that also encodes the values:

<dl>
    <dt><a href="en">english</a></dt><dd>The Galaxy S4 will be unveiled in New York this week...</dd>
    <dt><a href="fr">french</a></dt><dd>The Galaxy S4 sera dévoilée à New York cette semaine...</dd>
</dl>

Click here for more details on the HTML syntax used.

Data as a first-class citizen

It is nothing new that there has been a large gap between the current Web and the Web of data (Semantic Web) vision. Linked Data showed up in 2006 as an attempt to close this gap by introducing the Web principles to data publishing.

The problem is the the “Web” expected data to adjust to it by using HTTP URIs but it didn’t really return the favor. It has still remained the same Web of documents, with data described in its building blocks — the same old, boring, flat web documents.

In reality, sending a few HTTP requests and getting simple, raw data is much easier than parsing the documents and dealing with the current syntax mess. It is especially useful for data discovery, in which one asks simple questions to predictable URLs, and obtains short answers giving him the clue (links) about the state of the dataset.

On the publisher side, the implementation is perhaps not as easy as uploading e.g. a Turtle file, but it’s not too hard either. The real benefit is that it is easy to understand what is going on — essentially no need for learning a new syntax.

Remember the last time somebody started explaining how to get to some information on the Web? “Search for this, then click there, then… ” provoking you to interrupt him and say “just give me the goddamn URL!”

This is a good metaphor of what we need to do with data as well. The approach I have described results in each piece of data getting its own (HTTP) URL, meaning that one can easily share atomic data the same way we share the ordinary web pages now.

To put it in fancy words, this way data becomes a first-class citizen of the Web.

But data is identified using URIs, that’s the whole point of RDF and Semantic Web! Isn’t data already a first class-citizen? No, it isn’t, because by dereferencing these URIs you don’t get plain data, but documents.

By adding an intermediary resource that acts as a “variable” whose value is then returned after dereferencing it, we will finally allow (atomic) data to have “equal rights” as documents, making the necessary step for the Web of data to arise.