Problems of Linked Data (4/4): Consuming data

The problem of consuming data published in the Linked Data style can be best understood by an example. For instance, imagine a user or a software agent wanting to find out what the capital of Germany is. Let’s assume that the user already knows the URI reference that represents the concept of Germany (http://dbpedia.org/resource/Germany) and the URI of the property “has capital” (http://dbpedia.org/ontology/capital). Therefore, the user is basically looking for the object of a triple in which the subject and the predicate are known.

<http://dbpedia.org/resource/Germany> <http://dbpedia.org/ontology/capital> ?object .

The user expects to find the answer by looking up the URI http://dbpedia.org/resource/Germany. To get desired data, one has to go through the following procedure:

  • First, the user sends HTTP request to http://dbpedia.org/resource/Germany. In the HTTP headers he specifies the RDF notation (format) in which he wants to receive the description of the resource. For RDF/XML syntax, the “Accept:” header in the HTTP GET request should look like this:

    Accept: text/html;q=0.5, application/rdf+xml

    “Accept:“ header indicates that it would take either HTML or RDF, but would prefer RDF. This preference is indicated by the quality value q=0.5 for HTML. This is called content negotiation.

  • The server would answer:

    HTTP/1.1 303 See Other
    Location: http://dbpedia.org/data/Germany.rdf
    Vary: Accept

    This is a 303 redirect, which tells the client that a Web document containing a description of the requested (non-information) resource, in the requested format, can be found at the URI http://dbpedia.org/data/Germany.rdf („Vary:“ header is required so that caches work correctly).

  • Next, the client will try to de-reference the new URI, looking up the http://dbpedia.org/data/Germany.rdf, given in the response from the server.

  • The server then responds with “200 OK” message, thus telling the client that the response contains the representation of the information resource. The “Content-Type:” header indicates the desired RDF/XML format, and the rest of the message contains the representation describing the desired non-information resource, i.e. the triples encoded in the RDF/XML notation. This description can be of significant size – in this particular case (http://dbpedia.org/data/Germany.rdf) it weights nearly half a megabyte (428KB).

  • When the download is complete, the description must be parsed which requires a special library. The usual procedure is that the triples are loaded into a local graph, while queries are performed, depending on the implementation, via API methods or SPARQL.

  • Finally, the desired information is obtained — the URI reference of the capital of Germany is http://dbpedia.org/resource/Berlin (34 bytes). If you need some additional information describing Berlin, you have to repeat the entire procedure with a new URI http://dbpedia.org/resource/Berlin.

problems of linked data: consuming data

As seen in this example, the access to an RDF triple and its object requires a significant number of steps, as well as the time for downloading the representation, parsing it and querying the results. It requires programming skills, the necessary libraries and knowing their methods or the SPARQL language for creating queries.

On the other hand, there is an alternative way of fetching data – via a SPARQL endpoint. Data from the example can be obtained using a simple query sent (as a query string) to http://dbpedia.org/sparql/:

SELECT ?object WHERE {
  <http://dbpedia.org/resource/Germany> <http://dbpedia.org/ontology/capital> ?object .
}

The problem is that the user has to know another standard – SPARQL language. Also, many sites that publish Linked Data don’t have a SPARQL endpoint, which is not a mandatory requirement of Linked Data, but a recommendation if a dataset is large, such as in our DBpedia example.

However, Linked Data is not about a single SPARQL endpoint for accessing data, but rather the opposite – it’s about breaking the dataset into the web of interconnected resources identified by HTTP URIs. In this sense, the described procedure for obtaining simple information is typical and recommended way of accessing data published according to Linked Data principles. Therefore, there is an obvious problem – an inability to perform simple operations in a quick and easy way.

Leigh Dodds covered the data access problem in blog post RDF Data Access Options, or Isn’t HTTP already the API?. The post was a follow-up to the discussion on limitations of Linked Data, triggered by his following comment:

While I think SPARQL is an important and powerful tool in the RDF toolchain I don’t think it should be seen as the standard way of querying RDF over the web. There’s a big data access gulf between de-referencing URIs and performing SPARQL queries. We need something to fill that space, and I think the Linked Data API fills that gap very nicely.

The Linked Data API is an additional layer that provides a simple REST API over RDF graphs to bridge the gap between Linked Data and SPARQL. The API layer acts as a proxy for any SPARQL endpoint, allowing more sophisticated queries without the knowledge of SPARQL. This API allows Linked Data and SPARQL to “convert” into REST API – a method that is widely accepted and familiar to web developers.

The view that an extra layer in the form of Linked Data API is needed has provoked the question that Ed Summers asked on Twitter:

@ldodds but your blog post suggests that an API for linked data is needed; isn’t http already the API?

This is the crucial question about the nature of Linked Data that can be also asked as: “Isn’t Linked Data already the API?”. The aforementioned blog post by Leigh Dodds’s deals with this problem and analyzes limitations of Linked Data. The author states that Linked Data provides two basic methods of data access:

  • Resource Lookups: by dereferencing APIs we can obtain a (typically) complete description of a resource.
  • Graph Traversal: following relationships and recursively de-referencing URIs to retrieve descriptions of related entities; this is (typically, not necessarily) reconstituted into a graph on the client.

Leigh then argues that in order to provide an advanced level of functionality, at least two additional important aspects of data interaction should be provided:

  • Listing: ability to retrieve lists/collections of things; navigation through those lists, e.g. by paging; and list manipulation, e.g. by filtering or sorting.
  • Existence Checks: ability to determine whether a particular structure is present in a graph.

SPARQL can handle all of these options, as well as far more complex operations. However, by using SPARQL one is stepping around HTTP, which is the basic assumption of the traditional Web and the Web of data. From a hypermedia perspective, using parameterised URLs, i.e. queries integrated in the HTTP protocol is a much more natural solution than tunneling SPARQL queries. The hipermedia principle is important not only in the REST architecture, but also in Linked Data, which is based on the Web technologies HTTP and URI. Leigh therefore argues that the Linked Data API could be a good solution for this problem.

One can conclude from the Leigh’s post that Linked Data is not enough, suggesting two possibilities: that it represents the basic functionality that can be built on, or that it doesn’t provide even the basic functionality.

Linked Data enables traversing a labeled, directed graph. It uses the universal interface based on dereferencable HTTP URIs, but beneath that, there is a large diversity of syntaxes. After de-referencing a URI reference, one can face any of the numerous formats, which are sometimes rendered as HTML, sometimes as XML, and sometimes as a plain text. They are often unreadable so you have to look at the source code to try to figure out what they’re about. And when you do that, you often can’t click the links, so you have to copy/paste URIs. And sometimes descripitons cannot be opened in the browser, so they’re downloaded. In that case you have to open them in a text editor. So better turn the “URL Highlighting” option on. Sorry, but that is not a good user experience.

HTTP URI as a universal interface is just not enough. The Web has clearly showed the need for a universal syntax, and the universal way of encoding hyperlinks – its most fundamental elements. What is the <a href=""> equivalent in the Linked Data world? Where is the universal syntax for a hyperlink in the Web of data context?

The conceptual problem of Linked Data has been covered in one of the previous posts, where I analyzed the decision to decompose an RDF graph into so-called RDF molecules. One shouldn’t forget that this approach is associated with the deeper problems of the RDF model. The manifestation of this decision in practice is shown in the case of accessing the object of an RDF triple – a very simple operation that requires considerable effort and time.

It can be concluded that, although founded on the good initial idea, Linked Data has a lot of problems and suffers from serious inconsistencies. Linked Data is not defined properly. A lot of room for different interpretations indicates its substantial weakness.

Linked Data celebrates HTTP URIs, but a significant number of the nodes in a graph is not identified by HTTP URIs. It aims to build the Web of data, but still centres around documents. It tries to introduce the new paradigm, but is stuck in the old mindset. It is inspired by the original Web, but is unable to provide its level of simplicity.

Publishing data by Linked Data rules for most people is very hard. Consuming data is hard. Understanding the underpinning theory is hard. Almost everything in Linked Data is hard. And what do you get? Not even the basic functionality of an API. Traversing a graph and getting data is difficult and inefficient if done programatically and almost impossible in a browser.

Considering the serious data access problems, the idea of adding another layer of complexity – some kind of API, sounds like the only reasonable solution. However, let’s instead of fixing consequences focus on the causes, for a change. And the major cause is the inherited problems of the underlying (RDF) data model.


  • http://milicicvuk.com/blog/2012/03/08/solving-linked-data-problems-with-hypernotation-dbpedia-example/ Solving Linked Data problems with Hypernotation (DBpedia example)

    [...] aspect of Linked data is also problematic. When it comes to getting RDF data, there are two extremes – a primitive one vs. a highly [...]