Problems of Linked Data (1/4): Identity

Linked Data is defined in Wikipedia as follows:

Linked Data describes a method of publishing structured data so that it can be interlinked and become more useful.

Linked Data has emerged from the ambitious Semantic Web idea, as the result of the need for more pragmatic approach in which the emphasis is not so much on semantics. Linked Data relies on the fundamental Web technologies HTTP and URI, and is based upon the idea that the same principles that led to the success of the Web should be applied to the idea of the Semantic Web, aiming to create the Web of data, as an extension of the existing Web of documents. As explained in Linked Data – The Story So Far:

While the Semantic Web, or Web of Data, is the goal or the end result of this process, Linked Data provides the means to reach that goal.

The basic idea of Linked Data is relatively simple. Tim Berners-Lee’s note on Linked Data describes four rules for publishing data on the Web:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs, so that they can discover more things.

Linked Data describes a method of publishing aimed at individuals and groups who want to participate in the creation of the Web of Data, or the Giant Global Graph. URIs are limited to HTTP URIs, which make it possible for a person or a software agent to follow links – traverse a graph and get new information, much like the original Web.

Linked Data is realized through the Linking Open Data (LOD) project, which integrates datasets published in this way. In September 2010, there were 203 datasets containing over 25 billion RDF triples, of which about 395 million RDF links. The number of datasets is steadily increasing, indicating that Linked Data is a step in the right direction.

However, despite the success, Linked Data has numerous problems, related both to the concept as well as the implementation of the idea. Four main categories can be singled out, referring to the problems of:

Identity

The simple question of „What is exactly Linked Data?“ is not easy to answer. This problem is described in the briefing paper The Semantic Web, Linked and Open Data published in July 2010:

There is currently considerable ambiguity as to the exact nature of Linked Data. The debate primarily centres around whether Linked Data must adhere to the four principles outlined in Tim Berners-Lee’s “Linked Data Design Issues”, and in particular whether use of RDF and SPARQL is mandatory. Some argue that RDF is integral to Linked Data, others suggest that while it may be desirable, use of RDF is optional rather than mandatory.

In mid-2009 Paul Miller asked the question Does Linked Data need RDF?. He stated that the idea that Linked Data can only be Linked Data if expressed in RDF is a dogmatism that makes him „deeply uncomfortable“. A big debate questioning (once basic) assumptions of Linked Data began, and there is still no consensus today.

The problem arose because of the imprecise definition of the Linked Data rules that can be interpreted in different ways. The document that defines Linked Data and its rules is a personal note by Tim Berners-Lee and hasn’t been formally approved by the W3C. Even the term “Linked Data” itself is controversial and contributes to the confusion in the sense that the exact concept of “Linked Data” is conflated with the general idea of linking (any) data. Anyway, the third rule for publishing Linked Data has led to the most confusion and debate:

3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)

Some interpret this rule as a requirement for the mandatory use of RDF, while others see it solely as  recommendation. It was discovered that the part in the parenthesis (RDF*, SPARQL) has been added later, which opened up the question of when and for what reason. An additional problem is that in the same document the author states  that the four rules that are not actually rules, but expectations of behaviour:

Breaking them will not destroy anything, but misses and opportunity to make data interconnected.

Another problem is that the whole Linked Data concept is defined on the basis of the RDF molecule idea. An RDF molecule represents the part of an RDF graph with the lowest possible level of granularity without losing meaning and is heavily based on the limitations of RDF, namely the existence of blank nodes. This concept is referred to as the „useful information“ which should be provided when someone look up a URI (reference).

In an interview in July 2009, Tim Berners-Lee was clear that getting data open and machine-readable is more important than getting it linked, and as an example uses „remarkably popular“ CSV files that can be exported from spreadsheets or databases. Therefore, one should exploit existing structured data that is not in RDF, which will be more useful if published on the Web than not published at all. This data can be then converted “on the fly” in RDF, if needed.

Tim wasn’t the only one who revised his opinion. It seems that many others in the Linked Data community are slowly converting from RDF purists to Linked Data pragmatists. For instance, Michael Hausenblas made his point clear in 2009 that „RDF is, next to URIs and HTTP at the core of linked data“ and asked What else?. However, in a recent blog post titled Towards Networked Data he stated that he has accepted the more inclusive approach – the one that doesn’t require RDF.

Does Linked Data need RDF?

In 2010, a star rating system was added to the “Linked Data Design Issues”, as a mechanism for evaluating the quality of published data that is rated from one to five stars. In order to deserve one star, one should just put data on the Web (with an open licence), while every additional effort that makes data closer to the Linked Data principles is awarded by an extra star. It’s explicitly stated that the use of RDF and SPARQL standards is rewarded by four stars, and five stars are given if data is linked to other people’s data to provide context. The goal of this evaluation is to encourage people to publish data on the Web in general, as well as to motivate them to use RDF and interlink data. Therefore, Linked Data can be seen in a broader sense – as a process, where RDF is not required, but its value is emphasized and its use is encouraged.

This approach has several important repercussions. It turned out Linked Data is actually „linked“ only when data is rated with five stars, so the name „Linked Data“ doesn’t make much sense for the lower rated data. The „3 star“ data is thus interpreted as Open Data (one based an open licence and in non-proprietary formats), but the problem remained is that between Open Data (3 stars) and Linked Data (5 stars), there is an requirement for using RDF, implying Linked Data must be based on RDF.

The most recent approach, however, treats Open Data, Linked Data and RDF as three separate concepts that overlap. In other words, Linked Data has diverged into three separate sets, and only the intersection of all three sets results in „5 star“ data. Therefore, although the majority of Linked Data is expressed in RDF and relies on RDF links, links can be encoded in different formats, such as XML, KML, etc. RDF is not needed anymore for linking, so it’s left as an option. Interlinking is thus seen as a distinctive feature of Linked Data, which doesn’t require RDF.

There is no consensus though on whether Open Data must have clear copyright and an open licence, or whether Linked Data URIs have to be resolvable, as well as many other details, but it seems that there is an agreement on general matters.

However, official definitions don’t follow this change. The difference between this new approach to Linked Data and the official definition is noticeable. On the official W3C page about Linked Data, RDF is described as a common format that is needed in order to realize the Linked Data idea. The question “What is Linked Data Used For” is answered as follows:

Linked Data lies at the heart of what Semantic Web is all about: large scale integration of, and reasoning on, data on the Web.

RDF is an essential element of the Semantic Web – if Linked Data “lies at the heart of the Semantic Web”, an integration and a reasoning can hardly be imagined without RDF. This is a pretty clear sign that „RDF purism“ is still prevalent in the Linked Data circles.

On the other side, many well known figures in the community are openly against this exclusive approach. Kingsley Idehen claims that „conflating RDF and Linked Data is the worst thing to ever happen to the Linked Data vision“. According to him, Linked Data is all about using de-referencable URIs to manifest the universal model via triples (EAV), while RDF is just an implementation detail. Kingsley compares Linked Data (and Web architecture) to Object Oriented technology, in which objects have identity distinct from their representations. He applies OO terminology to the Linked Data context, using the terms „Data object“ and „Object representation“.

Stefan Decker introduces the term „composability“, which refers to the principle that „the value of data can be increased if it can be combined with other data.“ A data format that allows composability is one that:

  1. have no schema
  2. are self-describing
  3. are “object centric”. In order to integrate information about different entities data must be related to these entities
  4. are graph-based, because object-centric data sources, when composed, results in a graph, in the general case

Stefan claims that „any data format that fulfils the requirements (thus enabling the data Web) is „more or less“ isomorphic to RDF.

In the above mentioned post, Michael Hausenblas generalizes the Linked Data principles: the four rules are referred to as: entity identity, access, structure and integration. These one-word descriptions are much clearer and to the point, and the common term „entity“ is friendlier and more understandable to people coming from different backgrounds than overly academic „RDF molecule“ and „Minimum Spanning Graph“, or too vague „useful information“.

It seems that all these new approaches to Linked Data have two repeating patterns. The first is a generalization – RDF has been replaced with EAV abstraction and the general concept of labeled directed graph. The term “entity” encapsulates the ideas of URI reference and its representation, or „RDF molecule“. The controversial third rule is now about structure, not specific standards. Linked Data is clearly differentiated from RDF, which has become just one of the possible implementations.

The second pattern is using the ideas and terminology from Object Oriented model. Terms like „data object“, „entity“, „object centric data“, all refer to the same thing – the concept of an object in Object-oriented sense. This shouldn’t surprise because, after all, an object is a graph. A node in a graph representing an object is determined by the nodes it is immediately connected to. These nodes, together with the respective links, basically represent properties and relations of a node, forming a set of triples that describe it. This is actually a bit more complex than that in the context of RDF, but, in its simplified form, this concept is indeed very close to the concept of an object in the OO model.

This is interesting because an object has never been so clearly recognized as an intermediary structure between triples and the global graph. This pattern indicates a potential for further connecting these three levels of modularity: logic/semantics on a triple level, OO principles on an object level, and graph theory on a graph level. These powerful ideas bound together could potentially lead to a new paradigm which could benefit from each.

When linking data, i.e. creating relationships that can be described, the idea of a triple emerges almost by itself. But, the pure idea of a triple and RDF as a final product are quite different things. If done right, RDF would be a simple and powerful framework for arranging those triples and could become a logical choice for people. In that case, we would not even percieve it as RDF, we would see data and relations and RDF would feel natural and intuitive. A graphic designer Joe Sparano once said: „Good design is obvious. Great design is transparent.“ In other words, a great design is taken for granted. It works so well that it’s invisible. If RDF is invisible, we wouldn’t have this discussion at all.

RDF is, unfortunately, painfully “visible”. The RDF model is badly designed, containing many special cases that cause a lot of trouble in practice. It has showed its utter inflexibility – an inability to adapt to any environment it was not originally intended for. The biggest problem is that there is still no awareness of this, so instead of fixing the basic data model, new standards are built on top of it.

It seems that RDF is paying the price for constantly isolating itself from the other related concepts over the years. For instance, is RDF really so different from the OO model, that objects are almost never mentioned in the context of describing RDF? How many web developers would have understood RDF better if it was explained in terms of the similarity/difference to the OO model? How much better and useful RDF would be if the Semantic Web and OO communities worked together?

Nevertheless, I think that, regardless of how much RDF sucks, running away from it is not a long term solution, because it’s running away from problems instead of solving them. Ok, let’s use EAV instead of RDF, or call it a labeled directed graph that uses URIs as identifiers. But the same problems that exist in RDF will arise in another model, whatever you call it, because the challenges of RDF are universal when it comes to linking data on the Web. Justin Leavesley nailed it in the comment:

Ask yourself the question. Why hasn’t the linking of data taken off before? If there is all this data out there, why didn’t it just get linked together?


  • http://faviki.wordpress.com/2011/07/27/what-is-exactly-linked-data/ What is exactly Linked Data? « Faviki Blog

    [...] read the whole post > [...]

  • http://milicicvuk.com/blog/2011/07/28/problems-of-linked-data-24-concept/ Problems of Linked Data (2/4): Concept

    [...] a bit hard to write about Linked Data because of the many changes it’s going through. Therefore, until it becomes stable again, I’ll stick to the official [...]

  • http://milicicvuk.com/blog/2011/08/04/problems-of-linked-data-44-consuming-data/ Problems of Linked Data (4/4): Consuming data

    [...] and suffers from serious inconsistencies. Linked Data is not defined properly. A lot of room for different interpretations indicates its substantial [...]

  • http://www.appzdata.com/fyd/2011/08/18/a-web-means-standards/ A web means standards | AppzData

    [...] the Linked Data community, this is a subject of debate. (A good summary of some of the debate, and blog post that gives some more sources.) The main argument here is that just having data on the web is a [...]

  • http://milicicvuk.com/blog/2012/03/08/solving-linked-data-problems-with-hypernotation-dbpedia-example/ Solving Linked Data problems with Hypernotation (DBpedia example)

    [...] simple question of „What is exactly Linked Data?“ is not easy to answer. The main concern here is whether or not RDF is [...]

  • http://maxrohde.com/2012/05/06/introducing-onedb/ Introducing onedb: Connect Small Data in the Cloud « Missing Link

    [...] nodes will be identified with unique resolvable identifiers such [...]