It’s a bit hard to write about Linked Data because of the many changes it’s going through. Therefore, until it becomes stable again, I’ll stick to the official definition of Linked Data, the one that assumes RDF in it. In order to get a clear perspective on the problem, it’s important to analyze the original idea. Many problems I’ll be focusing on are not directly related to RDF, and when they are, they can be understood in a wider sense, as universal problems of directed, labeled graphs in the Web context. For the sake of practicality and avoiding repeating „directed, labeled graphs that use URIs as identifiers“ all the time, I’ll just use the term „RDF“.
In the context of Linked Data, a parallel is often drawn between the Web of documents and the Web of data. For example, in the tutorial How to Publish Linked Data on the Web, this comparison is made in several places:
The Web of Data can be accessed using Linked Data browsers, just as the traditional Web of documents is accessed using HTML browsers. [...] The glue that holds together the traditional document Web is the hypertext links between HTML pages. The glue of the data web is RDF links.
The Web of documents is the current Web. Simply put, it represents a network (graph) whose nodes are Web documents. Each document is identified by an (unique) HTTP URI and returns some content when one looks up the URI. Similarly, the Web of data (or the Semantic Web) in the context of Linked Data represents a network whose nodes are URI references, which, when looked up, return (or redirect to) a document containing the description of a resource identified by the URI reference in question.
This description encodes the set of triples describing the resource, representing the basic unit of the Web of Data, called the RDF molecule. The RDF molecule is described in the paper Tracking RDF Graph Provenance using RDF Molecules, in which the authors state that in the Web of data, compared to the Web of documents, information is structured and encoded in a much finer level of granularity. They then determine the level of granularity, i.e. find the smallest components into which an RDF graph can be decomposed without losing meaning.
RDF documents are identified as poor candidates because they may contain redundant or unrelated data. On the other hand, an RDF triple – the smallest subset of an RDF graph, brings a different problem. Namely, when multiple triples share a blank node, decomposition causes loss of information. They discuss the following example:
@prefix foaf: <http://xmlns.com/foaf/0.1/>. _:x foaf:firstName "Li" . _:x foaf:surname "Ding" .
The graph describes a foaf:Person with first name “Li” and surname “Ding”. If we decompose G1 into its two triples and treat each as a separate RDF graph, we lose the information that there exists a person that that both has the first name “Li” and also the surname “Ding”.
The solution the authors provide is the RDF molecule, which is described as follows:
In order to handle the information loss caused by triple level operation, we propose a higher level of granularity, the RDF molecule. An RDF graph’s molecules are the smallest components into which the graph can be decomposed into separate sub-graphs without loss of information.
Translated into the Linked Data language, the RDF molecule represents a set of triples that describe a node, so that it contains the arcs out of that node, and the arcs in. In other words, it returns any RDF triples in which the term appears as either subject or object.
A graph is called browsable, if, for the URI of any of its nodes that is looked up, information which describes the node is returned. Describing a node means:
- Returning all statements where the node is a subject or object; and
- Describing all blank nodes attached to the node by one arc.
The concept of the RDF molecule is problematic for several reasons.
First, a blank node is the special case of the RDF model that causes numerous problems. Such blank node is, therefore, the main reason why an RDF graph is not decomposed into RDF triples, but RDF molecules. This means that if blank nodes are removed completely from the RDF model, the level of granularity will suddenly be lowered to the level of an RDF triple, which would significantly change the structure of the Web of data and the concept of Linked Data.
Blank nodes are not solely the problem of RDF, as I carelessly stated in the previous post. On the contrary, they are the paradigmatic example of what I meant by universal problems of a labeled directed graph in the Web context. I’ve realized that, reading through the discussions on the JSON-LD mailing list. Creators of the new syntax treat RDF as an option, following the trend of redefining and generalizing Linked Data. In that sense, they use (somewhat friendlier) term “unlabeled node” for what is called “blank node” (or bnode) in RDF. No wonder, it turned out that unlabeled nodes are the subject of fierce debates. Unlabeled nodes come naturally in JSON, but are not allowed in the newly redefined Linked Data, making it virtually impossible to get to the consensus. The format is even splitted into two sub-formats, the one that allows unlabeled nodes and another that doesn’t. The situation is so dramatic that Manu Sporny asked Is Linked Data useless?, and stated that if JSON-LD didn’t support unlabeled nodes, his company wouldn’t use the format at all.
As I concluded in the previous post, merely running away from RDF will not solve the problems.
Another problem of the concept of the RDF molecule is that an RDF graph in the context of the Web of data is seen as a set of triples, rather than a graph. It’s true that an RDF graph is defined as a set of triples, but in the context of the Web of data, one should focus on its graph aspect, in which the structure consists of nodes and links, rather then triples. On the Web of documents, documents are nodes of the Web graph, while RDF molecules don’t represent nodes of an RDF graph, but its subsets. Therefore, this kind of Web should be called the “Web of molecules” rather than the Web of data.
In the Linked Data context, a node of the Web of data is redirected to the document containing its description. “Data” (or “datum”, if you’re pedantic) as the basic unit of this new Web of data, represents a new paradigm that needs to take over the role that the “document” had before. However, this idea is not fully elaborated, and documents still exist as data containers. In other words, the data structure is “glued” onto the (2D) document instead of being implemented via (3D) HTTP URIs.
Creating the Web of data is a challenge because an RDF graph and the Web graph are two different types of graphs. Arcs (links) in an RDF graph have names, while hyperlinks on the Web have only direction. Another problem is the fact that in an RDF graph, out of three types of nodes only URI references are identified by a URI. How to follow the idea of Web documents and assign each “data” on the Web of data a URI, when blank nodes and literals have no URI? On the Web of documents, every document has a URI, there are no “blank documents” and “literal documents”.
One can ask what is actually “data” in the Web of data context, and what is its relation to a node on one hand, and a triple on other. In the general case, data is defined as the lowest level of abstraction, that on its own carries no meaning. Analogous to the Web, where all documents have a URI, each data in the context of the Web of data should be identified by a URI. Unfortunately, it seems that this “lowest level of abstraction” that has a URI doesn’t come to the fore in Linked Data, which has missed an opportunity to address this issue more seriously.
- next post: Problems of Linked Data (3/4): Publishing data »
- « previous post: Problems of Linked Data (1/4): Identity