There is a number of challenges regarding the realization of the Semantic Web vision. In the series of blog posts to follow, I’ll be focusing on the Linked Data field and the ideas, technologies and standards it’s based on. I’ll analyze the problems of the three important aspects of Linked Data: the RDF model, the RDF notation and the drawbacks of Linked Data itself.
Problems regarding the RDF model
RDF is a data model that represents knowledge in the form of simple statements called RDF triples, which consist of a subject, a predicate and an object, like a simple sentences in a human language. The subject is a thing (resource) that a statement describes, the predicate of a statement identifies a property or a relation, while the object is a value of a property or a target of a relation.
In RDF, there are three types of nodes – URI references, blank nodes and literals. URI references identify resources, blank nodes represent anonymous resources that are not assigned a URI, and literals denote values such as numbers or dates. The subject of an RDF triple may be a URI reference or a blank node, the predicate must be a URI reference, and the object may be of all three kinds (URI references, literals, blank nodes). When combined together, RDF triples form a direct, labeled graph. Subjects and objects of RDF triples become nodes in an RDF graph, and predicates become arcs connecting them.
The RDF model is based on a simple idea, but it has problems that make it unnecessarily complicated, thus decreasing its value. These problems can be divided into three categories:
- the existence of nodes that have no name
- problems associated with the literals
- the lack of a unique concept of the node
Nodes without a name represent a special kind of nodes called blank nodes (bNode). These nodes simply indicate the existence of a thing, without using, or saying anything about, the name of that thing. Therefore, they are referred to as existential variables of an RDF graph.
Due to the absence of a name (URI), manipulating data containing blank nodes is much harder – they make otherwise trivial operations far more complex. They complicate the lives of data consumers, especially if data changes in the future. Blank nodes add a lot of complexity to the standards built upon them, and the implementations consuming them. They are poorly understood and difficult for beginners.
While in theory blank nodes don’t have a name, in practice, when publishing data, they can be assigned an ID in a local graph/document scope, in order to enable several RDF triples to reference the same unidentified resource. This local identifier is called a “blank node identifier” and it’s different from URIs or literals, because it doesn’t provide a unique name in the global context. Because of that it requires a special treatment:
When graphs are merged, their blank nodes must be kept distinct if meaning is to be preserved; this may call for re-allocation of blank node identifiers. Such blank node identifiers are not part of the RDF abstract syntax, and the representation of triples containing blank nodes is entirely dependent on the particular concrete syntax used.
Different parsers treat blank nodes differently using methods for automatically assigning URIs (Skolemization), which complicates things further. Skolemization refers to replacing existential variables with unique constants, or simply – a way of assigning URIs to blank nodes.
Blank nodes are originally created as shortcuts for publishers. They enable indirect referencing, which is close to the human way of thinking, where not everything is precisely identified, but rather expressed by unspecified words (pronouns) as “somebody”, “something” and others.
However, blank nodes have caused such problems that people question if they are worth the effort, and consider alternative solutions. Some people suggest removing them altogether from RDF, but this brings its own problems. The standard interpretation of blank nodes as existential variables is deeply ingrained in the standards. Also, choosing the URI for each resource is a non-negligible effort for publishers. An additional problem is that there is already a significant amount of published data that contains blank nodes (FOAF profiles for instance).
Blank nodes is a frequently discussed and controversial topic which divides the Semantic Web community. Richard Cyganiak wrote a short analysis of the problem of blank nodes in the blog post titled “Blank nodes considered harmful”. He asks the question “Are blank nodes evil?”, and answers:
Not always. Sometimes they are tolerable, sometimes they are a necessary last resort, and sometimes they are good enough. But they are never good.
Richard then discusses several situations in which the blank nodes can be tolerated, such as transient data not meant to be stored, or for unimportant auxiliary resources, like some n-ary relations. Finally, he gets to the conclusion:
The higher the percentage of blank nodes in a dataset, the less useful it is.