The Semantic Web is often described as an extension of the current Web. The idea of what extending the Web should look like can be seen in Linked Data.
In order to better understand the importance of Linked Data, one has to understand the context in which it emerged, i.e. the problem it has been trying to solve.
In Linked Data – design issues in 2006 Tim Berners-Lee wrote:
Many research and evaluation projects in the few years of the Semantic Web technologies produced ontologies, and significant data stores, but the data, if available at all, is buried in a zip archive somewhere, rather than being accessible on the web as linked data.
Put in this perspective, Linked Data did an important thing – it required that data is actually put on the Web, and demanded that resolvable (HTTP) URIs are used as identifiers. It set the rules of how to use the existing Web technologies to publish and connect structured data on the Web. The data that extends the old Web is interconnected and itself form a new Web, often referred to as the Semantic Web or the Web of data.
Therefore, one can say that Linked Data paved the way for structured data to evolve into what really can be considered as some sort of a web. This web, the Web of data, is perhaps not so magnificent as once seen in the Semantic Web vision, but, for the first time, the „Web“ part of the „Semantic Web“ has started to take off.
Different data models
The Web is a type of graph where all nodes (web resources) are identified by URIs and edges (links) have a direction, but not a name. This kind of a graph is called a directed graph.
On the other hand, the RDF model is a graph as well, albeit a different kind – a labeled, directed graph. It differs from the Web graph in that its edges (links) besides a direction have a label. Some nodes in an RDF graph can be identified by URIs (URI references), but in addition there are nodes that are not identified at all (blank nodes), and ones that are identified by their value and not by a URI (literals).
||Directed labeled graph
||Have direction and labels
||Not all are identified by URIs
||Identified by URIs
The above table shows the comparison between the two graphs. It’s clear that there are significant differences, but there are also a couple of things common to the both models:
- nodes identified by URIs
- all edges have a direction
Linked Data has been utilizing these similarities in order to „project“ an RDF graph to the Web graph. RDF nodes identified by URIs become web resources and the part of a graph describing them is retrieved by dereferencing those URIs. Other nodes in the RDF graph that are not identified by URIs are obviously not assigned a web resource; they are simply encoded in triples using RDF notations. Labeled (or typed) links are in a similar way realized on the syntax level.
Considering all the limitations it faced, Linked Data has offered perhaps the only reasonable solution. Of course, one can argue that there are many unnecessarily complicated aspects of it, partly caused by the same limitations and partly because of a number of problematic decisions.
But the essential idea of Linked Data seems right. After all, having all the differences between the two kinds of graphs, what is an alternative? Is there really any better way to project the RDF graph to the Web?
The key problem is not in Linked Data itself. Given all the circumstances, it does a pretty good job in “adjusting the data” for the Web. But the problem is the mere idea of „adjusting“ data, which implies that data is not modeled for the Web in the first place.
That is the wrong way of solving the problem. It is not data that need to be adapted, but the model data is based on.
One can argue that the RDF model is a good, even a perfect model that implements the ideas of description logic. But from the evolutionary perspective, there is just a perfect adaptation – perfection is not due to some absolute criteria.
The Web is based on simple and pretty clear rules – one of them is that all nodes (resources) are identified by URIs. One could ask, how it’s possible at all that in a model that has the ambition to live on such a Web, there are nodes that don’t follow this obvious requirement of the environment.
On the most basic, data model level, the directed graph has to somehow hold all the information of the directed, labeled graph. In other words, every element of the RDF graph has to be projected to some element of the Web graph.
A logical assumption is that for every node of an RDF graph, there has to be a corresponding node of the Web graph. But if we try to do that, soon we will face serious problems.
Let’s use the same RDF graph example we used in the previous posts.
We can see from the above image that although URI references can be projected to the nodes on the Web identified with the same URIs, projecting blank nodes and literals is impossible.
Another challenge is projecting edges (links) from the RDF graph to the Web graph. Even if we find a way to project all the nodes, how to project the links’ labels on directed edges (hyperlinks)?
We can compare this kind of projection with projecting a 3D object onto a 2D plane, where the challenge is to map the information from the 3D space to the space with one dimension less.
Following this metaphor, we can refer to the label of a link as that third dimension. The hyperlinks in the (directed) Web graph just doesn’t have that dimension.
The good thing is that we have one clear requirement: to project an RDF graph to the Web graph, all nodes in the RDF graph must be identified by URIs. There is simply no other way to name nodes on the Web.
Therefore, the only way is to change the RDF model, so that every node gets a name – not just any name, but a URI. Blank nodes are not acceptable any more and the method of identification of literals must be changed, so that they are identified by URIs and not by values they represent.
Changing the RDF model may sound like a ridiculous idea at this time. After all, it is used for all these years and it is proved to work in various contexts. The problem is that it doesn’t work properly in the Web context. If we want to build the Semantic Web, the RDF model must adapt to the Web environment.
On this blog I’ve already described how each node in the RDF model can be assigned a URI using a very simple method, so I won’t go into details here. In short, the only way to do so is by utilising the paths, i.e. using names for nodes that correspond to graph traversal. In this way we can assign the URIs to what was previously considered as blank nodes. The same principle can be applied to literals as well. Therefore, with a relatively small changes to the model, all nodes can be assigned a URI.
In the context of projecting nodes, a special attention must be given to literals. Literal nodes are special in that they hold two pieces of information: a name (URI) and the data they represent. Therefore, they can be referenced by the name and by the value.
In the Web context, nodes are always referenced by their names (URIs). But in the context of RDF notations, it makes more sense to reference a literal by its value. Therefore, one must have a way to come up with a literal URI based on the context.
Take an example where two literals are referenced by values:
<http://chucknorris.com/data_/chuck> foaf:nick "Chuck Norris" .
<http://chucknorris.com/data_/chuck> foaf:nick "Fatality" .
What are the URIs of the literals? We can mint the
<http://chucknorris.com/data_/chuck/foaf_nick>, but there is simply no additional information to differentiate the two. This can be solved with an additional node that takes place between the URI reference and the literal:
foaf:nick <http://chucknorris.com/data_/chuck/foaf_nick/1> .
rdf:value "Chuck Norris" .
foaf:nick <http://chucknorris.com/data_/chuck/foaf_nick/2> .
rdf:value "Fatality" .
Now, we can easily come up with the URIs for both literals:
rdf_value is used as a convention by which every literal’s URI ends. In the future posts, we’ll see that this approach to literals adds other important benefits, especially when it comes to bringing the RDF model and OO model closer together.
In any case, we made the first important step in the challenge of projecting the RDF graph to the Web graph the proper way. Now we have the RDF model in which all nodes are identified by URIs.
As seen in the above image, every node in the RDF graph can be projected to a node in the Web graph.
We’ve figured out how to assign a URI to all RDF nodes, but what about the inherently nameless links in the Web graph? How to project labeled edges of the RDF model to the nameless edges of the Web graph?
In the first problem we had to change the RDF model. Does it mean that we must change the Web to come up with named links? After all, there is no way for directed edges in a directed graph to hold any other information than a direction.
The thing is, the Web is not just a random directed graph. It contains websites which form hierarchical structures. These trees are comprised of nodes from a single website, organized in a hierarchical order.
In a tree, the URI of every child node is the URI of its parent + „something“. That „something“ is what links these two nodes, representing the information that a „tree“ link holds, i.e. its name. These links can have a name, being quite different from the typical nameless „graph“ (hyper)links.
The ability for „tree“ links to hold a name is what we are looking for. Now here comes the exciting part. Nodes linked with tree links form paths, the same kind of paths we used for naming all the nodes in the RDF model! In fact, the named (or typed) links appear almost by themselves when using the concept of paths to assign all nodes names.
In the RDF context, in a tree path, the parent + „something“ equals to the subject URI + predicate CURIE, where the object (child) URI becomes the resulting URI created this way.
The predicate URI is represented in the form of CURE (prefix:name), that needs a definition of the prefix. Given that a web site can be seen as the namespace of all of its web pages, the prefix is defined for every website separately.
For now, let’s see what projecting the labeled edges will look like in practice:
In the above image, one can see that CURIEs represent the differences between the URIs of adjacent nodes in the hieararchy. Or from the RDF point of view, the differences between the URIs of the subject and the object in a triple.
However, what about the edges targeting nodes from the external websites, that are not part of the hierarchy? In the image, these edges are assigned the red question marks – another problem we have to solve.
In the RDF graph example, there are triples in which the object is not the child in a hierarchy. For instance:
<http://chucknorris.com/data_/chuck> foaf:knows <http://brucelee.com/data_/bruce> .
In the Turtle notation, this triple expresses all the information we need to know. However, there are actually two distinct concepts here: the reference and the target node. The reference can be understood as a variable, or a node that points to the target. If we want to project this relation, we must separate these two concepts. In other words, we will need an additional node that will act like the reference.
If we follow the same principles as where the object is a child node, we will end up with the URI like this:
This way, we encoded the information about the link as the difference between the nodes
In the Web context, "pointing to" means "linking to". Thus, we will use hyperlinks to connect references with targets. Therefore, the URI
<http://chucknorris.com/data_/chuck/foaf_knows/bruce> identifies the reference that holds a hyperlink to
It says that the first node is the same as the second, meaning the hyperlink represents the „same as“ relation. Because hyperlinks can’t hold a name, they all have to share the same meaning. In the RDF context, this meaning can be expressed with the property
In the image, three hyperlinks are depicted in the blue color. Also, one can notice the intermediate node
http://chucknorris.com/data_/chuck/foaf_knows taking place between
http://chucknorris.com/data_/chuck and the two child reference nodes.
The tree links between this node and its children nodes are not named by a CURIE, thus not defined explicitly. Their names are the "keys" that distinguish the members of the "array" the node represents. The implied property between the members and
http://chucknorris.com/data_/chuck/foaf_knows node is
http://chucknorris.com/data_/chuck/foaf_knows is a subclass of the range of the
foaf:knows property (i.e.
foaf:Person). This will be discussed in more detail in the future posts.
We are almost finished. The final thing we have to do is to provide a "dictionary" - the definitions for the all prefixes used in CURIEs.
The question is whether it’s better to place all data together with prefix "subtree" on a single tree branch in a website or to use several branches. The latter approach allows binding the
prefix_ segment directly to the website address which results in somewhat nicer URIs.
In practice this will look like this: one branch of the website tree is used for projecting the RDF graph, and another is used for prefix definitions, in a similar way it’s done in RDF notations.
In this case, the prefixes
geo that are used in CURIEs are the child nodes of the standard
prefix_ node. These nodes are reference nodes to relevant namespace URIs. For instance, the
foaf prefix will be defined as follows:
<http://chucknorris.com> prefix: <http://chucknorris.com/prefix_/foaf> .
<http://chucknorris.com/prefix_/foaf> owl:sameAs <http://xmlns.com/foaf/0.1> .
In the image below, the namespaces are shown as single nodes for simplicity. They are of course nodes in the websites forming similar structures as the other depicted websites.
The final image shows the projected RDF graph in a simpler and friendlier way. As opposed to the previous image where full URIs are written together with reduntant typed links, here the nodes' names are CURIE segments. The name of a node represents the predicate between it and its parent node. The full URI of any node can be obtained by connecting all the nodes in the hierarchy.
Literals are terminal nodes and are represented as leaves, while their CURIEs are not shown explicitly. A literal must always end with the
rdf_value segment, so they are implied. For instance, the URI of the rightmost leaf can be obtained in the following way:
+ '/' + 'data_'
+ '/' + 'chuck'
+ '/' + 'foaf_based_near'
+ '/' + 'geo_long'
+ '/' + 'rdf_value' =
Reference nodes have a blue border and provide the connection using the hyperlink (blue arrow) to a target node. They can also be considered terminal nodes in the website context, because as literals they can’t branch further.
The final image shows that the projection of an RDF graph onto the Web is possible and can be done in an elegant way.