The Web is just a bunch of trees plus shortcuts

The “Graph thinking” is one of the biggest conceptual problems when it comes to learning and understanding Linked Data and the RDF model, according to Rob Styles. Here, the term “graph thinking” refers to the ability to think about data as a graph, a web, a network. People, although understand the concept of a graph, are used to think about data from one point of view or another, and have difficulty when they need to “put themselves above the data”, i. e. imagine a graph as a whole.

It’s interesting that for developers it can be even harder (compared to non-programmers):

Having worked with tables in the RDBMS for so long, many developers have adopted tables as their way of thinking about the problem. Even for those fluent in object-oriented design (a graph model) the practical implications of working with a graph of objects leads us to develop, predominantly, trees.

Similarly, it seems that most people understand that the Web is a huge graph consisting of web pages and hyperlinks between them. However, the Web is “experienced” from the perspective of particular Web sites or pages (which are organized predominantly hierarchically), rather than a Web graph as a whole.

For example, the typical navigation menu on a website contains a list of hyperlinks to internal web pages (top-level menu), representing hierarchically organized “child” nodes of the tree forming around the website as its root. External hyperlinks to other Web sites and pages, as well as internal (relative) hyperlinks that “skip” this hierarchy, break the tree structure and create a graph*.

graph tree links

Another „graph“, people seem to intuitively understand, is a file system. File systems typically have directories (folders) and allow hierarchies where directories may contain subdirectories. These trees are relatively easy to understand, but are somewhat limited when it comes to navigation. In a tree, you can go one level up, or one level down.

Fortunately, you’re not limited to this kind of “tree links”, but can “jump” to any part of file system. You can do that, thanks to shortcuts, and these are possible due to the fact that every folder or file has a unique address – a path that can be easily manipulated. So when starting a program, you don’t have to go to the exact location of the executable file on the disk every time, but rather click on the shortcut on the Desktop. A similar way hyperlinks break the hierarchy of websites, shortcuts break the hierarchical structures of folders in a file system.

It seems that predominantly hierarchical (plus “shortcut” links) view of a graph is intuitively understood and that this fact should be used in order to facilitate understanding of the RDF model.

Linked Data is a step in this direction. In the Linked Data context, resources are identified by HTTP URIs, and their descriptions (obtained by dereferencing the URIs) contain all the RDF triples in which a particular resource appears as the subject or the object. In short, the description contains the part of a graph in which one node becomes the “root” relative to the other nodes, that can be thought of as its children nodes. Again, RDF links break tree structures connecting these subgraphs (RDF molecules or data objects), into a single global giant graph.

However, the problem is that you can’t browse this Linked Data graph in a way you do it on the Web, or in your file system. You are not allowed to traverse the nodes „hidden“ in documents containing the descriptions – you must download and parse them. These bits of data don’t have addresses, paths you can refer to or use for shortcuts.

When it comes to the Semantic Web and RDF, it seems that the idea of paths is primarily applied in the context of query languages. But what about paths as a part of the RDF model itself?

Tim Berners-Lee has written about them in the document Shorthand: Paths and lists, and @keywords:

Often it turns out that you need to refer to something indirectly through a string of properties, such as “George’s mother’s assistant’s home’s address’ zipcode”. This is traversal of the graph.

Such an indirect referencing can be expressed through a series of RDF triples chained with a number of blank nodes:

[is con:zipcode of [
    is con:address of [
        is con:home of [
            is off:assistant of [
                is rel:mother of :George]]]]]

The author then presents more elegant notation – a shortcut inspired by cascading style used by methods and attributes in an object-oriented language (dot notation), where „.“ (dot) is used as a delimiter:

:George.rel:mother
          .off:assistant
            .con:home
              .con:address
                 .con:zipcode

This is forward traversal of the graph, where with each “.” you move from something to its property. So ?x.con:mailbox is x’s mailbox, and in fact in english you can read the “.” as ” ‘s”.

Let me repeat what I think is one of the most powerful and yet one of the most neglected ideas of the Semantic Web:

You move from something to its property.
?x.con:mailbox is x’s mailbox.

In Linked Data, you don’t move from something to its property. You can only move from something to “something else”. Now, if you can move to the property, it means you can stop, rest a bit, look around you. If you look behind, you’ll see a single node, the parent. And if you look ahead, you’ll see the children nodes, through which you can go on the journey, one node at a time. You are placed on the part of the global graph that has a form of a tree.

The statement “?x.con:mailbox is x’s mailbox” suggests that the “mailbox” relation is “instantiated”, materialized in the form of distinct node, being dependant on its parent. That node has a dual nature, encoding the relation and the node involved in the relation.

This approach is the one that fully respects the nature of a directed labeled graph. It’s elegant and provides flexibility in expression. It facilitates implementation of n-ary relations and encourages modular design. It allows deep, nested structures instead of flat ones. It uses indirect referencing, which is how people think and refer to things.

Finally, it indirectly acknowledges the hierarchical aspect of the RDF graph. It is quite similar to the structure of the websites. This is the only approach that enables realization of the Web of data, i.e. proper projection of an RDF graph to the Web graph.

So, how come such powerful idea has never come to life? First, Tim presented this idea primarily as a syntax convention (sugar), failing to realize the full potential of his own words. Second, it relies on hated, URI-less, evil blank nodes. The only way to fix it is to somehow add URIs to these nodes. But, isn’t the very absence of URI references what makes this approach possible?

It sounds almost like a paradox. On one hand you have paths without URIs, and on the other there are opaque URIs… containing no paths. There are two clear requirements – one from each side of the equation. Paths and URIs are both needed. Therefore, we have no other choice than to connect them.

And don’t forget: ?x.con:mailbox is x’s mailbox.

* Of course a tree is a already a (kind of a) graph. Here, the term “graph” can be thought of as a graph in a wider sense.

Getting rid of typed literals

So far I was dealing with the complexities of the RDF model – the mess surrounding the concept of a node, many different types of nodes, as well as different methods of identification of nodes that have a name. However, there are aspects which are not just overcomplicated, but plain ugly as well. Here I am referring to the last bits of the RDF model that needs some serious attention: typed literals and literals with language tags.

Like blank nodes, this is the subject that caused many debates over the years. The people who made the model, somehow always find a justification for every ugly bit of it. However, the fact that the RDF model is „logically consistent“ (or whatever the right word is) doesn’t mean that it’s the only solution, and certainly doesn’t mean that the solution is the best one.

In the post on redefining the node of an RDF graph, I described nodes from the two aspects: a node as a data structure and a node as a symbol. As a data structure, a node is basically an object with two attributes: a name and data. Therefore, if the data attribute holds a datatype or a language tag in addition to the value, it will no longer contain atomic data. It becomes an object itself with two attributes. And that sucks.

So why the datatype has to be explicitly added to a literal every time? Couldn’t the datatype be declared, the similar way an instance is declared? If the range of a property can be described, and than the type of an instance can be inferred, why that doesn’t work for literals, too?

The problem

In the discussion on this subject on the RDF Working Group mailing list, Antoine Zimmermann explained it nicely:

Often, one would like to write:

ex:prop  rdfs:range  xsd:decimal .
ex:sub   ex:prop     "42" .

and infer that “42″ is a decimal number. However, what one gets from these two triples is that “42″ is a sequence of 2 characters AND a decimal, which is inconsistent.

Antonie then writes that overcoming this in the RDF model is hard, because „literals are universal identifiers, just like URIs“. So „42“ in all situations is identifying the same thing. He goes on with another example, which would not be possible in practice:

ex:prop rdfs:range xsd:decimal .
ex:sub ex:prop "42" .
ex:password rdfs:range xsd:string .
ex:sub2 ex:password "42" .

Here, certainly some people would expect the first “42″ to be denoting the number, while the second is just two characters. But this implicitly assumes that the denotation of literals is contextual: it would depend on which predicate is used in the triple. While it would be possible, in principle, to define a language where this makes sense, it does not fit at all with the RDF data model.

In another response in the same thread, Pat Hayes wrote, regarding inferring datatypes from the property definitions:

[...] in general, a property range is a class, not a datatype. What happens when the range is just a class and has no associated L2V mapping? Also, a property can have many ranges. What happens if two of them are datatype classes? Which one gets control over the interpretation of the literal string? [...]

A new literal

Before I start discussing potential solutions for the above problems, let’s recall how literals are different in the new RDF model I’ve been proposing in this blog, compared to the classical RDF model:

  • Literal nodes are identified by URIs, not its values
  • Literal nodes are not literal values themselves, but the symbols representing the values
  • A literal node is always the object of the rdf:value predicate in a triple

In a new model, literals are no more treated as universal identifiers. They are, like every other node in the RDF model, identified by a URI. They represent literal values, whose meaning depends on the context. This context is not defined by a literal, but a node having that literal as a value. This node is an instance (object) of one or more classes. It has one main value, represented by a literal and realized by the rdf:value property. This kind of node resembles primitive data types from programming languages, such as string, integer, float, boolean…

The difference is that in programming, you don’t have to define the meaning of every variable. In object oriented programming, for example, the „primitive“ variable named weight is often implemented as an instance of the Float class. In RDF, however, the weight has a meaning that is not limited to the datatype of its value.

rdf:datatype

Therefore, we have a type of an object and a datatype of that object’s value as distinct concepts. For instance, a product weight can be an instance of the class ex:Weight, and its value can be of float datatype. Check out the following example:

ex:Weight
    rdf:type      rdfs:Class ;
    rdf:datatype  xsd:float .

ex:hasWeight
    rdfs:range    ex:Weight .

<http://example.com/data_/product/item10245>
    ex:hasWeight  <http://example.com/data_/product/item10245/ex_hasWeight> .
<http://example.com/data_/product/item10245/ex_hasWeight>
    rdf:value     "2.4" ;
    ex:inUnit     ex:kg .

The new property rdf:datatype used here has the meaning „the value of the class S is of datatype O“. Therefore, the value of http://example.com/data_/product/item10245 is of datatype xsd:float.

The literal is represented by its value 2.4. There is no need to explicitly write its URI in the syntax. If needed, it can be easily obtained by connecting the URI of its parent node and CURIE segment rdf_value. The result is http://example.com/data_/product/item10245/ex_hasWeight/rdf_value.

2.4 is not a universal identifier any more, so 2.4 doesn’t have to always identify the same thing. Its meaning thus becomes relative to the context, not absoute in the whole RDF graph. For instance, it can sometimes be a float number, and sometimes a string. It’s just a value whose meaning depends on the type of its parent node. In this case, the literal acts as the value of an instance of the class ex:Weight which rdf:datatype is described as xsd:float, so we can say that 2.4 is a float number.

This is a good example why it is important to make a clear distinction between the concepts of a node, its name and data it holds. In the case of the literal 2.4, one can say that this literal is a node that has the name http://example.com/data_/product/item10245/ex_hasWeight/rdf_value and hold value 2.4. In short, literal node != literal value, and literal value != literal name.

Furthermore, a property range is always a class, and a datatype is a range of new, special property rdf:datatype. This way, classes and datatypes with associated L2V mapping are clearly separated.

In the reality, one cannot expect that a class will have just one rdf:datatype. In addition, a node can be an instance of more than one class, having different datatypes defined. As a solution, one can describe the instance itself with the preferred rdf:datatype, as in the following example:

<http://example.com/data_/product/item10245/ex_hasWeight>
    rdf:value     "2.4" ;
    rdf:datatype  xsd:float ;
    ex:inUnit     ex:kg .

The datatype information would be on the URI http://example.com/data_/product/item10245/ex_hasWeight/rdf_datatype, providing much more elegant way for expressing datatypes than using typed literals.

But it shouldn’t be mandatory. Often, the context can help. In this case, there is the property ex:inUnit which rdfs:domain is, say, ex2:Measure, with rdf:datatype of xsd:float. In addition, when in doubt, data consumer can check the superclass of ex:Weight, and find out that it is, say, ex3:Measure, where ex3 is an authoritative source.

Chances are that, ultimately, a limited number of superclasses defining datatypes will emerge and that classes which instances can have values, in order to be trusted, will have to be declared as subclasses of these superclasses.

Finally, one last note on the datatypes in RDF. My impression is that a few datatypes are used much more frequently than the others. These most frequently used datatypes should be really the part of the core RDF (or RDFS) ontology, for instance rdf:String, rdf:Number, rdf:Boolean, rdf:Date would be just enough. Actually, it would make sanse to merge RDF and RDFS with a few elements from OWL (so-called RDFS++) into one core ontology. Things are sometimes just rediculously unintuitive: take rdf:Property and rdfs:Class for instance. But that’s another story.

rdf:language

Finally, we get to the literals with language tags – another special case of literals. Let’s look at the example from DBpedia:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrade is the capital and largest city of Serbia. "@en .
<http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrad ist die Hauptstadt der Republik Serbien."@de .

In this snippet, the language tags en and de, placed at the end of literals using (another) special syntax @, denote that the literals are in English and German language, respectively.

The solution for literals with language tags is based on the same principles as the solution for typed literals. Let’s focus on the English version. First, we need to distinguish the concept of a value from the resource that contains that value:

<http://dbpedia.org/resource/Belgrade>
    rdfs:comment <http://dbpedia.org/resource/Belgrade/rdfs_comment/en> .

<http://dbpedia.org/resource/Belgrade/rdfs_comment/en>
    rdf:value "Belgrade is the capital and largest city of Serbia." .

…which can be shorten to:

<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [
    rdf:value "Belgrade is the capital and largest city of Serbia." .

]
...or made even more shorten:

<http://dbpedia.org/resource/Belgrade>
    rdfs:comment:en "Belgrade is the capital and largest city of Serbia." .

Here, rdfs:comment:en is an example of the extended CURIE described in an earlier post.

http://dbpedia.org/resource/Belgrade/rdfs_comment/en is a URI of a node whose value is the description in English. The URI segment at the end of the URI is required because there are several rdfs:comment properties. en can be a convention that has the role of the previously used language tag (in the form of @en). This solution shifts the language information into the URI and doesn't require special treatment in the RDF model, as it was the case with literals with language tags.

However, in order to explicitly describe the language, an additional triple is needed:

<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [
    rdf:value "Belgrade is the capital and largest city of Serbia." ;
    rdf:language <http://dbpedia.org/resource/English_language> .
]

Here I used the new rdf:language property to describe the language of the comment. The subject of the RDF triple is a comment identified by http://dbpedia.org/resource/Belgrade/rdfs_comment/en, and the object is the DBpedia resource representing English language.

It seems reasonable that rdf:language will be used to describe primarily instances, compared to rdf:datatype intended for the description of classes. However, there should be no restrictions – each should be allowed to describe both classes and instances.

Conslusion

If we want a truly simple and flexible RDF model, the denotation of literals should be contextual. A literal serves the role of the value of some other node in a graph. This node carries the information about a literal – its datatype as well as the language.

In that sense, two new properties are proposed: rdf:datatype and rdf:language, which are special in that they don’t describe a resource directly, but rather its value (or the value of its instance, in the case of a class) expressed using the rdf:value property. The meaning of the literal value depends on that node. This is possible because the universal identifier of a literal is its URI, not the value.

The proposed realization of literals, along with the removal of blank nodes, greatly simplifies the RDF model. Resources are described exclusively with RDF triples, using URI references and plain literals. "Special" literal nodes are discarded from the model.

The new RDF model has a clearly defined node that always has a name identified using the same mechanism (URI), i.e. all names belong to a single set. There are only two kinds of nodes - one representing a resource and one representing a literal value. The fact that all nodes now have a URI as name enables the next step – a (proper) realization of an RDF graph on the Web.


The “RDF graph” URI pattern

Anyone involved in anything having to do with the Semantic Web or Linked Data knows how much time and energy is wasted on endless discussions on the blank node issue. It is a controversial topic because on the one side blank nodes cause huge problems in practice, while on the other, they enable a great flexibility in expressing.

In this flexibility a more profound reason is hidden, which perhaps can explain how blank nodes have survived as a part of the RDF model all these years despite all the headaches they have caused. The thing is, blank nodes reflect a human way of referencing things. Let’t me show an example:

If I want to talk about my left arm, it’s quite unnatural to invent a new identifier for it. I’ll just say „my left arm“, describing it relative to myself, and a listener will understand. This is possible due to human’s ability to understand the context. He or she knows that the pronoun „my“ refers to me as something unique, the arm being part of me, and the “left” finally specifying the exact arm. So, my left arm, unique in the universe, is referenced quite simply and elegantly.

In RDF, it can be expressed with the two statements (triples) as: “I have an arm. It (has a property that) is left”. Let’s assume that we know that the blank node is of a type „ex:Arm“ implicitly through the property:

ex:hasArm rdfs:range ex:Arm .

Given that the URI of me is http://milicicvuk.com/data_/vuk, and assuming the relevant properties are defined in ex ontology, we can express it with the following triples:

<http://milicicvuk.com/data_/vuk> ex:hasArm [
    ex:hasProperty ex:Left .
]

The left arm is represented by the blank node, which is the object in the first triple and the subject in the second, thus chaining them and forming a rather readable code.

Now, let’s take a slightly more complicated example. I can say something as “the 5 cm scar on my left arm” (or my left arm’s 5 cm scar). Again, the scar is relative to the left arm, and the arm is relative to me. Translated to RDF, it will become: “I have an arm witch has a property that is left and has a scar that has a length which has a value 5 and is in unit of cm. This rather cumbersome sentence is much clearer when written in Turtle notation using nested blank nodes:

<http://milicicvuk.com/data_/vuk> ex:hasArm [
    ex:hasProperty ex:Left .
    ex:hasScar [
        ex:hasLength [
            rdf:value "5" ;
            ex:inUnit ex:cm .
        ]
    ]
]

Here we have three blank nodes that connect various statements which results in pretty elegant and readable code. This level of elegance and readability can never be achieved by using URI references.

That’s what makes blank nodes cool – they allow referencing relative to another thing. You can, instead of minting identifiers for every possible resource, just say, „something“ or „someone“, which is related to something else that has an identifier. The trouble is that this coolness is greatly diminished due to the negative side of not having global identifiers.

The question is: is it possible to keep the flexibility of blank nodes while having URIs at the same time? The answers is: yes, there is an elegant solution that allows just that.

Namespaces

To understand it, let’s try to look at the problem from the perspective of a namespace. The idea of a namespace is related to that of a context. A namespace is defined as a container that provides context of identifiers. A namespace has a unique name in the global space, allowing otherwise ambiguous identifiers to also become globally unique.

Now, let’s for a moment look at the part enclosed between [ and ] in the first RDF example. The subject and the predicate of the first triple (<http://milicicvuk.com/data_/vuk> ex:hasArm) act as the namespace of the part between square brackets. It uniquely defines the “container” that provides context for local identifiers.

<http://milicicvuk.com/data_/vuk> ex:hasArm [
    ex:hasProperty ex:Left .
]

However, in order for this namespace to be usable, we must convert it to a URI. The URI http://milicicvuk.com/data_/vuk alone can be seen as a kind of namespace for the predicate ex:hasArm. Of course, ex:hasArm is also unique C(URI)E, but in this context, it acts as a local identifier.

Put in this perspective, it is not hard to figure out what the full name of that identifier is. It can be made as with every other namespaced variable, by concatenating the namespace with the local name.

As a delimiter, we are going to use the “slash” character “/”, a standard delimiter of URI segments. The result is:

http://milicicvuk.com/data_/vuk/ex:hasArm

Another thing we have to do is to replace the URL unfriendly character “:” with something else. Let’s use the “_” char[*]. Finally, we get:

http://milicicvuk.com/data_/vuk/ex_hasArm

We got the URI of the namespace defined by the subject and the predicate of the triple. Another way of looking at this URI is as the full name of the “local identifier” ex:hasArm, defined in the http://milicicvuk.com/data_/vuk context. In any case, in the context of this new namespace we are going to define new local identifiers, using the following template:

http://milicicvuk.com/data_/vuk/ex_hasArm/localIdentifier

Namespaces are cool because they allow us not to worry about the global scope. The uniqueness of a namespace guarantees that all new identifiers defined in its context will also be unique. This way we reduced the problem of creating a whole new URI to the problem of inventing a name which has to be only locally unique.

In this particular case, having that I have just two arms, the local identifiers „left“ and „right“ will do the job nicely[**]. The full URIs (with the namespace) will thus look like this:

http://milicicvuk.com/data_/vuk/ex_hasArm/left

http://milicicvuk.com/data_/vuk/ex_hasArm/right

Therefore, the resource (the arm) that was previously represented by a blank node got the URI (http://milicicvuk.com/data_/vuk/ex_hasArm/left). The blank node just evolved to the URI reference while keeping its flexibility of expressing!

Additionally, we have a clear pattern for other URIs, too. What about identifiers for legs? No problem:

http://milicicvuk.com/data_/vuk/ex_hasLeg/left

http://milicicvuk.com/data_/vuk/ex_hasLeg/right

As I mentioned earlier, every new URI is at the same time the namespace for new identifiers. New namespaces can be built on the basis of the previous ones, forming the chain of nested namespaces. Namely, ex:hasScar is a local identifier in the context defined by http://milicicvuk.com/data_/vuk/ex_hasLeg/left namespace. Suppose it’s a scar from a surgery, suggesting the local identifier surgery:

http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery

Again, the new URI is the namespace of the subject http://milicicvuk.com/data_/vuk/ex_hasLeg/left and the predicate ex_hasScar, forming the container for the local identifier “surgery”. The full URI of the scar is therefore the URI of the object of that triple, previously being a blank node:

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left> ex:hasScar <http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery> .

What about literals? The exact same method can be applied to constructing the URIs of literals as well. The literal “5″ in the second RDF example will get the URI:

http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength/rdf_value

Literals’ URIs by convention always end with rdf_value segment because literal nodes are always values of the rdf:value property. Also, literal nodes are special in that they are terminal nodes, meaning they can not branch further (and thus can not serve as namespaces for new identifiers).

URI patterns

You may recognized a pattern used in these URIs. It is a variation of a well-known URI pattern used on the Web, that consists of two parts: one representing the collection, and other being one individual (instance) of the collection.

This pattern is also used in Linked Data. In the book Linked Data patterns, this kind of URIs are called patterned URIs and are recommended as as way for creating more hackable and human-readable URIs. The authors suggest using pluralized class names as the first part of the URI pattern, and identifier as the second.

For example if an application will be publishing data about book resources, which are modelled as the rdf:type ex:Book. One might construct URIs of the form:

/books/12345

Where /books is the base part of the URI indicating “the collection of books”, and the 12345 is an identifier for an individual book.

In another, hierarchical URIs pattern, the authors state:

Where a natural hierarchy exists between a set of resources use Patterned URIs that conform to the following pattern:

:collection/:item/:sub-collection/:item

E.g. in a system which is publishing data about individual books and their chapters, we might use the following identifier for chapter 1 of a specific book:

/books/12345/chapters/1

The /chapters URI will naturally reflect to the collection of all chapters within a specific book. The /books URI maps to the collection of all books within a system, etc.

A pattern for naming nodes of an RDF graph can be considered as a kind of “hierarchical URIs” pattern where a property name is used instead of a pluralized class. Its form can be written as follows:

:property/:item/:sub-property/:item

A “hierarchical” is perhaps not the best name for the relations between nodes in a graph, but bear in mind that the part of a graph described this way has the form of a tree with the described resource as a root. Anyways, to differentiate it from the other URI patterns, let’s call it the “RDF graph” URI pattern.

The “RDF graph” URI pattern

Using properties instead of class names explicitly state the relations between the nodes. Also, information about the item’s class can be preserved if contained in the property name, as it’s the case with ex:Arm class in ex:hasArm.

The “RDF graph” pattern can be applied to the entire URI of a node, starting from the domain name to the last segment. The default namespace website.com/data_ is a container for the root level nodes which than branch to the lowest level nodes using the same pattern. For instance, in the URI http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength/rdf_value, there are five “property” parts of the URI denoting properties (data:, ex:hasLeg, ex:hasScar, ex:hasLength and rdf:value) and three item (or key) parts (vuk, left, surgery)[***].

The diagram showing a part of RDF graph describing all the nodes contained in the URI looks like this:

RDF graph URI pattern

The triples in the Turtle syntax look like this:

<http://milicicvuk.com>
    data:
        <http://milicicvuk.com/data_/vuk> .

<http://milicicvuk.com/data_/vuk>
    ex:hasLeg
        <http://milicicvuk.com/data_/vuk/ex_hasLeg/left> .

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left>
    ex:hasScar
        <http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery>

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery>
    ex:hasLength
        <http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength>

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength>
    rdf:value
        "5" .

Using the more concise syntax based on extended CURIEs, it will look as follows:

<http://milicicvuk.com> data::vuk [
    ex:hasLeg:left [
        ex:hasScar:surgery [
            ex:hasLength [
                rdf:value "5" .
            ]
        ]
    ]
]

Note that the literal is represented by its value (5). Its URI, if needed, can be easily inferred from its parent URI.

Different syntactic representation of the first “property” part and the second “item” part of the URI allows a URI to be readable not just to people, but to machines as well. In the form such as /books/12345/chapters/1 we intuitively know which part is which, but there is no syntactic constraints that explicitly make those parts distinct. In the “RDF graph” pattern, the property segment is always in the form of CURIE, which enables a parser to automatically identify and distinguish between the segments.

Furthermore, the prefixes of the CURIE properties are defined on the default namespace website.com/prefix_, so the full properties’ URIs can be obtained automatically as well. For instance, the full URI of the ex prefix could be retrieved from the http://milicicvuk.com/prefix_/ex path.

This approach allows a generic algorithm for identifying URIs implemented using the “RDF graph” pattern and distinguishing them from the ordinary, opaque URIs. Then, the parser can sort out the two types of segments and decompose the URIs to triples thanks to the explicitly defined meanings of the properties. This means that the parser is able not just to “read” the URI, but also “understand” it, by recursively parsing all the relevant URIs and getting triples it needs to learn. Its “knowledge” can be also used to guess new URIs by recombining the segments in the similar way humans do it with readable/hackable URIs. Finally, due to the fact that triples “live” in URIs and are inseparable from them, the source of triples is always known.

There is another important repercussion of using the “RDF graph” pattern. Because properties (in the form of CURIEs) become the part of the URI, they limit the publisher’s choice when it comes to generating his URIs. In other words, the ontology directs the creation of URIs by providing the names for properties. The burden and responsibility of minting URIs is thus transferred from a publisher to an ontology creator. The only things the publisher has to worry about are the local identifiers („left“, „right“ and „surgery“ in the above examples). These kind of “keys” can be recommended by the ontology maker, or can (perhaps more probably) arise as conventions from the community’s best practices.

[*] It could be that some less frequently used character, or double underscore “__” would serve better.

[**] In other cases, if there are many identifiers, or descriptive names aren’t important, simple indexes can be automatically generated or existing IDs can be used.

[***] Note that the :property/:sub-property pattern is also possible if there is a single item, as in ex_hasLength/rdf_value. All the combinations will be discussed in more detail in the future posts.


Extended CURIE (prefix:localName:key)

In the post Assigning a URI to each node of an RDF graph, I described the mechanism that enables all nodes to get URIs. For example, the age of a person identified by the URI reference http://chucknorris.com/data_/chuck can be described by using a “classic” blank node as follows (using the Turtle syntax):

<http://chucknorris.com/data_/chuck> foaf:age [ rdf:value "30" ]

When the blank node gets a URI by adding the predicate CURIE on the subject URI, the example will look like this:

<http://chucknorris.com/data_/chuck>
   foaf:age <http://chucknorris.com/data_/chuck/foaf_age>

<http://chucknorris.com/data_/chuck/foaf_age>
   rdf:value "30" .

The blank node got the URI, but the initial elegance of the syntax is lost. If a single-valued property is used, like in this case, the URI of the object doesn’t contain additional (key) segments, so it can be derived automatically (http://chucknorris.com/data_/chuck + “/” + foaf_age). With that in mind, and the fact that the property rdf:value is assumed when it comes to a literal, the previous example can be expressed in a simpler way, using syntatic sugar:

<http://chucknorris.com/data_/chuck> foaf:age "30" .

.. which has exactly the same meaning as the syntax from the beginning of the post:

<http://chucknorris.com/data_/chuck> foaf:age [ rdf:value "30" ]

For multi-valued properties, as is the case with foaf:nick property for instance, the URI can’t be derived automatically. In that case, one can use the “extended” CURIE syntax. For example, if http://chucknorris.com/data_/chuck has two nicknames that are identified by http://chucknorris.com/data_/chuck/foaf_nick/1 and http://chucknorris.com/data_/chuck/foaf_nick/2, the Turtle syntax using extended CURIEs might look like this:

<http://chucknorris.com/data_/chuck> foaf:nick:1 "Chuck" .
<http://chucknorris.com/data_/chuck> foaf:nick:2 "Fatality" .

The key segments 1 and 2 are added on the existing CURIE using (an already used) : delimiter. The same principle is applied if used key segments are not numbers – for example, foaf:nick:byFriend and foaf:nick:byGirlfriend are valid as well. If the nickname is described by other properties besides the rdf:value, the Turtle syntax might look like this:

<http://chucknorris.com/data_/chuck> foaf:nick:byGirlfriend [
   rdf:value "Fatality" .
   foaf:maker <http://chucknorris.com/data_/chucksGirlfriend/1880341> .
]

This syntax is equivalent to the classic Turtle syntax:

<http://chucknorris.com/data_/chuck>
   foaf:nick <http://chucknorris.com/data_/chuck/foaf_nick/byGirlfriend> .

<http://chucknorris.com/data_/chuck/foaf_nick/byGirlfriend>
   rdf:value "Fatality";
   foaf:maker <http://chucknorris.com/data_/chucksGirlfriend/1880341> .

A general form of the extended CURIE is the prefix:localName:key, in which one of the first two elements (prefix or localName) is mandatory. The table shows possible variations of the extended CURIE syntax along with their URI equivalents:

property
(extended CURIE)
URI segment(s)
prefix:localName /prefix_localName
:localName /_localName
prefix: /prefix_
prefix:localName:key /prefix_localName/key
:localName:key /_localName/key
prefix::key /prefix_/key

Based on information in the extended CURIE, an object’s URI of an RDF triple in all versions can be easily derived, and the syntax is free from long URIs and redundancy.


Literals, blank nodes, n-ary relations and rdf:value

A literal node is a specific type of a node because it represents a value and as such is always dependent on a resource whose value represents. As has been discussed in the post Problems of the RDF model: Literals, there is a need to clearly separate the concept of a literal from this resource that acts sort of like a primitive variable. The rdf:value property has been used for the implementation of this idea, as shown in the example where the blank node takes the role of an instance of the class “Nick”, while the literal represents the value of this instance:

<http://carlosraynorris.com/data_/carlos> foaf:nick [ rdf:value “Chuck” ]

The rdf:value is an interesting property that needs to be further analyzed. In the RDF Primer, this property is described in the context of modeling n-ary relationships. N-ary relationships are those that exist between more than two resources (a triple represents a binary relationship between the subject and the object). A general way to represent any n-ary relation in RDF, as described in the RDF primer, is to…

[...] select one of the participants to serve as the subject of the original relation, then specify an intermediate resource to represent the rest of the relation, then give that new resource properties representing the remaining components of the relation.

This new resource, which is tipicaly represented as a blank node, takes the role of the “glue” – it becomes the subject in the new triples that describe the other resources of the n-ary relation. As an example of an n-ary relation, person’s address information is used:

address(exstaff:85740, "1501 Grant Avenue", "Bedford", "Massachusetts", "01730")

It is then “broken” in the described way into the following triples:

exstaff:85740   exterms:address        _:johnaddress .
_:johnaddress   exterms:street         "1501 Grant Avenue" .
_:johnaddress   exterms:city           "Bedford" .
_:johnaddress   exterms:state          "Massachusetts" .
_:johnaddress   exterms:postalCode     "01730" .

In the first triple, the exstaff:85740 (the URI reference identifying one of the employees) takes the role of the subject, while a blank node identifier _:johnaddress" (identifying John’s address) becomes the object. This blank node then becomes the subject in the rest of the triples describing the elements of the address – the street, city, state and zip code.

In this example, none of the individual parts of the structured value (the address) could be considered the “main” value (of the exterms:address property), all of the parts contribute equally to the value. However, in some cases one of the parts of the structured value is often thought of as the “main” value, with the other parts of the relation providing additional contextual or other information that qualifies the main value. Such a case is described in the following example of the same document:

exproduct:item10245   exterms:weight   _:weight10245 .
_:weight10245         rdf:value        "2.4"^^xsd:decimal .
_:weight10245         exterms:units    exunits:kilograms .

These three triples describe the product exproduct:item10245 weighted 2.4 kg. The rdf:value is used as a convinient property to represent the main value ​​of the weight  which equals 2.4. In the RDF Primer, this decision is explained as follows:

There is no need to use rdf:value for these purposes (e.g., a user-defined property name, such as exterms:amount, could have been used instead of rdf:value), and RDF does not associate any special meaning with rdf:value. rdf:value is simply provided as a convenience for use in these commonly-occurring situations.

Therefore, the rdf:value property has no precisely defined meaning. The rdf:value “is typically used to identify the ‘primary’ or ‘major’ value of a property which has several values, or has as its value a complex entity with several facets or properties of its own.” However, the standard use cases where this property imposes itself as an intuitive choice suggests that the meaning of the rdf:value property perhaps might be defined more precisely.

Let’s look at the general case of a literal RDF triple in the (classical) RDF model:

resource    property    literal .

A literal, therefore, represents the value of the property of a resource. However, this is conceptually wrong because the “value” of the property of a resource in general is a new resource (identified by a URI reference). In the example at the beginning of this post, the literal “Chuck” is not the property’s value, but the value of a new concept that could be called “Carlos’ nickname”. This new concept, which can be loosely referred to as a primitive variable, is missing in a literal triple.

In other words, the object of a literal triple is a “complex entity”, which has two aspects – a “variable” and a value. In a similar way a new node was introduced during the realization of an n-ary relationship, we can create two triples in which the “primitive variable” serves as the “glue” connecting them. Thus, the general case of using literals with the rdf:value property should look like this:

resource               property      primitive variable .
primitive variable     rdf:value     literal .

Data a literal represents on its own has no meaning – its meaning is dependent on the resource “primitive variable” whose value is the literal. The “primitive variable” is a node identified by URI reference that acts as a primitive variable because it can be represented by a single value. The rdf:value thus explicitly describes the relationship between a URI reference and a literal and must be a single-value property. This means that the rdf:value property is the only instance of the class “owl:DatatypeProperty”, while all the other properties are the instances of the class “owl:ObjectProperty”. In other words, each literal triple must have the rdf:value property as a predicate.

Thus there are three constraints that define a literal:

  • A literal must be the object of an RDF triple in which the rdf:value is the predicate
  • A URI reference may have only one rdf:value property
  • A literal can not be described by the new properties

A node “primitive variable” can be described by other properties that more closely describe its value – for example, the language used, the unit of measurement, the currency and so on. It’s worth noting that these properties refer to the “primitive variable” rather than the literal. For example, a specific nickname can be in English language, not the “Chuck”, a specific weight can be expressed in kilograms, not the value “2.4″, the specific product’s price can be in EUR, rather than its value “99.99″. Also, the literal datatype is defined when describing the class to which the “primitive variable” belongs, as will be discussed in the next post.

A plain literal is not a string – it is a node identified by URI reference representing a value, which can be of a string datatype. The URI of a literal is obtained in the same way as with blank nodes – by adding the property CURIE on the URI reference whose value is the literal. Since this property is always rdf:value, a literal has a standard URI “primitiveVariableURI/rdf_value”.

In an RDF notation, a literal is always represented by its value, while its URI can be concluded easily if needed. The URI is important when a literal is used in the Web context. There are two ways to implement a literal on the Web – as the web resource sharing the literal’s URI (primitiveVariableURI/rdf_value) and returning the literal value, or as a shortcut – the content of a web resource “primitiveVariableURI”. The realization of these two methods will be discussed in more detail in future posts.


Assigning a URI to each node of an RDF graph

Before we start, let’s remind ourselves of the example RDF graph we used in the previous post:

The challenge is to figure out URIs for nodes having question marks, namely blank nodes and literals.

How to provide a URI for each node of an RDF graph? The solution to this problem can be found in the very nature of the Web. Namely, a unique (HTTP) URI for all nodes can be obtained in a similar way ordinary web pages get their URLs. The domain of each website is unique, while webpages that naturally have ambiguous names, get unique URLs in the context of a web site.

For instance, imagine that the website http://chucknorris.com has a contact page. The term “contact” is ambiguous and exists on a number of web pages, but the URL http://chucknorris.com/contact becomes globally unique. In the context of triples of an RDF graph, http://chucknorris.com would become the subject, the “contact” predicate, and the http://chucknorris.com/contact the object of the RDF triple.

<http://chucknorris.com> “contact” <http://chucknorris.com/contact> .

However, there are two significant differences between web pages and nodes of an RDF graph. First, the properties that make up predicates in an RDF graph are URIs themselves, and not mere words (like the “contact” in the example). Secondly, a resource can be linked by the same properties to several different values, i.e. there may be several RDF triples with the same subjects and predicates, but different objects. In this case, simple concatenation of the subject and the predicate is not enough to create a unique URI.

The idea for solving the first problem can be found in CURIE syntax. CURIE defines an abbreviated syntax for expressing URIs in the “prefix:localName” form, which is already widely used in RDF notations. It consists of a prefix and a local name separated by the collon (:) delimiter. The prefix is a reference to a URI namespace, i.e. the part of a URI common to all resources of a domain. For example, resources defined by FOAF ontology share the namespace http://xmlns.com/foaf/0.1/, which is usually mapped to the prefix “foaf”. The CURIE for the property http://xmlns.com/foaf/0.1/based_near will therefore become “foaf:based_near”.

By extending the URI of the subject (http://chucknorris.com/data_/chuck) with the predicate in the CURIE form (foaf:based_near), the blank node from the above example will obtain the URI http://chucknorris.com/data_/chuck/foaf:based_near. However, the character “:” is reserved in the URI syntax and forbidden in file names and folders, as well as in other contexts, so an alternative delimiter is needed. Instead of the “:” we can use the underscore (_), making the previous example look like this:

http://chucknorris.com/data_/chuck/foaf_based_near

The triple in question will look like this:

<http://chucknorris.com/data_/chuck>
   foaf:based_near
      <http://chucknorris.com/data_/chuck/foaf_based_near> .

The same method can be applied to other blank nodes, for instance:

<http://chucknorris.com/data_/chuck/foaf_based_near>
   geo:lat
      <http://chucknorris.com/data_/chuck/foaf_based_near/geo_lat> .

When using the CURIE syntax, one needs to define the prefixes and map them to the appropriate namespaces. This definition is usually located at the beginning of a document. For example, in the Turtle notation the keyword “@prefix” is used at the beginning of a file, while in notations based on XML, it is usually defined on the root tag using the “prefix” or “xmlns” attributes. Since the web site has a tree structure, the logical choice for the definition of a prefix is the root of the tree. Prefixes are therefore defined at the website level and placed on the “website.com/prefix_” path. For example, the URL http://chucknorris.com/prefix_/foaf can return the reference to http://xmlns.com/foaf/0.1/ namespace. Therefore, for the CURIE form of a URI, the full URI can be obtained in a relatively simple way.

The second problem is related to the assignment of URIs in the situation where there are multiple RDF triples with the same subjects and predicates, but different objects. For example, what will happen if the node http://chucknorris.com/data_/chuck from the example graph is connected using the same property “foaf:based_near” to multiple (geo:Point) nodes? In that case, the http://chucknorris.com/data_/chuck/foaf_based_near URI is not suitable because it  is unclear to which node it refers. It is therefore necessary to provide a mechanism that allows a distinct URI for each node.

Here an analogy with arrays in programming languages can help. If the based_near is the name of an array, its members will be named as based_near[0], based_near[1] and so on. One can also use an associative array (hash), where instead of numbers, (descriptive) keys are used as indexes, for example – based_near['belgrade'] and based_near['pancevo'].

In the HTTP context, the names of array members will become the URIs http://chucknorris.com/data_/chuck/foaf_based_near/1 and http://chucknorris.com/data_/chuck/foaf_based_near/2 (for simplicity and compatibility with other standards the indices start from 1 instead of 0). The associative array equivalents would be http://chucknorris.com/data_/chuck/foaf_based_near/belgrade and http://chucknorris.com/data_/chuck/foaf_based_near/pancevo.

These segments should be carefully chosen to ensure stability of the URIs. Their subsequent change affects all the URIs of child nodes containing the URI of the parent node. These „key“ segments can also be used when there is only one property, if it is expected to be more in the future. In this way it is ensured that later addition of a new object for the same property in an RDF triple will not cause changing the current URI. If the property is unique, the key can be omitted.

Adding the URI predicates in shortened (CURIE) form on the subjects URI, together with adding arbitrary keys on the resulting URI, allows for simple mechanism of assigning URIs to all nodes of an RDF graph. “Blank” nodes are now identified by URIs just like URI references. Using the same method literals can get a URI as well, which will be discussed in more detail in the following post. With URIs assigned to blank nodes, our example graph looks like this:

URIs tailored this way are always defined in the context of the “parent” URI, which makes them dependend on it. The nodes they identify represent some kind of property of the node in which context they have been defined, meaning that deleting the parent will cause deletion of its child. However, the “initial” nodes (for example http://chucknorris.com/data_/chuck) are in a similar way dependant on the web site, so viewed that way there are no fundamental differences between the “initial” and the “blank” nodes.


Fixing the RDF model: (re)defining a node of an RDF graph

In the previous posts, I analyzed the problems of the RDF model – the existence of blank nodes, various problems related to plain and typed literals and the absence of the universal concept of a node in an RDF graph. A node, the basic element of an RDF graph, is not clearly defined. There are conceptually completely different types of nodes, with no unique method of identification. This is the key problem that more or less directly causes other problems of the RDF model and technologies that are based on RDF. It is therefore necessary to start with this problem.

There are three types of nodes in an RDF graph – URI references, blank nodes and literals. The picture above shows a typical RDF graph that contains these three types of nodes. The graph describes a person identified by the URI http://chucknorris.com/data_/chuck, its name and whom he knows. He is based near a geographical point, which is described as well. The rdf, foaf and geo ontologies are used.

URI references and blank nodes are usually shown as circles or ellipses, while literals are depicted as rectangles. In the ellipses that represent resources there is the name (URI) of the resource, (except for a blank node which is empty), while the rectangles representing literals contain a literal value.

In order to define the universal concept of a node, we have to analyze features and aspects that are common to all nodes. There are two main ways to approach a node – it can be viewed as a data structure and a as a symbol.

A node as a data structure

When viewed as a data structure, two main aspects of a node can be singled out – its name and data that it may hold. The example graph shows that nodes are determined by the first or the second aspect. URI references are determined by its name, while literals are determined by its literal value, i.e. some data. One can ask a question: Do nodes represented by an URI hold some data, and what is the name of a literal?

URI references may represent information or non-information resources. Non-information resources are determined by their name (URI), which distinctly separates them from other resources, and they are not literal values, but represent specific things, concepts or ideas. Therefore, URI references representing non-information resurces don’t hold any data.

Information resources contain information, however this information refers to their representation, which is a concept distinct from the resource. In other words, if the representation is presented in an RDF graph, it would become a literal rather than a URI reference. One can say that information resources contain data, while literals are data themselves. Therefore, URI references that represent information resources also lack data.

On the other hand, what is the name of literals and blank nodes? Literals are identified by the values ​​they represent, so it can be said that the name is equal to their value. Blank nodes “indicate the existence of a thing, without using, or saying anything about, the name of that thing“. Blank nodes can have a local name, but it’s not the part of the abstract syntax, in which blank node “has no intrinsic name” Therefore, blank nodes by definition have no name.

The above analysis can be represented in a simple table:

name data
URI reference yes no
Blank node no no
Literal yes yes

We can conclude that all nodes, no matter how different from each other, are determined by these two fundamental aspects: a name and data. In other words, one can speak about the universal concept of a node, a superclass from which subclasses (URI references, blank nodes and literals) inherit and are based on the different manifestations of these two apsects.

fixing rdf node

URI references hold no data, and blank nodes in addition have no name. In order to describe these situations we can use the NULL value, which indicates that there is no value and is different from zero or empty string (“”).

name data
URI reference URI NULL
Blank node NULL NULL
Literal string string

The table shows the different values of the name and data aspects ​for the different types of nodes. It should be mentioned that a typed literal is a string combined with a datatype URI. In the table plain literal is shown for simplicity’s sake.

The universal concept of a node can be realized as an unordered set of name/value pairs, namely two pairs that both can have a NULL as a value. This data structure is referred to as an object, record, struct, hash table, associative array and others, depending on the context.

Using this new concept of a node, the previous RDF graph can be represented as follows:

This graph is a good example that shows the mess around the various methods of identifying the nodes. URIs and strings are used as identifiers for different nodes, and a node can also be blank.

Key issues regarding the definition of the universal node of an RDF graph are whether a node can be unnamed, and whether there may be several ways of identifying nodes. In previous posts numerous problems caused by the existence of blank nodes have been discussed. In a context where the focus is on data, the ability to easily reference a node is expected and logical. It is therefore necessary that all nodes have a name.

However, the existence of the name itself is not enough. A simple model requires a unique way of creating the names, i.e. the IDs for all nodes. One of the assumptions for the realization of the original Web was the existence of a single mechanism for the asigning IDs at a global level, i.e. URIs to all resources. A URI has a key role when it comes to the realization of RDF and the Web of Data (Linked Data), so a solution which allows nodes that are not identified by URI can be rightfully questioned.

A node as a symbol

A node is clearly used for representing various stuff – real word objects, ideas, anything you can imagine. So, a logical assumption is that a node is some kind of symbol. Let’s see what Wikipedia says about a symbol:

A symbol is something which represents an idea, a physical entity or a process but is distinct from it.

This definition is very close to the idea of ​​a URI reference, which may represent practically anything. It is also clearly distinct from the thing it represents. A URI reference representing Chuck Norris and Chuck Norris are not the same things. Therefore, a URI reference can be referred to as a symbol.  The same can be said to blank nodes, which basically have the same properties as URI references, with the difference that they have no name (which seems not to be required by the symbol definition).

On the other hand, a literal is defined as a “string combined with an optional language tag” or “with a datatype URI“. “Plain literals are considered to denote themselves, so have a fixed meaning.

If a literal refers to itself, it is not distinct from the entity it represents. A literal doesn’t represent data, it is data itself. Thus, the literal does not meet the fundamental criteria to be a symbol, meaning that an RDF graph consists of a mixture of symbols with some elements that are not symbols. In other words, the structure of an RDF graph as an abstract representation is not clearly separated from what the graph with nodes represents.

To understand why this is a problem, let’s look at the “role of context in symbolism“, where a rather scarce, but a clear description with an example is given:

The context of a symbol may change its meaning. Similar five–pointed stars might signify a law enforcement officer or a member of the armed services, depending the uniform.

Therefore, one of the symbol’s properties is that it’s meaning depends on a context. The meaning of a URI reference is deteremined by relations with other nodes, i.e. triples which describe it. Connect it to different nodes and you’ll change its meaning.

On the contrary, a literal “has a fixed meaning“. Which is interesting because data, by definition, “on its own carries no meaning“. In an RDF graph, the property used in a literal triple does not affect the meaning of the literal. Even if the property’s range is defined, information about the meaning of the literal is contained only in the literal itself and is immutable.

As I previously stated, the problem is that a literal doesn’t represent a literal value, it is that value itself. Another problem is that this value is used as a universal identifier. Both things are against the nature of a symbol. They also makes a literal completely different node than a URI reference. Can one simple model stand so much variety?

The RDF model can be done much simpler. It can have all nodes conceptually equal and identified using the same mechanism.

A graph is always an abstract representation, containing the nodes that always represent, i.e. symbolize things. A literal, therefore, as the node of an RDF graph, has to be a symbol, distinct from what it represents. Having said that, one must distinguish between a literal node and a literal value that the node represents, the same way a URI reference is distinguished from a resource it refers to.

Secondly, the use of a literal value ​​as an identifier is clearly a bad idea. Introducing another way of identification in the context in which there is already a powerful identifier – URI, unnecessarily adds to the complexity of the RDF model. Finally, RDF is realized on the Web, where a URI is a natural identificator as well.

What is the meaning of a literal and how to identify it correctly? In an earlier blog post, I compared the RDF model to the object-oriented model, making an analogy between objects and URI references. A literal in the OO context is “identified” as the value of an object’s property. If we take this analogy, a literal node should have a special role in an RDF graph – one in which it acts as a value of another node.

Therefore, there are only two types of nodes. A newly defined literal is also identified by a URI, causing the term “URI reference” to become problematic. However, for simplicity, I will keep on using the old terminology, while a literal can be understood as a URI reference that holds data.

The definition of a node

On the basis of the above analysis we can single out several requirements that the RDF model must meet in order to achieve maximum simplicity and consistency. Besides the things we already know – that a node is an element of a graph connected to other nodes via typed links, and that it can represent a resource or a literal value, we can add a few more:

  • A node has two aspects – a name and data
  • A node’s name must always be a URI
  • A node is a symbol, meaning it is always distnict from what it represents
  • A node that holds data is a special kind of node acting as the value of another node

Thanks to these requirements and constraints, it seems that we have enough material to try to finally define a node. A node of an RDF graph, therefore, can be defined as a symbol identified by a URI that represents a resource or a literal value, connected by typed links with other nodes, forming a directed, labeled graph.

Structurally, a node is determined by a name and data and consists of two key-value pairs. A name is always a URI, whose primary role is to identify the node in a global context. A URI, however, has other important functions that will be further analyzed in future posts. If represents a literal value, a node has a special role in the graph – it acts as the value of another node.

Now we need to materialize this theory in practice. The previous graph example, according to the new definition will look like this:

First, note that the notation is simplified: instead of using the “DATA: null”, we simply omitted the rectangles. Also, there is no need to repeat the “NAME:” and the “DATA:” all the time, because the names are already represented by the ellipses and data is represented by the recatangles.

Three new blank nodes are added – the one for each literal. “Blank” nodes and literals have a question mark instead of a URI.

Let’s first focus on the identification and the challenge of assigning URIs to all nodes. For now, let’s concentrate on blank nodes. How to assign a URI to blank nodes? More on that in the next post.


Problems of Linked Data (4/4): Consuming data

The problem of consuming data published in the Linked Data style can be best understood by an example. For instance, imagine a user or a software agent wanting to find out what the capital of Germany is. Let’s assume that the user already knows the URI reference that represents the concept of Germany (http://dbpedia.org/resource/Germany) and the URI of the property “has capital” (http://dbpedia.org/ontology/capital). Therefore, the user is basically looking for the object of a triple in which the subject and the predicate are known.

<http://dbpedia.org/resource/Germany> <http://dbpedia.org/ontology/capital> ?object .

The user expects to find the answer by looking up the URI http://dbpedia.org/resource/Germany. To get desired data, one has to go through the following procedure:

  • First, the user sends HTTP request to http://dbpedia.org/resource/Germany. In the HTTP headers he specifies the RDF notation (format) in which he wants to receive the description of the resource. For RDF/XML syntax, the “Accept:” header in the HTTP GET request should look like this:

    Accept: text/html;q=0.5, application/rdf+xml

    “Accept:“ header indicates that it would take either HTML or RDF, but would prefer RDF. This preference is indicated by the quality value q=0.5 for HTML. This is called content negotiation.

  • The server would answer:

    HTTP/1.1 303 See Other
    Location: http://dbpedia.org/data/Germany.rdf
    Vary: Accept

    This is a 303 redirect, which tells the client that a Web document containing a description of the requested (non-information) resource, in the requested format, can be found at the URI http://dbpedia.org/data/Germany.rdf („Vary:“ header is required so that caches work correctly).

  • Next, the client will try to de-reference the new URI, looking up the http://dbpedia.org/data/Germany.rdf, given in the response from the server.

  • The server then responds with “200 OK” message, thus telling the client that the response contains the representation of the information resource. The “Content-Type:” header indicates the desired RDF/XML format, and the rest of the message contains the representation describing the desired non-information resource, i.e. the triples encoded in the RDF/XML notation. This description can be of significant size – in this particular case (http://dbpedia.org/data/Germany.rdf) it weights nearly half a megabyte (428KB).

  • When the download is complete, the description must be parsed which requires a special library. The usual procedure is that the triples are loaded into a local graph, while queries are performed, depending on the implementation, via API methods or SPARQL.

  • Finally, the desired information is obtained — the URI reference of the capital of Germany is http://dbpedia.org/resource/Berlin (34 bytes). If you need some additional information describing Berlin, you have to repeat the entire procedure with a new URI http://dbpedia.org/resource/Berlin.

problems of linked data: consuming data

As seen in this example, the access to an RDF triple and its object requires a significant number of steps, as well as the time for downloading the representation, parsing it and querying the results. It requires programming skills, the necessary libraries and knowing their methods or the SPARQL language for creating queries.

On the other hand, there is an alternative way of fetching data – via a SPARQL endpoint. Data from the example can be obtained using a simple query sent (as a query string) to http://dbpedia.org/sparql/:

SELECT ?object WHERE {
  <http://dbpedia.org/resource/Germany> <http://dbpedia.org/ontology/capital> ?object .
}

The problem is that the user has to know another standard – SPARQL language. Also, many sites that publish Linked Data don’t have a SPARQL endpoint, which is not a mandatory requirement of Linked Data, but a recommendation if a dataset is large, such as in our DBpedia example.

However, Linked Data is not about a single SPARQL endpoint for accessing data, but rather the opposite – it’s about breaking the dataset into the web of interconnected resources identified by HTTP URIs. In this sense, the described procedure for obtaining simple information is typical and recommended way of accessing data published according to Linked Data principles. Therefore, there is an obvious problem – an inability to perform simple operations in a quick and easy way.

Leigh Dodds covered the data access problem in blog post RDF Data Access Options, or Isn’t HTTP already the API?. The post was a follow-up to the discussion on limitations of Linked Data, triggered by his following comment:

While I think SPARQL is an important and powerful tool in the RDF toolchain I don’t think it should be seen as the standard way of querying RDF over the web. There’s a big data access gulf between de-referencing URIs and performing SPARQL queries. We need something to fill that space, and I think the Linked Data API fills that gap very nicely.

The Linked Data API is an additional layer that provides a simple REST API over RDF graphs to bridge the gap between Linked Data and SPARQL. The API layer acts as a proxy for any SPARQL endpoint, allowing more sophisticated queries without the knowledge of SPARQL. This API allows Linked Data and SPARQL to “convert” into REST API – a method that is widely accepted and familiar to web developers.

The view that an extra layer in the form of Linked Data API is needed has provoked the question that Ed Summers asked on Twitter:

@ldodds but your blog post suggests that an API for linked data is needed; isn’t http already the API?

This is the crucial question about the nature of Linked Data that can be also asked as: “Isn’t Linked Data already the API?”. The aforementioned blog post by Leigh Dodds’s deals with this problem and analyzes limitations of Linked Data. The author states that Linked Data provides two basic methods of data access:

  • Resource Lookups: by dereferencing APIs we can obtain a (typically) complete description of a resource.
  • Graph Traversal: following relationships and recursively de-referencing URIs to retrieve descriptions of related entities; this is (typically, not necessarily) reconstituted into a graph on the client.

Leigh then argues that in order to provide an advanced level of functionality, at least two additional important aspects of data interaction should be provided:

  • Listing: ability to retrieve lists/collections of things; navigation through those lists, e.g. by paging; and list manipulation, e.g. by filtering or sorting.
  • Existence Checks: ability to determine whether a particular structure is present in a graph.

SPARQL can handle all of these options, as well as far more complex operations. However, by using SPARQL one is stepping around HTTP, which is the basic assumption of the traditional Web and the Web of data. From a hypermedia perspective, using parameterised URLs, i.e. queries integrated in the HTTP protocol is a much more natural solution than tunneling SPARQL queries. The hipermedia principle is important not only in the REST architecture, but also in Linked Data, which is based on the Web technologies HTTP and URI. Leigh therefore argues that the Linked Data API could be a good solution for this problem.

One can conclude from the Leigh’s post that Linked Data is not enough, suggesting two possibilities: that it represents the basic functionality that can be built on, or that it doesn’t provide even the basic functionality.

Linked Data enables traversing a labeled, directed graph. It uses the universal interface based on dereferencable HTTP URIs, but beneath that, there is a large diversity of syntaxes. After de-referencing a URI reference, one can face any of the numerous formats, which are sometimes rendered as HTML, sometimes as XML, and sometimes as a plain text. They are often unreadable so you have to look at the source code to try to figure out what they’re about. And when you do that, you often can’t click the links, so you have to copy/paste URIs. And sometimes descripitons cannot be opened in the browser, so they’re downloaded. In that case you have to open them in a text editor. So better turn the “URL Highlighting” option on. Sorry, but that is not a good user experience.

HTTP URI as a universal interface is just not enough. The Web has clearly showed the need for a universal syntax, and the universal way of encoding hyperlinks – its most fundamental elements. What is the <a href=""> equivalent in the Linked Data world? Where is the universal syntax for a hyperlink in the Web of data context?

The conceptual problem of Linked Data has been covered in one of the previous posts, where I analyzed the decision to decompose an RDF graph into so-called RDF molecules. One shouldn’t forget that this approach is associated with the deeper problems of the RDF model. The manifestation of this decision in practice is shown in the case of accessing the object of an RDF triple – a very simple operation that requires considerable effort and time.

It can be concluded that, although founded on the good initial idea, Linked Data has a lot of problems and suffers from serious inconsistencies. Linked Data is not defined properly. A lot of room for different interpretations indicates its substantial weakness.

Linked Data celebrates HTTP URIs, but a significant number of the nodes in a graph is not identified by HTTP URIs. It aims to build the Web of data, but still centres around documents. It tries to introduce the new paradigm, but is stuck in the old mindset. It is inspired by the original Web, but is unable to provide its level of simplicity.

Publishing data by Linked Data rules for most people is very hard. Consuming data is hard. Understanding the underpinning theory is hard. Almost everything in Linked Data is hard. And what do you get? Not even the basic functionality of an API. Traversing a graph and getting data is difficult and inefficient if done programatically and almost impossible in a browser.

Considering the serious data access problems, the idea of adding another layer of complexity – some kind of API, sounds like the only reasonable solution. However, let’s instead of fixing consequences focus on the causes, for a change. And the major cause is the inherited problems of the underlying (RDF) data model.


Problems of Linked Data (3/4): Publishing data

In the blog post What people find hard about Linked Data, Rob Styles covered the difficulties that people face when they first learn about publishing Linked Data. His analysis is based on the experience of teaching Linked Data hundreds of people with different profiles and backgrounds. According to Rob, people find Linked Data hard to learn because of several steps along the way – certain things that are conceptually difficult to grasp.
learning linked data publishing

One of these is understanding the difference between URI and URL:

First they [people on the course] have to recognise that they need different URIs for the document and the thing the document describes. It’s a leap to understand:

  • that they can just make these up
  • that no meaning should be inferred from the words in it (and yet best practice is to make the readable)
  • that they can say things about other peoples’ URIs (though those statements won’t be de-referencable)
  • that they can choose their own URIs and URI patterns to work to

The information/non-information resource distinction forms part of this difficulty too. While for naive cases this is easy to understand, how a non-information resource gets de-referenced and you get back a description of it is difficult.

Rob puts together HTTP, 303s and supporting custom URIs in a separate set of problems:

[...] Most web devs today will have had no reason to experience more than 200, 404 and 302 [HTTP status codes] — some will understand 401 if they’ve done some work with logins, but even then most of the framework will hide that for you.

So, the need to route requests to code using a mechanism other than filename in URL is something that, while simple, most people haven’t done before. Add into that the need to handle non-information resources, issue raw 303s and then handle the request for a very similar document URL and you have a bit of stuff that is out of the norm — and that looks complicated.

When Richard Cyganiak asked about the Impractical features of the RDF stack on answers.semanticweb.com, by far the highest voted answer, by Ed Summers, refers to the problem of the difference between information and non-information resources:

As a software developer the worst thing about Linked Data for me is trying to decide if something is an Information Resource or not… and minting identifiers and defining server side behavior accordingly. httpRange-14 is dead, long live httpRange-14! I personally have come to prefer REST‘s laissez-faire approach to the nature of resources. URLs identify Resources. Resources can be anything. When you resolve a URL you get a Representation back. Does it really have to be more complicated than that?

According to the HTTP Range 14 resolution, non-information resources are not allowed to return HTTP response “200 OK” after the HTTP request, but rather to redirect to the URL where the resource is described. Many large websites like Google, Yahoo, Bing, Facebook, New Your Times, Freebase are violating httpRange-14, sending a clear message of its impracticality. This idea is not strongly supported even in the Linked Data community where people often debate this controversial topic.

The content negotiation is another aspect that contributes to the complexity of Linked Data. In the Frequently Observed Problems on the Web of Data, an entire chapter is dedicated to frequent mistakes in practice related to how a document is accessed on the Web, with particular reference to HTTP-related issues. A significant number of errors is covered: incorrect Content-Type, content negotiation, incorrect interpretation of the Accept Header, missing Vary Header and the problems with caching.

Publishing Linked Data is often perceived as unduly difficult, demotivating people interested in publishing data. An average potential publisher has been „spoiled“ by much simpler solutions on the Web. She is used to getting quick explanations, and learning from 5 minute tutorials. When it comes to Linked Data, you need 5 minutes just to (try to) explain the difference between information and non-information resources. People have no other option than to learn how to publish Linked Data from 100 pages books and 3 hour lectures. It seems it’s not possible to explain Linked Data in less time and that’s what we should worry about.

It looks like the (selfish & lazy) nature of an average internet user is not well understood and exploited in Linked Data. For instance, explaining why the world will be a better place if one includes links to other things is not what motivates people. I don’t create hyperlinks on this blog because I will help „connecting data islands into a global, interconnected data space“. I do that because the links add value to my blog and help my readers by providing the context. The mere idea of telling people to make links in 2011 is just wrong. If one thing comes naturally on the Web (whether it is the Web of documents or the Web of data), it’s linking.

Linked Data is trying to follow the principles of the original Web, but instead of focusing on the most important one – simplicity, it insists on the implementation of various relatively complex and geeky technologies of the Web architecture. One can argue that neither of the technologies individually is that hard to understand and implement, but taken together, they make publishing Linked Data complex, esoteric and different to what people are used to on the Web.


Problems of Linked Data (2/4): Concept

It’s a bit hard to write about Linked Data because of the many changes it’s going through. Therefore, until it becomes stable again, I’ll stick to the official definition of Linked Data, the one that assumes RDF in it. In order to get a clear perspective on the problem, it’s important to analyze the original idea. Many problems I’ll be focusing on are not directly related to RDF, and when they are, they can be understood in a wider sense, as universal problems of directed, labeled graphs in the Web context. For the sake of practicality and avoiding repeating „directed, labeled graphs that use URIs as identifiers“ all the time, I’ll just use the term „RDF“.

In the context of Linked Data, a parallel is often drawn between the Web of documents and the Web of data. For example, in the tutorial How to Publish Linked Data on the Web, this comparison is made in several places:

The Web of Data can be accessed using Linked Data browsers, just as the traditional Web of documents is accessed using HTML browsers. [...] The glue that holds together the traditional document Web is the hypertext links between HTML pages. The glue of the data web is RDF links.

The Web of documents is the current Web. Simply put, it represents a network (graph) whose nodes are Web documents. Each document is identified by an (unique) HTTP URI and returns some content when one looks up the URI. Similarly, the Web of data (or the Semantic Web) in the context of Linked Data represents a network whose nodes are URI references, which, when looked up, return (or redirect to) a document containing the description of a resource identified by the URI reference in question.

This description encodes the set of triples describing the resource, representing the basic unit of the Web of Data, called the RDF molecule. The RDF molecule is described in the paper Tracking RDF Graph Provenance using RDF Molecules, in which the authors state that in the Web of data, compared to the Web of documents, information is structured and encoded in a much finer level of granularity. They then determine the level of granularity, i.e. find the smallest components into which an RDF graph can be decomposed without losing meaning.

RDF documents are identified as poor candidates because they may contain redundant or unrelated data. On the other hand, an RDF triple – the smallest subset of an RDF graph, brings a different problem. Namely, when multiple triples share a blank node, decomposition causes loss of information. They discuss the following example:

@prefix foaf: <http://xmlns.com/foaf/0.1/>.
_:x foaf:firstName "Li" .
_:x foaf:surname "Ding" .

The graph describes a foaf:Person with first name “Li” and surname “Ding”. If we decompose G1 into its two triples and treat each as a separate RDF graph, we lose the information that there exists a person that that both has the first name “Li” and also the surname “Ding”.

The solution the authors provide is the RDF molecule, which is described as follows:

In order to handle the information loss caused by triple level operation, we propose a higher level of granularity, the RDF molecule. An RDF graph’s molecules are the smallest components into which the graph can be decomposed into separate sub-graphs without loss of information.

Translated into the Linked Data language, the RDF molecule represents a set of triples that describe a node, so that it contains the arcs out of that node, and the arcs in. In other words, it returns any RDF triples in which the term appears as either subject or object.

A graph is called browsable, if, for the URI of any of its nodes that is looked up, information which describes the node is returned. Describing a node means:

  1. Returning all statements where the node is a subject or object; and
  2. Describing all blank nodes attached to the node by one arc.

The concept of the RDF molecule is problematic for several reasons.

First, a blank node is the special case of the RDF model that causes numerous problems. Such blank node is, therefore, the main reason why an RDF graph is not decomposed into RDF ​​triples, but RDF molecules. This means that if blank nodes are removed completely from the RDF model, the level of granularity will suddenly be lowered to the level of an RDF triple, which would significantly change the structure of the Web of data and the concept of Linked Data.

Blank nodes are not solely the problem of RDF, as I carelessly stated in the previous post. On the contrary, they are the paradigmatic example of what I meant by universal problems of a labeled directed graph in the Web context. I’ve realized that, reading through the discussions on the JSON-LD mailing list. Creators of the new syntax treat RDF as an option, following the trend of redefining and generalizing Linked Data. In that sense, they use (somewhat friendlier) term “unlabeled node” for what is called “blank node” (or bnode) in RDF. No wonder, it turned out that unlabeled nodes are the subject of fierce debates. Unlabeled nodes come naturally in JSON, but are not allowed in the newly redefined Linked Data, making it virtually impossible to get to the consensus. The format is even splitted into two sub-formats, the one that allows unlabeled nodes and another that doesn’t. The situation is so dramatic that Manu Sporny asked Is Linked Data useless?, and stated that if JSON-LD didn’t support unlabeled nodes, his company wouldn’t use the format at all.

As I concluded in the previous post, merely running away from RDF will not solve the problems.

Problems of Linked Data - Concept

Another problem of the concept of the RDF molecule is that an RDF graph in the context of the Web of data is seen as a set of triples, rather than a graph. It’s true that an RDF graph is defined as a set of triples, but in the context of the Web of data, one should focus on its graph aspect, in which the structure consists of nodes and links, rather then triples. On the Web of documents, documents are nodes of the Web graph, while RDF molecules don’t represent nodes of an RDF graph, but its subsets. Therefore, this kind of Web should be called the “Web of molecules” rather than the Web of data.

In the Linked Data context, a node of the Web of data is redirected to the document containing its description. “Data” (or “datum”, if you’re pedantic) as the basic unit of this new Web of data, represents a new paradigm that needs to take over the role that the “document” had before. However, this idea is not fully elaborated, and documents still exist as data containers. In other words, the data structure is “glued” onto the (2D) document instead of being implemented via (3D) HTTP URIs.

Creating the Web of data is a challenge because an RDF graph and the Web graph are two different types of graphs. Arcs (links) in an RDF graph have names, while hyperlinks on the Web have only direction. Another problem is the fact that in an RDF graph, out of three types of nodes only URI references are identified by a URI. How to follow the idea of ​​Web documents and assign each “data” on  the Web of data a URI, when blank nodes and literals have no URI? On the Web of documents, every document has a URI, there are no “blank documents” and “literal documents”.

One can ask what is actually “data” in the Web of data context, and what is its relation to a node on one hand, and a triple on other. In the general case, data is defined as the lowest level of abstraction, that on its own carries no meaning. Analogous to the Web, where all documents have a URI, each data in the context of the Web of data should be identified by a URI. Unfortunately, it seems that this “lowest level of abstraction” that has a URI doesn’t come to the fore in Linked Data, which has missed an opportunity to address this issue more seriously.