What the Web looks like

How do you imagine the Web (or do you at all)?

I imagine it like this:

First, let’s imagine a small website. The website typically has a hierarchical structure, i.e. a tree:

web_3d_visualisation_1

Here, the circles are web pages – the big one is a homepage, smaller circles are the next level of hierarchy and so on. The web pages of the same level are drawn at the same distance from their parent web page.

The web pages are connected with tree links. “Tree” links are links that can connect just a parent with a child in a hierarchy, hence the name.

As we know, trees can be organized in a number of different ways. Also, websites can vary from a few web pages to millions of them, causing a huge variety of possible arrangements and shapes.

web_3d_visualisation_2

But the Web with just tree links would be a boring place. Websites would be isolated islands not aware of each other. Also, the navigation between the web pages on a single website would be limited by the hierarchical order of web pages. The only way to get to a desired destination would be to find the right branch, and then, one node at a time, travel to your target web page. Just like you do everyday browsing through the folders in your file system.

Fortunately, there is another kind of link: a hyperlink. This link is not restricted by a hierarchical order and can magically connect any two web pages, not just in a single website, but on the whole Web. No matter how low in a hierarchy, when it comes to a hyperlink, every web page is equal.

Therefore, hyperlinks “break” the hierarchical order of a website and cause the tree to become a graph, so we can also call them “graph” links. An analog in a file system is a shortcut that allows you “teleporting” to any place on the disk.

With added hyperlinks, our four websites will look something like this:

web_3d_visualisation_3

We have added the third dimension. That way we ensure that the hyperlinks don’t overlap.

Now, let’s try to imagine that our little “Web” is growing. New domains are registered and new websites are emerging. New hyperlinks are added pointing from the new web pages to the existing ones and vice versa. As we see in the next image, everything is pretty much the same, just on a larger scale.

web_3d_visualisation_4

With a fair amount of certainty we can conclude that if we add more and more web pages and finally reach the real size of the Web, the same basic structure will remain.

However, there is a problem. In this kind of a structure, not all websites are equal. The distance of an average website is much shorter if you’re near the center, meaning the centrally positioned websites are more privileged than the ones placed on the edge. This arrangement doesn’t reflect the democracy of the Web. Also, this kind of spatial structure is not elegant in a sense that results in very long hyperlinks connecting distant parts.

Maybe we could give up a “flat” arrangement of the websites and allow them to float in 3D space, perhaps something like stars in the Universe. However, this doesn’t solve the problem. Still there are central parts and periphery parts like in the previous model.

We need a structure in which there won’t be any “edges”. An elegant solution would assume a 3D geometrical object that enables desired equality. We can imagine that the flat surface in the previous image is actually the surface of a big sphere that just looks flat when looked from a close distance, just like the Earth. A sphere ensures that all web pages are equal and controls the maximum length of a hyperlink. The most distant web page is no more distant than a diameter of a sphere.

web_3d_visualisation_5

The interiour of the sphere is white because of the vast amount of (white) hyperlinks. If we grab a website and pull it from the sphere, this white matter of hyperlinks become clearly visible.

web_3d_visualisation_6

Although this model looks pretty elegant, it still has a problem. The Web is growing constantly, meaning that the sphere is getting bigger and bigger. In a purely theoretical model, this doesn’t make much difference. But imagine the Web realized in the physical world. It would eventually become so big that the cost of material for hyperlinks would be extremely high.

In addition, the maximum size of the Web would have to be in some way restricted, not just due to cost but for practical reasons. Perhaps it would be placed in some kind of container, which would be holding it and possibly protect it from the environment. So how the Web would grow in such limiting conditions?

Well, the “cortex” of the Web can be folded, allowing much larger surface for websites to grow, while retaining the basic idea of equality and keeping the hyperlinks relatively short. Unfourtunately, I’ve not provided an image for this. But I have no doubt that your brain is capable of creating one.


Two types of links on the Web

In the last post I discussed the hierarchical aspect of the Web, suggesting that there are two types of links on the Web: tree links and graph links.

Tree links

The Web consists of web sites, which typically have a tree structure, i.e. one that involves parent-child links between the various levels of the hierarchy. A homepage is a root connected with the top level web pages, which are connected to the next level and so on. Let’s call this kind of links „tree“ links. Such links can be implemented in a website in two ways: implicitly and explicitly.

Implicit tree links

Take, for instance, the following two web pages, assuming that a product with the ID „456“ belongs to a category identified by „123“.

http://website.com/?categoryID=123

http://website.com/?productID=456

These two web pages form a parent-child (category-product) relation, even though that’s not clear from the structure of URLs. Here, URLs don’t provide any hint about the relation of the two, so we have no idea what is their relation until we look up them.

Explicit tree links

Another method is the one using explicit tree links:

http://website.com/categories/123

http://website.com/categories/123/products/456

Compared to implicit tree links, the hierarchical structure is clear from URLs. The difference between source (parent) and the target (child) URLs is the edge between the nodes in the tree. This difference (products/456) can be implemented as a relative hyperlink from the source to its immediate child:

<a href="products/456">products</a>

Take another example:

http://website.com/guitars/electric-guitars

Here, the tree structure is clear as well:

http://website.com

http://website.com/guitars

http://website.com/guitars/electric-guitars

Each additional child’s URL contains an extra segment that distinguishes it from its parent. The difference between URLs is encoded explicitly, as a part of target webpage’s URL.

Paths

In the last post I touched on the concept of paths in a graph. Paths allow one to refer to something indirectly, allowing the traversal from something to its property. We can make Tim Berners-Lee’s example „?x.con:mailbox is x’s mailbox“ more general: x/property is x’ property, where x is the source node that can be a website, or any other node in the hierarchy [*].

Therefore,

<http://website.com/property> is <http://website.com>’s property

meaning that in

<a href="property">this property</a>

placed in http://webpage.com, we know not just the direction of the link, but its name encoded as the value of the href attribute as well.

In the context of a hierarchy, this property is limited to „has child“, or „is parent of“. However, as we see in the example

http://website.com/guitars/electric-guitars

the properties „guitars“ and „electric-guitars“ represent richer relations between the web pages, that are intuitively understood by people.

Explicit tree links therefore connect web pages having URLs in the form of paths. They have the following features:

  • The URL of a target web page contains the URL of the source web page
  • The target web page is the source’s property, and is dependant on it
  • The difference between the URLs encodes the name of relation between web pages

For example:

<http://website.com/guitars> links to <http://website.com/guitars/electric-guitars>

  • The URL of the source web page is obviously contained in the URL of the target web page (http://website.com/guitars/electric-guitars)
  • The target node is the source’s property „electric-guitars“. If http://website.com/guitars is deleted, the child is deleted too, so it’s existentially dependant on it.
  • The difference between URLs is „electric-guitars“, representing the relation between the web pages

The third point is the key. It tells that tree links can have a name (or a type), without encoding rel attribute in the <a> tag of the web page that links to the target page. This name denotes a relation between two nodes in a graph (tree), using nothing but a fundamental technology of the Web architecture – URI.

In the hierarchical context, it has the meaning „has child“ (or „is parent of“). Not much, but the idea that the meaning of a relation between nodes is encoded in the path is exciting. It suggests that the meaning of other types of relations can be encoded as well, using the same principle.

The problem is that these relations are human-readable only. A machine has no clue what the string „guitars“ represents. But if instead of string we use a URI, we can encode the explicit relation (predicate).

A short history of tree links

Historically, explicit tree links (and paths) were common in the early phase of the Web, when static files were published and URLs simply reflexed the directories on the server. Soon, dynamic pages with query string URLs emerged and the relations of web pages were not clear from their URLs any more.

Finally, Web 2.0 has popularized URL rewriting leaving the actual locations of files on the server irrelevant. On the other hand, URLs based on the friendly patterns, often encoding the hierarchical relations between the web pages has been gaining popularity again.

One of the important axioms of the Web architecture is the opacity axiom. It’s a rule requiring URLs to be opaque, i.e. treated like compact identifiers/addresses that don’t encode any data that could indicate the nature of the resource behind it. Therefore, the value of paths is not recognized and it was not until the popularization of friendly and „hackable“ URLs that the rule is somewhat revised and a level of transparency is allowed in certain situations.

The opacity axiom makes sense in the global Web context where, in general, one can’t rely on the string value of the URL (URI). However, the power of paths and indirect referencing, especially in regards to their ability to give link a name can’t be ignored any more. If explicitly identified and defined using URIs, these links’ names open up a completely new paradigm which will require different rules. In that context, the URIs are not just transparent, but machine-readable, leaving the opacity axiom depreciated.

Graph links

The second type of links are the links that are not limited by the hierarchical order, but enable teleporting to a random web page on the Web. These links are the „real“ hyperlinks, which can be also called „graph“ links, to differentiate them from tree links.

Compared to explicit tree links, hyperlinks have the following features:

  • The URL of a target web page doesn’t contain the URL of the source web page
  • A target web page is not the property of the source web page and is not in any way dependant on it
  • They always hold the same meaning

For Example:

<http://website.com/guitars> links to <http://anotherwebsite.com/guitar-parts>

  • The URL of the source web page is obviously not contained in the target’s URL
  • http://anotherwebsite.com/guitar-parts is not dependant on http://website.com/guitars. If http://website.com/guitars disappears, that doesn’t affect the existence of target.
  • Because the URL of the source is not contained in the target URL, there is no difference between URLs, so the type of relation between them can’t be encoded.

Although graph links are capable of holding a meaning, this meaning is always the same. It just says: this web page points to another. Maybe it can be defined more specifically, using the verb „mention“ instead of „point“. But whatever common meaning people agree upon, it’s the same in all possible contexts.

The graph links, or hyperlinks, are directed edges of the Web graph, so it’s quite natural for them to hold only the information about the direction. The fact that one web page points to (or mentions) another tells us about the one-way direction of the hyperlink between them and nothing more.

Therefore, hyperlinks are the type of links on the Web that inherently can’t hold any other information than the direction. That’s why any effort to add a name to a hyperlink is ultimately doomed to fail. A hyperlink is just not intended for that purpose.

One example is using HTML tags and rel attributes trying to describe hyperlinks. Another is RDF links that use predicates with explicitly described meaning, and can be encoded in a number of special syntaxes. The problem is that the both approaches don’t respect the fact that hyperlinks inherently can’t contain any other information than the direction.

If we want to make the Semantic Web (as an extension to the Web) a reality, first we have to fully understand the Web. And we can’t understand the Web without understanding its fundamental elements – links. There are two types of links – tree links and hyperlinks. The only way for the Web to evolve to the Semantic Web is to recognize the fact that the explicit tree links are the only one able to encode names, and use them together with hyperlinks, respecting the distinct but powerful nature of both.

[*] I am using the terms property and relation somewhat interchangeably here. In one of the future posts I’ll cover the difference between the two in the context of paths in more details.

The Web is just a bunch of trees plus shortcuts

The “Graph thinking” is one of the biggest conceptual problems when it comes to learning and understanding Linked Data and the RDF model, according to Rob Styles. Here, the term “graph thinking” refers to the ability to think about data as a graph, a web, a network. People, although understand the concept of a graph, are used to think about data from one point of view or another, and have difficulty when they need to “put themselves above the data”, i. e. imagine a graph as a whole.

It’s interesting that for developers it can be even harder (compared to non-programmers):

Having worked with tables in the RDBMS for so long, many developers have adopted tables as their way of thinking about the problem. Even for those fluent in object-oriented design (a graph model) the practical implications of working with a graph of objects leads us to develop, predominantly, trees.

Similarly, it seems that most people understand that the Web is a huge graph consisting of web pages and hyperlinks between them. However, the Web is “experienced” from the perspective of particular Web sites or pages (which are organized predominantly hierarchically), rather than a Web graph as a whole.

For example, the typical navigation menu on a website contains a list of hyperlinks to internal web pages (top-level menu), representing hierarchically organized “child” nodes of the tree forming around the website as its root. External hyperlinks to other Web sites and pages, as well as internal (relative) hyperlinks that “skip” this hierarchy, break the tree structure and create a graph*.

graph tree links

Another „graph“, people seem to intuitively understand, is a file system. File systems typically have directories (folders) and allow hierarchies where directories may contain subdirectories. These trees are relatively easy to understand, but are somewhat limited when it comes to navigation. In a tree, you can go one level up, or one level down.

Fortunately, you’re not limited to this kind of “tree links”, but can “jump” to any part of file system. You can do that, thanks to shortcuts, and these are possible due to the fact that every folder or file has a unique address – a path that can be easily manipulated. So when starting a program, you don’t have to go to the exact location of the executable file on the disk every time, but rather click on the shortcut on the Desktop. A similar way hyperlinks break the hierarchy of websites, shortcuts break the hierarchical structures of folders in a file system.

It seems that predominantly hierarchical (plus “shortcut” links) view of a graph is intuitively understood and that this fact should be used in order to facilitate understanding of the RDF model.

Linked Data is a step in this direction. In the Linked Data context, resources are identified by HTTP URIs, and their descriptions (obtained by dereferencing the URIs) contain all the RDF triples in which a particular resource appears as the subject or the object. In short, the description contains the part of a graph in which one node becomes the “root” relative to the other nodes, that can be thought of as its children nodes. Again, RDF links break tree structures connecting these subgraphs (RDF molecules or data objects), into a single global giant graph.

However, the problem is that you can’t browse this Linked Data graph in a way you do it on the Web, or in your file system. You are not allowed to traverse the nodes „hidden“ in documents containing the descriptions – you must download and parse them. These bits of data don’t have addresses, paths you can refer to or use for shortcuts.

When it comes to the Semantic Web and RDF, it seems that the idea of paths is primarily applied in the context of query languages. But what about paths as a part of the RDF model itself?

Tim Berners-Lee has written about them in the document Shorthand: Paths and lists, and @keywords:

Often it turns out that you need to refer to something indirectly through a string of properties, such as “George’s mother’s assistant’s home’s address’ zipcode”. This is traversal of the graph.

Such an indirect referencing can be expressed through a series of RDF triples chained with a number of blank nodes:

[is con:zipcode of [
    is con:address of [
        is con:home of [
            is off:assistant of [
                is rel:mother of :George]]]]]

The author then presents more elegant notation – a shortcut inspired by cascading style used by methods and attributes in an object-oriented language (dot notation), where „.“ (dot) is used as a delimiter:

:George.rel:mother
          .off:assistant
            .con:home
              .con:address
                 .con:zipcode

This is forward traversal of the graph, where with each “.” you move from something to its property. So ?x.con:mailbox is x’s mailbox, and in fact in english you can read the “.” as ” ‘s”.

Let me repeat what I think is one of the most powerful and yet one of the most neglected ideas of the Semantic Web:

You move from something to its property.
?x.con:mailbox is x’s mailbox.

In Linked Data, you don’t move from something to its property. You can only move from something to “something else”. Now, if you can move to the property, it means you can stop, rest a bit, look around you. If you look behind, you’ll see a single node, the parent. And if you look ahead, you’ll see the children nodes, through which you can go on the journey, one node at a time. You are placed on the part of the global graph that has a form of a tree.

The statement “?x.con:mailbox is x’s mailbox” suggests that the “mailbox” relation is “instantiated”, materialized in the form of distinct node, being dependant on its parent. That node has a dual nature, encoding the relation and the node involved in the relation.

This approach is the one that fully respects the nature of a directed labeled graph. It’s elegant and provides flexibility in expression. It facilitates implementation of n-ary relations and encourages modular design. It allows deep, nested structures instead of flat ones. It uses indirect referencing, which is how people think and refer to things.

Finally, it indirectly acknowledges the hierarchical aspect of the RDF graph. It is quite similar to the structure of the websites. This is the only approach that enables realization of the Web of data, i.e. proper projection of an RDF graph to the Web graph.

So, how come such powerful idea has never come to life? First, Tim presented this idea primarily as a syntax convention (sugar), failing to realize the full potential of his own words. Second, it relies on hated, URI-less, evil blank nodes. The only way to fix it is to somehow add URIs to these nodes. But, isn’t the very absence of URI references what makes this approach possible?

It sounds almost like a paradox. On one hand you have paths without URIs, and on the other there are opaque URIs… containing no paths. There are two clear requirements – one from each side of the equation. Paths and URIs are both needed. Therefore, we have no other choice than to connect them.

And don’t forget: ?x.con:mailbox is x’s mailbox.

* Of course a tree is a already a (kind of a) graph. Here, the term “graph” can be thought of as a graph in a wider sense.

Getting rid of typed literals

So far I was dealing with the complexities of the RDF model – the mess surrounding the concept of a node, many different types of nodes, as well as different methods of identification of nodes that have a name. However, there are aspects which are not just overcomplicated, but plain ugly as well. Here I am referring to the last bits of the RDF model that needs some serious attention: typed literals and literals with language tags.

Like blank nodes, this is the subject that caused many debates over the years. The people who made the model, somehow always find a justification for every ugly bit of it. However, the fact that the RDF model is „logically consistent“ (or whatever the right word is) doesn’t mean that it’s the only solution, and certainly doesn’t mean that the solution is the best one.

In the post on redefining the node of an RDF graph, I described nodes from the two aspects: a node as a data structure and a node as a symbol. As a data structure, a node is basically an object with two attributes: a name and data. Therefore, if the data attribute holds a datatype or a language tag in addition to the value, it will no longer contain atomic data. It becomes an object itself with two attributes. And that sucks.

So why the datatype has to be explicitly added to a literal every time? Couldn’t the datatype be declared, the similar way an instance is declared? If the range of a property can be described, and than the type of an instance can be inferred, why that doesn’t work for literals, too?

The problem

In the discussion on this subject on the RDF Working Group mailing list, Antoine Zimmermann explained it nicely:

Often, one would like to write:

ex:prop  rdfs:range  xsd:decimal .
ex:sub   ex:prop     "42" .

and infer that “42″ is a decimal number. However, what one gets from these two triples is that “42″ is a sequence of 2 characters AND a decimal, which is inconsistent.

Antonie then writes that overcoming this in the RDF model is hard, because „literals are universal identifiers, just like URIs“. So „42“ in all situations is identifying the same thing. He goes on with another example, which would not be possible in practice:

ex:prop rdfs:range xsd:decimal .
ex:sub ex:prop "42" .
ex:password rdfs:range xsd:string .
ex:sub2 ex:password "42" .

Here, certainly some people would expect the first “42″ to be denoting the number, while the second is just two characters. But this implicitly assumes that the denotation of literals is contextual: it would depend on which predicate is used in the triple. While it would be possible, in principle, to define a language where this makes sense, it does not fit at all with the RDF data model.

In another response in the same thread, Pat Hayes wrote, regarding inferring datatypes from the property definitions:

[...] in general, a property range is a class, not a datatype. What happens when the range is just a class and has no associated L2V mapping? Also, a property can have many ranges. What happens if two of them are datatype classes? Which one gets control over the interpretation of the literal string? [...]

A new literal

Before I start discussing potential solutions for the above problems, let’s recall how literals are different in the new RDF model I’ve been proposing in this blog, compared to the classical RDF model:

  • Literal nodes are identified by URIs, not its values
  • Literal nodes are not literal values themselves, but the symbols representing the values
  • A literal node is always the object of the rdf:value predicate in a triple

In a new model, literals are no more treated as universal identifiers. They are, like every other node in the RDF model, identified by a URI. They represent literal values, whose meaning depends on the context. This context is not defined by a literal, but a node having that literal as a value. This node is an instance (object) of one or more classes. It has one main value, represented by a literal and realized by the rdf:value property. This kind of node resembles primitive data types from programming languages, such as string, integer, float, boolean…

The difference is that in programming, you don’t have to define the meaning of every variable. In object oriented programming, for example, the „primitive“ variable named weight is often implemented as an instance of the Float class. In RDF, however, the weight has a meaning that is not limited to the datatype of its value.

rdf:datatype

Therefore, we have a type of an object and a datatype of that object’s value as distinct concepts. For instance, a product weight can be an instance of the class ex:Weight, and its value can be of float datatype. Check out the following example:

ex:Weight
    rdf:type      rdfs:Class ;
    rdf:datatype  xsd:float .

ex:hasWeight
    rdfs:range    ex:Weight .

<http://example.com/data_/product/item10245>
    ex:hasWeight  <http://example.com/data_/product/item10245/ex_hasWeight> .
<http://example.com/data_/product/item10245/ex_hasWeight>
    rdf:value     "2.4" ;
    ex:inUnit     ex:kg .

The new property rdf:datatype used here has the meaning „the value of the class S is of datatype O“. Therefore, the value of http://example.com/data_/product/item10245 is of datatype xsd:float.

The literal is represented by its value 2.4. There is no need to explicitly write its URI in the syntax. If needed, it can be easily obtained by connecting the URI of its parent node and CURIE segment rdf_value. The result is http://example.com/data_/product/item10245/ex_hasWeight/rdf_value.

2.4 is not a universal identifier any more, so 2.4 doesn’t have to always identify the same thing. Its meaning thus becomes relative to the context, not absoute in the whole RDF graph. For instance, it can sometimes be a float number, and sometimes a string. It’s just a value whose meaning depends on the type of its parent node. In this case, the literal acts as the value of an instance of the class ex:Weight which rdf:datatype is described as xsd:float, so we can say that 2.4 is a float number.

This is a good example why it is important to make a clear distinction between the concepts of a node, its name and data it holds. In the case of the literal 2.4, one can say that this literal is a node that has the name http://example.com/data_/product/item10245/ex_hasWeight/rdf_value and hold value 2.4. In short, literal node != literal value, and literal value != literal name.

Furthermore, a property range is always a class, and a datatype is a range of new, special property rdf:datatype. This way, classes and datatypes with associated L2V mapping are clearly separated.

In the reality, one cannot expect that a class will have just one rdf:datatype. In addition, a node can be an instance of more than one class, having different datatypes defined. As a solution, one can describe the instance itself with the preferred rdf:datatype, as in the following example:

<http://example.com/data_/product/item10245/ex_hasWeight>
    rdf:value     "2.4" ;
    rdf:datatype  xsd:float ;
    ex:inUnit     ex:kg .

The datatype information would be on the URI http://example.com/data_/product/item10245/ex_hasWeight/rdf_datatype, providing much more elegant way for expressing datatypes than using typed literals.

But it shouldn’t be mandatory. Often, the context can help. In this case, there is the property ex:inUnit which rdfs:domain is, say, ex2:Measure, with rdf:datatype of xsd:float. In addition, when in doubt, data consumer can check the superclass of ex:Weight, and find out that it is, say, ex3:Measure, where ex3 is an authoritative source.

Chances are that, ultimately, a limited number of superclasses defining datatypes will emerge and that classes which instances can have values, in order to be trusted, will have to be declared as subclasses of these superclasses.

Finally, one last note on the datatypes in RDF. My impression is that a few datatypes are used much more frequently than the others. These most frequently used datatypes should be really the part of the core RDF (or RDFS) ontology, for instance rdf:String, rdf:Number, rdf:Boolean, rdf:Date would be just enough. Actually, it would make sanse to merge RDF and RDFS with a few elements from OWL (so-called RDFS++) into one core ontology. Things are sometimes just rediculously unintuitive: take rdf:Property and rdfs:Class for instance. But that’s another story.

rdf:language

Finally, we get to the literals with language tags – another special case of literals. Let’s look at the example from DBpedia:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrade is the capital and largest city of Serbia. "@en .
<http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrad ist die Hauptstadt der Republik Serbien."@de .

In this snippet, the language tags en and de, placed at the end of literals using (another) special syntax @, denote that the literals are in English and German language, respectively.

The solution for literals with language tags is based on the same principles as the solution for typed literals. Let’s focus on the English version. First, we need to distinguish the concept of a value from the resource that contains that value:

<http://dbpedia.org/resource/Belgrade>
    rdfs:comment <http://dbpedia.org/resource/Belgrade/rdfs_comment/en> .

<http://dbpedia.org/resource/Belgrade/rdfs_comment/en>
    rdf:value "Belgrade is the capital and largest city of Serbia." .

…which can be shorten to:

<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [
    rdf:value "Belgrade is the capital and largest city of Serbia." .

]
...or made even more shorten:

<http://dbpedia.org/resource/Belgrade>
    rdfs:comment:en "Belgrade is the capital and largest city of Serbia." .

Here, rdfs:comment:en is an example of the extended CURIE described in an earlier post.

http://dbpedia.org/resource/Belgrade/rdfs_comment/en is a URI of a node whose value is the description in English. The URI segment at the end of the URI is required because there are several rdfs:comment properties. en can be a convention that has the role of the previously used language tag (in the form of @en). This solution shifts the language information into the URI and doesn't require special treatment in the RDF model, as it was the case with literals with language tags.

However, in order to explicitly describe the language, an additional triple is needed:

<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [
    rdf:value "Belgrade is the capital and largest city of Serbia." ;
    rdf:language <http://dbpedia.org/resource/English_language> .
]

Here I used the new rdf:language property to describe the language of the comment. The subject of the RDF triple is a comment identified by http://dbpedia.org/resource/Belgrade/rdfs_comment/en, and the object is the DBpedia resource representing English language.

It seems reasonable that rdf:language will be used to describe primarily instances, compared to rdf:datatype intended for the description of classes. However, there should be no restrictions – each should be allowed to describe both classes and instances.

Conslusion

If we want a truly simple and flexible RDF model, the denotation of literals should be contextual. A literal serves the role of the value of some other node in a graph. This node carries the information about a literal – its datatype as well as the language.

In that sense, two new properties are proposed: rdf:datatype and rdf:language, which are special in that they don’t describe a resource directly, but rather its value (or the value of its instance, in the case of a class) expressed using the rdf:value property. The meaning of the literal value depends on that node. This is possible because the universal identifier of a literal is its URI, not the value.

The proposed realization of literals, along with the removal of blank nodes, greatly simplifies the RDF model. Resources are described exclusively with RDF triples, using URI references and plain literals. "Special" literal nodes are discarded from the model.

The new RDF model has a clearly defined node that always has a name identified using the same mechanism (URI), i.e. all names belong to a single set. There are only two kinds of nodes - one representing a resource and one representing a literal value. The fact that all nodes now have a URI as name enables the next step – a (proper) realization of an RDF graph on the Web.


The “RDF graph” URI pattern

Anyone involved in anything having to do with the Semantic Web or Linked Data knows how much time and energy is wasted on endless discussions on the blank node issue. It is a controversial topic because on the one side blank nodes cause huge problems in practice, while on the other, they enable a great flexibility in expressing.

In this flexibility a more profound reason is hidden, which perhaps can explain how blank nodes have survived as a part of the RDF model all these years despite all the headaches they have caused. The thing is, blank nodes reflect a human way of referencing things. Let’t me show an example:

If I want to talk about my left arm, it’s quite unnatural to invent a new identifier for it. I’ll just say „my left arm“, describing it relative to myself, and a listener will understand. This is possible due to human’s ability to understand the context. He or she knows that the pronoun „my“ refers to me as something unique, the arm being part of me, and the “left” finally specifying the exact arm. So, my left arm, unique in the universe, is referenced quite simply and elegantly.

In RDF, it can be expressed with the two statements (triples) as: “I have an arm. It (has a property that) is left”. Let’s assume that we know that the blank node is of a type „ex:Arm“ implicitly through the property:

ex:hasArm rdfs:range ex:Arm .

Given that the URI of me is http://milicicvuk.com/data_/vuk, and assuming the relevant properties are defined in ex ontology, we can express it with the following triples:

<http://milicicvuk.com/data_/vuk> ex:hasArm [
    ex:hasProperty ex:Left .
]

The left arm is represented by the blank node, which is the object in the first triple and the subject in the second, thus chaining them and forming a rather readable code.

Now, let’s take a slightly more complicated example. I can say something as “the 5 cm scar on my left arm” (or my left arm’s 5 cm scar). Again, the scar is relative to the left arm, and the arm is relative to me. Translated to RDF, it will become: “I have an arm witch has a property that is left and has a scar that has a length which has a value 5 and is in unit of cm. This rather cumbersome sentence is much clearer when written in Turtle notation using nested blank nodes:

<http://milicicvuk.com/data_/vuk> ex:hasArm [
    ex:hasProperty ex:Left .
    ex:hasScar [
        ex:hasLength [
            rdf:value "5" ;
            ex:inUnit ex:cm .
        ]
    ]
]

Here we have three blank nodes that connect various statements which results in pretty elegant and readable code. This level of elegance and readability can never be achieved by using URI references.

That’s what makes blank nodes cool – they allow referencing relative to another thing. You can, instead of minting identifiers for every possible resource, just say, „something“ or „someone“, which is related to something else that has an identifier. The trouble is that this coolness is greatly diminished due to the negative side of not having global identifiers.

The question is: is it possible to keep the flexibility of blank nodes while having URIs at the same time? The answers is: yes, there is an elegant solution that allows just that.

Namespaces

To understand it, let’s try to look at the problem from the perspective of a namespace. The idea of a namespace is related to that of a context. A namespace is defined as a container that provides context of identifiers. A namespace has a unique name in the global space, allowing otherwise ambiguous identifiers to also become globally unique.

Now, let’s for a moment look at the part enclosed between [ and ] in the first RDF example. The subject and the predicate of the first triple (<http://milicicvuk.com/data_/vuk> ex:hasArm) act as the namespace of the part between square brackets. It uniquely defines the “container” that provides context for local identifiers.

<http://milicicvuk.com/data_/vuk> ex:hasArm [
    ex:hasProperty ex:Left .
]

However, in order for this namespace to be usable, we must convert it to a URI. The URI http://milicicvuk.com/data_/vuk alone can be seen as a kind of namespace for the predicate ex:hasArm. Of course, ex:hasArm is also unique C(URI)E, but in this context, it acts as a local identifier.

Put in this perspective, it is not hard to figure out what the full name of that identifier is. It can be made as with every other namespaced variable, by concatenating the namespace with the local name.

As a delimiter, we are going to use the “slash” character “/”, a standard delimiter of URI segments. The result is:

http://milicicvuk.com/data_/vuk/ex:hasArm

Another thing we have to do is to replace the URL unfriendly character “:” with something else. Let’s use the “_” char[*]. Finally, we get:

http://milicicvuk.com/data_/vuk/ex_hasArm

We got the URI of the namespace defined by the subject and the predicate of the triple. Another way of looking at this URI is as the full name of the “local identifier” ex:hasArm, defined in the http://milicicvuk.com/data_/vuk context. In any case, in the context of this new namespace we are going to define new local identifiers, using the following template:

http://milicicvuk.com/data_/vuk/ex_hasArm/localIdentifier

Namespaces are cool because they allow us not to worry about the global scope. The uniqueness of a namespace guarantees that all new identifiers defined in its context will also be unique. This way we reduced the problem of creating a whole new URI to the problem of inventing a name which has to be only locally unique.

In this particular case, having that I have just two arms, the local identifiers „left“ and „right“ will do the job nicely[**]. The full URIs (with the namespace) will thus look like this:

http://milicicvuk.com/data_/vuk/ex_hasArm/left

http://milicicvuk.com/data_/vuk/ex_hasArm/right

Therefore, the resource (the arm) that was previously represented by a blank node got the URI (http://milicicvuk.com/data_/vuk/ex_hasArm/left). The blank node just evolved to the URI reference while keeping its flexibility of expressing!

Additionally, we have a clear pattern for other URIs, too. What about identifiers for legs? No problem:

http://milicicvuk.com/data_/vuk/ex_hasLeg/left

http://milicicvuk.com/data_/vuk/ex_hasLeg/right

As I mentioned earlier, every new URI is at the same time the namespace for new identifiers. New namespaces can be built on the basis of the previous ones, forming the chain of nested namespaces. Namely, ex:hasScar is a local identifier in the context defined by http://milicicvuk.com/data_/vuk/ex_hasLeg/left namespace. Suppose it’s a scar from a surgery, suggesting the local identifier surgery:

http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery

Again, the new URI is the namespace of the subject http://milicicvuk.com/data_/vuk/ex_hasLeg/left and the predicate ex_hasScar, forming the container for the local identifier “surgery”. The full URI of the scar is therefore the URI of the object of that triple, previously being a blank node:

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left> ex:hasScar <http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery> .

What about literals? The exact same method can be applied to constructing the URIs of literals as well. The literal “5″ in the second RDF example will get the URI:

http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength/rdf_value

Literals’ URIs by convention always end with rdf_value segment because literal nodes are always values of the rdf:value property. Also, literal nodes are special in that they are terminal nodes, meaning they can not branch further (and thus can not serve as namespaces for new identifiers).

URI patterns

You may recognized a pattern used in these URIs. It is a variation of a well-known URI pattern used on the Web, that consists of two parts: one representing the collection, and other being one individual (instance) of the collection.

This pattern is also used in Linked Data. In the book Linked Data patterns, this kind of URIs are called patterned URIs and are recommended as as way for creating more hackable and human-readable URIs. The authors suggest using pluralized class names as the first part of the URI pattern, and identifier as the second.

For example if an application will be publishing data about book resources, which are modelled as the rdf:type ex:Book. One might construct URIs of the form:

/books/12345

Where /books is the base part of the URI indicating “the collection of books”, and the 12345 is an identifier for an individual book.

In another, hierarchical URIs pattern, the authors state:

Where a natural hierarchy exists between a set of resources use Patterned URIs that conform to the following pattern:

:collection/:item/:sub-collection/:item

E.g. in a system which is publishing data about individual books and their chapters, we might use the following identifier for chapter 1 of a specific book:

/books/12345/chapters/1

The /chapters URI will naturally reflect to the collection of all chapters within a specific book. The /books URI maps to the collection of all books within a system, etc.

A pattern for naming nodes of an RDF graph can be considered as a kind of “hierarchical URIs” pattern where a property name is used instead of a pluralized class. Its form can be written as follows:

:property/:item/:sub-property/:item

A “hierarchical” is perhaps not the best name for the relations between nodes in a graph, but bear in mind that the part of a graph described this way has the form of a tree with the described resource as a root. Anyways, to differentiate it from the other URI patterns, let’s call it the “RDF graph” URI pattern.

The “RDF graph” URI pattern

Using properties instead of class names explicitly state the relations between the nodes. Also, information about the item’s class can be preserved if contained in the property name, as it’s the case with ex:Arm class in ex:hasArm.

The “RDF graph” pattern can be applied to the entire URI of a node, starting from the domain name to the last segment. The default namespace website.com/data_ is a container for the root level nodes which than branch to the lowest level nodes using the same pattern. For instance, in the URI http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength/rdf_value, there are five “property” parts of the URI denoting properties (data:, ex:hasLeg, ex:hasScar, ex:hasLength and rdf:value) and three item (or key) parts (vuk, left, surgery)[***].

The diagram showing a part of RDF graph describing all the nodes contained in the URI looks like this:

RDF graph URI pattern

The triples in the Turtle syntax look like this:

<http://milicicvuk.com>
    data:
        <http://milicicvuk.com/data_/vuk> .

<http://milicicvuk.com/data_/vuk>
    ex:hasLeg
        <http://milicicvuk.com/data_/vuk/ex_hasLeg/left> .

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left>
    ex:hasScar
        <http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery>

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery>
    ex:hasLength
        <http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength>

<http://milicicvuk.com/data_/vuk/ex_hasLeg/left/ex_hasScar/surgery/ex_hasLength>
    rdf:value
        "5" .

Using the more concise syntax based on extended CURIEs, it will look as follows:

<http://milicicvuk.com> data::vuk [
    ex:hasLeg:left [
        ex:hasScar:surgery [
            ex:hasLength [
                rdf:value "5" .
            ]
        ]
    ]
]

Note that the literal is represented by its value (5). Its URI, if needed, can be easily inferred from its parent URI.

Different syntactic representation of the first “property” part and the second “item” part of the URI allows a URI to be readable not just to people, but to machines as well. In the form such as /books/12345/chapters/1 we intuitively know which part is which, but there is no syntactic constraints that explicitly make those parts distinct. In the “RDF graph” pattern, the property segment is always in the form of CURIE, which enables a parser to automatically identify and distinguish between the segments.

Furthermore, the prefixes of the CURIE properties are defined on the default namespace website.com/prefix_, so the full properties’ URIs can be obtained automatically as well. For instance, the full URI of the ex prefix could be retrieved from the http://milicicvuk.com/prefix_/ex path.

This approach allows a generic algorithm for identifying URIs implemented using the “RDF graph” pattern and distinguishing them from the ordinary, opaque URIs. Then, the parser can sort out the two types of segments and decompose the URIs to triples thanks to the explicitly defined meanings of the properties. This means that the parser is able not just to “read” the URI, but also “understand” it, by recursively parsing all the relevant URIs and getting triples it needs to learn. Its “knowledge” can be also used to guess new URIs by recombining the segments in the similar way humans do it with readable/hackable URIs. Finally, due to the fact that triples “live” in URIs and are inseparable from them, the source of triples is always known.

There is another important repercussion of using the “RDF graph” pattern. Because properties (in the form of CURIEs) become the part of the URI, they limit the publisher’s choice when it comes to generating his URIs. In other words, the ontology directs the creation of URIs by providing the names for properties. The burden and responsibility of minting URIs is thus transferred from a publisher to an ontology creator. The only things the publisher has to worry about are the local identifiers („left“, „right“ and „surgery“ in the above examples). These kind of “keys” can be recommended by the ontology maker, or can (perhaps more probably) arise as conventions from the community’s best practices.

[*] It could be that some less frequently used character, or double underscore “__” would serve better.

[**] In other cases, if there are many identifiers, or descriptive names aren’t important, simple indexes can be automatically generated or existing IDs can be used.

[***] Note that the :property/:sub-property pattern is also possible if there is a single item, as in ex_hasLength/rdf_value. All the combinations will be discussed in more detail in the future posts.


Extended CURIE (prefix:localName:key)

In the post Assigning a URI to each node of an RDF graph, I described the mechanism that enables all nodes to get URIs. For example, the age of a person identified by the URI reference http://chucknorris.com/data_/chuck can be described by using a “classic” blank node as follows (using the Turtle syntax):

<http://chucknorris.com/data_/chuck> foaf:age [ rdf:value "30" ]

When the blank node gets a URI by adding the predicate CURIE on the subject URI, the example will look like this:

<http://chucknorris.com/data_/chuck>
   foaf:age <http://chucknorris.com/data_/chuck/foaf_age>

<http://chucknorris.com/data_/chuck/foaf_age>
   rdf:value "30" .

The blank node got the URI, but the initial elegance of the syntax is lost. If a single-valued property is used, like in this case, the URI of the object doesn’t contain additional (key) segments, so it can be derived automatically (http://chucknorris.com/data_/chuck + “/” + foaf_age). With that in mind, and the fact that the property rdf:value is assumed when it comes to a literal, the previous example can be expressed in a simpler way, using syntatic sugar:

<http://chucknorris.com/data_/chuck> foaf:age "30" .

.. which has exactly the same meaning as the syntax from the beginning of the post:

<http://chucknorris.com/data_/chuck> foaf:age [ rdf:value "30" ]

For multi-valued properties, as is the case with foaf:nick property for instance, the URI can’t be derived automatically. In that case, one can use the “extended” CURIE syntax. For example, if http://chucknorris.com/data_/chuck has two nicknames that are identified by http://chucknorris.com/data_/chuck/foaf_nick/1 and http://chucknorris.com/data_/chuck/foaf_nick/2, the Turtle syntax using extended CURIEs might look like this:

<http://chucknorris.com/data_/chuck> foaf:nick:1 "Chuck" .
<http://chucknorris.com/data_/chuck> foaf:nick:2 "Fatality" .

The key segments 1 and 2 are added on the existing CURIE using (an already used) : delimiter. The same principle is applied if used key segments are not numbers – for example, foaf:nick:byFriend and foaf:nick:byGirlfriend are valid as well. If the nickname is described by other properties besides the rdf:value, the Turtle syntax might look like this:

<http://chucknorris.com/data_/chuck> foaf:nick:byGirlfriend [
   rdf:value "Fatality" .
   foaf:maker <http://chucknorris.com/data_/chucksGirlfriend/1880341> .
]

This syntax is equivalent to the classic Turtle syntax:

<http://chucknorris.com/data_/chuck>
   foaf:nick <http://chucknorris.com/data_/chuck/foaf_nick/byGirlfriend> .

<http://chucknorris.com/data_/chuck/foaf_nick/byGirlfriend>
   rdf:value "Fatality";
   foaf:maker <http://chucknorris.com/data_/chucksGirlfriend/1880341> .

A general form of the extended CURIE is the prefix:localName:key, in which one of the first two elements (prefix or localName) is mandatory. The table shows possible variations of the extended CURIE syntax along with their URI equivalents:

property
(extended CURIE)
URI segment(s)
prefix:localName /prefix_localName
:localName /_localName
prefix: /prefix_
prefix:localName:key /prefix_localName/key
:localName:key /_localName/key
prefix::key /prefix_/key

Based on information in the extended CURIE, an object’s URI of an RDF triple in all versions can be easily derived, and the syntax is free from long URIs and redundancy.


Literals, blank nodes, n-ary relations and rdf:value

A literal node is a specific type of a node because it represents a value and as such is always dependent on a resource whose value represents. As has been discussed in the post Problems of the RDF model: Literals, there is a need to clearly separate the concept of a literal from this resource that acts sort of like a primitive variable. The rdf:value property has been used for the implementation of this idea, as shown in the example where the blank node takes the role of an instance of the class “Nick”, while the literal represents the value of this instance:

<http://carlosraynorris.com/data_/carlos> foaf:nick [ rdf:value “Chuck” ]

The rdf:value is an interesting property that needs to be further analyzed. In the RDF Primer, this property is described in the context of modeling n-ary relationships. N-ary relationships are those that exist between more than two resources (a triple represents a binary relationship between the subject and the object). A general way to represent any n-ary relation in RDF, as described in the RDF primer, is to…

[...] select one of the participants to serve as the subject of the original relation, then specify an intermediate resource to represent the rest of the relation, then give that new resource properties representing the remaining components of the relation.

This new resource, which is tipicaly represented as a blank node, takes the role of the “glue” – it becomes the subject in the new triples that describe the other resources of the n-ary relation. As an example of an n-ary relation, person’s address information is used:

address(exstaff:85740, "1501 Grant Avenue", "Bedford", "Massachusetts", "01730")

It is then “broken” in the described way into the following triples:

exstaff:85740   exterms:address        _:johnaddress .
_:johnaddress   exterms:street         "1501 Grant Avenue" .
_:johnaddress   exterms:city           "Bedford" .
_:johnaddress   exterms:state          "Massachusetts" .
_:johnaddress   exterms:postalCode     "01730" .

In the first triple, the exstaff:85740 (the URI reference identifying one of the employees) takes the role of the subject, while a blank node identifier _:johnaddress" (identifying John’s address) becomes the object. This blank node then becomes the subject in the rest of the triples describing the elements of the address – the street, city, state and zip code.

In this example, none of the individual parts of the structured value (the address) could be considered the “main” value (of the exterms:address property), all of the parts contribute equally to the value. However, in some cases one of the parts of the structured value is often thought of as the “main” value, with the other parts of the relation providing additional contextual or other information that qualifies the main value. Such a case is described in the following example of the same document:

exproduct:item10245   exterms:weight   _:weight10245 .
_:weight10245         rdf:value        "2.4"^^xsd:decimal .
_:weight10245         exterms:units    exunits:kilograms .

These three triples describe the product exproduct:item10245 weighted 2.4 kg. The rdf:value is used as a convinient property to represent the main value ​​of the weight  which equals 2.4. In the RDF Primer, this decision is explained as follows:

There is no need to use rdf:value for these purposes (e.g., a user-defined property name, such as exterms:amount, could have been used instead of rdf:value), and RDF does not associate any special meaning with rdf:value. rdf:value is simply provided as a convenience for use in these commonly-occurring situations.

Therefore, the rdf:value property has no precisely defined meaning. The rdf:value “is typically used to identify the ‘primary’ or ‘major’ value of a property which has several values, or has as its value a complex entity with several facets or properties of its own.” However, the standard use cases where this property imposes itself as an intuitive choice suggests that the meaning of the rdf:value property perhaps might be defined more precisely.

Let’s look at the general case of a literal RDF triple in the (classical) RDF model:

resource    property    literal .

A literal, therefore, represents the value of the property of a resource. However, this is conceptually wrong because the “value” of the property of a resource in general is a new resource (identified by a URI reference). In the example at the beginning of this post, the literal “Chuck” is not the property’s value, but the value of a new concept that could be called “Carlos’ nickname”. This new concept, which can be loosely referred to as a primitive variable, is missing in a literal triple.

In other words, the object of a literal triple is a “complex entity”, which has two aspects – a “variable” and a value. In a similar way a new node was introduced during the realization of an n-ary relationship, we can create two triples in which the “primitive variable” serves as the “glue” connecting them. Thus, the general case of using literals with the rdf:value property should look like this:

resource               property      primitive variable .
primitive variable     rdf:value     literal .

Data a literal represents on its own has no meaning – its meaning is dependent on the resource “primitive variable” whose value is the literal. The “primitive variable” is a node identified by URI reference that acts as a primitive variable because it can be represented by a single value. The rdf:value thus explicitly describes the relationship between a URI reference and a literal and must be a single-value property. This means that the rdf:value property is the only instance of the class “owl:DatatypeProperty”, while all the other properties are the instances of the class “owl:ObjectProperty”. In other words, each literal triple must have the rdf:value property as a predicate.

Thus there are three constraints that define a literal:

  • A literal must be the object of an RDF triple in which the rdf:value is the predicate
  • A URI reference may have only one rdf:value property
  • A literal can not be described by the new properties

A node “primitive variable” can be described by other properties that more closely describe its value – for example, the language used, the unit of measurement, the currency and so on. It’s worth noting that these properties refer to the “primitive variable” rather than the literal. For example, a specific nickname can be in English language, not the “Chuck”, a specific weight can be expressed in kilograms, not the value “2.4″, the specific product’s price can be in EUR, rather than its value “99.99″. Also, the literal datatype is defined when describing the class to which the “primitive variable” belongs, as will be discussed in the next post.

A plain literal is not a string – it is a node identified by URI reference representing a value, which can be of a string datatype. The URI of a literal is obtained in the same way as with blank nodes – by adding the property CURIE on the URI reference whose value is the literal. Since this property is always rdf:value, a literal has a standard URI “primitiveVariableURI/rdf_value”.

In an RDF notation, a literal is always represented by its value, while its URI can be concluded easily if needed. The URI is important when a literal is used in the Web context. There are two ways to implement a literal on the Web – as the web resource sharing the literal’s URI (primitiveVariableURI/rdf_value) and returning the literal value, or as a shortcut – the content of a web resource “primitiveVariableURI”. The realization of these two methods will be discussed in more detail in future posts.


Assigning a URI to each node of an RDF graph

Before we start, let’s remind ourselves of the example RDF graph we used in the previous post:

The challenge is to figure out URIs for nodes having question marks, namely blank nodes and literals.

How to provide a URI for each node of an RDF graph? The solution to this problem can be found in the very nature of the Web. Namely, a unique (HTTP) URI for all nodes can be obtained in a similar way ordinary web pages get their URLs. The domain of each website is unique, while webpages that naturally have ambiguous names, get unique URLs in the context of a web site.

For instance, imagine that the website http://chucknorris.com has a contact page. The term “contact” is ambiguous and exists on a number of web pages, but the URL http://chucknorris.com/contact becomes globally unique. In the context of triples of an RDF graph, http://chucknorris.com would become the subject, the “contact” predicate, and the http://chucknorris.com/contact the object of the RDF triple.

<http://chucknorris.com> “contact” <http://chucknorris.com/contact> .

However, there are two significant differences between web pages and nodes of an RDF graph. First, the properties that make up predicates in an RDF graph are URIs themselves, and not mere words (like the “contact” in the example). Secondly, a resource can be linked by the same properties to several different values, i.e. there may be several RDF triples with the same subjects and predicates, but different objects. In this case, simple concatenation of the subject and the predicate is not enough to create a unique URI.

The idea for solving the first problem can be found in CURIE syntax. CURIE defines an abbreviated syntax for expressing URIs in the “prefix:localName” form, which is already widely used in RDF notations. It consists of a prefix and a local name separated by the collon (:) delimiter. The prefix is a reference to a URI namespace, i.e. the part of a URI common to all resources of a domain. For example, resources defined by FOAF ontology share the namespace http://xmlns.com/foaf/0.1/, which is usually mapped to the prefix “foaf”. The CURIE for the property http://xmlns.com/foaf/0.1/based_near will therefore become “foaf:based_near”.

By extending the URI of the subject (http://chucknorris.com/data_/chuck) with the predicate in the CURIE form (foaf:based_near), the blank node from the above example will obtain the URI http://chucknorris.com/data_/chuck/foaf:based_near. However, the character “:” is reserved in the URI syntax and forbidden in file names and folders, as well as in other contexts, so an alternative delimiter is needed. Instead of the “:” we can use the underscore (_), making the previous example look like this:

http://chucknorris.com/data_/chuck/foaf_based_near

The triple in question will look like this:

<http://chucknorris.com/data_/chuck>
   foaf:based_near
      <http://chucknorris.com/data_/chuck/foaf_based_near> .

The same method can be applied to other blank nodes, for instance:

<http://chucknorris.com/data_/chuck/foaf_based_near>
   geo:lat
      <http://chucknorris.com/data_/chuck/foaf_based_near/geo_lat> .

When using the CURIE syntax, one needs to define the prefixes and map them to the appropriate namespaces. This definition is usually located at the beginning of a document. For example, in the Turtle notation the keyword “@prefix” is used at the beginning of a file, while in notations based on XML, it is usually defined on the root tag using the “prefix” or “xmlns” attributes. Since the web site has a tree structure, the logical choice for the definition of a prefix is the root of the tree. Prefixes are therefore defined at the website level and placed on the “website.com/prefix_” path. For example, the URL http://chucknorris.com/prefix_/foaf can return the reference to http://xmlns.com/foaf/0.1/ namespace. Therefore, for the CURIE form of a URI, the full URI can be obtained in a relatively simple way.

The second problem is related to the assignment of URIs in the situation where there are multiple RDF triples with the same subjects and predicates, but different objects. For example, what will happen if the node http://chucknorris.com/data_/chuck from the example graph is connected using the same property “foaf:based_near” to multiple (geo:Point) nodes? In that case, the http://chucknorris.com/data_/chuck/foaf_based_near URI is not suitable because it  is unclear to which node it refers. It is therefore necessary to provide a mechanism that allows a distinct URI for each node.

Here an analogy with arrays in programming languages can help. If the based_near is the name of an array, its members will be named as based_near[0], based_near[1] and so on. One can also use an associative array (hash), where instead of numbers, (descriptive) keys are used as indexes, for example – based_near['belgrade'] and based_near['pancevo'].

In the HTTP context, the names of array members will become the URIs http://chucknorris.com/data_/chuck/foaf_based_near/1 and http://chucknorris.com/data_/chuck/foaf_based_near/2 (for simplicity and compatibility with other standards the indices start from 1 instead of 0). The associative array equivalents would be http://chucknorris.com/data_/chuck/foaf_based_near/belgrade and http://chucknorris.com/data_/chuck/foaf_based_near/pancevo.

These segments should be carefully chosen to ensure stability of the URIs. Their subsequent change affects all the URIs of child nodes containing the URI of the parent node. These „key“ segments can also be used when there is only one property, if it is expected to be more in the future. In this way it is ensured that later addition of a new object for the same property in an RDF triple will not cause changing the current URI. If the property is unique, the key can be omitted.

Adding the URI predicates in shortened (CURIE) form on the subjects URI, together with adding arbitrary keys on the resulting URI, allows for simple mechanism of assigning URIs to all nodes of an RDF graph. “Blank” nodes are now identified by URIs just like URI references. Using the same method literals can get a URI as well, which will be discussed in more detail in the following post. With URIs assigned to blank nodes, our example graph looks like this:

URIs tailored this way are always defined in the context of the “parent” URI, which makes them dependend on it. The nodes they identify represent some kind of property of the node in which context they have been defined, meaning that deleting the parent will cause deletion of its child. However, the “initial” nodes (for example http://chucknorris.com/data_/chuck) are in a similar way dependant on the web site, so viewed that way there are no fundamental differences between the “initial” and the “blank” nodes.


Fixing the RDF model: (re)defining a node of an RDF graph

In the previous posts, I analyzed the problems of the RDF model – the existence of blank nodes, various problems related to plain and typed literals and the absence of the universal concept of a node in an RDF graph. A node, the basic element of an RDF graph, is not clearly defined. There are conceptually completely different types of nodes, with no unique method of identification. This is the key problem that more or less directly causes other problems of the RDF model and technologies that are based on RDF. It is therefore necessary to start with this problem.

There are three types of nodes in an RDF graph – URI references, blank nodes and literals. The picture above shows a typical RDF graph that contains these three types of nodes. The graph describes a person identified by the URI http://chucknorris.com/data_/chuck, its name and whom he knows. He is based near a geographical point, which is described as well. The rdf, foaf and geo ontologies are used.

URI references and blank nodes are usually shown as circles or ellipses, while literals are depicted as rectangles. In the ellipses that represent resources there is the name (URI) of the resource, (except for a blank node which is empty), while the rectangles representing literals contain a literal value.

In order to define the universal concept of a node, we have to analyze features and aspects that are common to all nodes. There are two main ways to approach a node – it can be viewed as a data structure and a as a symbol.

A node as a data structure

When viewed as a data structure, two main aspects of a node can be singled out – its name and data that it may hold. The example graph shows that nodes are determined by the first or the second aspect. URI references are determined by its name, while literals are determined by its literal value, i.e. some data. One can ask a question: Do nodes represented by an URI hold some data, and what is the name of a literal?

URI references may represent information or non-information resources. Non-information resources are determined by their name (URI), which distinctly separates them from other resources, and they are not literal values, but represent specific things, concepts or ideas. Therefore, URI references representing non-information resurces don’t hold any data.

Information resources contain information, however this information refers to their representation, which is a concept distinct from the resource. In other words, if the representation is presented in an RDF graph, it would become a literal rather than a URI reference. One can say that information resources contain data, while literals are data themselves. Therefore, URI references that represent information resources also lack data.

On the other hand, what is the name of literals and blank nodes? Literals are identified by the values ​​they represent, so it can be said that the name is equal to their value. Blank nodes “indicate the existence of a thing, without using, or saying anything about, the name of that thing“. Blank nodes can have a local name, but it’s not the part of the abstract syntax, in which blank node “has no intrinsic name” Therefore, blank nodes by definition have no name.

The above analysis can be represented in a simple table:

name data
URI reference yes no
Blank node no no
Literal yes yes

We can conclude that all nodes, no matter how different from each other, are determined by these two fundamental aspects: a name and data. In other words, one can speak about the universal concept of a node, a superclass from which subclasses (URI references, blank nodes and literals) inherit and are based on the different manifestations of these two apsects.

fixing rdf node

URI references hold no data, and blank nodes in addition have no name. In order to describe these situations we can use the NULL value, which indicates that there is no value and is different from zero or empty string (“”).

name data
URI reference URI NULL
Blank node NULL NULL
Literal string string

The table shows the different values of the name and data aspects ​for the different types of nodes. It should be mentioned that a typed literal is a string combined with a datatype URI. In the table plain literal is shown for simplicity’s sake.

The universal concept of a node can be realized as an unordered set of name/value pairs, namely two pairs that both can have a NULL as a value. This data structure is referred to as an object, record, struct, hash table, associative array and others, depending on the context.

Using this new concept of a node, the previous RDF graph can be represented as follows:

This graph is a good example that shows the mess around the various methods of identifying the nodes. URIs and strings are used as identifiers for different nodes, and a node can also be blank.

Key issues regarding the definition of the universal node of an RDF graph are whether a node can be unnamed, and whether there may be several ways of identifying nodes. In previous posts numerous problems caused by the existence of blank nodes have been discussed. In a context where the focus is on data, the ability to easily reference a node is expected and logical. It is therefore necessary that all nodes have a name.

However, the existence of the name itself is not enough. A simple model requires a unique way of creating the names, i.e. the IDs for all nodes. One of the assumptions for the realization of the original Web was the existence of a single mechanism for the asigning IDs at a global level, i.e. URIs to all resources. A URI has a key role when it comes to the realization of RDF and the Web of Data (Linked Data), so a solution which allows nodes that are not identified by URI can be rightfully questioned.

A node as a symbol

A node is clearly used for representing various stuff – real word objects, ideas, anything you can imagine. So, a logical assumption is that a node is some kind of symbol. Let’s see what Wikipedia says about a symbol:

A symbol is something which represents an idea, a physical entity or a process but is distinct from it.

This definition is very close to the idea of ​​a URI reference, which may represent practically anything. It is also clearly distinct from the thing it represents. A URI reference representing Chuck Norris and Chuck Norris are not the same things. Therefore, a URI reference can be referred to as a symbol.  The same can be said to blank nodes, which basically have the same properties as URI references, with the difference that they have no name (which seems not to be required by the symbol definition).

On the other hand, a literal is defined as a “string combined with an optional language tag” or “with a datatype URI“. “Plain literals are considered to denote themselves, so have a fixed meaning.

If a literal refers to itself, it is not distinct from the entity it represents. A literal doesn’t represent data, it is data itself. Thus, the literal does not meet the fundamental criteria to be a symbol, meaning that an RDF graph consists of a mixture of symbols with some elements that are not symbols. In other words, the structure of an RDF graph as an abstract representation is not clearly separated from what the graph with nodes represents.

To understand why this is a problem, let’s look at the “role of context in symbolism“, where a rather scarce, but a clear description with an example is given:

The context of a symbol may change its meaning. Similar five–pointed stars might signify a law enforcement officer or a member of the armed services, depending the uniform.

Therefore, one of the symbol’s properties is that it’s meaning depends on a context. The meaning of a URI reference is deteremined by relations with other nodes, i.e. triples which describe it. Connect it to different nodes and you’ll change its meaning.

On the contrary, a literal “has a fixed meaning“. Which is interesting because data, by definition, “on its own carries no meaning“. In an RDF graph, the property used in a literal triple does not affect the meaning of the literal. Even if the property’s range is defined, information about the meaning of the literal is contained only in the literal itself and is immutable.

As I previously stated, the problem is that a literal doesn’t represent a literal value, it is that value itself. Another problem is that this value is used as a universal identifier. Both things are against the nature of a symbol. They also makes a literal completely different node than a URI reference. Can one simple model stand so much variety?

The RDF model can be done much simpler. It can have all nodes conceptually equal and identified using the same mechanism.

A graph is always an abstract representation, containing the nodes that always represent, i.e. symbolize things. A literal, therefore, as the node of an RDF graph, has to be a symbol, distinct from what it represents. Having said that, one must distinguish between a literal node and a literal value that the node represents, the same way a URI reference is distinguished from a resource it refers to.

Secondly, the use of a literal value ​​as an identifier is clearly a bad idea. Introducing another way of identification in the context in which there is already a powerful identifier – URI, unnecessarily adds to the complexity of the RDF model. Finally, RDF is realized on the Web, where a URI is a natural identificator as well.

What is the meaning of a literal and how to identify it correctly? In an earlier blog post, I compared the RDF model to the object-oriented model, making an analogy between objects and URI references. A literal in the OO context is “identified” as the value of an object’s property. If we take this analogy, a literal node should have a special role in an RDF graph – one in which it acts as a value of another node.

Therefore, there are only two types of nodes. A newly defined literal is also identified by a URI, causing the term “URI reference” to become problematic. However, for simplicity, I will keep on using the old terminology, while a literal can be understood as a URI reference that holds data.

The definition of a node

On the basis of the above analysis we can single out several requirements that the RDF model must meet in order to achieve maximum simplicity and consistency. Besides the things we already know – that a node is an element of a graph connected to other nodes via typed links, and that it can represent a resource or a literal value, we can add a few more:

  • A node has two aspects – a name and data
  • A node’s name must always be a URI
  • A node is a symbol, meaning it is always distnict from what it represents
  • A node that holds data is a special kind of node acting as the value of another node

Thanks to these requirements and constraints, it seems that we have enough material to try to finally define a node. A node of an RDF graph, therefore, can be defined as a symbol identified by a URI that represents a resource or a literal value, connected by typed links with other nodes, forming a directed, labeled graph.

Structurally, a node is determined by a name and data and consists of two key-value pairs. A name is always a URI, whose primary role is to identify the node in a global context. A URI, however, has other important functions that will be further analyzed in future posts. If represents a literal value, a node has a special role in the graph – it acts as the value of another node.

Now we need to materialize this theory in practice. The previous graph example, according to the new definition will look like this:

First, note that the notation is simplified: instead of using the “DATA: null”, we simply omitted the rectangles. Also, there is no need to repeat the “NAME:” and the “DATA:” all the time, because the names are already represented by the ellipses and data is represented by the recatangles.

Three new blank nodes are added – the one for each literal. “Blank” nodes and literals have a question mark instead of a URI.

Let’s first focus on the identification and the challenge of assigning URIs to all nodes. For now, let’s concentrate on blank nodes. How to assign a URI to blank nodes? More on that in the next post.


Problems of Linked Data (4/4): Consuming data

The problem of consuming data published in the Linked Data style can be best understood by an example. For instance, imagine a user or a software agent wanting to find out what the capital of Germany is. Let’s assume that the user already knows the URI reference that represents the concept of Germany (http://dbpedia.org/resource/Germany) and the URI of the property “has capital” (http://dbpedia.org/ontology/capital). Therefore, the user is basically looking for the object of a triple in which the subject and the predicate are known.

<http://dbpedia.org/resource/Germany> <http://dbpedia.org/ontology/capital> ?object .

The user expects to find the answer by looking up the URI http://dbpedia.org/resource/Germany. To get desired data, one has to go through the following procedure:

  • First, the user sends HTTP request to http://dbpedia.org/resource/Germany. In the HTTP headers he specifies the RDF notation (format) in which he wants to receive the description of the resource. For RDF/XML syntax, the “Accept:” header in the HTTP GET request should look like this:

    Accept: text/html;q=0.5, application/rdf+xml

    “Accept:“ header indicates that it would take either HTML or RDF, but would prefer RDF. This preference is indicated by the quality value q=0.5 for HTML. This is called content negotiation.

  • The server would answer:

    HTTP/1.1 303 See Other
    Location: http://dbpedia.org/data/Germany.rdf
    Vary: Accept

    This is a 303 redirect, which tells the client that a Web document containing a description of the requested (non-information) resource, in the requested format, can be found at the URI http://dbpedia.org/data/Germany.rdf („Vary:“ header is required so that caches work correctly).

  • Next, the client will try to de-reference the new URI, looking up the http://dbpedia.org/data/Germany.rdf, given in the response from the server.

  • The server then responds with “200 OK” message, thus telling the client that the response contains the representation of the information resource. The “Content-Type:” header indicates the desired RDF/XML format, and the rest of the message contains the representation describing the desired non-information resource, i.e. the triples encoded in the RDF/XML notation. This description can be of significant size – in this particular case (http://dbpedia.org/data/Germany.rdf) it weights nearly half a megabyte (428KB).

  • When the download is complete, the description must be parsed which requires a special library. The usual procedure is that the triples are loaded into a local graph, while queries are performed, depending on the implementation, via API methods or SPARQL.

  • Finally, the desired information is obtained — the URI reference of the capital of Germany is http://dbpedia.org/resource/Berlin (34 bytes). If you need some additional information describing Berlin, you have to repeat the entire procedure with a new URI http://dbpedia.org/resource/Berlin.

problems of linked data: consuming data

As seen in this example, the access to an RDF triple and its object requires a significant number of steps, as well as the time for downloading the representation, parsing it and querying the results. It requires programming skills, the necessary libraries and knowing their methods or the SPARQL language for creating queries.

On the other hand, there is an alternative way of fetching data – via a SPARQL endpoint. Data from the example can be obtained using a simple query sent (as a query string) to http://dbpedia.org/sparql/:

SELECT ?object WHERE {
  <http://dbpedia.org/resource/Germany> <http://dbpedia.org/ontology/capital> ?object .
}

The problem is that the user has to know another standard – SPARQL language. Also, many sites that publish Linked Data don’t have a SPARQL endpoint, which is not a mandatory requirement of Linked Data, but a recommendation if a dataset is large, such as in our DBpedia example.

However, Linked Data is not about a single SPARQL endpoint for accessing data, but rather the opposite – it’s about breaking the dataset into the web of interconnected resources identified by HTTP URIs. In this sense, the described procedure for obtaining simple information is typical and recommended way of accessing data published according to Linked Data principles. Therefore, there is an obvious problem – an inability to perform simple operations in a quick and easy way.

Leigh Dodds covered the data access problem in blog post RDF Data Access Options, or Isn’t HTTP already the API?. The post was a follow-up to the discussion on limitations of Linked Data, triggered by his following comment:

While I think SPARQL is an important and powerful tool in the RDF toolchain I don’t think it should be seen as the standard way of querying RDF over the web. There’s a big data access gulf between de-referencing URIs and performing SPARQL queries. We need something to fill that space, and I think the Linked Data API fills that gap very nicely.

The Linked Data API is an additional layer that provides a simple REST API over RDF graphs to bridge the gap between Linked Data and SPARQL. The API layer acts as a proxy for any SPARQL endpoint, allowing more sophisticated queries without the knowledge of SPARQL. This API allows Linked Data and SPARQL to “convert” into REST API – a method that is widely accepted and familiar to web developers.

The view that an extra layer in the form of Linked Data API is needed has provoked the question that Ed Summers asked on Twitter:

@ldodds but your blog post suggests that an API for linked data is needed; isn’t http already the API?

This is the crucial question about the nature of Linked Data that can be also asked as: “Isn’t Linked Data already the API?”. The aforementioned blog post by Leigh Dodds’s deals with this problem and analyzes limitations of Linked Data. The author states that Linked Data provides two basic methods of data access:

  • Resource Lookups: by dereferencing APIs we can obtain a (typically) complete description of a resource.
  • Graph Traversal: following relationships and recursively de-referencing URIs to retrieve descriptions of related entities; this is (typically, not necessarily) reconstituted into a graph on the client.

Leigh then argues that in order to provide an advanced level of functionality, at least two additional important aspects of data interaction should be provided:

  • Listing: ability to retrieve lists/collections of things; navigation through those lists, e.g. by paging; and list manipulation, e.g. by filtering or sorting.
  • Existence Checks: ability to determine whether a particular structure is present in a graph.

SPARQL can handle all of these options, as well as far more complex operations. However, by using SPARQL one is stepping around HTTP, which is the basic assumption of the traditional Web and the Web of data. From a hypermedia perspective, using parameterised URLs, i.e. queries integrated in the HTTP protocol is a much more natural solution than tunneling SPARQL queries. The hipermedia principle is important not only in the REST architecture, but also in Linked Data, which is based on the Web technologies HTTP and URI. Leigh therefore argues that the Linked Data API could be a good solution for this problem.

One can conclude from the Leigh’s post that Linked Data is not enough, suggesting two possibilities: that it represents the basic functionality that can be built on, or that it doesn’t provide even the basic functionality.

Linked Data enables traversing a labeled, directed graph. It uses the universal interface based on dereferencable HTTP URIs, but beneath that, there is a large diversity of syntaxes. After de-referencing a URI reference, one can face any of the numerous formats, which are sometimes rendered as HTML, sometimes as XML, and sometimes as a plain text. They are often unreadable so you have to look at the source code to try to figure out what they’re about. And when you do that, you often can’t click the links, so you have to copy/paste URIs. And sometimes descripitons cannot be opened in the browser, so they’re downloaded. In that case you have to open them in a text editor. So better turn the “URL Highlighting” option on. Sorry, but that is not a good user experience.

HTTP URI as a universal interface is just not enough. The Web has clearly showed the need for a universal syntax, and the universal way of encoding hyperlinks – its most fundamental elements. What is the <a href=""> equivalent in the Linked Data world? Where is the universal syntax for a hyperlink in the Web of data context?

The conceptual problem of Linked Data has been covered in one of the previous posts, where I analyzed the decision to decompose an RDF graph into so-called RDF molecules. One shouldn’t forget that this approach is associated with the deeper problems of the RDF model. The manifestation of this decision in practice is shown in the case of accessing the object of an RDF triple – a very simple operation that requires considerable effort and time.

It can be concluded that, although founded on the good initial idea, Linked Data has a lot of problems and suffers from serious inconsistencies. Linked Data is not defined properly. A lot of room for different interpretations indicates its substantial weakness.

Linked Data celebrates HTTP URIs, but a significant number of the nodes in a graph is not identified by HTTP URIs. It aims to build the Web of data, but still centres around documents. It tries to introduce the new paradigm, but is stuck in the old mindset. It is inspired by the original Web, but is unable to provide its level of simplicity.

Publishing data by Linked Data rules for most people is very hard. Consuming data is hard. Understanding the underpinning theory is hard. Almost everything in Linked Data is hard. And what do you get? Not even the basic functionality of an API. Traversing a graph and getting data is difficult and inefficient if done programatically and almost impossible in a browser.

Considering the serious data access problems, the idea of adding another layer of complexity – some kind of API, sounds like the only reasonable solution. However, let’s instead of fixing consequences focus on the causes, for a change. And the major cause is the inherited problems of the underlying (RDF) data model.