Getting rid of typed literals

So far I was dealing with the complexities of the RDF model – the mess surrounding the concept of a node, many different types of nodes, as well as different methods of identification of nodes that have a name. However, there are aspects which are not just overcomplicated, but plain ugly as well. Here I am referring to the last bits of the RDF model that needs some serious attention: typed literals and literals with language tags.

Like blank nodes, this is the subject that caused many debates over the years. The people who made the model, somehow always find a justification for every ugly bit of it. However, the fact that the RDF model is „logically consistent“ (or whatever the right word is) doesn’t mean that it’s the only solution, and certainly doesn’t mean that the solution is the best one.

In the post on redefining the node of an RDF graph, I described nodes from the two aspects: a node as a data structure and a node as a symbol. As a data structure, a node is basically an object with two attributes: a name and data. Therefore, if the data attribute holds a datatype or a language tag in addition to the value, it will no longer contain atomic data. It becomes an object itself with two attributes. And that sucks.

So why the datatype has to be explicitly added to a literal every time? Couldn’t the datatype be declared, the similar way an instance is declared? If the range of a property can be described, and than the type of an instance can be inferred, why that doesn’t work for literals, too?

The problem

In the discussion on this subject on the RDF Working Group mailing list, Antoine Zimmermann explained it nicely:

Often, one would like to write:

ex:prop  rdfs:range  xsd:decimal .
ex:sub   ex:prop     "42" .

and infer that “42″ is a decimal number. However, what one gets from these two triples is that “42″ is a sequence of 2 characters AND a decimal, which is inconsistent.

Antonie then writes that overcoming this in the RDF model is hard, because „literals are universal identifiers, just like URIs“. So „42“ in all situations is identifying the same thing. He goes on with another example, which would not be possible in practice:

ex:prop rdfs:range xsd:decimal .
ex:sub ex:prop "42" .
ex:password rdfs:range xsd:string .
ex:sub2 ex:password "42" .

Here, certainly some people would expect the first “42″ to be denoting the number, while the second is just two characters. But this implicitly assumes that the denotation of literals is contextual: it would depend on which predicate is used in the triple. While it would be possible, in principle, to define a language where this makes sense, it does not fit at all with the RDF data model.

In another response in the same thread, Pat Hayes wrote, regarding inferring datatypes from the property definitions:

[...] in general, a property range is a class, not a datatype. What happens when the range is just a class and has no associated L2V mapping? Also, a property can have many ranges. What happens if two of them are datatype classes? Which one gets control over the interpretation of the literal string? [...]

A new literal

Before I start discussing potential solutions for the above problems, let’s recall how literals are different in the new RDF model I’ve been proposing in this blog, compared to the classical RDF model:

  • Literal nodes are identified by URIs, not its values
  • Literal nodes are not literal values themselves, but the symbols representing the values
  • A literal node is always the object of the rdf:value predicate in a triple

In a new model, literals are no more treated as universal identifiers. They are, like every other node in the RDF model, identified by a URI. They represent literal values, whose meaning depends on the context. This context is not defined by a literal, but a node having that literal as a value. This node is an instance (object) of one or more classes. It has one main value, represented by a literal and realized by the rdf:value property. This kind of node resembles primitive data types from programming languages, such as string, integer, float, boolean…

The difference is that in programming, you don’t have to define the meaning of every variable. In object oriented programming, for example, the „primitive“ variable named weight is often implemented as an instance of the Float class. In RDF, however, the weight has a meaning that is not limited to the datatype of its value.

rdf:datatype

Therefore, we have a type of an object and a datatype of that object’s value as distinct concepts. For instance, a product weight can be an instance of the class ex:Weight, and its value can be of float datatype. Check out the following example:

ex:Weight
    rdf:type      rdfs:Class ;
    rdf:datatype  xsd:float .

ex:hasWeight
    rdfs:range    ex:Weight .

<http://example.com/data_/product/item10245>
    ex:hasWeight  <http://example.com/data_/product/item10245/ex_hasWeight> .
<http://example.com/data_/product/item10245/ex_hasWeight>
    rdf:value     "2.4" ;
    ex:inUnit     ex:kg .

The new property rdf:datatype used here has the meaning „the value of the class S is of datatype O“. Therefore, the value of http://example.com/data_/product/item10245 is of datatype xsd:float.

The literal is represented by its value 2.4. There is no need to explicitly write its URI in the syntax. If needed, it can be easily obtained by connecting the URI of its parent node and CURIE segment rdf_value. The result is http://example.com/data_/product/item10245/ex_hasWeight/rdf_value.

2.4 is not a universal identifier any more, so 2.4 doesn’t have to always identify the same thing. Its meaning thus becomes relative to the context, not absoute in the whole RDF graph. For instance, it can sometimes be a float number, and sometimes a string. It’s just a value whose meaning depends on the type of its parent node. In this case, the literal acts as the value of an instance of the class ex:Weight which rdf:datatype is described as xsd:float, so we can say that 2.4 is a float number.

This is a good example why it is important to make a clear distinction between the concepts of a node, its name and data it holds. In the case of the literal 2.4, one can say that this literal is a node that has the name http://example.com/data_/product/item10245/ex_hasWeight/rdf_value and hold value 2.4. In short, literal node != literal value, and literal value != literal name.

Furthermore, a property range is always a class, and a datatype is a range of new, special property rdf:datatype. This way, classes and datatypes with associated L2V mapping are clearly separated.

In the reality, one cannot expect that a class will have just one rdf:datatype. In addition, a node can be an instance of more than one class, having different datatypes defined. As a solution, one can describe the instance itself with the preferred rdf:datatype, as in the following example:

<http://example.com/data_/product/item10245/ex_hasWeight>
    rdf:value     "2.4" ;
    rdf:datatype  xsd:float ;
    ex:inUnit     ex:kg .

The datatype information would be on the URI http://example.com/data_/product/item10245/ex_hasWeight/rdf_datatype, providing much more elegant way for expressing datatypes than using typed literals.

But it shouldn’t be mandatory. Often, the context can help. In this case, there is the property ex:inUnit which rdfs:domain is, say, ex2:Measure, with rdf:datatype of xsd:float. In addition, when in doubt, data consumer can check the superclass of ex:Weight, and find out that it is, say, ex3:Measure, where ex3 is an authoritative source.

Chances are that, ultimately, a limited number of superclasses defining datatypes will emerge and that classes which instances can have values, in order to be trusted, will have to be declared as subclasses of these superclasses.

Finally, one last note on the datatypes in RDF. My impression is that a few datatypes are used much more frequently than the others. These most frequently used datatypes should be really the part of the core RDF (or RDFS) ontology, for instance rdf:String, rdf:Number, rdf:Boolean, rdf:Date would be just enough. Actually, it would make sanse to merge RDF and RDFS with a few elements from OWL (so-called RDFS++) into one core ontology. Things are sometimes just rediculously unintuitive: take rdf:Property and rdfs:Class for instance. But that’s another story.

rdf:language

Finally, we get to the literals with language tags – another special case of literals. Let’s look at the example from DBpedia:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrade is the capital and largest city of Serbia. "@en .
<http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrad ist die Hauptstadt der Republik Serbien."@de .

In this snippet, the language tags en and de, placed at the end of literals using (another) special syntax @, denote that the literals are in English and German language, respectively.

The solution for literals with language tags is based on the same principles as the solution for typed literals. Let’s focus on the English version. First, we need to distinguish the concept of a value from the resource that contains that value:

<http://dbpedia.org/resource/Belgrade>
    rdfs:comment <http://dbpedia.org/resource/Belgrade/rdfs_comment/en> .

<http://dbpedia.org/resource/Belgrade/rdfs_comment/en>
    rdf:value "Belgrade is the capital and largest city of Serbia." .

…which can be shorten to:

<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [
    rdf:value "Belgrade is the capital and largest city of Serbia." .

]
...or made even more shorten:

<http://dbpedia.org/resource/Belgrade>
    rdfs:comment:en "Belgrade is the capital and largest city of Serbia." .

Here, rdfs:comment:en is an example of the extended CURIE described in an earlier post.

http://dbpedia.org/resource/Belgrade/rdfs_comment/en is a URI of a node whose value is the description in English. The URI segment at the end of the URI is required because there are several rdfs:comment properties. en can be a convention that has the role of the previously used language tag (in the form of @en). This solution shifts the language information into the URI and doesn't require special treatment in the RDF model, as it was the case with literals with language tags.

However, in order to explicitly describe the language, an additional triple is needed:

<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [
    rdf:value "Belgrade is the capital and largest city of Serbia." ;
    rdf:language <http://dbpedia.org/resource/English_language> .
]

Here I used the new rdf:language property to describe the language of the comment. The subject of the RDF triple is a comment identified by http://dbpedia.org/resource/Belgrade/rdfs_comment/en, and the object is the DBpedia resource representing English language.

It seems reasonable that rdf:language will be used to describe primarily instances, compared to rdf:datatype intended for the description of classes. However, there should be no restrictions – each should be allowed to describe both classes and instances.

Conslusion

If we want a truly simple and flexible RDF model, the denotation of literals should be contextual. A literal serves the role of the value of some other node in a graph. This node carries the information about a literal – its datatype as well as the language.

In that sense, two new properties are proposed: rdf:datatype and rdf:language, which are special in that they don’t describe a resource directly, but rather its value (or the value of its instance, in the case of a class) expressed using the rdf:value property. The meaning of the literal value depends on that node. This is possible because the universal identifier of a literal is its URI, not the value.

The proposed realization of literals, along with the removal of blank nodes, greatly simplifies the RDF model. Resources are described exclusively with RDF triples, using URI references and plain literals. "Special" literal nodes are discarded from the model.

The new RDF model has a clearly defined node that always has a name identified using the same mechanism (URI), i.e. all names belong to a single set. There are only two kinds of nodes - one representing a resource and one representing a literal value. The fact that all nodes now have a URI as name enables the next step – a (proper) realization of an RDF graph on the Web.


  • http://milicicvuk.com/blog/2012/01/14/hypernotation-classification-of-hypernodes/ Hypernotation: Classification of hyperNodes

    [...] A hyperLiteral is a plain literal realized in the Hypernotation context. It’s a node that represents raw data itself and is used when a hyperObject has a single-value representation that is expressed with rdf:value property. The URI of a hyperLiteral, therefore, always has the form of hyperObjectURI/rdf_value and it doesn’t branch further. Using its own URI is one way of publishing a hyperLiteral. The other way is to implement it as a value of hyperAttribute, what was discussed earlier in this post. This type of literal returns a plain unformatted text (atomic data), and can be called „raw data“ literal. Dealing with different data types is discussed in the post Getting rid of typed literals. [...]