So far I was dealing with the complexities of the RDF model – the mess surrounding the concept of a node, many different types of nodes, as well as different methods of identification of nodes that have a name. However, there are aspects which are not just overcomplicated, but plain ugly as well. Here I am referring to the last bits of the RDF model that needs some serious attention: typed literals and literals with language tags.
Like blank nodes, this is the subject that caused many debates over the years. The people who made the model, somehow always find a justification for every ugly bit of it. However, the fact that the RDF model is „logically consistent“ (or whatever the right word is) doesn’t mean that it’s the only solution, and certainly doesn’t mean that the solution is the best one.
In the post on redefining the node of an RDF graph, I described nodes from the two aspects: a node as a data structure and a node as a symbol. As a data structure, a node is basically an object with two attributes: a name and data. Therefore, if the data attribute holds a datatype or a language tag in addition to the value, it will no longer contain atomic data. It becomes an object itself with two attributes. And that sucks.
So why the datatype has to be explicitly added to a literal every time? Couldn’t the datatype be declared, the similar way an instance is declared? If the range of a property can be described, and than the type of an instance can be inferred, why that doesn’t work for literals, too?
Often, one would like to write:
ex:prop rdfs:range xsd:decimal . ex:sub ex:prop "42" .
and infer that “42″ is a decimal number. However, what one gets from these two triples is that “42″ is a sequence of 2 characters AND a decimal, which is inconsistent.
Antonie then writes that overcoming this in the RDF model is hard, because „literals are universal identifiers, just like URIs“. So „42“ in all situations is identifying the same thing. He goes on with another example, which would not be possible in practice:
ex:prop rdfs:range xsd:decimal . ex:sub ex:prop "42" . ex:password rdfs:range xsd:string . ex:sub2 ex:password "42" .
Here, certainly some people would expect the first “42″ to be denoting the number, while the second is just two characters. But this implicitly assumes that the denotation of literals is contextual: it would depend on which predicate is used in the triple. While it would be possible, in principle, to define a language where this makes sense, it does not fit at all with the RDF data model.
[...] in general, a property range is a class, not a datatype. What happens when the range is just a class and has no associated L2V mapping? Also, a property can have many ranges. What happens if two of them are datatype classes? Which one gets control over the interpretation of the literal string? [...]
A new literal
Before I start discussing potential solutions for the above problems, let’s recall how literals are different in the new RDF model I’ve been proposing in this blog, compared to the classical RDF model:
- Literal nodes are identified by URIs, not its values
- Literal nodes are not literal values themselves, but the symbols representing the values
- A literal node is always the object of the
rdf:valuepredicate in a triple
In a new model, literals are no more treated as universal identifiers. They are, like every other node in the RDF model, identified by a URI. They represent literal values, whose meaning depends on the context. This context is not defined by a literal, but a node having that literal as a value. This node is an instance (object) of one or more classes. It has one main value, represented by a literal and realized by the
rdf:value property. This kind of node resembles primitive data types from programming languages, such as string, integer, float, boolean…
The difference is that in programming, you don’t have to define the meaning of every variable. In object oriented programming, for example, the „primitive“ variable named
weight is often implemented as an instance of the
Float class. In RDF, however, the
weight has a meaning that is not limited to the datatype of its value.
Therefore, we have a type of an object and a datatype of that object’s value as distinct concepts. For instance, a product weight can be an instance of the class
ex:Weight, and its value can be of float datatype. Check out the following example:
ex:Weight rdf:type rdfs:Class ; rdf:datatype xsd:float . ex:hasWeight rdfs:range ex:Weight . <http://example.com/data_/product/item10245> ex:hasWeight <http://example.com/data_/product/item10245/ex_hasWeight> . <http://example.com/data_/product/item10245/ex_hasWeight> rdf:value "2.4" ; ex:inUnit ex:kg .
The new property
rdf:datatype used here has the meaning „the value of the class S is of datatype O“. Therefore, the value of
http://example.com/data_/product/item10245 is of datatype
The literal is represented by its value
2.4. There is no need to explicitly write its URI in the syntax. If needed, it can be easily obtained by connecting the URI of its parent node and CURIE segment
rdf_value. The result is
2.4 is not a universal identifier any more, so
2.4 doesn’t have to always identify the same thing. Its meaning thus becomes relative to the context, not absoute in the whole RDF graph. For instance, it can sometimes be a float number, and sometimes a string. It’s just a value whose meaning depends on the type of its parent node. In this case, the literal acts as the value of an instance of the class
rdf:datatype is described as
xsd:float, so we can say that
2.4 is a float number.
This is a good example why it is important to make a clear distinction between the concepts of a node, its name and data it holds. In the case of the literal
2.4, one can say that this literal is a node that has the name
http://example.com/data_/product/item10245/ex_hasWeight/rdf_value and hold value
2.4. In short, literal node != literal value, and literal value != literal name.
Furthermore, a property range is always a class, and a datatype is a range of new, special property
rdf:datatype. This way, classes and datatypes with associated L2V mapping are clearly separated.
In the reality, one cannot expect that a class will have just one
rdf:datatype. In addition, a node can be an instance of more than one class, having different datatypes defined. As a solution, one can describe the instance itself with the preferred
rdf:datatype, as in the following example:
<http://example.com/data_/product/item10245/ex_hasWeight> rdf:value "2.4" ; rdf:datatype xsd:float ; ex:inUnit ex:kg .
The datatype information would be on the URI
http://example.com/data_/product/item10245/ex_hasWeight/rdf_datatype, providing much more elegant way for expressing datatypes than using typed literals.
But it shouldn’t be mandatory. Often, the context can help. In this case, there is the property
rdfs:domain is, say,
xsd:float. In addition, when in doubt, data consumer can check the superclass of
ex:Weight, and find out that it is, say,
ex3 is an authoritative source.
Chances are that, ultimately, a limited number of superclasses defining datatypes will emerge and that classes which instances can have values, in order to be trusted, will have to be declared as subclasses of these superclasses.
Finally, one last note on the datatypes in RDF. My impression is that a few datatypes are used much more frequently than the others. These most frequently used datatypes should be really the part of the core RDF (or RDFS) ontology, for instance
rdf:Date would be just enough. Actually, it would make sanse to merge RDF and RDFS with a few elements from OWL (so-called RDFS++) into one core ontology. Things are sometimes just rediculously unintuitive: take
rdfs:Class for instance. But that’s another story.
Finally, we get to the literals with language tags – another special case of literals. Let’s look at the example from DBpedia:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . <http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrade is the capital and largest city of Serbia. "@en . <http://dbpedia.org/resource/Belgrade> rdfs:comment "Belgrad ist die Hauptstadt der Republik Serbien."@de .
In this snippet, the language tags
de, placed at the end of literals using (another) special syntax
@, denote that the literals are in English and German language, respectively.
The solution for literals with language tags is based on the same principles as the solution for typed literals. Let’s focus on the English version. First, we need to distinguish the concept of a value from the resource that contains that value:
<http://dbpedia.org/resource/Belgrade> rdfs:comment <http://dbpedia.org/resource/Belgrade/rdfs_comment/en> . <http://dbpedia.org/resource/Belgrade/rdfs_comment/en> rdf:value "Belgrade is the capital and largest city of Serbia." .
…which can be shorten to:
<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [ rdf:value "Belgrade is the capital and largest city of Serbia." .
...or made even more shorten:
<http://dbpedia.org/resource/Belgrade> rdfs:comment:en "Belgrade is the capital and largest city of Serbia." .
rdfs:comment:en is an example of the extended CURIE described in an earlier post.
http://dbpedia.org/resource/Belgrade/rdfs_comment/en is a URI of a node whose value is the description in English. The URI segment at the end of the URI is required because there are several
en can be a convention that has the role of the previously used language tag (in the form of
@en). This solution shifts the language information into the URI and doesn't require special treatment in the RDF model, as it was the case with literals with language tags.
However, in order to explicitly describe the language, an additional triple is needed:
<http://dbpedia.org/resource/Belgrade> rdfs:comment:en [ rdf:value "Belgrade is the capital and largest city of Serbia." ; rdf:language <http://dbpedia.org/resource/English_language> . ]
Here I used the new
rdf:language property to describe the language of the comment. The subject of the RDF triple is a comment identified by
http://dbpedia.org/resource/Belgrade/rdfs_comment/en, and the object is the DBpedia resource representing English language.
It seems reasonable that
rdf:language will be used to describe primarily instances, compared to
rdf:datatype intended for the description of classes. However, there should be no restrictions – each should be allowed to describe both classes and instances.
If we want a truly simple and flexible RDF model, the denotation of literals should be contextual. A literal serves the role of the value of some other node in a graph. This node carries the information about a literal – its datatype as well as the language.
In that sense, two new properties are proposed:
rdf:language, which are special in that they don’t describe a resource directly, but rather its value (or the value of its instance, in the case of a class) expressed using the
rdf:value property. The meaning of the literal value depends on that node. This is possible because the universal identifier of a literal is its URI, not the value.
The proposed realization of literals, along with the removal of blank nodes, greatly simplifies the RDF model. Resources are described exclusively with RDF triples, using URI references and plain literals. "Special" literal nodes are discarded from the model.
The new RDF model has a clearly defined node that always has a name identified using the same mechanism (URI), i.e. all names belong to a single set. There are only two kinds of nodes - one representing a resource and one representing a literal value. The fact that all nodes now have a URI as name enables the next step – a (proper) realization of an RDF graph on the Web.
- next post: The Web is just a bunch of trees plus shortcuts »
- « previous post: The “RDF graph” URI pattern