In the previous posts, I analyzed the problems of the RDF model – the existence of blank nodes, various problems related to plain and typed literals and the absence of the universal concept of a node in an RDF graph. A node, the basic element of an RDF graph, is not clearly defined. There are conceptually completely different types of nodes, with no unique method of identification. This is the key problem that more or less directly causes other problems of the RDF model and technologies that are based on RDF. It is therefore necessary to start with this problem.
There are three types of nodes in an RDF graph – URI references, blank nodes and literals. The picture above shows a typical RDF graph that contains these three types of nodes. The graph describes a person identified by the URI http://chucknorris.com/data_/chuck, its name and whom he knows. He is based near a geographical point, which is described as well. The rdf, foaf and geo ontologies are used.
URI references and blank nodes are usually shown as circles or ellipses, while literals are depicted as rectangles. In the ellipses that represent resources there is the name (URI) of the resource, (except for a blank node which is empty), while the rectangles representing literals contain a literal value.
In order to define the universal concept of a node, we have to analyze features and aspects that are common to all nodes. There are two main ways to approach a node – it can be viewed as a data structure and a as a symbol.
A node as a data structure
When viewed as a data structure, two main aspects of a node can be singled out – its name and data that it may hold. The example graph shows that nodes are determined by the first or the second aspect. URI references are determined by its name, while literals are determined by its literal value, i.e. some data. One can ask a question: Do nodes represented by an URI hold some data, and what is the name of a literal?
URI references may represent information or non-information resources. Non-information resources are determined by their name (URI), which distinctly separates them from other resources, and they are not literal values, but represent specific things, concepts or ideas. Therefore, URI references representing non-information resurces don’t hold any data.
Information resources contain information, however this information refers to their representation, which is a concept distinct from the resource. In other words, if the representation is presented in an RDF graph, it would become a literal rather than a URI reference. One can say that information resources contain data, while literals are data themselves. Therefore, URI references that represent information resources also lack data.
On the other hand, what is the name of literals and blank nodes? Literals are identified by the values they represent, so it can be said that the name is equal to their value. Blank nodes “indicate the existence of a thing, without using, or saying anything about, the name of that thing“. Blank nodes can have a local name, but it’s not the part of the abstract syntax, in which blank node “has no intrinsic name” Therefore, blank nodes by definition have no name.
The above analysis can be represented in a simple table:
We can conclude that all nodes, no matter how different from each other, are determined by these two fundamental aspects: a name and data. In other words, one can speak about the universal concept of a node, a superclass from which subclasses (URI references, blank nodes and literals) inherit and are based on the different manifestations of these two apsects.
URI references hold no data, and blank nodes in addition have no name. In order to describe these situations we can use the NULL value, which indicates that there is no value and is different from zero or empty string (“”).
The table shows the different values of the name and data aspects for the different types of nodes. It should be mentioned that a typed literal is a string combined with a datatype URI. In the table plain literal is shown for simplicity’s sake.
The universal concept of a node can be realized as an unordered set of name/value pairs, namely two pairs that both can have a NULL as a value. This data structure is referred to as an object, record, struct, hash table, associative array and others, depending on the context.
Using this new concept of a node, the previous RDF graph can be represented as follows:
This graph is a good example that shows the mess around the various methods of identifying the nodes. URIs and strings are used as identifiers for different nodes, and a node can also be blank.
Key issues regarding the definition of the universal node of an RDF graph are whether a node can be unnamed, and whether there may be several ways of identifying nodes. In previous posts numerous problems caused by the existence of blank nodes have been discussed. In a context where the focus is on data, the ability to easily reference a node is expected and logical. It is therefore necessary that all nodes have a name.
However, the existence of the name itself is not enough. A simple model requires a unique way of creating the names, i.e. the IDs for all nodes. One of the assumptions for the realization of the original Web was the existence of a single mechanism for the asigning IDs at a global level, i.e. URIs to all resources. A URI has a key role when it comes to the realization of RDF and the Web of Data (Linked Data), so a solution which allows nodes that are not identified by URI can be rightfully questioned.
A node as a symbol
A node is clearly used for representing various stuff – real word objects, ideas, anything you can imagine. So, a logical assumption is that a node is some kind of symbol. Let’s see what Wikipedia says about a symbol:
A symbol is something which represents an idea, a physical entity or a process but is distinct from it.
This definition is very close to the idea of a URI reference, which may represent practically anything. It is also clearly distinct from the thing it represents. A URI reference representing Chuck Norris and Chuck Norris are not the same things. Therefore, a URI reference can be referred to as a symbol. The same can be said to blank nodes, which basically have the same properties as URI references, with the difference that they have no name (which seems not to be required by the symbol definition).
On the other hand, a literal is defined as a “string combined with an optional language tag” or “with a datatype URI“. “Plain literals are considered to denote themselves, so have a fixed meaning.”
If a literal refers to itself, it is not distinct from the entity it represents. A literal doesn’t represent data, it is data itself. Thus, the literal does not meet the fundamental criteria to be a symbol, meaning that an RDF graph consists of a mixture of symbols with some elements that are not symbols. In other words, the structure of an RDF graph as an abstract representation is not clearly separated from what the graph with nodes represents.
To understand why this is a problem, let’s look at the “role of context in symbolism“, where a rather scarce, but a clear description with an example is given:
The context of a symbol may change its meaning. Similar five–pointed stars might signify a law enforcement officer or a member of the armed services, depending the uniform.
Therefore, one of the symbol’s properties is that it’s meaning depends on a context. The meaning of a URI reference is deteremined by relations with other nodes, i.e. triples which describe it. Connect it to different nodes and you’ll change its meaning.
On the contrary, a literal “has a fixed meaning“. Which is interesting because data, by definition, “on its own carries no meaning“. In an RDF graph, the property used in a literal triple does not affect the meaning of the literal. Even if the property’s range is defined, information about the meaning of the literal is contained only in the literal itself and is immutable.
As I previously stated, the problem is that a literal doesn’t represent a literal value, it is that value itself. Another problem is that this value is used as a universal identifier. Both things are against the nature of a symbol. They also makes a literal completely different node than a URI reference. Can one simple model stand so much variety?
The RDF model can be done much simpler. It can have all nodes conceptually equal and identified using the same mechanism.
A graph is always an abstract representation, containing the nodes that always represent, i.e. symbolize things. A literal, therefore, as the node of an RDF graph, has to be a symbol, distinct from what it represents. Having said that, one must distinguish between a literal node and a literal value that the node represents, the same way a URI reference is distinguished from a resource it refers to.
Secondly, the use of a literal value as an identifier is clearly a bad idea. Introducing another way of identification in the context in which there is already a powerful identifier – URI, unnecessarily adds to the complexity of the RDF model. Finally, RDF is realized on the Web, where a URI is a natural identificator as well.
What is the meaning of a literal and how to identify it correctly? In an earlier blog post, I compared the RDF model to the object-oriented model, making an analogy between objects and URI references. A literal in the OO context is “identified” as the value of an object’s property. If we take this analogy, a literal node should have a special role in an RDF graph – one in which it acts as a value of another node.
Therefore, there are only two types of nodes. A newly defined literal is also identified by a URI, causing the term “URI reference” to become problematic. However, for simplicity, I will keep on using the old terminology, while a literal can be understood as a URI reference that holds data.
The definition of a node
On the basis of the above analysis we can single out several requirements that the RDF model must meet in order to achieve maximum simplicity and consistency. Besides the things we already know – that a node is an element of a graph connected to other nodes via typed links, and that it can represent a resource or a literal value, we can add a few more:
- A node has two aspects – a name and data
- A node’s name must always be a URI
- A node is a symbol, meaning it is always distnict from what it represents
- A node that holds data is a special kind of node acting as the value of another node
Thanks to these requirements and constraints, it seems that we have enough material to try to finally define a node. A node of an RDF graph, therefore, can be defined as a symbol identified by a URI that represents a resource or a literal value, connected by typed links with other nodes, forming a directed, labeled graph.
Structurally, a node is determined by a name and data and consists of two key-value pairs. A name is always a URI, whose primary role is to identify the node in a global context. A URI, however, has other important functions that will be further analyzed in future posts. If represents a literal value, a node has a special role in the graph – it acts as the value of another node.
Now we need to materialize this theory in practice. The previous graph example, according to the new definition will look like this:
First, note that the notation is simplified: instead of using the “DATA: null”, we simply omitted the rectangles. Also, there is no need to repeat the “NAME:” and the “DATA:” all the time, because the names are already represented by the ellipses and data is represented by the recatangles.
Three new blank nodes are added – the one for each literal. “Blank” nodes and literals have a question mark instead of a URI.
Let’s first focus on the identification and the challenge of assigning URIs to all nodes. For now, let’s concentrate on blank nodes. How to assign a URI to blank nodes? More on that in the next post.
- next post: Assigning a URI to each node of an RDF graph »
- « previous post: Problems of Linked Data (4/4): Consuming data