An RDF syntax (notation) is a concrete syntax for writing (serializing) RDF triples. There are many different notations, some of which are based on existing formats (XML, (X)HTML, JSON), while others use special formats (Turtle).
The problems of the RDF model, discussed in the previous posts, are inherited by RDF notations. Various special cases require special support in a syntax, which makes it more complex. However, RDF notations contain a number of their own problems, which can be divided into two categories:
- various problems in RDF notations
- the lack of a single dominant notation
Various problems in RDF notations
The RDF/XML syntax has been developed and standardized by the W3C, and is widely used in publishing Linked Data on the Web. It has been present since 1999, and revised in 2004. XML, a popular and dominant format for describing and exchanging structured data at the time, has been chosen as the logical choice for the RDF syntax.
However, RDF/XML has evolved into a complex and unintuitive notation. Because of its high expressivity, the same RDF graph can be serialized in many different ways, making the usage of standard XML tools very difficult. It is believed that RDF/XML has significantly slowed the adoption of RDF and the Semantic Web idea. RDF/XML has not been clearly separated from RDF as a model, because of which the simplicity of the RDF model didn’t come to the fore. People who wanted to learn about RDF have conflated these two concepts and were initially discouraged by the complexity of the RDF/XML syntax.
Given all this excellent simplicity [of the RDF model], you have to kind of boggle when you look at one of the first examples taken from a recent RDF Working Draft.
<rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar"> <ex:editor> <rdf:Description> <ex:homePage> <rdf:Description rdf:about="http://purl.org/net/dajobe/"> </rdf:Description> </ex:homePage> </rdf:Description> </ex:editor> </rdf:Description>
Where, pray tell, are the resources, properties, and values? What benefit could I expect to derive from viewing this particular source?
[...] At this point, the RDF evangelists pipe up and say “Well, Ordinary People ™ don’t have to look at the source, there will be tools to sort all that out.” Sorry, I just don’t believe that. If, in 1994, you’d needed DreamWeaver or equivalent to write for the Web, there wouldn’t be a Web today.
The example code in his post still exists today in the official RDF/XML syntax specification. This example shows the hierarchical nature of XML and suggests that in addition to the problematic decisions that led to the unnecessary complexity of the notation, there is a deeper problem – the mis-match between the model and the syntax. Namely, XML is designed for a hierarchical (tree) data model, while RDF is a set of triples, i.e. a graph.
Firstly, there is a mis-match in the data models: serialization involves turning a graph into a tree. There are many different ways to achieve that so, without applying some external constraints, the output can be highly variable. The problem is that those constraints can be highly specific, so are difficult to generalize. This results in a high degree of syntax variability of RDF/XML in the wild, and that undermines the ability to use RDF/XML with standard XML tools like XPath, XSLT, etc. They (unsurprisingly) operate only on the surface XML syntax not the “real” data model.
Secondly, people dislike RDF/XML because of the mis-match in (loosely speaking) the native data types. XML is largely about elements and attributes whereas RDF has resources, properties, literals, blank nodes, lists, sequences, etc. And of course there are those ever present URIs. This leads to additional syntax short-cuts and hijacking of features like XML Namespaces to simplify the output, whilst simultaneously causing even more variability in the possible serializations.
Thirdly, when it comes to parsing, RDF/XML just isn’t a very efficient serialization. It’s typically more verbose and can involve much more of a memory overhead when parsing than some of the other syntaxes.
Leigh believes that neither the notation based on JSON will solve the problem. According to him, the JSON based RDF notation will suffer from the same problems as XML, because JSON is basically also a tree, it has different datatypes, etc. The result is that JSON will be used in an unnatural way, which is why it will not be able to express its main qualities for which it was chosen in the first place.
You can’t shoe horn RDF in to JSON, no matter how hard you try – well, you can, but you loose all the benefits of JSON in the first place, because the data is RDF, triples and not objects, rdf nodes and not simple values.
Another problem of the JSON syntax is the lack of support for hyperlinks, so the customized solutions that rely on an application logic must be implemented. In the context of the Web of data, this is a serious drawback.
Possibly the most popular RDF notations are Turtle and RDFa. Turtle is a simple and readable alternative for the RDF/XML syntax. The fact that it’s a special format requires a separate parser. Also, until recently, it didn’t have a registered media type. Although very popular in the Semantic Web community, it’s not so widespread in the wild.
RDFa is an RDF notation that enables RDF triples to be embedded in (X)HTML documents using a set of attributes of the (X)HTML elements. It doesn’t require separate documents, but instead allows people to add structure to an existing content. It represents by far the most successful RDF notation, which in recent years gained considerable popularity. Facebook social graph is based on RDFa, and search engines support it. It’s also supported in the popular CMS and blog platforms WordPress, Drupal 7, and many other web sites (BestBuy, overstock.com, examiner.com …).
RDFa offers flexibility when it comes to the description of the existing content of a web page, but it’s dependent on it and doesn’t work as an independent notation, which makes it a bad solution for general use. Another downside of RDFa is mixing content with metadata, which increases the file size and violates one of the important principles in computer science.
RDFa is seriously threatened by the recent launch of Schema.org – a joint project of Google, Bing and Yahoo search engines. Schema.org provides a collection of shared vocabularies intended for webmasters to create structured data in their web pages, thus helping search engines to better understand their content. Instead of RDFa, they have chosen a simpler and less expressive syntax Microdata, which uses HTML5 attributes for the metadata embedding.
As explained in the Schema.org website, a pragmatic approach resulted in the rejection of RDFa because of its substantial complexity. In addition, they focus on microdata alone, because „supporting multiple syntaxes makes documentation for webmasters more complex and introduces more overhead in terms of defining new formats“.
It can be concluded that currently RDF notations are going through a crisis. The failure of RDF/XML has led to the search for simpler and more practical solutions. The notations that emerged over time – primarily Turtle, RDFa and more recently, the JSON-based ones and Microdata, all have their advantages and disadvantages. However, it seems that most of the notations work well in a particular context, but not as a universal notation.
RDFa has slowly began to emerge as the dominant notation, but its future has become questionable after Microdata has been chosen by the major search engines. One of the specified reasons for pushing only Microdata syntax is particularly important – the need for a single syntax, and should be discussed in more details.
The lack of a single dominant notation
A significant problem with RDF notations is the existence and use of multiple notations in practice. In the blog post Priorities for RDF, Jeni Tennison discussed two main reasons why the existence of multiple notations slows the adoption:
First, publishers aren’t always generating data automatically; in a number of cases (which I think and hope will grow) RDF data is being generated just like CSV files are, as static documents which are simply published in the same way as other static documents. In these cases, publishers either have to do the research and make a decision about which format to use, or produce the data in multiple formats. This is a particular challenge when people aren’t convinced they want to generate RDF anyway.
Second, toolsets have to handle producing or consuming multiple formats. That means more code, more testing and more maintenance on both the production and consumption sides of the equation, all of which raise the implementation burden.
Jeni concludes that the diversity is natural during the initial stages of the use of a technology. But as that technology matures, agreed foundations are needed that, even if imperfect, would provide the stability and solve the above problems.
Your data source should at least provide RDF descriptions as RDF/XML which is the only official syntax for RDF. As RDF/XML is not very human-readable, your data source could additionally provide Turtle descriptions when asked for MIME-type application/x-turtle. In situations where your think people might want to use your data together with XML technologies such as XSLT or XQuery, you might additionally also serve a TriX serialization, as TriX works better with these technologies than RDF/XML.
In other words, the message to a potential data publisher is: “RDF/XML sucks, but you have to provide it. If you want a syntax that doesn’t suck so bad, additionally provide Turtle. Finally, if you want to provide an XML-based syntax that actually allows for using XML tools, provide TriX”.
I cannot speak for others, but I know how I felt after reading this for the first time, at the point when I finally got a slight idea of the difference between information and non-information resources and the purpose of 303 redirects.
It should be mentioned that this tutorial was published in 2007 and is now superseeded by the book Linked Data: Evolving the Web into a Global Data Space which provides more up-to-date introduction into Linked Data. Nevertheless, these issues are still relevant – in the book there are five different RDF syntaxes described, and the number is growing.
Perhaps an analogy with HTML provides a good way to illustrate the problem. HTML is a format that could be certainly more elegant and expressive. However, HTML has survived thanks to the other properties, primarily a careful balancing between interoperability and evolvability. It is made with an ability to evolve, adapt and survive. It has proved to be good enough and easy enough solution for most people, without which it would be difficult to imagine the great success of the Web. If the problem of syntax (HTML) has been solved differently, the Web may have developed at a faster or slower pace. But what would the Web look like if everyone had to publish information in multiple formats, and invest time and effort to learn and choose from alternatives?
One can conclude that overly complex solutions, the mismatch between the model and the syntax, misunderstanding the nature of the Web, and wrong priorities that prefer expressiveness and elegance to simplicity and practicality, are some of the problems associated with the RDF notations.
- next post: Problems of Linked Data (1/4): Identity »
- « previous post: The Ultimate Problem of RDF and the Semantic Web