Can JSON and RDF be friends?

Is it possible to turn RDF into an idiomatic tree-based format? The history of RDF notations, especially the experience with RDF/XML and JSON-LD proved the opposite. Ian Davis summed it up nicely:

The main problem I see with the “idiomatic JSON” use case is that although it’s much more usable by the average web author, it’s always going to butt up against various mismatches in model: graphs vs trees, URIs vs shortnames, literals/languages/datatypes vs strings, repeated properties vs simple values, blank nodes, lists/collections vs arrays/dictionaries.

The blunt truth is all of those things make RDF an unfriendly model to web authors and I think it will be very hard, or impossible, to develop an idiomatic JSON serialisation that web authors will care about.

In this post I’ll try to tackle this problem by trying to answer the following questions:

  1. What are the challenges of converting a random JSON document to triples?
  2. What compromises need to be made on both (JSON and RDF) sides to make this possible?
  3. How complex would the JSON -> RDF (and vice versa) parser be?

1. What are the challenges of converting a random JSON document to triples?

I remember a meeting in which a partner mentioned they have data needed for the project. When asked if data is in RDF format he replied with a great relief that data is “just some simple key-value pairs”. In my mind, key-value pairs and triples are just the two sides of the same coin, so I felt kind of sad that people perceived the two as entirely different things.

What is it that makes a bunch of key-value pairs appear so different then a set of triples? Add the subject to key-value pairs and you get triples. Or vice-versa — group the triples around the common subject and you get the key-value pairs. Well, in practice, it’s a bit more complicated.

Today, JSON is the de facto standard for describing data. Despite its simplicity, JSON enables flexibility and expressivness. Still, there is a limited number of patterns used, which opens the opportunity to identify them and understand how they would map to corresponding triples.

As a simple experiment, I am going to pick a random JSON and try to naively convert it to triples, ignoring the URIs, datatypes and other RDF features. Let’s use the description of Underscore JavaScript library on npm.

Let’s start with the first four key-value pairs.

{
  "_id": "underscore",
  "_rev": "251-c05c3d825f5bc6b691649b4f90a3c894",
  "name": "underscore",
  "description": "JavaScript's functional programming helper library.",
  ...
}

I will use underscore as the subject for simplicity. The keys and values will become predicates and objects respectively.

underscore _id "underscore"
underscore _rev "251-c05c3d825f5bc6b691649b4f90a3c894"
underscore name "underscore"
underscore description "JavaScript's functional programming helper library."

Nested objects

The conversion of flat JSON is unsurprisingly straightforward. What about nested objects?

{
...
  "repository": {
    "type": "git",
    "url": "git://github.com/jashkenas/underscore.git"
  }
...
}

The nested object could be treated as a blank node, but it’s better to use the node name we already have thanks to dot notation. The property repository, used as a predicate, can simply be interpreted as ‘has repository’.

underscore repository underscore.repository
underscore.repository type "git"
underscore.repository url "git://github.com/jashkenas/underscore.git"

Let’s take a look at another example with two nested objects:

{
 ...
  "versions": {
    "1.0.3": {
      "name": "underscore",
      "description": "Functional programming aid for JavaScript. Works well with jQuery.",
      "url": "http://documentcloud.github.com/underscore/",
      ...
  }
  ...
}

Following the same method as in the previous example, we would get:

underscore versions underscore.versions
underscore.versions 1.0.3 underscore.versions["1.0.3"]

This looks strange. While the property versions can be interpreted as ‘has versions’, 1.0.3. as a predicate doesn’t make much sense.

In order to interpret nested objects structure, we must understand the author’s intent. The problem is that associative arrays and objects have different semantics, but in JSON (ie. JavaScript) they are written using the same syntax. An associative array is an array where descriptive keys are used instead of integer indices; in an object the keys are the names of its properties.

In the above example the value of versions is an associative array, but the value of 1.0.3 is an object. In other words, 1.0.3 has never meant to be a property of an object, but simply a key of an array.

Therefore, we must choose another strategy. We are going to connect underscore directly to the array item, and consider 1.0.3. as a value of an special “key” property, while versions means ‘has version’:

underscore versions underscore.versions["1.0.3"]
underscore.versions["1.0.3"] key "1.0.3"

The conversion of the rest of the key-values pairs is straightforward:

underscore.versions["1.0.3"] name "underscore"
underscore.versions["1.0.3"] description "Functional programming aid for JavaScript. Works well with jQuery."
underscore.versions["1.0.3"] url "http://documentcloud.github.com/underscore/"

Arrays

Arrays can be parsed in the same way as associative arrays. Let’s use this snippet from the JSON file:

{
  "versions": {
    ...
    "1.1.3": {
      "maintainers": [
        {
          "name": "documentcloud",
          "email": "jeremy@documentcloud.org"
        },
        {
          "name": "jashkenas",
          "email": "jashkenas@gmail.com"
        }
      ]
    }
    ...
}

The resulting triples would look like this:

underscore.versions[1.1.3] maintainers underscore.versions[1.1.3].maintainers[0]
underscore.versions[1.1.3].maintainers[0] name "documentcloud"
underscore.versions[1.1.3].maintainers[0] email "jeremy@documentcloud.org"
underscore.versions[1.1.3] maintainers underscore.versions[1.1.3].maintainers[1]
underscore.versions[1.1.3].maintainers[1] name "jashkenas"
underscore.versions[1.1.3].maintainers[1] email "jashkenas@gmail.com"

If the array values are primitives and the order is not important, like in this example…

{
  ...
  "keywords": ["util", "functional", "server", "client", "browser"],
  ...
}

… the subject could perhaps be connected directly to the values:

underscore keywords "util"
underscore keywords "functional"
underscore keywords "server"
underscore keywords "client"
underscore keywords "browser"

.. or parsed in the same way as object, using the “special” value (rdf:value?) property.

underscore keywords underscore.keywords[0]
underscore.keywords[0] value "util"
underscore.keywords[0] index 0
underscore keywords underscore.keywords[1]
underscore.keywords[1] value "functional"
underscore.keywords[1] index 1
...

2. What compromises need to be made on both sides to make this possible?

The examples above cover different patterns used in the underscore JSON. In my experience with writing and reading different JSON documents I have seen these patterns repeating endlessly. However, JSON is flexible and it’s used in many ways, and in same cases it can’t be converted easily to RDF. Take for instance these two examples:

{
  "prop": [["foo", "bar"], ["fooValue", "barValue"]]
}

{
  "properties": [
    {
      "name": "foo",
      "value": "fooValue"
    },
    {
      "name": "bar",
      "value": "barValue"
    }
  ]
}

In the first example nested arrays are used to represent a table, in the second key/value pairs are encoded explicitly. Are these examples idiomatic? I would say no, but the question is where is the border that separates an idiomatic JSON from a non-idiomatic one. Should users be deliberately constrained to simpler patterns, or some of these patterns should be still allowed as some kind of a special, higher abstraction syntax?

Distinguishing between keys and properties

In any case, the experiment has shown that one of the biggest problems is the inability of JSON to distinguish between objects and associative arrays. In order to solve this, we don’t need to extend JSON syntax, we can just use a different syntax for keys and properties.

In RDF, properties are identified by URIs, so it’s common to use the compact URI (CURIE) syntax in prefix:name form. The presence of a special CURIE delimiter (: ) can be used as an indication of whether it’s a key or a property. For instance, look at the following example from the FOAF specification, converted from RDF/XML to JSON:

{
  "foaf:account": {
      "freenode": {
        "foaf:accountServiceHomepage":
          "href:": "http://www.freenode.net"
    }
  }
}

Here we know that the keys containing : are the properties, while the others (freenode) are the keys of the associative array.

Naming nodes with URIs

We need a way to deal with URIs as node names in JSON. In the previous section, we used implied names for nested objects thanks to dot notation. These names (ie. paths) could be easily translated to URIs by using slash (/) instead of dots.

That means that for the nodes that are existentially dependent of other nodes, instead of blank nodes or random URIs we would have to use this strict pattern. Following the previous example, the resulting triples would be:

<http://danbri.org/danbri>
  foaf:account
    <http://danbri.org/danbri/foaf:account/freenode>

<http://danbri.org/danbri/foaf:account/freenode>
  foaf:accountServiceHomepage
    <http://www.freenode.net>

Such URis can be “packed” into the tree, and written in JSON without any friction. The lack of order of structure in the graph and unfriendliness of URIs are now hidden. The same data is much easier to read and write when organized in a tree. That’s exactly why people use tree-based syntaxes, and now we have RDF represented this way! And as an extra bonus, there is no need for blank nodes any longer.

Another challenge is encoding object URIs in JSON. There is a number of different proposals how to do it. Here I followed the paradigm already used on the Web. When you write eg. a blog post and mention something, you put a hyperlink that usually points to some representative webpage describing that concept (eg. Wikipedia page, github, twitter profile, blog post etc.). Therefore, whenever you want to “break” the tree and point to a node on the Web of data, the URI is modelled as the value of the href: property. href: looks like a reserved word, but it’s also a CURIE, only without the “name” part.

3. How complex would the parser be?

The basic algorithm for parsing triple-friendly JSON is extremely simple thanks to the standard way of “packing” triples into a tree structure. In general, the hierarchy of keys and properties directs how the tree is going to be parsed into RDF. Other than that, we would need a few “special” properties, like href: described above. Also, we need a way to map CURIEs to full URIs, but this doesn’t really require a special syntax (like @prefix in Turtle or xmlns: in XML); it could be described consistently using triples as well. Finally, if used, primitive JSON types could be mapped to corresponding XSD datatypes.

One of the problems of RDF notations is that it’s possible to write an RDF graph in many different ways. The benefit of the approach described here is that it forces people to write JSON in a very predictable and consistent way. A small number of variations allows for the simple parser and easier understanding of data. JSON made this way can serve as a basis for a canonical RDF JSON format.

Finally, thanks to the fact that this approach is not really made for JSON specifically, but a general tree-based format, the same idea can be easily applied to eg. XML or YAML, and the parser would essentially use the same algorithm.