[14 July 2009]
As reported in my previous post, I’ve been thinking about RDF a bit lately. So I’ve decided to dust off some meditations on the subject that originated several years ago.
I was feeding the dogs one evening when Enrique dropped by and complained bitterly about the shortcoming of various colleagues’ attempts to persuade people (including me) of the value of RDF: the overstatement, the misrepresentations of other technologies (both XML and relational databases), the overselling of RDF’s virtues. “True, they would make anyone with any marketing sense tear their hair out,” I said. “But it’s not rational to infer that there are no arguments for RDF, just because its advocates make such a poor show of arguing for it. If you want to understand what RDF does, without overstatement and without mischaracterization of other technologies, why don’t you try constructing a dispassionate account of what RDF does and doesn’t get us?”
Enrique’s response was something like what follows.
It should be noted that Enrique focuses here on RDF itself, not RDF + OWL. OWL was still very new at the time, and Enrique was reacting to years of rhetoric about how RDF, by itself, was semantically richer than XML. I have also corrected a slip or two in Enrique’s original effort; he couldn’t remember the term phatic, for example.
I wonder (Enrique said) if RDF can be summarized in three points:
- It proposes a way to think about information: there are things, they have properties.
- It proposes that we use a single universe of names for all individuals: URIs.
- It provides a single model of property attribution, namely the binary predicate, and thus gives us three well known roles (subject, verb, object, or relation-name, first-argument, second-argument) for participants in relations.
These may be worth some commentary.
It proposes a way to think about information: there are things, they have properties.
There’s no proof that all information, or all knowledge, or all propositions, can be thought of as being about things with properties. In fact, there are many very bright philosophers who deny it outright. But those who deny it don’t provide anything of similar convenience for machine processing.
Formal logic as usually taught today similarly tells us how to talk about things with properties. It’s quite plausible that there are things we can’t express conveniently or at all in formal logic — just look at the mess formal logicians are in trying to justify the truth table for material implication — but just as formal logic can be useful even if there are things it cannot do, so also for any way of talking about things and properties.
Things and properties, as usually considered, don’t capture very well the expressive, conative, metalingual, or phatic aspects of language, as Jakobson calls them (let alone the poetic), just the representational. Again, like logic.
It proposes that we use a single universe of names for all individuals: URIs.
URIs are interesting in part because they are simultaneously a unified set of names and a distributed system. Using them, we can eliminate ambiguity (if URIs are correctly used), though not synonymy.
Contrast naming disciplines in SQL, DTDs, programming languages, first-order predicate calculus, or natural language, with uncoordinated naming.
Contrast also naming disciplines involving central authority (‘use-mine-or-nothing’). If I remember correctly, there are central authorities who control Linnaean nomenclature, and names for specific geological formations, and the names of compounds given in official pharmacopeias.
It provides a single model of property attribution, namely the binary predicate, and thus gives us three well known roles (subject, verb, object, or relation-name, first-argument, second-argument) for participants in relations.
This simplicity, together with the lack of ambiguity in URIs when properly used, means that merger of arbitrary sets of triples is safe and easy. When predicates of arbitrary arity are allowed, merger can be more complex, or less effective, because when two sets of normalized relations are merged in a straightforward way, the result is not necessarily normalized. When things are resolved to triples, they are always in normal form. So the primary reasons for sub-optimal results after merging sets of triples are failures to merge owing to undetected synonymy, entailment relations other than synonymy (variation in specificity), variation in methods of currying n-ary predicates, and orbis-tertius variation.
“Orbis-tertius variation? What on earth are you talking about?” “Nothing on earth! It’s my short-hand way of talking about the radically underdetermined nature of our ad hoc ontologies. It’s a reference to Borges’s story Tlön, Uqbar, Orbis Tertius, my favorite treatise on ontology. Think of it as …” “… kind of an hommage. Right,” I said.
Ontological variation, by contrast, which shows up in variable-arity systems as difference of opinion about just what domains should be regarded as involved in an n-ary relation, does not cause problems for triples.
After he left, I realized I have no idea what Enrique meant by this.
Like the element/attribute distinction and child/parent, sibling/next-sib relations in SGML, this is a very thin standardization layer; it means almost nothing (which is why there can be so many sources of semantic variation). And again like the element/attribute model of SGML, that little turns out to be quite a lot, merely because that thin layer of standardization provides hooks that allow software to provide meaningful and useful operations defined in terms of those three roles. These operations can be performed without the software having the least idea of the meaning of the data (which is one reason it is so bizarre that Semantic Web enthusiasts insist so fervently on the implausible claim that the semantics of RDF data are overt in ways the semantics of other formats are not — I suspect the problem is that those particular enthusiasts think the distinction between circles and arrows counts as ‘semantics’.)
The semantic advantage of both SGML and RDF over some of their more obvious alternatives is this: precisely because they don’t define a prescribed semantics, the user can model whatever the user is interested in, using the primitive objects and relations built into the system to model whatever they wish to take as the primitive objects and relations of the system they are interested in representing. When this works well, operations on the primitive objects and relations can be used to model operations in the application domain, and the user has the feeling of being able to work ‘directly’ with the concepts of the application domain, with reduced need to pay attention to details of the representation. The ‘universality’ achieved by such a semantically thin layer of primitive notions is exactly parallel to the universality of s-expressions and relations and it is not surprising that the advantages we feel to accrue from XML and/or RDF are very similar to the advantages claimed for s-expressions by Lisp enthusiasts and for the relational model by Codd, Date, and the relational warriors of the 1970s and 1980s.
Because the primitive notions of RDF (things and properties) are explicitly tied to ideas of modeling, they feel (at least to believers) more nearly ‘semantic’ than the notions of other systems (e.g. XML or s-expressions). The thinness of the triple layer can be an advantage, not only in simplifying the universe of possible primitive operations, but also in reducing threshold anxiety. (More elaborate modeling systems invariably require something like a leap of faith; RDF’s tenets are thin enough and bland enough to make its required leap of faith somewhat smaller and less frightening.) And thin as it is, the subject/verb/object model does allow an infrastructure that knows nothing of the semantics of the information to do a lot of useful things, just as the semantics of the relational model allow RDBMS which understand nothing of application semantics to do a lot of useful things.
Some things it does not do (although prominent exponents of the Semantic Web sometimes speak as if it did):
- provide ‘self-describing data’ (if such a thing exists at all)
- ensure that ‘the semantics’ of data are always explicit or always understood
- guarantee that data from different sources are usefully mergeable
- tell us how to understand, model, formalize our data
- tell us how to validate our data
- tell us how to express complex relations clearly (this problem is not only not addressed by RDF; RDF does as much as any notation or model can to render it insoluble)
This may have been true when Enrique first wrote this, but the situation has perhaps changed with the publication of the Note Defining N-ary Relations on the Semantic Web, published in 2006, which recommends that tuples be reified with a gay abandon that might cause even avowed Platonists to pause and wonder whether all of those things really should be treated as individuals by our logic. Determined nominalists may be horrified by the recommendation, but it’s no longer true to say that the RDF community doesn’t say how to handle the problem.
Some fans of RDF will perhaps feel that Enrique has shortchanged RDF here, but I have to say that Enrique’s arguments have gone a long way toward making me think RDF could be useful, even if I am still not a committed as I suspect some of my friends in the W3C’s Semantic Web Activity would like.
If this message in a bottle is ever read by anyone, I will be interested to hear back from you on whether you find Enrique’s analysis persuasive.