In his enlightening essay Si tacuisses, Enrique …, my colleague Thomas Roessler outlines some specific ways in which RDF’s provision of a strictly monotonic semantics makes some things possible for applications of RDF, and makes other things impossible. He concludes by saying
RDF semantics, therefore, is exposed to criticism from two angles: On the small scale, it imposes restrictions on those who model data … that can indeed bite badly. On the large scale, real life isn’t monotonic …, and RDF’s modeling can’t deal with that….
XML is “dumb” enough to not be subject to either of these criticisms. It is, however, not even trying to address the issues that large-scale data integration and aggregation will bring.
I think TR may both underestimate the degree to which XML (like SGML before it) contributes to making large-scale data integration possible, and overestimate the contribution that can be made to this task by monotonic semantics. To make large-scale data integration and aggregation possible, what must be done? I think that in a lot of situations, the first task is not “ensure that the application semantics are monotonic” but “try to record the data in an application-independent, reusable form”. If you cannot say what the data mean without reference to a single authoritative application, then you cannot reuse the data. If you have not defined an application-independent semantics for the data, then you will experience huge difficulties with any reuse of the data. Bear in mind that data integration and aggregation (whether large-scale or small-) are intrinsically, necessarily, kinds of data reuse. No data reuse, no data integration.
For that reason, I think TR’s final sentence shows an underdeveloped appreciation for the relevant technologies. Like the development of centralized databases designed to control redundancy and store common information in application-independent ways, the development of descriptive markup in SGML helped lay an essential foundation for any form of secondary data integration. Or is there a way to integrate data usefully without knowing anything at all about what it means? Having achieved the hard-won ability to own and control our own information, instead of having it be owned by software vendors, we can now turn to ways in which we can organize its semantics to minimize downstream complications. But there is no need to begin the effort by saying “well, the effort to wrest control of information from proprietary formats is all well and good, but it really isn’t trying to solve the problems of large-scale data integration that we are interested in.”
(Enrique whistled when he read that sentence. “You really want to dive down that rathole? Look, some people worked hard to achieve something; some other people didn’t think highly enough of the work the first people did, or didn’t talk about it with enough superlatives. Do you want to spend this post addressing your deep-seated feelings of inadequacy and your sense of being under-appreciated? Or do you want to talk about data integration? Sheesh. Dry up, wouldja?“)
Conversely, I think TR may overestimate the importance of the contribution RDF, or any similar technology, can make to useful data integration. Any data store that can be thought of as a conjunction of sentences can be merged through the simple process of set union; RDF’s restriction to atomic triples contributes nothing (as far as I can currently see) to that mergeability. (Are there ways in which RDF triple stores are mergeable that Topic Map graphs are not mergeable? Or relational data stores?)
And it’s not clear to me that simple mechanical mergeability in itself contributes all that much to our ability to integrate data from different sources. Data integration, as I understand the term, involves putting together information from different source to achieve some purpose or accomplish some task. But using information to achieve a purpose always involves understanding the information and seeing how it can be brought to bear on the problem. In my experience, finding or making a human brain with the required understanding is the hard part; once that’s available, the kinds of simple automatic mergers made possible by RDF or Topic Maps have seemed (in my experience, which may be woefully inadequate in this regard) a useful convenience, but not always an essential one. It might well be that the data from source A cannot be merged mechanically with that from source B, but an integrator who understands how to use the data from A and B to solve a problem will often experience no particular difficulty working around that impossibility.
I don’t mean to underestimate the utility of simple mechanical processing steps. They can reduce costs and increase reliability. (That’s why I’m interested in validation.) But by themselves they will never actually solve any very interesting problems, and the contribution of mechanical tools seems to me smaller than the contribution of the human understanding needed to deploy them usefully.
And finally, I think Thomas’s post raises an important and delicate question about the boundaries RDF sets to application semantics. An important prerequisite for useful data integration is, it would seem, that there be some useful data worth retaining and integrating. How thoroughly can we convince ourselves that in requiring monotonic semantics RDF has not excluded from its purview important classes of information most conveniently represented in other ways?