Descriptive markup and data integration

In his enlightening essay Si tacuisses, Enrique …, my colleague Thomas Roessler outlines some specific ways in which RDF’s provision of a strictly monotonic semantics makes some things possible for applications of RDF, and makes other things impossible. He concludes by saying

RDF semantics, therefore, is exposed to criticism from two angles: On the small scale, it imposes restrictions on those who model data … that can indeed bite badly. On the large scale, real life isn’t monotonic …, and RDF’s modeling can’t deal with that….

XML is “dumb” enough to not be subject to either of these criticisms. It is, however, not even trying to address the issues that large-scale data integration and aggregation will bring.

I think TR may both underestimate the degree to which XML (like SGML before it) contributes to making large-scale data integration possible, and overestimate the contribution that can be made to this task by monotonic semantics. To make large-scale data integration and aggregation possible, what must be done? I think that in a lot of situations, the first task is not “ensure that the application semantics are monotonic” but “try to record the data in an application-independent, reusable form”. If you cannot say what the data mean without reference to a single authoritative application, then you cannot reuse the data. If you have not defined an application-independent semantics for the data, then you will experience huge difficulties with any reuse of the data. Bear in mind that data integration and aggregation (whether large-scale or small-) are intrinsically, necessarily, kinds of data reuse. No data reuse, no data integration.

For that reason, I think TR’s final sentence shows an underdeveloped appreciation for the relevant technologies. Like the development of centralized databases designed to control redundancy and store common information in application-independent ways, the development of descriptive markup in SGML helped lay an essential foundation for any form of secondary data integration. Or is there a way to integrate data usefully without knowing anything at all about what it means? Having achieved the hard-won ability to own and control our own information, instead of having it be owned by software vendors, we can now turn to ways in which we can organize its semantics to minimize downstream complications. But there is no need to begin the effort by saying “well, the effort to wrest control of information from proprietary formats is all well and good, but it really isn’t trying to solve the problems of large-scale data integration that we are interested in.”

(Enrique whistled when he read that sentence. “You really want to dive down that rathole? Look, some people worked hard to achieve something; some other people didn’t think highly enough of the work the first people did, or didn’t talk about it with enough superlatives. Do you want to spend this post addressing your deep-seated feelings of inadequacy and your sense of being under-appreciated? Or do you want to talk about data integration? Sheesh. Dry up, wouldja?“)

Conversely, I think TR may overestimate the importance of the contribution RDF, or any similar technology, can make to useful data integration. Any data store that can be thought of as a conjunction of sentences can be merged through the simple process of set union; RDF’s restriction to atomic triples contributes nothing (as far as I can currently see) to that mergeability. (Are there ways in which RDF triple stores are mergeable that Topic Map graphs are not mergeable? Or relational data stores?)

And it’s not clear to me that simple mechanical mergeability in itself contributes all that much to our ability to integrate data from different sources. Data integration, as I understand the term, involves putting together information from different source to achieve some purpose or accomplish some task. But using information to achieve a purpose always involves understanding the information and seeing how it can be brought to bear on the problem. In my experience, finding or making a human brain with the required understanding is the hard part; once that’s available, the kinds of simple automatic mergers made possible by RDF or Topic Maps have seemed (in my experience, which may be woefully inadequate in this regard) a useful convenience, but not always an essential one. It might well be that the data from source A cannot be merged mechanically with that from source B, but an integrator who understands how to use the data from A and B to solve a problem will often experience no particular difficulty working around that impossibility.

I don’t mean to underestimate the utility of simple mechanical processing steps. They can reduce costs and increase reliability. (That’s why I’m interested in validation.) But by themselves they will never actually solve any very interesting problems, and the contribution of mechanical tools seems to me smaller than the contribution of the human understanding needed to deploy them usefully.

And finally, I think Thomas’s post raises an important and delicate question about the boundaries RDF sets to application semantics. An important prerequisite for useful data integration is, it would seem, that there be some useful data worth retaining and integrating. How thoroughly can we convince ourselves that in requiring monotonic semantics RDF has not excluded from its purview important classes of information most conveniently represented in other ways?

2 thoughts on “Descriptive markup and data integration

  1. Let me be more precise than in that blog post: XML does not give you a single way to merge data; any large-scale aggregation that you want to do depends on knowledge about the vocabularies. With RDF, you don’t have that dependency.

    I’m not making any statement about the relative usefulness of these models for data integration (and have no good judgment on that) — that question is indeed the elephant in the room.

  2. Regarding: “Are there ways in which RDF triple stores are mergeable that Topic Map graphs are not mergeable? Or relational data stores?”

    My view (not coming from practice, admittedly) is that there are two benefits RDF gives us over relational data stores. (I don’t know enough about Topic Maps here…)

    First, when merging two relational data stores, you cannot be sure that primary keys in similar tables are shared by the tables (especially with generated IDs), therefore you cannot put the contents of the tables together, even if the structure would allow it. In RDF, where relations replace tables, and URIs replace primary keys, the different datasets to be merged can unambiguously use the same relations and resources, so in effect, the content of tables is being merged. (Not sure this is clear…)

    Second, even if two data sets do not share a single URI (apart from the RDF and RDFS vocabularies), the ease of merging them allows a query to contain the mappings: one can have a triple pattern that looks for either foaf:name or vcard:FN; the processor doesn’t know that they have similar semantics, but the person who writes the query does, and they have the power to express that. In SQL, to my knowledge remaining from college, you cannot easily access different databases, and I know of no deployment where somebody would actually put the raw data from two databases into one, and I know no SQL engine that would kind of facade multiple databases, which would then allow a similar query.

    It’s not necessarily that real honest-to-God data integration effort would benefit from RDF set merging; it’s that data can be merged serendipitously if need be by the person who needs it, using only the common query language and common query engines.

Comments are closed.