International versioning symposium (Balisage report 1)

[11 August 2008]

The Balisage 2008 conference begins tomorrow. Today, there is a one-day pre-conference symposium on Versioning XML Vocabularies and Systems. So far (this version of this post is being written just after lunch), I’ve learned a lot.

Peter Brown of Pensive spoke about the nature of versions (and thus of the versioning problem) as social constructs. Plato and Aristotle thought of reality as having natural joints, and of a good conceptual apparatus as cutting reality at those joints; that would suggest that there is typically some natural distinction between x (for any x) and not-x. Nowadays, Brown suggested, things look different. A look at French and American butchers’ charts shows that even the cuts of meat are culturally variable. But if even real joints don’t necessarily determine how the cow is cut, how much less will the joints of reality (as we understand reality) fully determine an ontology. Why, Brown asked, do we identify some things as versions of other things? What is the point of the identification; what is particular about the points at which the term is applied? Are versions used for identifying particular content? for tracking evolution? for tracking differences? for identifying canonical forms of variable material? The answer, essentially, is “yes”: versions are used for all of the above. That is to say, our use of the term version is culturally and functionally determined. He offered a number of useful bits of advice, most saliently perhaps the advice: “If anyone mentions versioning, assume nothing! In particular, don’t assume that you know what they mean.”

David Orchard talked about the fundamental problems of versioning from a down to earth point of view, with particular emphasis on the issues of forwards and backwards compatibility, their definition in terms of the sets of documents produced and accepted by various agents, and their effect on deployment of protocols and software. He talked about specific things vocabulary designers can do to make it easier to extend and evolve their languages.

A couple of points made in the discussion after David’s talk may be worth recording. He had pointed to the difficulty some schema authors experience in writing a schema which allows extension in all the right places, while still making clear what sort of messages are actually expected in the absence of extension; there is no law, however, that says a language or a protocol spec must be defined by a single schema. If you want producers of v.1 messages to produce only a particular kind of messages, and v.1 consumers to accept all of those messages, but also a larger set of messages (so that they will react calmly when confronted with v.2 messages), then you could do worse than to define two schemas, one for the producer language and one for the acceptor language.

Marc de Graauw picked up the notion of compatibility, and the idea of defining compatibility in terms of set theory, and gave a thorough exposition of how they interrelate. He gave short shrift to the idea of defining compatibility in terms of a formal logic like (for example) OWL; it’s way too complicated to contemplate a serious logical definition of something as complicated as a real-life vocabulary like HL7 v3. Also, definitions like that are hard to reconcile with the idea of consumers or producers of messages as black boxes. Instead, he suggested defining the semantics of a vocabulary in terms of behavior. His slides pushed Venn diagrams for message sets just about as far as they can go.

Laurel Shifrin of Lexis / Nexis gave the first of a series of case studies, describing the various mechanisms Lexis / Nexis has adopted to try to manage vocabulary change in an extremely heterogeneous and decentralized organization. Answer: it’s not easy, and you really do benefit from executive buy-in. (Again, there was a lot more; I should try to elaborate this description further, later on.)

Debbie Lapeyre spoke about a vocabulary for (mostly biomedical) journal content used by the U.S. National Library of Medicine. The perceived stability and backwards-compatibility guarantees of the vocabulary were among the key reasons for the success of the vocabulary, and its wide adoption among a larger community. But over time, patience with the shortcomings and outright errors in the vocabulary waned, and even the user community was asking for a revision which would include incompatible changes. I had the impression that Debbie was a little surprised that the members of the responsible group were not lynched, but that she was glad that they weren’t, and that the new version can repair some problems she has had on her conscience as a vocabulary designer for a while.

Ken Holman talked about the way versioning problems are handled in UBL (the universal business language). UBL has moved to a two-layer validation story: the only normative definition of UBL documents is given by the XSD schema documents, and business rules are enforced in a separate business-rule validation step (rather than customizing local versions of the schema documents to incorporate business rules). Perhaps the most surprising restriction, for some auditors, may be the UBL policy that no their schema documents will never use xsd:choice constructions. Life is made harder by the fact that many consumers of UBL documents appear (by Ken’s account) to use systems which simply will not or cannot show their users invalid documents. It’s hard to make informed decisions about how to recover from a problem, if you’re using a system that won’t show you the problematic data. Fortunately, UBL has some ingenious people.