Balisage: The markup conference 2009

Posted on 3 April 2009 by cmsmcq

[3 April 2009]

Three weeks to go until the 24 April deadline for papers for the 2009 edition of Balisage: The markup conference.

We want your paper. So give it to us, already!

This is a peer-reviewed conference that seeks to be of interest both to the theorist and to the practitioner of markup. That makes it a lot of fun (at least for people like me who are interested both in theory and in practice, and who like to see them informing each other). And the peer reviews are unusual (at least in my experience of conference paper submissions) in the detail and passion of their comments.

If you have markup-related work to report on, you will not get better feedback from any conference on the planet. (Disclaimer: I am one of the organizers, and have been known to have a small soft spot in my heart for this conference. But don’t take my word for it: ask anyone who has spoken at Balisage how the peer review and the questions at the conference compared with other conferences.)

Details of submission procedure, of course, in the call for participation.

I look forward to reading your papers.

Further notes on optimistic concurrency and XML parsing

Posted on 22 August 2008 by cmsmcq

[22 August 2008]

I just posted some notes on a paper given at Balisage 2008 by Yu Wu et al. of Intel.

A few thoughts occurred to me in writing up those notes which might merit separate consideration.

How effective could pessimization be?

A key part of the optimistic concurrency algorithm presented by Yu Wu et al. is that the process of chunking the document needs to be quick. So they make some guesses, when chunking, that could later be proven wrong; in that case, the chunk needs to be re-parsed.

I suppose the worse-case scenario here is that a sufficiently lucky and malignant adversary could construct a document in which the context at the end of chunk 1 means that chunk 2 needs to be reparsed, and the reparsing of chunk 2 reveals for the first time that chunk 3 now needs to be reparsed, and so on, so that in the end you end up using n time slices to parse n chunks, instead of n divided by the number of threads.

So there’s an interesting question: how long can we keep this up?

It’s pretty clear that if we know exactly where the pre-scanner will break the chunks, then we can devise an XML document that forces chunk 2 to be reparsed. Can we construct a document in which only the second, correct parse of chunk 2 reveals that chunk 3 now needs to be reparsed (i.e. in which the first parse of chunk 2 makes chunk 3 look OK, and the second one shows that it’s not OK)?

Can we make a document in which every time we reparse a chunk with the correct context, we discover that the next chunk also needs to be reparsed? How much reworking can an omniscient and malevolent XML author cause this algorithm to do? Remember that comments and CDATA sections do not nest; the worst I can figure out off hand is that a comment or CDATA section begins in chunk 1 and doesn’t end until the last chunk.

How many chunks do you want?

The paper says fewer chunks are better than many chunks (to reduce post-processing costs), and that you want at least as many chunks as there are threads (to ensure that all cores can be busy). To simplify the examples I’ve been thinking about, I’ve been imagining that if I have eight threads, I’ll make eight chunks.

But if I’ve read the performance data and charts right, the biggest single reason the Horatian parser is not getting an eight-fold speedup when using eight threads is the need to reparse some chunks, owing to bad guesses about parse context made during the first parse. If we have eight threads and eight chunks, everything is fine for the first pass over the chunks. But if we need to reparse two of the chunks, then it rather looks as if six threads might be sitting idle waiting for the re-parsing to finish.

I wonder: would you get better results if you had shorter chunks, and more of them, to keep more threads busy longer? What you want is enough chunks to ensure that while you are reparsing some chunks, you still have other chunks for the other threads to parse.

As a first approximation, imagine that we have eight threads. Instead of eight chunks, we make fourteen chunks, and give the first eight of them to the eight threads. Let’s say two of them need to be reparsed; the reparsing goes on at the same time that the remaining six threads parse the remaining six chunks. The minimal path through the speculative parsing step remains the time it takes to parse two chunks, but the chunks are somewhat smaller now. The only question is how much additional time the post-processing step will now take, given that it has fourteen and not eight chunks to knit together.

And of course you need to bear in mind that if one chunk in four turns out to need re-parsing, then three or four out of the fourteen chunks are going to need reparsing, not just two. By the time you factor that in, and try to ensure that your last round of parsing doesn’t generate any new re-parse requests, things have gotten more complicated than I can conveniently deal with here (or elsewhere).

Maybe that’s why the Intel paper was so non-committal on the way to choose how many chunks to make in the first place: it can get pretty complicated pretty fast.

Optimization and context independence in schema languages

One of the things that intrigues me about these results is that so much of what people have said needs to be done to schema languages to ensure that validation can be fast has nothing much to do with the speed gains shown by optimistic concurrency.

I thought for a while that this work did benefit from the fact that elements can be validated against XSD types without knowledge of their context (no reference to ancestors or siblings in any assertions, for example), but on reflection I’m not sure this is true: in order to find the right element declaration and type definition to bind an instance element, you need to know (a) the expanded name of the element (which means knowing the in-scope namespaces, which in practice means having looked at all of the ancestors of the element), and (b) the type assigned to the element’s parent (unless this element is itself the validation root). Once you have a type, it’s true that validation is independent of context. But the assignment of a type to an element or attribute does depend, in the normal case, on the context. It’s not clear to me that allowing upward-pointing XPath expressions in assertions or conditional type assignment would make much difference.

To really exploit parallelism in validation, it would seem you want to eliminate the variable binding of expanded names to element declarations and to types.

Back to DTDs plus datatypes, anyone?

Optimistic concurrency and XML parsing and validation (Balisage report 3, in which chronology is abandoned)

Posted on 22 August 2008 by cmsmcq

[22 August 2008]

My brief hope (it would be misleading to refer to it as a “plan”) to report daily from Balisage 2008 has bitten the dust — it did that shortly after noon on the first day, when my account of Sandro Hawke’s work on XTAN turned out to take more time than I had available — but there is still a lot to say. I’m going to abandon the chronological approach, however, and write about things that come to mind, in a more or less random order.

One of my favorite papers this year was the one submitted by Yu Wu, Qi Zhang, Zhiqiang Yu, and Jianhui Li, of Intel, under the slightly daunting title “A Hybrid Parallel Processing for XML Parsing and Schema Validation”. (I think they are all members of the XML Engineering Team at the Intel China Software Center in Shanghai, but I am not sure I’ve read all the affiliation info correctly; I keep being distracted by the implications of an Intel software center having an XML engineering team.)

When I paraphrased this paper to a friend recently, her response was “Wow! That’s a lot more accessible than I would have guessed from the title.” So perhaps it’s worth while to try to record here the high points of the work, in a way that’s accessible to people to people with no more than lay knowledge of the relevant technical disciplines. (This goal is made easier, of course, by the fact that I don’t have more than lay knowledge of those disciplines myself.)

For technical details, of course, readers should go to the paper in the online proceedings of the conference; all errors in this summary are mine.

The elevator speech

The quick executive summary: XML parsing, and validation, can be a lot faster if performed by a multi-threaded process using optimistic concurrency.

By “optimistic concurrency”, I mean a strategy that parallelizes aggressively, even if that means doing some work speculatively, based on guesses made when parallelizing the work, guesses that might prove wrong. When the guesses are wrong, the process has to spend extra time later fixing the resulting problems. But if the speedup gained by the concurrency is great enough, it can outweigh the cost of wrong guesses. (This is a bit like the way Ethernet chooses to clean up after packet collisions, instead of attempting to prevent them the way token-ring networks do.)

A fast and simple-minded process divides the XML document into chunks, and multiple parallel threads handle multiple chunks at the same time. The fragmentary results achieved by the chunkwise parsing can be knit back together quickly; sometimes the fixup process shows that one or the other chunk needs to be reparsed.

What does Moore’s Law have to do with XML parsing?

OK, so much for the elevator speech. If you’re still interested, here is a little more detail.

First, some background. Moore’s Law says, roughly, that the number of transistors it’s possible to put on a chip doubles every eighteen months. For many years, this doubling of transistor count was accompanied by increases in clock speed. (I don’t understand the connection, myself, not being an electrical engineer. I just take it on faith.) But higher clock speeds apparently require more power and generate more heat, and eventually this causes problems for the people who actually use the chips. (Very few laptop designers are persuaded that water-cooled systems can be made to have the requisite portability. And liquid nitrogen, which would be the next step? Don’t get them started.)

So nowadays the doubling of transistors appears to be reflected in the rise of multi-core chips; dual-core chips in the current crop of off-the-shelf machines, with every expectation that the number of cores on standard chips will rise. Intel has already shipped chips with four and eight cores, although I haven’t seen any four-core machines on my list when shopping for laptops. (I’m not sure whether cores are expected to double every eighteen months indefinitely, or not; if they do, will we end up with a 1024-core chip vaguely resembling the Connection Machine in our laptops in another fifteen years?)

It used to be that performance rose about as fast as the transistor count, because the clock speed kept going up; even if software didn’t do anything smarter, it kept getting faster because the chip was faster. But to get performance benefits out of multi-core chips, a system is going to want to keep all of the cores busy whenever possible. People have been thinking about parallel computing for a long time, and at the 30,000-foot level, the answer so far seems to boil down to the simple, general principle “Gosh, that’s hard!”

Under those circumstances it seems plausible that a manufacturer of multi-core chips might see it as in the manufacturer’s own interest to show people how to multi-thread common applications, so as to make it easier to get as much performance as possible out of multi-core chips.

Parallelizing parsing

How do you parallelize the task of parsing an XML input stream? (There may be other ways to keep multiple threads busy, but this seems like the obvious choice if we can manage it.)

The answer wasn’t obvious to me. There are references to parallelism in the titles or summaries of a number of papers in Grune and Jacobs’s online bibliography on parsing techniques (part of the supplemental material to their book on Parsing Techniques), but none that leap out at me as being easy to understand.

One way to parallelize XML parsing is to run a pre-scanner over the document to select suitable places to divide it into chunks, and then to parse the chunks in parallel. (There is earlier work on this by Wei Lu et al., which Yu Wu et al. cite. They also cite work on other ways to parallelize XML parsing, but I don’t understand them and won’t try to describe them.)

The problem with (the existing state of) the pre-scanning approach, according to the Intel team, is that the pre-scanning takes a lot of time by itself, and once the parsing process itself is optimized, the overhead imposed by the pre-scanner ends up being prohibitive (or at least deplorable).

Chunking

So Yu Wu et al. use a simpler (and I guess faster) pre-scanner. They don’t attempt to find wholly optimal chunk boundaries, and they don’t attempt to ensure that the parse context at the chunk boundary is completely understood by the pre-scanner. They don’t go into great detail on the chunking method, but if I have understood it correctly, the primary criteria for division of the XML document into chunks are that (1) the chunks should be of about the same size, (2) there should be at least one chunk per thread, (3) other things being equal, fewer chunks is better (because this reduces the cost of post-processing), and (4) each chunk should start with a left angle bracket.

Until I looked at the paper again just now I thought each chunk had to begin with something that looks like a start-tag, but I can’t find that constraint in the paper. Maybe I hallucinated it.

So while I don’t know the details of the Intel pre-scanner, I imagine a pre-scanner that just looks for an angle bracket in the neighborhood of the desired chunk boundary: if you have an 800-MB document you want to divide into eight chunks (in order to allocate them to eight threads, say), my imaginary pre-scanner just looks in the vicinity of byte offset 100,000,000 for a left angle bracket, scanning forwards and/or backwards until it finds one. If we are reading the document from random-access storage, the pre-scanner doesn’t even have to scan the body of the chunk.

Some readers will be saying to themselves at this point “But wait, not everything that looks like a start-tag or an end-tag is necessarily a start-tag or an end-tag. It might be in a CDATA section. It might be in a comment. What then?”

Hold that thought. You’re quite right, and we’ll come back to it.

Parallel parsing

Now, each thread is given a chunk of XML to parse, but only one thread gets to begin at the beginning of the document, so only one thread actually knows where it is. The other threads are all following the advice of the Latin poet Horace, who recommends beginning in the middle (medias in res). (I’m tempted to see if we could call this parsing technique Horatian parsing, for that reason. That’s probably not going to fly in the community at large, but it’s more compact than “hybrid parallel-processing parser” or any other descriptive phrase I can come up with denoting the algorithm presented by the Intel team, so I’ll use it in the rest of this description.)

The nature of XML, however, is that the element structure is fairly explicit in the document and doesn’t require a lot of context information. When in a chunk you see a start-tag, some content, and a matching end-tag, you know that’s an element, and you can build the appropriate data structure. But you will also run into some things that you can’t handle all by yourself: unmatched start-tags (for elements that end after the current chunk), unmatched end-tags (for elements that started in earlier chunks), and undeclared namespace prefixes (for namespace bindings declared in some ancestor element). The Horatian threads keep track of each of these phenomena, and produce as their output a convenient representation of parts of a parse tree (think DOM, not SAX), and lists of unmatched start-tags, end-tags, and namespace prefixes, together with some miscellaneous context information.

The post-processor can use the lists of unmatched bits to knit the data structures together: the last unmatched start-tag found in chunk 1 matches the first unmatched end-tag found in chunk 2, and so on.

The post-processor can also use the context information to see whether any parsing needs to be redone. (Those of you who were worried about misleading angle brackets in comments and CDATA sections and so on? Remember I told you to hold that thought? OK, put it down now. This is where we deal with it.)

If chunk n ended in the middle of a comment or in a CDATA section, then the parsing of chunk n + 1 will need to be redone. But if we have divided the document into eight chunks, and one chunk, or four, need a second parsing, we are still running at about four times the speed of a sequential parser.

Validation

Once the chunks have been knit together (and, crucially, once all namespace prefixes are bound), the same chunking is used to perform schema-validity assessment: one thread validates chunk 1, another validates chunk 2, etc.

I won’t go into the fixup needed to knit together the various partial results; it suffices to observe that in both XSD and Relax NG, the validation needed for an element cannot be determined reliably without knowing its ancestry. The consequence is that validation cannot be performed reliably without doing the post-parsing fixup first. (You could guess which element declaration to use, I suppose, but I think the rate of wrong guesses would be too high.)

If you want to pass appinfo from the governing element declaration to the consumer, you will also in pathological cases need to know the left siblings of the element, in order to know where it occurs in the content model of its parent, so you can know which appinfo to use. I expect any schema validator designed for this kind of optimistic concurrency will therefore decline to expose appinfo from element declarations.

Performance

In theory, if parallelization imposes no overhead, then by using two threads you get a two-fold speedup, four threads gets a four-fold speedup, etc.

In practice, the test results presented by the Intel group show, as one might expect, that there is some non-negligeable overhead in this scheme. For four threads, their stand-alone parser shows a slightly better than two-fold speedup; for eight threads, slightly better than four-fold. Some of the overhead, clearly, is in the pre-scanning and the post-processing, but if I’m reading their graphs correctly, for eight threads neither of these processes comes close to the amount of time spent in parsing and validation, so I’m guessing that the main performance hit comes from the need to re-parse some chunks.

Small documents (which the paper describes as those smaller than 64 KB) do not benefit from the parallelization, and larger documents (larger than 1 MB) benefit more than medium-sized ones. They didn’t talk at any length about the nature of the test data, beyond saying that it was real data from real customers.

Optimization can be a real pain sometimes, and listening to people talk about optimization can be a good cure for insomnia. But sometimes, someone has a really simple Big Idea that makes sense even to someone as ignorant as me, and then it can be an awful lot of fun to hear about it and think about the implications.

Optimistic concurrency for XML processing: a Big Idea whose implications can be huge.

A way to allow both extensibility and interoperability (Balisage report 2)

Posted on 19 August 2008 by cmsmcq

[begun 11 August 2008, finished 19 August 2008]

At the versioning symposium the day before the Balisage 2008 conference, my colleague Sandro Hawke talked about a method he has developed for allowing a core language to be extended in a variety of ways while still retaining reasonably good interoperability. The details of his implementation are interesting, but what strikes me most forcibly is the underlying model of interoperability, which seems to me to bring some very useful new ideas to bear on the problem. There may be — I mean there must be — situations where these new ideas don’t help. It’s not actually clear to me how best to characterize the situations where they do help, and the situations where they don’t, but it does seem to me clear that in the situations where they do help, they can help a lot.

It took me a bit of effort to grasp the basic idea, and in fact I didn’t really get it until my evil twin Enrique started to explain it to me. The summary of the basic ideas that follows is Enrique’s, not Sandro’s, and Sandro should not be held responsible for it.

1 Start at the beginning.

“It’s a very good place to start,” I murmured. “What was that?” “Nothing.” “One more Sound of music reference out of you and this explanation is over, got that?”

It’s obvious, and thus well understood by at least some people, that interchange is easy when everyone involves is speaking exactly the same language.

“The Lord said, ‘If as one people speaking one language they have begun to do this, then nothing they plan to do will be impossible for them.’“ “And right he was. Sometimes Yahweh really knows what he’s talking about.”

2 But we have a lot of situations where we don’t want to, or can’t, speak exactly the same language.

3 In some cases, with unrelated languages, the only option seems to be translation, or non-communication.

“Non-communication?” I asked. “Always a popular option,” said Enrique. “Popular?” “Or at least, a common result. I assume it must be popular, given how many normally intelligent working group members engage in it. Do you really think that in the average working group argument, people are speaking the same language? Or even trying to?” “Well, of course they are. People working together in good faith …” “Good faith? working together? When was the last time you actually listened to what people were saying in a working group meeting?” “Don’t say that where my Working Group chairs can hear you, OK?” “You think they don’t know? Stick around, kid. Your naïvete is refreshing.”

4 An important case: We have a lot of situations where we want to support multiple versions / variants of the same vocabulary: a sequence of version 1, version 2, and version 3 is a simple example.

Another example is the definition of a core architecture and various competing (or complementary) extensions. For example, in the context of the W3C Rule Interchange Format, Sandro’s slide imagines a Core RIF dialect, extended separately by adding actions (to create Production Rule Dialect) and by adding equality and functions (to create a Basic Logic Dialect), the Basic Logic Dialect in turn being extended by a negation-as-failure feature and a classical-negation feature (producing, respectively, a logic-programming dialect and a version of first order logic).

“Is it an important pattern here,” I asked, “that we are talking about different dialects with overlapping expressive power?” “Possibly. What do you mean?” “The examples you mention seem to involve different dialects with semantics or functionality in common, or partially in common: you can approximate negation-as-failure with classical negation, or vice versa, if you translate direct from the one dialect to the other. If you had to translate them down into a common core language without any negation at all, you wouldn’t be able to approximate them nearly so well.” “You’re getting ahead of yourself here, but yes, that’s true of the examples so far.”

Another example is the definition of a central reference architecture and various extensions or specializations.

“You mean like HL7 and different specializations of it for general practitioners and for hospitals and other health-care institutions?” “Possibly.”

5 A common approach to such situations is a kind of hub and spoke model: specify an interchange language, an interlingua. Everybody translates into it, everybody translates out of it. So you have a 2×n problem, not an n-squared problem.

6 But in some situations, e.g. core language + multiple independent extensions, the only feasible interlingua is the core language without extensions. (Example: consider any attempt to write portable C, or portable SQL. Basic rule: stay away from vendor extensions at all costs, even if they would be a lot of help. Secondary rule: use them, but put ifdefs and so on around them.)

I.e. the interlingua is often not as expressive as the specific variants and extensions people are using. (That’s why they’re using the variants and not the interlingua.)

Put yet another way: going from the variants to the interlingua can be lossy, and often is.

7 Two things are wrong / can be wrong with an interlingua like that (or ANY interlingua solution), especially when a round trip through the interlingua produces something less useful or idiomatic than the original. (This is not logically necessary, it just almost always seems to be the case.)

Well, three.

(a) If two people (say, Ana Alicia and Ignacio) are doing blind interchange, and using the same extensions, it seems dumb for Ana Alicia to filter out all the extensions she uses, and for Ignacio not to get them. They would not have hurt him, they would have been helpful to him. The same applies if Ana and Ignacio are using similar but distinct extensions: if it’s possible to translate Ana’s extensions into a form that Ignacio can understand, that may be better than losing them entirely.

But that doesn’t work well if the interlingua can’t handle Ana Alicia’s extensions.

Sandro gave, among others, the example of the blink tag. The rules of HTML say that a browser that doesn’t understand the blink element should ignore its start- and end-tags. That makes the defined tags a sort of interlingua, and prescribes a simple (brutally simple, and also brutal) translation into the interlingua. But for text that should ideally be presented as blinking, it would be better to fall back to strong or some other form of strong highlighting, rather than rendering the contents using the same style as the surrounding text.

(b) Having extensions Ignacio doesn’t understand may or may not be a problem, depending on what Ignacio is planning on doing with the data.

If one translation preserves the logical semantics of a set of logical rules (e.g. in RIF), but makes the rules humanly illegible, and another preserves human legibility but damages the logical content, then it matters whether Ignacio is running an inference engine or just displaying rules to users without doing any inference.

8 One advantage of direct translation pairs (the n-squqred approach instead of 2×n with an interlingua) is that the output is often higher quality. This is true both for artificial and for natural languages.

“Really?” I asked. “Sure,” said Enrique. “Look at the secret history of any of the European machine-translation projects. Even the ones based on having an interlingua ended up with annotations that say in effect ‘if you’re going from Spanish to Italian, do this; if from Spanish to German, do that.’’ “Where did you read that?” “Somewhere. No, I can’t cite a source. But I bet it’s true.”

But when we don’t know in advance what pairs are needed? and how many people are going to be playing? And thus when we cannot prepare all n-squared translation pairs in advance? What do we do then?

9 Could we manage a system where the pairwise translation is selected dynamically, based on simpler parts (so the total cost is not quadratic but less)?

10 Yes. Here’s how, in principle.

(a) Identify one or more base interlingua notations.

“That’s got to be a centralized function, surely?” I asked. “Probably; if it’s not centralized you’re unlikely to reach critical mass. But when you’re thinking of this technique as a way to provide for the extensibility of a particular language, then it’s not likely to be an issue in practice, is it? That language itself will be the candidate interlingua.” “Then why say ‘one more more’?” I asked. “Because in principle you could have more than one. Some widely deployed language in the application domain might be a second interlingua.”

(b) Provide names for various extensions / variants / versions of the base notation(s).

This means you can describe any end point in the process as accepting a particular base language with zero or more named extensions.

“Does this need to be centralized?” I asked. “Not at all; just use normal procedures to avoid name conflicts between extensions named separately.” “You mean, use URIs?” “Yes, use URIs. Sometimes you’re not as dumb as you look, you know?”

(c) Identify well known purposes for the use of the data (display to user, logical reasoning, etc.). This can if necessary be done in a distributed way, although it’s probably most effective if at least the basic purposes are agreed upon and named up front.

(d) For each extension, say how (and whether) it can be translated into some other target language — whether the target is one of the interlinguas identified in step (a) or is an interlingua plus a particular set of extensions.

Define the translation. And quantify (somehow, probably using some subjective quality measure) how lossy it is, how good the output is, for each of the specific well known purposes.

“This,” said Enrique, “ is why it’s better to centralize the naming of well known purposes. If you don’t have an agreed upon set of purposes, you end up with spotty, incomplete characterizations of various translations.” “And how is the translation to be specified?” I asked. “Any way you like. XTAN and GRDDL both use XSLT, but at this level of abstraction that’s an implementation detail. I take back what I said about your looks.”

(e) Now we have a graph of language variants, each defined as an interlingua plus extensions. Each node is a language variant. Each arc is a translation from one variant to another. Each arc has a cost associated with each well known purpose.

(f) To get from Ana Alicia’s language variant to Ignacio’s language variant, all Ignacio has to do (or Ana Alicia — whoever is taking charge of the translation) is find a path from Ana’s node in the graph

“The node representing Ana Alicia’s dialect, you mean.” “Pedant! Yes, of course that’s what I mean.”

to Ignacio’s. It need not be direct; it may involve several arcs.

(g) If there’s more than one path, Ignacio (let’s assume he is doing the translation) can choose the path that produces the highest quality output for his purposes.

Finding the optimal path through a weighted graph is not without its exciting points —

“You mean it’s hard” “Well, of course I do. What did you think exciting meant in this context?! But only when the graph is large.”

— not (I was saying) without its exciting points, but it’s also a reasonably well understood problem, and perfectly tractable in a small graph of the kind you’re going to have in applications of this technique. Sandro’s implementation uses logical rules here to good effect.

(h) If there’s only one path, Ignacio can decide whether the output is going to do him any good, or if he should just bag the idea of reusing Ana Alicia’s data.

11 The description above is how it works in principle. But can it actually be implemented? Yes.

But for details, you’re going to have to look at Sandro’s talk on XTAN, or on other writeups he has done (cited in the talk).

International versioning symposium (Balisage report 1)

Posted on 11 August 2008 by cmsmcq

[11 August 2008]

The Balisage 2008 conference begins tomorrow. Today, there is a one-day pre-conference symposium on Versioning XML Vocabularies and Systems. So far (this version of this post is being written just after lunch), I’ve learned a lot.

Peter Brown of Pensive spoke about the nature of versions (and thus of the versioning problem) as social constructs. Plato and Aristotle thought of reality as having natural joints, and of a good conceptual apparatus as cutting reality at those joints; that would suggest that there is typically some natural distinction between x (for any x) and not-x. Nowadays, Brown suggested, things look different. A look at French and American butchers’ charts shows that even the cuts of meat are culturally variable. But if even real joints don’t necessarily determine how the cow is cut, how much less will the joints of reality (as we understand reality) fully determine an ontology. Why, Brown asked, do we identify some things as versions of other things? What is the point of the identification; what is particular about the points at which the term is applied? Are versions used for identifying particular content? for tracking evolution? for tracking differences? for identifying canonical forms of variable material? The answer, essentially, is “yes”: versions are used for all of the above. That is to say, our use of the term version is culturally and functionally determined. He offered a number of useful bits of advice, most saliently perhaps the advice: “If anyone mentions versioning, assume nothing! In particular, don’t assume that you know what they mean.”

David Orchard talked about the fundamental problems of versioning from a down to earth point of view, with particular emphasis on the issues of forwards and backwards compatibility, their definition in terms of the sets of documents produced and accepted by various agents, and their effect on deployment of protocols and software. He talked about specific things vocabulary designers can do to make it easier to extend and evolve their languages.

A couple of points made in the discussion after David’s talk may be worth recording. He had pointed to the difficulty some schema authors experience in writing a schema which allows extension in all the right places, while still making clear what sort of messages are actually expected in the absence of extension; there is no law, however, that says a language or a protocol spec must be defined by a single schema. If you want producers of v.1 messages to produce only a particular kind of messages, and v.1 consumers to accept all of those messages, but also a larger set of messages (so that they will react calmly when confronted with v.2 messages), then you could do worse than to define two schemas, one for the producer language and one for the acceptor language.

Marc de Graauw picked up the notion of compatibility, and the idea of defining compatibility in terms of set theory, and gave a thorough exposition of how they interrelate. He gave short shrift to the idea of defining compatibility in terms of a formal logic like (for example) OWL; it’s way too complicated to contemplate a serious logical definition of something as complicated as a real-life vocabulary like HL7 v3. Also, definitions like that are hard to reconcile with the idea of consumers or producers of messages as black boxes. Instead, he suggested defining the semantics of a vocabulary in terms of behavior. His slides pushed Venn diagrams for message sets just about as far as they can go.

Laurel Shifrin of Lexis / Nexis gave the first of a series of case studies, describing the various mechanisms Lexis / Nexis has adopted to try to manage vocabulary change in an extremely heterogeneous and decentralized organization. Answer: it’s not easy, and you really do benefit from executive buy-in. (Again, there was a lot more; I should try to elaborate this description further, later on.)

Debbie Lapeyre spoke about a vocabulary for (mostly biomedical) journal content used by the U.S. National Library of Medicine. The perceived stability and backwards-compatibility guarantees of the vocabulary were among the key reasons for the success of the vocabulary, and its wide adoption among a larger community. But over time, patience with the shortcomings and outright errors in the vocabulary waned, and even the user community was asking for a revision which would include incompatible changes. I had the impression that Debbie was a little surprised that the members of the responsible group were not lynched, but that she was glad that they weren’t, and that the new version can repair some problems she has had on her conscience as a vocabulary designer for a while.

Ken Holman talked about the way versioning problems are handled in UBL (the universal business language). UBL has moved to a two-layer validation story: the only normative definition of UBL documents is given by the XSD schema documents, and business rules are enforced in a separate business-rule validation step (rather than customizing local versions of the schema documents to incorporate business rules). Perhaps the most surprising restriction, for some auditors, may be the UBL policy that no their schema documents will never use xsd:choice constructions. Life is made harder by the fact that many consumers of UBL documents appear (by Ken’s account) to use systems which simply will not or cannot show their users invalid documents. It’s hard to make informed decisions about how to recover from a problem, if you’re using a system that won’t show you the problematic data. Fortunately, UBL has some ingenious people.

Messages in a Bottle

CMSMcQ's klog

Category Archives: Balisage