An ineffable moment

[2008-12-31+07:00 / 2009-01-01Z]

If we add it all up, the XML Schema Working Group has spent a lot of time, since the beginning of the group, worrying about leap seconds.

XSD 1.0 attempts to accommodate them in its descriptions of the date/time types, but leaves some aspects of behavior unspecified. Accordingly, implementations of XSD 1.0 vary wildly in how carefully they handle leap seconds; not every implementation comes with built-in knowledge of when leap seconds have occurred in the past, and not every implementation enforces the rules which specify that, when they occur in the future, leap seconds will occur only at midnight, UTC, at the end of a month. You can read some things on the Web that seem to imply leap seconds can only occur at the end of June or December, or possibly also March and September, but that’s not what the relevant spec says. The quarter days are to be preferred, but in principle a leap second could occur at the end of any month. You can also read things on the Web that suggest many people are uncertain whether the responsible authorities could in principle insert (or delete) two leap seconds at a time, or even more. I was unsure myself, until some time after I read the relevant specification it finally dawned on me that the answer is no.

[“The relevant spec?” asked Enrique. “Which is that?” “Well, if you want to be precise, it’s Recommendation ITU-R TF.460-6: Standard-frequency and time-signal emissions, published by the International Telecommunications Union (Geneva: ITU, February 2002). I don’t remember how I managed to acquire a copy.” “And how can you be sure there can never be adjacent leap seconds, if it doesn’t say that flat out?” “What it says is that leap seconds are to be added at the end of some month, UTC, in order to keep UTC within 0.8 seconds of the appropriate solar time measure (maybe UT1 or UT2, but it’s been a while and I forget the details). If after adding two leap seconds, UTC is within 0.8 seconds of solar time, then before they were added it must have been more than 0.8 seconds off. But it’s not allowed to be more than 0.8 seconds off — that’s the point of adding or deleting leap seconds.” “What if there was a large change between January and June?” insisted Enrique. “Then the spec implies that a leap second should be added between January and June. The spec does not limit the insertion of leap seconds to December 31 and June 30, it just says to prefer those dates. I think the implication is pretty clear that if you need to add a leap second at the end of May, you are supposed to do so.”

[“Yeah, but what if the world slowed down by two seconds in the course of a single month? Isn’t that logically possible?” “Logically possible, yeah, I guess so. But astronomically implausible. If the rotation of the earth starts to vary that much, it’s likely to be because a large asteroid just hit us, or something. Under those circumstances, schema-validity is likely to be the least of our worries” “Well, my point exactly,” said Enrique. “If the world is falling apart, that’s the last time you want your systems to start failing because the schema validator doesn’t like your time stamps. There will be more important things to be worrying about!”]

In developing XSD 1.1, we spent a lot of time trying to nail things down better, but ultimately reached the conclusion that there just was no good way to allow all real leap seconds and only real leap seconds, to handle validation of dateTime values for the future, and to maintain the principle that a document’s schema-validity against a given schema is the same today and in the future; it should not change from day to day depending on decisions made by the managers of Universal Coordinated Time. In the end, we said that XSD 1.1 processors just don’t handle leap seconds at all: the moments in the global time-line which are occupied by leap seconds do not correspond to values in the xsd:dateTime value space.

It’s an important principle of schema design (and of the use of other formalisms as well, I think) that in the general case, what the formal notation can express may be only an approximation to the reality you are modeling. Some things may exist without being able to be spoken. Mostly we mean by that that specific rules that apply in a given context may not be expressible in a given formal notation, since the expressive power of the formalism may be hobbled in order to preserve its tractability. It’s nice, I think, that the principle is also instantiated by the dateTime type: there are some moments of UTC time that cannot be captured as values in the dateTime value space.

All this is on my mind, of course, because one of those moments is scheduled to occur today. At midnight UTC. Any moment, now, in fact.


[“Shouldn’t it be midnight local time?” hissed Enrique. “No, you’re thinking of shifting to and from Daylight Savings Time. Leap seconds are inserted at the same moment all around the globe. Hush, now, don’t spoil it. Just wait and watch.”]

And there it went. Midnight UTC has passed, and the sequence of seconds shown by the applet at for Mountain Time went:

  • 16:59:56
  • 16:59:57
  • 16:59:58
  • 16:59:59
  • 16:59:60 [That’s it! That’s it!! “Hey, come look at this!” I wanted to call to my wife.]
  • 17:00:00 [“No, wait, never mind. It’s over already.”
  • 17:00:01

All of these past weeks, as events in W3C and in the economy and in the world have gone from bad to worse, I’ve been waiting impatiently to shake the dust of this year from my feet, and yearning for 2009 and a new leaf. The new year will surely be hard in many ways, I tell myself, but it cannot be as bad as the year just ending. As far as I can tell, I am not alone; 2009 has a heavy freight of hope and expectations to carry. A heavier freight than it’s fair to ask any year to bear.

So I like the idea that between the old year, so widely and deservingly anathematized, and the new one which carries so much fragile hope, time paused, just for a second, to gather its forces before picking up its burden and marching forward again.

Happy New Year, o my readers. Happy New Year.

Further notes on optimistic concurrency and XML parsing

[22 August 2008]

I just posted some notes on a paper given at Balisage 2008 by Yu Wu et al. of Intel.

A few thoughts occurred to me in writing up those notes which might merit separate consideration.

How effective could pessimization be?

A key part of the optimistic concurrency algorithm presented by Yu Wu et al. is that the process of chunking the document needs to be quick. So they make some guesses, when chunking, that could later be proven wrong; in that case, the chunk needs to be re-parsed.

I suppose the worse-case scenario here is that a sufficiently lucky and malignant adversary could construct a document in which the context at the end of chunk 1 means that chunk 2 needs to be reparsed, and the reparsing of chunk 2 reveals for the first time that chunk 3 now needs to be reparsed, and so on, so that in the end you end up using n time slices to parse n chunks, instead of n divided by the number of threads.

So there’s an interesting question: how long can we keep this up?

It’s pretty clear that if we know exactly where the pre-scanner will break the chunks, then we can devise an XML document that forces chunk 2 to be reparsed. Can we construct a document in which only the second, correct parse of chunk 2 reveals that chunk 3 now needs to be reparsed (i.e. in which the first parse of chunk 2 makes chunk 3 look OK, and the second one shows that it’s not OK)?

Can we make a document in which every time we reparse a chunk with the correct context, we discover that the next chunk also needs to be reparsed? How much reworking can an omniscient and malevolent XML author cause this algorithm to do? Remember that comments and CDATA sections do not nest; the worst I can figure out off hand is that a comment or CDATA section begins in chunk 1 and doesn’t end until the last chunk.

How many chunks do you want?

The paper says fewer chunks are better than many chunks (to reduce post-processing costs), and that you want at least as many chunks as there are threads (to ensure that all cores can be busy). To simplify the examples I’ve been thinking about, I’ve been imagining that if I have eight threads, I’ll make eight chunks.

But if I’ve read the performance data and charts right, the biggest single reason the Horatian parser is not getting an eight-fold speedup when using eight threads is the need to reparse some chunks, owing to bad guesses about parse context made during the first parse. If we have eight threads and eight chunks, everything is fine for the first pass over the chunks. But if we need to reparse two of the chunks, then it rather looks as if six threads might be sitting idle waiting for the re-parsing to finish.

I wonder: would you get better results if you had shorter chunks, and more of them, to keep more threads busy longer? What you want is enough chunks to ensure that while you are reparsing some chunks, you still have other chunks for the other threads to parse.

As a first approximation, imagine that we have eight threads. Instead of eight chunks, we make fourteen chunks, and give the first eight of them to the eight threads. Let’s say two of them need to be reparsed; the reparsing goes on at the same time that the remaining six threads parse the remaining six chunks. The minimal path through the speculative parsing step remains the time it takes to parse two chunks, but the chunks are somewhat smaller now. The only question is how much additional time the post-processing step will now take, given that it has fourteen and not eight chunks to knit together.

And of course you need to bear in mind that if one chunk in four turns out to need re-parsing, then three or four out of the fourteen chunks are going to need reparsing, not just two. By the time you factor that in, and try to ensure that your last round of parsing doesn’t generate any new re-parse requests, things have gotten more complicated than I can conveniently deal with here (or elsewhere).

Maybe that’s why the Intel paper was so non-committal on the way to choose how many chunks to make in the first place: it can get pretty complicated pretty fast.

Optimization and context independence in schema languages

One of the things that intrigues me about these results is that so much of what people have said needs to be done to schema languages to ensure that validation can be fast has nothing much to do with the speed gains shown by optimistic concurrency.

I thought for a while that this work did benefit from the fact that elements can be validated against XSD types without knowledge of their context (no reference to ancestors or siblings in any assertions, for example), but on reflection I’m not sure this is true: in order to find the right element declaration and type definition to bind an instance element, you need to know (a) the expanded name of the element (which means knowing the in-scope namespaces, which in practice means having looked at all of the ancestors of the element), and (b) the type assigned to the element’s parent (unless this element is itself the validation root). Once you have a type, it’s true that validation is independent of context. But the assignment of a type to an element or attribute does depend, in the normal case, on the context. It’s not clear to me that allowing upward-pointing XPath expressions in assertions or conditional type assignment would make much difference.

To really exploit parallelism in validation, it would seem you want to eliminate the variable binding of expanded names to element declarations and to types.

Back to DTDs plus datatypes, anyone?

Optimistic concurrency and XML parsing and validation (Balisage report 3, in which chronology is abandoned)

[22 August 2008]

My brief hope (it would be misleading to refer to it as a “plan”) to report daily from Balisage 2008 has bitten the dust — it did that shortly after noon on the first day, when my account of Sandro Hawke’s work on XTAN turned out to take more time than I had available — but there is still a lot to say. I’m going to abandon the chronological approach, however, and write about things that come to mind, in a more or less random order.

One of my favorite papers this year was the one submitted by Yu Wu, Qi Zhang, Zhiqiang Yu, and Jianhui Li, of Intel, under the slightly daunting title “A Hybrid Parallel Processing for XML Parsing and Schema Validation”. (I think they are all members of the XML Engineering Team at the Intel China Software Center in Shanghai, but I am not sure I’ve read all the affiliation info correctly; I keep being distracted by the implications of an Intel software center having an XML engineering team.)

When I paraphrased this paper to a friend recently, her response was “Wow! That’s a lot more accessible than I would have guessed from the title.” So perhaps it’s worth while to try to record here the high points of the work, in a way that’s accessible to people to people with no more than lay knowledge of the relevant technical disciplines. (This goal is made easier, of course, by the fact that I don’t have more than lay knowledge of those disciplines myself.)

For technical details, of course, readers should go to the paper in the online proceedings of the conference; all errors in this summary are mine.

The elevator speech

The quick executive summary: XML parsing, and validation, can be a lot faster if performed by a multi-threaded process using optimistic concurrency.

By “optimistic concurrency”, I mean a strategy that parallelizes aggressively, even if that means doing some work speculatively, based on guesses made when parallelizing the work, guesses that might prove wrong. When the guesses are wrong, the process has to spend extra time later fixing the resulting problems. But if the speedup gained by the concurrency is great enough, it can outweigh the cost of wrong guesses. (This is a bit like the way Ethernet chooses to clean up after packet collisions, instead of attempting to prevent them the way token-ring networks do.)

A fast and simple-minded process divides the XML document into chunks, and multiple parallel threads handle multiple chunks at the same time. The fragmentary results achieved by the chunkwise parsing can be knit back together quickly; sometimes the fixup process shows that one or the other chunk needs to be reparsed.

What does Moore’s Law have to do with XML parsing?

OK, so much for the elevator speech. If you’re still interested, here is a little more detail.

First, some background. Moore’s Law says, roughly, that the number of transistors it’s possible to put on a chip doubles every eighteen months. For many years, this doubling of transistor count was accompanied by increases in clock speed. (I don’t understand the connection, myself, not being an electrical engineer. I just take it on faith.) But higher clock speeds apparently require more power and generate more heat, and eventually this causes problems for the people who actually use the chips. (Very few laptop designers are persuaded that water-cooled systems can be made to have the requisite portability. And liquid nitrogen, which would be the next step? Don’t get them started.)

So nowadays the doubling of transistors appears to be reflected in the rise of multi-core chips; dual-core chips in the current crop of off-the-shelf machines, with every expectation that the number of cores on standard chips will rise. Intel has already shipped chips with four and eight cores, although I haven’t seen any four-core machines on my list when shopping for laptops. (I’m not sure whether cores are expected to double every eighteen months indefinitely, or not; if they do, will we end up with a 1024-core chip vaguely resembling the Connection Machine in our laptops in another fifteen years?)

It used to be that performance rose about as fast as the transistor count, because the clock speed kept going up; even if software didn’t do anything smarter, it kept getting faster because the chip was faster. But to get performance benefits out of multi-core chips, a system is going to want to keep all of the cores busy whenever possible. People have been thinking about parallel computing for a long time, and at the 30,000-foot level, the answer so far seems to boil down to the simple, general principle “Gosh, that’s hard!”

Under those circumstances it seems plausible that a manufacturer of multi-core chips might see it as in the manufacturer’s own interest to show people how to multi-thread common applications, so as to make it easier to get as much performance as possible out of multi-core chips.

Parallelizing parsing

How do you parallelize the task of parsing an XML input stream? (There may be other ways to keep multiple threads busy, but this seems like the obvious choice if we can manage it.)

The answer wasn’t obvious to me. There are references to parallelism in the titles or summaries of a number of papers in Grune and Jacobs’s online bibliography on parsing techniques (part of the supplemental material to their book on Parsing Techniques), but none that leap out at me as being easy to understand.

One way to parallelize XML parsing is to run a pre-scanner over the document to select suitable places to divide it into chunks, and then to parse the chunks in parallel. (There is earlier work on this by Wei Lu et al., which Yu Wu et al. cite. They also cite work on other ways to parallelize XML parsing, but I don’t understand them and won’t try to describe them.)

The problem with (the existing state of) the pre-scanning approach, according to the Intel team, is that the pre-scanning takes a lot of time by itself, and once the parsing process itself is optimized, the overhead imposed by the pre-scanner ends up being prohibitive (or at least deplorable).


So Yu Wu et al. use a simpler (and I guess faster) pre-scanner. They don’t attempt to find wholly optimal chunk boundaries, and they don’t attempt to ensure that the parse context at the chunk boundary is completely understood by the pre-scanner. They don’t go into great detail on the chunking method, but if I have understood it correctly, the primary criteria for division of the XML document into chunks are that (1) the chunks should be of about the same size, (2) there should be at least one chunk per thread, (3) other things being equal, fewer chunks is better (because this reduces the cost of post-processing), and (4) each chunk should start with a left angle bracket.

Until I looked at the paper again just now I thought each chunk had to begin with something that looks like a start-tag, but I can’t find that constraint in the paper. Maybe I hallucinated it.

So while I don’t know the details of the Intel pre-scanner, I imagine a pre-scanner that just looks for an angle bracket in the neighborhood of the desired chunk boundary: if you have an 800-MB document you want to divide into eight chunks (in order to allocate them to eight threads, say), my imaginary pre-scanner just looks in the vicinity of byte offset 100,000,000 for a left angle bracket, scanning forwards and/or backwards until it finds one. If we are reading the document from random-access storage, the pre-scanner doesn’t even have to scan the body of the chunk.

Some readers will be saying to themselves at this point “But wait, not everything that looks like a start-tag or an end-tag is necessarily a start-tag or an end-tag. It might be in a CDATA section. It might be in a comment. What then?”

Hold that thought. You’re quite right, and we’ll come back to it.

Parallel parsing

Now, each thread is given a chunk of XML to parse, but only one thread gets to begin at the beginning of the document, so only one thread actually knows where it is. The other threads are all following the advice of the Latin poet Horace, who recommends beginning in the middle (medias in res). (I’m tempted to see if we could call this parsing technique Horatian parsing, for that reason. That’s probably not going to fly in the community at large, but it’s more compact than “hybrid parallel-processing parser” or any other descriptive phrase I can come up with denoting the algorithm presented by the Intel team, so I’ll use it in the rest of this description.)

The nature of XML, however, is that the element structure is fairly explicit in the document and doesn’t require a lot of context information. When in a chunk you see a start-tag, some content, and a matching end-tag, you know that’s an element, and you can build the appropriate data structure. But you will also run into some things that you can’t handle all by yourself: unmatched start-tags (for elements that end after the current chunk), unmatched end-tags (for elements that started in earlier chunks), and undeclared namespace prefixes (for namespace bindings declared in some ancestor element). The Horatian threads keep track of each of these phenomena, and produce as their output a convenient representation of parts of a parse tree (think DOM, not SAX), and lists of unmatched start-tags, end-tags, and namespace prefixes, together with some miscellaneous context information.

The post-processor can use the lists of unmatched bits to knit the data structures together: the last unmatched start-tag found in chunk 1 matches the first unmatched end-tag found in chunk 2, and so on.

The post-processor can also use the context information to see whether any parsing needs to be redone. (Those of you who were worried about misleading angle brackets in comments and CDATA sections and so on? Remember I told you to hold that thought? OK, put it down now. This is where we deal with it.)

If chunk n ended in the middle of a comment or in a CDATA section, then the parsing of chunk n + 1 will need to be redone. But if we have divided the document into eight chunks, and one chunk, or four, need a second parsing, we are still running at about four times the speed of a sequential parser.


Once the chunks have been knit together (and, crucially, once all namespace prefixes are bound), the same chunking is used to perform schema-validity assessment: one thread validates chunk 1, another validates chunk 2, etc.

I won’t go into the fixup needed to knit together the various partial results; it suffices to observe that in both XSD and Relax NG, the validation needed for an element cannot be determined reliably without knowing its ancestry. The consequence is that validation cannot be performed reliably without doing the post-parsing fixup first. (You could guess which element declaration to use, I suppose, but I think the rate of wrong guesses would be too high.)

If you want to pass appinfo from the governing element declaration to the consumer, you will also in pathological cases need to know the left siblings of the element, in order to know where it occurs in the content model of its parent, so you can know which appinfo to use. I expect any schema validator designed for this kind of optimistic concurrency will therefore decline to expose appinfo from element declarations.


In theory, if parallelization imposes no overhead, then by using two threads you get a two-fold speedup, four threads gets a four-fold speedup, etc.

In practice, the test results presented by the Intel group show, as one might expect, that there is some non-negligeable overhead in this scheme. For four threads, their stand-alone parser shows a slightly better than two-fold speedup; for eight threads, slightly better than four-fold. Some of the overhead, clearly, is in the pre-scanning and the post-processing, but if I’m reading their graphs correctly, for eight threads neither of these processes comes close to the amount of time spent in parsing and validation, so I’m guessing that the main performance hit comes from the need to re-parse some chunks.

Small documents (which the paper describes as those smaller than 64 KB) do not benefit from the parallelization, and larger documents (larger than 1 MB) benefit more than medium-sized ones. They didn’t talk at any length about the nature of the test data, beyond saying that it was real data from real customers.

Optimization can be a real pain sometimes, and listening to people talk about optimization can be a good cure for insomnia. But sometimes, someone has a really simple Big Idea that makes sense even to someone as ignorant as me, and then it can be an awful lot of fun to hear about it and think about the implications.

Optimistic concurrency for XML processing: a Big Idea whose implications can be huge.

Two styles of Monte Carlo testing for grammars

[8 August 2008]

In an earlier post I talked about generating test cases for grammars, and asked a number of questions. I’ve been meaning to get back to the topic; this afternoon I revisited some work I did with the XSD 1.1 grammar for regular expressions, and so the topic is again on my mind.

Pseudo-random strings

In the past, I’ve used Monte Carlo test generation to test grammars, in particular to find strings that distinguish between grammars. For example, given two grammars for the lexical space of a particular datatype, can we find strings that are accepted by one grammar and rejected by the other?

But random strings can also be used just to understand a given grammar better.

At one point in our work on XSD 1.1, for example, I proposed that in order to have a looser coupling between XSD and the RFCs that define URIs, we should just accept any string at all as a type-valid URI, much the same way that HTML browsers ‘accept’ any string as the value of an href attribute. They may not understand it, but they don’t raise an error message. The rationale for my proposal was that in any case, the generic grammars defined by RFC 3986 and its predecessors are so vague, so general, and so forgiving, that it wasn’t really very clear that they outlawed very much at all. I’ve heard people who know URIs well say that really the only thing strings that grammar won’t accept are strings with two hash signs. Others, equally well informed, say that those are legal, too.

The scheme-specific grammars (e.g. the grammar for http: URLs) are much more useful in catching errors, I argued. But we don’t require that they be checked, and don’t really want to make them part of the definition of type validity for the anyURI type. (I think a good validator should check anyURI value against the scheme-specific grammars and issue warnings if there are problems, but I don’t know any that do that.) So, I argued, the XSD spec was giving the illusion of useful value checking, but not the reality. Others in the working group maintained that the generic grammar does perform some useful checks, and wanted to retain it.

Generating ten thousand or so strings of twenty or forty characters each, with each character a randomly chosen character in the printable range of the ASCII character set, persuaded me that I was wrong. Even if the only useful thing the generic URI grammar does is eliminate strings containing forbidden characters, it rejects well over half the random strings I generated.

So I abandoned my proposal, and in XSD 1.1, type validity for anyURI does involve checking the literal against the grammar given in the RFC.

More interesting pseudo-random strings

One problem, for a human looking at the data, is that if you generate random strings as I did, by generating twenty random numbers between 32 and 127, and then concatenating the corresponding characters, then you generate an awful lot of uninteresting test cases, and you can go a long long time before you actually exercise a specific part of your grammar.

So when I faced the task of generating test cases for XSD’s regular-expression grammar, I wanted some method that would generate more interesting test cases, test cases more likely to exercise all the parts of the grammar. In particular, I wanted to make sure we got good coverage of the more obscure corners of the character-class constructs. These have relatively elaborate substructures and the likelihood of generating a complex character-class expression using randomly chosen characters was way too low.

To make a long story short, I think I found a way. This is what I did.

  1. I built a set of strings in the following way.
    1. For each non-terminal in the grammar, I generated one or more strings that matched that non-terminal, and added them to the set.
    2. For each literal string mentioned in the grammar, I included that literal string in the set. For example, the first production rule in the regex grammar is
      regExp ::= branch ( '|' branch )* 

      So I added “|” to the set of strings.

    3. For each class of terminal symbol, I included one or two examples in the set of strings.

    For the regex grammar, I ended up with a set of a hundred or so strings, from “a” to “(” to “[p{IsBasicLatin}-[aeiou]]”.

  2. I put the set of strings into an array, in no particular order.
  3. Then, I generated sequences of five or ten random numbers between 0 and the size of my set of string. For example, “[92, 21, 48, 6, 41, 69, 47”.
  4. Then I used each number to look up the corresponding string in the array, and concatenated the results. For the random numbers just given, this produced the strings ”}”, ”a”, ”2”, ”)”, ”|xyz”, ”|”, ”[\p{IsBasicLatin}-[aeiou]]”, and ”(”.
    Together, these yield the test string “ }a2)|xyz|[\p{IsBasicLatin}-[aeiou]](

The test cases generated by this modified method (random selection of substrings from a set of strings known to be grammatically interesting) exercised the grammar far better than test cases generated by the random-character method.

If I were a better mathematician, I might be able to characterize the difference. But in a purely informal way, I think one can think of the production rules of any grammar as a set of regular expressions over an alphabet consisting of both the terminal and the non-terminal symbols of the grammar, not just of characters. In any normal grammar, for most non-terminals some or most random character sequences will fail to match the non-terminal. Test cases generated by selecting characters at random will have the same effect as selecting items randomly from the terminal + non-terminal vocabulary, but with a bias (typically a very strong bias) against selecting any non-terminal symbol. By making a set of substrings as described, the second Monte Carlo method evens the odds a bit, and you get sequences that are much more likely to match the regular expressions over the terminal + non-terminal vocabulary, or to come very close to matching them. And that combination (matching all the productions, and coming close to matching each of the productions, but missing) is as close as I can come to characterizing the set of ‘interesting’ test cases.

Certainly, it was not necessary to generate tens of thousands of test cases in order to find examples that were valid against one form of the grammar, and invalid against a different form of the grammar.

If anyone wants to look at this quick and dirty test generator, the code for the test generator is at; the parser for the grammars I was testing is in the other files of the same directory.

Thinking about schema mappings

[3 August 2008]

At the Digital Humanities conference in Finland in June, two papers made me think about a problem that has worried me off and on for a long time, ever since Mark Olsen at the ARTFL Project at the University of Chicago asked how he was supposed to provide searches across a large collection of documents, if all the documents were marked up differently.

Mark’s solution was simple, Procrustean, and effective: if I understood things correctly and remember aright, he translated everything into a single common vocabulary, which in the nature of things was a sort of lowest common denominator of text structure.

Stephen Ramsay and Brian Pytlik Zillig spoke about “Text analytics: a TEI format for cross-collection text analysis”, in which they described an approach similar to Mark’s in spirit, but crucially different in details. That is, like him their idea is to translate into a single common system of markup, so that the collection they are searching uses consistent ways of signaling textual features. Along the way, they will throw away information they believe to be of no interest for the kind of text analysis their tool is to support. The next day, Fotis Jannidis and Thorsten Vitt gave a paper on “Markup in Textgrid”, which also touched on the problem of providing a homogeneous interface to a heterogeneous collection of documents; if I understood them correctly, they didn’t want to throw away information, but were planning simply to store both the original and a modified (homogenized) form of the data. In the discussion period, we discussed briefly the relative merits of translating the heterogeneous material into a common format and of leaving it in its original formats.

The translation into a common format frequently involves loss of some information. For example, if not every document in the collection has been encoded in such a way as to mark all line-end hyphens according to the recommendations of the MLA’s Committee on Scholarly Editions, then it may be better to strip that information out rather than expose it and risk allowing the user to conclude that the other documents were printed originally without any line-end hyphens at all (after all, the query shows no line-end hyphens in those documents!). But that, in turn, means that you’d better be careful if you expect the work performed through the common interface to produce results which may lead to someone wanting to enrich the markup in the documents. If you’ve stripped out information from the original encoding, and now you enrich your stripped copy, later users are unlikely to thank you when they find themselves trying to re-unify the information you’ve added and the information you stripped out.

It would be nice to have a way to present heterogeneous collections through an interface that allows them to look homogeneous, without actually having to lose the details of the original markup.

It has become clear to me that this problem is closely related to problems of interest in relational databases and in RDF queries. (And probably in other areas where people worry about query languages, too, but if Topic Maps people have talked about this in my hearing, they did so without my understanding that they were also addressing this same problem.)

“Ah,” said Enrique. “They used the muffliato spell on you, did they?” “Hush,” I said.

Database people are interested in this problem in a variety of contexts. Perhaps they are performing a federated search and the common schema in terms of which the query is formulated doesn’t match the actual schemas in which the data are stored and exposed by the database management systems. Perhaps it’s not a federated query but there are other reasons we (a) want to query the data in terms of a schema that doesn’t match the ‘native’ schema, and (b) don’t want to transform the storage from the native schema into the query schema. My colleague Eric Prud’hommeaux has been working on a similar problem in the context of RDF. And of course as I say it’s been on the minds of markup people for a while; I’ve just found a paper that Nancy Ide and I wrote for the ASIS 97 conference in which we tried to stagger towards a better understanding of the problem. I have the sense that I understand the problem better now than I did then, but I could be wrong.

Two basic techniques seem to be possible, if you have a body of data in one vocabulary (let’s call it the “source vocabulary”) and would like to be able to query it using terms from a different vocabulary (the “target vocabulary”). Both assume that it’s possible to map information from the source vocabulary to the target vocabulary.

The first technique is Mark Olsen’s: you have or develop a mapping to go from the source vocabulary to the target vocabulary; you apply that mapping. You now have data in the target vocabulary, and you can query it in the usual way. Done. I believe this is what database people call “materializing the view”.

The second technique took me a while to get my head around. Again, we start from a mapping from the source vocabulary to the target vocabulary, and a query using the target vocabulary. The technique has several steps.

  1. Invert the mapping, so it maps from the target vocabulary to the source vocabulary. (Call the result “the inverse mapping”.)
  2. Apply the inverse mapping to the query, to produce a semantically equivalent query expressed in terms of the source vocabulary. (Since the query is not itself a relational database, or an RDF graph, or an XML document, there’s a certain sleight-of-hand going on here: even if you have successfully inverted the mapping, it will take some legerdemain to apply it to a query instead of to data. But just how hard or easy that is will depend a lot on the nature of the query and the nature of the mapping rules. One of the reasons for this klog post is that I want to be able to set up this context, so I can usefully think aloud about the implications for query languages and mapping rules.)
  3. Apply the source-vocabulary query to the source-vocabulary data. Simple, right? Well, no, not simple, but at least it’s a well known problem.
  4. Take the results of your query, and apply the original source-to-target mapping to them, to produce results expressed in / marked up in the target vocabulary.

Eric Prud’hommeaux may have been surprised, when he brought this topic up the other day, at the speed with which I told him that the key rule which any application of the second technique must obey is a principle I first learned in a course on language pedagogy, years ago in graduate school. (If so, he hid it well.)

The unit of translation is the utterance, not the word.

Everything else follows from this, so let me say it again. The unit of translation is the utterance, not the word. And almost every account of ‘semantic mapping’ systems I have heard in the last fifteen years goes wrong because it assumes the contrary. So let me say it a third time. The specific implications of this may vary from system to system, and need some unpacking I’m not prepared to do this afternoon, but the basic principle remains what I learned from Gertrude Mahrholz thirty years ago:

The unit of translation is the utterance, not the word.

More on this later. In the meantime, think about that.