A way to allow both extensibility and interoperability (Balisage report 2)

[begun 11 August 2008, finished 19 August 2008]

At the versioning symposium the day before the Balisage 2008 conference, my colleague Sandro Hawke talked about a method he has developed for allowing a core language to be extended in a variety of ways while still retaining reasonably good interoperability. The details of his implementation are interesting, but what strikes me most forcibly is the underlying model of interoperability, which seems to me to bring some very useful new ideas to bear on the problem. There may be — I mean there must be — situations where these new ideas don’t help. It’s not actually clear to me how best to characterize the situations where they do help, and the situations where they don’t, but it does seem to me clear that in the situations where they do help, they can help a lot.

It took me a bit of effort to grasp the basic idea, and in fact I didn’t really get it until my evil twin Enrique started to explain it to me. The summary of the basic ideas that follows is Enrique’s, not Sandro’s, and Sandro should not be held responsible for it.

1 Start at the beginning.

“It’s a very good place to start,” I murmured. “What was that?” “Nothing.” “One more Sound of music reference out of you and this explanation is over, got that?”

It’s obvious, and thus well understood by at least some people, that interchange is easy when everyone involves is speaking exactly the same language.

“The Lord said, ‘If as one people speaking one language they have begun to do this, then nothing they plan to do will be impossible for them.’“ “And right he was. Sometimes Yahweh really knows what he’s talking about.”

2 But we have a lot of situations where we don’t want to, or can’t, speak exactly the same language.

3 In some cases, with unrelated languages, the only option seems to be translation, or non-communication.

“Non-communication?” I asked. “Always a popular option,” said Enrique. “Popular?” “Or at least, a common result. I assume it must be popular, given how many normally intelligent working group members engage in it. Do you really think that in the average working group argument, people are speaking the same language? Or even trying to?” “Well, of course they are. People working together in good faith …” “Good faith? working together? When was the last time you actually listened to what people were saying in a working group meeting?” “Don’t say that where my Working Group chairs can hear you, OK?” “You think they don’t know? Stick around, kid. Your naïvete is refreshing.”

4 An important case: We have a lot of situations where we want to support multiple versions / variants of the same vocabulary: a sequence of version 1, version 2, and version 3 is a simple example.

Another example is the definition of a core architecture and various competing (or complementary) extensions. For example, in the context of the W3C Rule Interchange Format, Sandro’s slide imagines a Core RIF dialect, extended separately by adding actions (to create Production Rule Dialect) and by adding equality and functions (to create a Basic Logic Dialect), the Basic Logic Dialect in turn being extended by a negation-as-failure feature and a classical-negation feature (producing, respectively, a logic-programming dialect and a version of first order logic).

“Is it an important pattern here,” I asked, “that we are talking about different dialects with overlapping expressive power?” “Possibly. What do you mean?” “The examples you mention seem to involve different dialects with semantics or functionality in common, or partially in common: you can approximate negation-as-failure with classical negation, or vice versa, if you translate direct from the one dialect to the other. If you had to translate them down into a common core language without any negation at all, you wouldn’t be able to approximate them nearly so well.” “You’re getting ahead of yourself here, but yes, that’s true of the examples so far.”

Another example is the definition of a central reference architecture and various extensions or specializations.

“You mean like HL7 and different specializations of it for general practitioners and for hospitals and other health-care institutions?” “Possibly.”

5 A common approach to such situations is a kind of hub and spoke model: specify an interchange language, an interlingua. Everybody translates into it, everybody translates out of it. So you have a 2×n problem, not an n-squared problem.

6 But in some situations, e.g. core language + multiple independent extensions, the only feasible interlingua is the core language without extensions. (Example: consider any attempt to write portable C, or portable SQL. Basic rule: stay away from vendor extensions at all costs, even if they would be a lot of help. Secondary rule: use them, but put ifdefs and so on around them.)

I.e. the interlingua is often not as expressive as the specific variants and extensions people are using. (That’s why they’re using the variants and not the interlingua.)

Put yet another way: going from the variants to the interlingua can be lossy, and often is.

7 Two things are wrong / can be wrong with an interlingua like that (or ANY interlingua solution), especially when a round trip through the interlingua produces something less useful or idiomatic than the original. (This is not logically necessary, it just almost always seems to be the case.)

Well, three.

(a) If two people (say, Ana Alicia and Ignacio) are doing blind interchange, and using the same extensions, it seems dumb for Ana Alicia to filter out all the extensions she uses, and for Ignacio not to get them. They would not have hurt him, they would have been helpful to him. The same applies if Ana and Ignacio are using similar but distinct extensions: if it’s possible to translate Ana’s extensions into a form that Ignacio can understand, that may be better than losing them entirely.

But that doesn’t work well if the interlingua can’t handle Ana Alicia’s extensions.

Sandro gave, among others, the example of the blink tag. The rules of HTML say that a browser that doesn’t understand the blink element should ignore its start- and end-tags. That makes the defined tags a sort of interlingua, and prescribes a simple (brutally simple, and also brutal) translation into the interlingua. But for text that should ideally be presented as blinking, it would be better to fall back to strong or some other form of strong highlighting, rather than rendering the contents using the same style as the surrounding text.

(b) Having extensions Ignacio doesn’t understand may or may not be a problem, depending on what Ignacio is planning on doing with the data.

(c) Even more important than (b): the best lossy translation to use may vary with Ignacio’s use of the data.

If one translation preserves the logical semantics of a set of logical rules (e.g. in RIF), but makes the rules humanly illegible, and another preserves human legibility but damages the logical content, then it matters whether Ignacio is running an inference engine or just displaying rules to users without doing any inference.

8 One advantage of direct translation pairs (the n-squqred approach instead of 2×n with an interlingua) is that the output is often higher quality. This is true both for artificial and for natural languages.

“Really?” I asked. “Sure,” said Enrique. “Look at the secret history of any of the European machine-translation projects. Even the ones based on having an interlingua ended up with annotations that say in effect ‘if you’re going from Spanish to Italian, do this; if from Spanish to German, do that.’’ “Where did you read that?” “Somewhere. No, I can’t cite a source. But I bet it’s true.”

But when we don’t know in advance what pairs are needed? and how many people are going to be playing? And thus when we cannot prepare all n-squared translation pairs in advance? What do we do then?

9 Could we manage a system where the pairwise translation is selected dynamically, based on simpler parts (so the total cost is not quadratic but less)?

10 Yes. Here’s how, in principle.

(a) Identify one or more base interlingua notations.

“That’s got to be a centralized function, surely?” I asked. “Probably; if it’s not centralized you’re unlikely to reach critical mass. But when you’re thinking of this technique as a way to provide for the extensibility of a particular language, then it’s not likely to be an issue in practice, is it? That language itself will be the candidate interlingua.” “Then why say ‘one more more’?” I asked. “Because in principle you could have more than one. Some widely deployed language in the application domain might be a second interlingua.”

(b) Provide names for various extensions / variants / versions of the base notation(s).

This means you can describe any end point in the process as accepting a particular base language with zero or more named extensions.

“Does this need to be centralized?” I asked. “Not at all; just use normal procedures to avoid name conflicts between extensions named separately.” “You mean, use URIs?” “Yes, use URIs. Sometimes you’re not as dumb as you look, you know?”

(c) Identify well known purposes for the use of the data (display to user, logical reasoning, etc.). This can if necessary be done in a distributed way, although it’s probably most effective if at least the basic purposes are agreed upon and named up front.

(d) For each extension, say how (and whether) it can be translated into some other target language — whether the target is one of the interlinguas identified in step (a) or is an interlingua plus a particular set of extensions.

Define the translation. And quantify (somehow, probably using some subjective quality measure) how lossy it is, how good the output is, for each of the specific well known purposes.

“This,” said Enrique, “ is why it’s better to centralize the naming of well known purposes. If you don’t have an agreed upon set of purposes, you end up with spotty, incomplete characterizations of various translations.” “And how is the translation to be specified?” I asked. “Any way you like. XTAN and GRDDL both use XSLT, but at this level of abstraction that’s an implementation detail. I take back what I said about your looks.”

(e) Now we have a graph of language variants, each defined as an interlingua plus extensions. Each node is a language variant. Each arc is a translation from one variant to another. Each arc has a cost associated with each well known purpose.

(f) To get from Ana Alicia’s language variant to Ignacio’s language variant, all Ignacio has to do (or Ana Alicia — whoever is taking charge of the translation) is find a path from Ana’s node in the graph

“The node representing Ana Alicia’s dialect, you mean.” “Pedant! Yes, of course that’s what I mean.”

to Ignacio’s. It need not be direct; it may involve several arcs.

(g) If there’s more than one path, Ignacio (let’s assume he is doing the translation) can choose the path that produces the highest quality output for his purposes.

Finding the optimal path through a weighted graph is not without its exciting points —

“You mean it’s hard” “Well, of course I do. What did you think exciting meant in this context?! But only when the graph is large.”

— not (I was saying) without its exciting points, but it’s also a reasonably well understood problem, and perfectly tractable in a small graph of the kind you’re going to have in applications of this technique. Sandro’s implementation uses logical rules here to good effect.

(h) If there’s only one path, Ignacio can decide whether the output is going to do him any good, or if he should just bag the idea of reusing Ana Alicia’s data.

11 The description above is how it works in principle. But can it actually be implemented? Yes.

But for details, you’re going to have to look at Sandro’s talk on XTAN, or on other writeups he has done (cited in the talk).

International versioning symposium (Balisage report 1)

[11 August 2008]

The Balisage 2008 conference begins tomorrow. Today, there is a one-day pre-conference symposium on Versioning XML Vocabularies and Systems. So far (this version of this post is being written just after lunch), I’ve learned a lot.

Peter Brown of Pensive spoke about the nature of versions (and thus of the versioning problem) as social constructs. Plato and Aristotle thought of reality as having natural joints, and of a good conceptual apparatus as cutting reality at those joints; that would suggest that there is typically some natural distinction between x (for any x) and not-x. Nowadays, Brown suggested, things look different. A look at French and American butchers’ charts shows that even the cuts of meat are culturally variable. But if even real joints don’t necessarily determine how the cow is cut, how much less will the joints of reality (as we understand reality) fully determine an ontology. Why, Brown asked, do we identify some things as versions of other things? What is the point of the identification; what is particular about the points at which the term is applied? Are versions used for identifying particular content? for tracking evolution? for tracking differences? for identifying canonical forms of variable material? The answer, essentially, is “yes”: versions are used for all of the above. That is to say, our use of the term version is culturally and functionally determined. He offered a number of useful bits of advice, most saliently perhaps the advice: “If anyone mentions versioning, assume nothing! In particular, don’t assume that you know what they mean.”

David Orchard talked about the fundamental problems of versioning from a down to earth point of view, with particular emphasis on the issues of forwards and backwards compatibility, their definition in terms of the sets of documents produced and accepted by various agents, and their effect on deployment of protocols and software. He talked about specific things vocabulary designers can do to make it easier to extend and evolve their languages.

A couple of points made in the discussion after David’s talk may be worth recording. He had pointed to the difficulty some schema authors experience in writing a schema which allows extension in all the right places, while still making clear what sort of messages are actually expected in the absence of extension; there is no law, however, that says a language or a protocol spec must be defined by a single schema. If you want producers of v.1 messages to produce only a particular kind of messages, and v.1 consumers to accept all of those messages, but also a larger set of messages (so that they will react calmly when confronted with v.2 messages), then you could do worse than to define two schemas, one for the producer language and one for the acceptor language.

Marc de Graauw picked up the notion of compatibility, and the idea of defining compatibility in terms of set theory, and gave a thorough exposition of how they interrelate. He gave short shrift to the idea of defining compatibility in terms of a formal logic like (for example) OWL; it’s way too complicated to contemplate a serious logical definition of something as complicated as a real-life vocabulary like HL7 v3. Also, definitions like that are hard to reconcile with the idea of consumers or producers of messages as black boxes. Instead, he suggested defining the semantics of a vocabulary in terms of behavior. His slides pushed Venn diagrams for message sets just about as far as they can go.

Laurel Shifrin of Lexis / Nexis gave the first of a series of case studies, describing the various mechanisms Lexis / Nexis has adopted to try to manage vocabulary change in an extremely heterogeneous and decentralized organization. Answer: it’s not easy, and you really do benefit from executive buy-in. (Again, there was a lot more; I should try to elaborate this description further, later on.)

Debbie Lapeyre spoke about a vocabulary for (mostly biomedical) journal content used by the U.S. National Library of Medicine. The perceived stability and backwards-compatibility guarantees of the vocabulary were among the key reasons for the success of the vocabulary, and its wide adoption among a larger community. But over time, patience with the shortcomings and outright errors in the vocabulary waned, and even the user community was asking for a revision which would include incompatible changes. I had the impression that Debbie was a little surprised that the members of the responsible group were not lynched, but that she was glad that they weren’t, and that the new version can repair some problems she has had on her conscience as a vocabulary designer for a while.

Ken Holman talked about the way versioning problems are handled in UBL (the universal business language). UBL has moved to a two-layer validation story: the only normative definition of UBL documents is given by the XSD schema documents, and business rules are enforced in a separate business-rule validation step (rather than customizing local versions of the schema documents to incorporate business rules). Perhaps the most surprising restriction, for some auditors, may be the UBL policy that no their schema documents will never use xsd:choice constructions. Life is made harder by the fact that many consumers of UBL documents appear (by Ken’s account) to use systems which simply will not or cannot show their users invalid documents. It’s hard to make informed decisions about how to recover from a problem, if you’re using a system that won’t show you the problematic data. Fortunately, UBL has some ingenious people.

More questions about Monte Carlo test-case generation

[9 August 2008]

Dave Pawson’s response to the previous post here has made me wonder about ways to tweak the Monte Carlo methods for generating test strings using random number generators — in particular, ways to make them more systematic (and also, perhaps, more automatic).

At the moment, I don’t have answers, but a couple of questions have occurred to me, and if I write them down on a piece of paper I’m just going to lose the paper. So I’ll write them down here.

(1) Can we measure the effectiveness of different methods of test generation by counting how many random strings must be generated in order to obtain a certain number of ‘interesting’ test cases?

(2) One way to define ‘interesting’: given two grammars for similar but not identical languages, how many random strings must we generate, using different methods, in order to find ten strings that exercise the difference between the languages? (I.e. ten strings that are members of L1 but not of L2, or vice versa.)

If we think we know that L1 and L2 differ in exactly three ways, we can also ask: how many test cases must we generate in order to find examples (or: ten examples) of each of the three differences between the languages? (A method that quickly exercises the first two differences but never finds the third would not be our best choice for work with languages where we don’t know the answers in advance.)

(3) Another way: given any grammar, how many test strings must we generate, using a given method, in order to find three different strings which match each non-terminal in the grammar?

How many other ways of defining ‘interesting’ test cases can we find? How many other ways can we find of measuring the effectiveness of methods of generating test cases?

(4) At the moment, I know three methods for generating pseudo-random strings for testing grammars:

  1. Run the grammar itself as a generator, possibly using probabilities assigned to different branches at each choice point. John Cowan described this method in a comment on an earlier post here; Liam Quin also described essentially this method to me, in private conversation. If I need a name for this method, I’ll call it probabilistic sentence generation.
  2. For a given length of string n (given as a parameter, or itself chosen randomly), pick n characters randomly from the character set and concatenate them. I’ll call this random character concatenation.
  3. For a given number n, choose n items randomly from an array of strings constructed to include strings that match each non-terminal and terminal in the grammar’s vocabulary, and concatenate them. Since here the random selection is from strings representing the terminal + non-terminal vocabulary, I’ll call this random TNT concatenation.

Question: are there other ways to use random-number generators to generate test strings for grammars?

(5) Can the array of strings used in the TNT method be constructed automatically, by using random character concatenation until you have matches for each terminal and non-terminal, and the array is fully populated?

(Here, Enrique dug his elbow into my rib. “Well, obviously yes it can,” he hissed. “What kind of idiotic question is that?!” “Hey, I’m just brainstorming. Gimme a break. Besides, this helps set up the two questions that follow.”)

I should mention here that I have a strong gut feeling that it’s useful, in practical cases, to augment the grammar with a terminal class for characters (or strings) which can never occur in sentences of the language. For a grammar for URIs, that is, the class of characters illegal in URIs definitely needs to be represented in the TNT array; for a grammar for numbers in scientific notation, you definitely want to have a few entries in the TNT array for characters like space, or #, or letters of the alphabet other than ‘e’ and ‘E’, which occur in the coded character sets your users will be using but are not part of the alphabet of the language in the formal sense. I have no good formal grip on this gut feeling, and no good formal way to describe it; I’ll work on that. But for now, understand that when I talk about strings matching terminals, I also intend to include some characters that don’t match any terminal. Otherwise you will surely fail to exercise some error detection paths in your system.

(6) Can the TNT array be populated by a bootstrapping method? Would this work?

  1. Start with an array that just includes substrings for the terminals.
  2. Select a production rule whose right-hand side contains only terminals (or, later on, one whose right hand side contains only items for which the TNT array has substrings).
  3. Generate random strings until you have a match, or two matches, for that non-terminal. (N.B. for recursive rules you’ll want to treat each choice in the right-hand side as a separate production rule.)
  4. Add the new substrings to the TNT array.
  5. Repeat (i.e. return to step 2), until you have strings in the TNT array for each production rule (not just each non-terminal and terminal) in the grammar.

(7) Another possible bootstrapping method:

  1. Begin with a TNT array for all the terminals, as just described.
  2. Also begin with a count table showing the non-terminals (or better, the production rules) of the grammar, and how many substrings they have in the TNT array; to begin with, every production rule in the count table will have zero matches.
  3. Generate a random string from the TNT array.
  4. Check the string against each rule in the grammar table for which you have fewer than three strings in the TNT array; when you find a match, add that string to the TNT array and increment the number for that rule in the count table. If the string doesn’t match any rules (or any rules that aren’t already at 3), discard it and generate another (i.e. go back to step 3). If every entry in the count table has a count of 3, stop.

The description just given assumes that you want three strings in the TNT table for each rule in the grammar. That’s an arbitrary choice; it could be two, or one, or ten. And if there is reason to want especially thorough testing in some parts of the grammar, we could assign different quotas to different non-terminals. (If I’m particularly worried about character classes, then I want a lot of test cases that exercise that part of the grammar; I can get more by giving the non-terminals in that part of the grammar higher representation in the TNT table.)

(8) When I originally sketched out the TNT method I assumed that what I wanted was a few strings for each non-terminal (and terminal) in the grammar. In the course of working through the questions above I have come to believe that what we want is a few strings for each rule in the grammar: non-terminals with three rules need more strings than non-terminals with just one rule. Which is true? If the measurement methods described above work, it should be possible to measure the effeciveness of different approaches quantitatively.

(9) In setting up the TNT array, is the best default behavior going to be to provide the same number of strings for each item in the terminal + non-terminal vocabulary? Or would we do better to count the number of times each TNT vocabulary item is actually mentioned in the grammar, and assign slots proportionally? One mention: two strings. Eight mentions: sixteen strings. (Or should we do it logarithmically? One mention, two strings; eight mentions, three times two strings; n mentions, 2 times log2 n strings?)

(10) So far, this has all been about sequences of characters. How many of these methods can be transferred conveniently to the task of generating XML documents as test cases for document grammars?

Two styles of Monte Carlo testing for grammars

[8 August 2008]

In an earlier post I talked about generating test cases for grammars, and asked a number of questions. I’ve been meaning to get back to the topic; this afternoon I revisited some work I did with the XSD 1.1 grammar for regular expressions, and so the topic is again on my mind.

Pseudo-random strings

In the past, I’ve used Monte Carlo test generation to test grammars, in particular to find strings that distinguish between grammars. For example, given two grammars for the lexical space of a particular datatype, can we find strings that are accepted by one grammar and rejected by the other?

But random strings can also be used just to understand a given grammar better.

At one point in our work on XSD 1.1, for example, I proposed that in order to have a looser coupling between XSD and the RFCs that define URIs, we should just accept any string at all as a type-valid URI, much the same way that HTML browsers ‘accept’ any string as the value of an href attribute. They may not understand it, but they don’t raise an error message. The rationale for my proposal was that in any case, the generic grammars defined by RFC 3986 and its predecessors are so vague, so general, and so forgiving, that it wasn’t really very clear that they outlawed very much at all. I’ve heard people who know URIs well say that really the only thing strings that grammar won’t accept are strings with two hash signs. Others, equally well informed, say that those are legal, too.

The scheme-specific grammars (e.g. the grammar for http: URLs) are much more useful in catching errors, I argued. But we don’t require that they be checked, and don’t really want to make them part of the definition of type validity for the anyURI type. (I think a good validator should check anyURI value against the scheme-specific grammars and issue warnings if there are problems, but I don’t know any that do that.) So, I argued, the XSD spec was giving the illusion of useful value checking, but not the reality. Others in the working group maintained that the generic grammar does perform some useful checks, and wanted to retain it.

Generating ten thousand or so strings of twenty or forty characters each, with each character a randomly chosen character in the printable range of the ASCII character set, persuaded me that I was wrong. Even if the only useful thing the generic URI grammar does is eliminate strings containing forbidden characters, it rejects well over half the random strings I generated.

So I abandoned my proposal, and in XSD 1.1, type validity for anyURI does involve checking the literal against the grammar given in the RFC.

More interesting pseudo-random strings

One problem, for a human looking at the data, is that if you generate random strings as I did, by generating twenty random numbers between 32 and 127, and then concatenating the corresponding characters, then you generate an awful lot of uninteresting test cases, and you can go a long long time before you actually exercise a specific part of your grammar.

So when I faced the task of generating test cases for XSD’s regular-expression grammar, I wanted some method that would generate more interesting test cases, test cases more likely to exercise all the parts of the grammar. In particular, I wanted to make sure we got good coverage of the more obscure corners of the character-class constructs. These have relatively elaborate substructures and the likelihood of generating a complex character-class expression using randomly chosen characters was way too low.

To make a long story short, I think I found a way. This is what I did.

  1. I built a set of strings in the following way.
    1. For each non-terminal in the grammar, I generated one or more strings that matched that non-terminal, and added them to the set.
    2. For each literal string mentioned in the grammar, I included that literal string in the set. For example, the first production rule in the regex grammar is
      regExp ::= branch ( '|' branch )* 

      So I added “|” to the set of strings.

    3. For each class of terminal symbol, I included one or two examples in the set of strings.

    For the regex grammar, I ended up with a set of a hundred or so strings, from “a” to “(” to “[p{IsBasicLatin}-[aeiou]]”.

  2. I put the set of strings into an array, in no particular order.
  3. Then, I generated sequences of five or ten random numbers between 0 and the size of my set of string. For example, “[92, 21, 48, 6, 41, 69, 47”.
  4. Then I used each number to look up the corresponding string in the array, and concatenated the results. For the random numbers just given, this produced the strings ”}”, ”a”, ”2”, ”)”, ”|xyz”, ”|”, ”[\p{IsBasicLatin}-[aeiou]]”, and ”(”.
    Together, these yield the test string “ }a2)|xyz|[\p{IsBasicLatin}-[aeiou]](

The test cases generated by this modified method (random selection of substrings from a set of strings known to be grammatically interesting) exercised the grammar far better than test cases generated by the random-character method.

If I were a better mathematician, I might be able to characterize the difference. But in a purely informal way, I think one can think of the production rules of any grammar as a set of regular expressions over an alphabet consisting of both the terminal and the non-terminal symbols of the grammar, not just of characters. In any normal grammar, for most non-terminals some or most random character sequences will fail to match the non-terminal. Test cases generated by selecting characters at random will have the same effect as selecting items randomly from the terminal + non-terminal vocabulary, but with a bias (typically a very strong bias) against selecting any non-terminal symbol. By making a set of substrings as described, the second Monte Carlo method evens the odds a bit, and you get sequences that are much more likely to match the regular expressions over the terminal + non-terminal vocabulary, or to come very close to matching them. And that combination (matching all the productions, and coming close to matching each of the productions, but missing) is as close as I can come to characterizing the set of ‘interesting’ test cases.

Certainly, it was not necessary to generate tens of thousands of test cases in order to find examples that were valid against one form of the grammar, and invalid against a different form of the grammar.

If anyone wants to look at this quick and dirty test generator, the code for the test generator is at http://www.w3.org/XML/2008/03/xsdl-regex/testgenerator1.pl; the parser for the grammars I was testing is in the other files of the same directory.

Thinking about schema mappings

[3 August 2008]

At the Digital Humanities conference in Finland in June, two papers made me think about a problem that has worried me off and on for a long time, ever since Mark Olsen at the ARTFL Project at the University of Chicago asked how he was supposed to provide searches across a large collection of documents, if all the documents were marked up differently.

Mark’s solution was simple, Procrustean, and effective: if I understood things correctly and remember aright, he translated everything into a single common vocabulary, which in the nature of things was a sort of lowest common denominator of text structure.

Stephen Ramsay and Brian Pytlik Zillig spoke about “Text analytics: a TEI format for cross-collection text analysis”, in which they described an approach similar to Mark’s in spirit, but crucially different in details. That is, like him their idea is to translate into a single common system of markup, so that the collection they are searching uses consistent ways of signaling textual features. Along the way, they will throw away information they believe to be of no interest for the kind of text analysis their tool is to support. The next day, Fotis Jannidis and Thorsten Vitt gave a paper on “Markup in Textgrid”, which also touched on the problem of providing a homogeneous interface to a heterogeneous collection of documents; if I understood them correctly, they didn’t want to throw away information, but were planning simply to store both the original and a modified (homogenized) form of the data. In the discussion period, we discussed briefly the relative merits of translating the heterogeneous material into a common format and of leaving it in its original formats.

The translation into a common format frequently involves loss of some information. For example, if not every document in the collection has been encoded in such a way as to mark all line-end hyphens according to the recommendations of the MLA’s Committee on Scholarly Editions, then it may be better to strip that information out rather than expose it and risk allowing the user to conclude that the other documents were printed originally without any line-end hyphens at all (after all, the query shows no line-end hyphens in those documents!). But that, in turn, means that you’d better be careful if you expect the work performed through the common interface to produce results which may lead to someone wanting to enrich the markup in the documents. If you’ve stripped out information from the original encoding, and now you enrich your stripped copy, later users are unlikely to thank you when they find themselves trying to re-unify the information you’ve added and the information you stripped out.

It would be nice to have a way to present heterogeneous collections through an interface that allows them to look homogeneous, without actually having to lose the details of the original markup.

It has become clear to me that this problem is closely related to problems of interest in relational databases and in RDF queries. (And probably in other areas where people worry about query languages, too, but if Topic Maps people have talked about this in my hearing, they did so without my understanding that they were also addressing this same problem.)

“Ah,” said Enrique. “They used the muffliato spell on you, did they?” “Hush,” I said.

Database people are interested in this problem in a variety of contexts. Perhaps they are performing a federated search and the common schema in terms of which the query is formulated doesn’t match the actual schemas in which the data are stored and exposed by the database management systems. Perhaps it’s not a federated query but there are other reasons we (a) want to query the data in terms of a schema that doesn’t match the ‘native’ schema, and (b) don’t want to transform the storage from the native schema into the query schema. My colleague Eric Prud’hommeaux has been working on a similar problem in the context of RDF. And of course as I say it’s been on the minds of markup people for a while; I’ve just found a paper that Nancy Ide and I wrote for the ASIS 97 conference in which we tried to stagger towards a better understanding of the problem. I have the sense that I understand the problem better now than I did then, but I could be wrong.

Two basic techniques seem to be possible, if you have a body of data in one vocabulary (let’s call it the “source vocabulary”) and would like to be able to query it using terms from a different vocabulary (the “target vocabulary”). Both assume that it’s possible to map information from the source vocabulary to the target vocabulary.

The first technique is Mark Olsen’s: you have or develop a mapping to go from the source vocabulary to the target vocabulary; you apply that mapping. You now have data in the target vocabulary, and you can query it in the usual way. Done. I believe this is what database people call “materializing the view”.

The second technique took me a while to get my head around. Again, we start from a mapping from the source vocabulary to the target vocabulary, and a query using the target vocabulary. The technique has several steps.

  1. Invert the mapping, so it maps from the target vocabulary to the source vocabulary. (Call the result “the inverse mapping”.)
  2. Apply the inverse mapping to the query, to produce a semantically equivalent query expressed in terms of the source vocabulary. (Since the query is not itself a relational database, or an RDF graph, or an XML document, there’s a certain sleight-of-hand going on here: even if you have successfully inverted the mapping, it will take some legerdemain to apply it to a query instead of to data. But just how hard or easy that is will depend a lot on the nature of the query and the nature of the mapping rules. One of the reasons for this klog post is that I want to be able to set up this context, so I can usefully think aloud about the implications for query languages and mapping rules.)
  3. Apply the source-vocabulary query to the source-vocabulary data. Simple, right? Well, no, not simple, but at least it’s a well known problem.
  4. Take the results of your query, and apply the original source-to-target mapping to them, to produce results expressed in / marked up in the target vocabulary.

Eric Prud’hommeaux may have been surprised, when he brought this topic up the other day, at the speed with which I told him that the key rule which any application of the second technique must obey is a principle I first learned in a course on language pedagogy, years ago in graduate school. (If so, he hid it well.)

The unit of translation is the utterance, not the word.

Everything else follows from this, so let me say it again. The unit of translation is the utterance, not the word. And almost every account of ‘semantic mapping’ systems I have heard in the last fifteen years goes wrong because it assumes the contrary. So let me say it a third time. The specific implications of this may vary from system to system, and need some unpacking I’m not prepared to do this afternoon, but the basic principle remains what I learned from Gertrude Mahrholz thirty years ago:

The unit of translation is the utterance, not the word.

More on this later. In the meantime, think about that.