… I’ve been wandering late … (travels, part 2)

[26 October 2008]

This is the second in series of posts recording some of my impressions from recent travels.

After the XSLT meetings described in the previous post, and then a week at home, during which I was distracted by events that don’t need to be described here, I left again in early October for Europe. During the first half of last week [week before last, now], I was in Mannheim attending a workshop on organized by the electronic publications working group of the Union of German Academies of Science. Most of the projects represented were dictionaries of one stripe or another, many of them historical dictionaries (the Thesaurus Linguae Latinae, the Dictionnaire Etymologique de l’Ancien Français, the Deutsches Rechtswörterbuch, the Qumran-Wörterbuch, the Wörterbuch der deutschen Vinzersprache, both an Althochdeutsches Wörterbuch and an Althochdeutsches Etymologisches Wörterbuch, a whole set of dialect dictionaries, and others too numerous to name).

Some of the projects are making very well planned, good use of information technology (the Qumran dictionary in Göttingen sticks particularly in my mind), but many suffer from the huge weight of a paper legacy, or from short-sighted decisions made years ago. I’m sure it seemed like a good idea at the time to standardize on Word 6, and to build the project work flow around a set of Word 6 macros which are thought not to run properly in Word 7 or later versions of Word, and which were built by a clever participant in the project who is now long gone, leaving no one who can maintain or recreate them. But however good an idea it seemed at the time, it was in reality a foolish decision for which project is now paying a price (being stuck in outdated software, without the ability to upgrade, and with increasing difficulty finding support), and for which the academy sponsoring the project, and the future users of the work product, will continue paying for many years to come.

I gave a lecture Monday evening under the title “Standards in der geisteswissenschaftlichen Textdatenverarbeitung: Über die Zukunftssicherung von Sprachdaten”, in which I argued that the IT practice of projects involved with the preservation of our common cultural heritage must attend to a variety of problems that can make their work inaccessible to posterity.

The consistent use of suitably developed and carefully chosen open standards is by far the best way to ensure that the data and tools we create today can still be used by the intended beneficiaries in the future. I ended with a plea for people with suitable backgrounds in the history of language and culture to participate in standardization work, to ensure that the standards developed at W3C or elsewhere provide suitable support for the cultural heritage. The main problem, of course, is that the academy projects are already stretched thin and have no resources to spare for extra work. But I hope that the academies will realize that they have a role to play here, which is directly related to their mission.

It’s best, of course, if appropriate bodies join W3C as members and provide people to serve in Working Groups. More universities, more academies, more user organizations need to join and help guide the Web. (Boeing and Chevron and other user organizations within W3C do a lot for all users of the Web, but there is only so much they can accomplish as single members; they need reinforcements!) But even if an organization does not or cannot join W3C, an individual can have an impact by commenting on draft W3C specifications. All W3C working groups are required by the W3C development process to respond formally to comments received on Last-Call drafts, and to make a good-faith effort to satisfy the originator of the comment, either by doing as suggested or by providing a persuasive rationale for not doing so. (Maybe it’s not a good idea after all, maybe it’s a good idea but conflicts with other good ideas, etc.) It is not unknown for individuals outside the working group (and outside W3C entirely) to have a significant impact on a spec just by commenting on it systematically.

Whether anyone in the academies will take the invitation to heart remains to be seen, though at least a couple of people asked hesitantly after the lecture how much membership dues actually run. So maybe someday …

Thinking about schema mappings

[3 August 2008]

At the Digital Humanities conference in Finland in June, two papers made me think about a problem that has worried me off and on for a long time, ever since Mark Olsen at the ARTFL Project at the University of Chicago asked how he was supposed to provide searches across a large collection of documents, if all the documents were marked up differently.

Mark’s solution was simple, Procrustean, and effective: if I understood things correctly and remember aright, he translated everything into a single common vocabulary, which in the nature of things was a sort of lowest common denominator of text structure.

Stephen Ramsay and Brian Pytlik Zillig spoke about “Text analytics: a TEI format for cross-collection text analysis”, in which they described an approach similar to Mark’s in spirit, but crucially different in details. That is, like him their idea is to translate into a single common system of markup, so that the collection they are searching uses consistent ways of signaling textual features. Along the way, they will throw away information they believe to be of no interest for the kind of text analysis their tool is to support. The next day, Fotis Jannidis and Thorsten Vitt gave a paper on “Markup in Textgrid”, which also touched on the problem of providing a homogeneous interface to a heterogeneous collection of documents; if I understood them correctly, they didn’t want to throw away information, but were planning simply to store both the original and a modified (homogenized) form of the data. In the discussion period, we discussed briefly the relative merits of translating the heterogeneous material into a common format and of leaving it in its original formats.

The translation into a common format frequently involves loss of some information. For example, if not every document in the collection has been encoded in such a way as to mark all line-end hyphens according to the recommendations of the MLA’s Committee on Scholarly Editions, then it may be better to strip that information out rather than expose it and risk allowing the user to conclude that the other documents were printed originally without any line-end hyphens at all (after all, the query shows no line-end hyphens in those documents!). But that, in turn, means that you’d better be careful if you expect the work performed through the common interface to produce results which may lead to someone wanting to enrich the markup in the documents. If you’ve stripped out information from the original encoding, and now you enrich your stripped copy, later users are unlikely to thank you when they find themselves trying to re-unify the information you’ve added and the information you stripped out.

It would be nice to have a way to present heterogeneous collections through an interface that allows them to look homogeneous, without actually having to lose the details of the original markup.

It has become clear to me that this problem is closely related to problems of interest in relational databases and in RDF queries. (And probably in other areas where people worry about query languages, too, but if Topic Maps people have talked about this in my hearing, they did so without my understanding that they were also addressing this same problem.)

“Ah,” said Enrique. “They used the muffliato spell on you, did they?” “Hush,” I said.

Database people are interested in this problem in a variety of contexts. Perhaps they are performing a federated search and the common schema in terms of which the query is formulated doesn’t match the actual schemas in which the data are stored and exposed by the database management systems. Perhaps it’s not a federated query but there are other reasons we (a) want to query the data in terms of a schema that doesn’t match the ‘native’ schema, and (b) don’t want to transform the storage from the native schema into the query schema. My colleague Eric Prud’hommeaux has been working on a similar problem in the context of RDF. And of course as I say it’s been on the minds of markup people for a while; I’ve just found a paper that Nancy Ide and I wrote for the ASIS 97 conference in which we tried to stagger towards a better understanding of the problem. I have the sense that I understand the problem better now than I did then, but I could be wrong.

Two basic techniques seem to be possible, if you have a body of data in one vocabulary (let’s call it the “source vocabulary”) and would like to be able to query it using terms from a different vocabulary (the “target vocabulary”). Both assume that it’s possible to map information from the source vocabulary to the target vocabulary.

The first technique is Mark Olsen’s: you have or develop a mapping to go from the source vocabulary to the target vocabulary; you apply that mapping. You now have data in the target vocabulary, and you can query it in the usual way. Done. I believe this is what database people call “materializing the view”.

The second technique took me a while to get my head around. Again, we start from a mapping from the source vocabulary to the target vocabulary, and a query using the target vocabulary. The technique has several steps.

  1. Invert the mapping, so it maps from the target vocabulary to the source vocabulary. (Call the result “the inverse mapping”.)
  2. Apply the inverse mapping to the query, to produce a semantically equivalent query expressed in terms of the source vocabulary. (Since the query is not itself a relational database, or an RDF graph, or an XML document, there’s a certain sleight-of-hand going on here: even if you have successfully inverted the mapping, it will take some legerdemain to apply it to a query instead of to data. But just how hard or easy that is will depend a lot on the nature of the query and the nature of the mapping rules. One of the reasons for this klog post is that I want to be able to set up this context, so I can usefully think aloud about the implications for query languages and mapping rules.)
  3. Apply the source-vocabulary query to the source-vocabulary data. Simple, right? Well, no, not simple, but at least it’s a well known problem.
  4. Take the results of your query, and apply the original source-to-target mapping to them, to produce results expressed in / marked up in the target vocabulary.

Eric Prud’hommeaux may have been surprised, when he brought this topic up the other day, at the speed with which I told him that the key rule which any application of the second technique must obey is a principle I first learned in a course on language pedagogy, years ago in graduate school. (If so, he hid it well.)

The unit of translation is the utterance, not the word.

Everything else follows from this, so let me say it again. The unit of translation is the utterance, not the word. And almost every account of ‘semantic mapping’ systems I have heard in the last fifteen years goes wrong because it assumes the contrary. So let me say it a third time. The specific implications of this may vary from system to system, and need some unpacking I’m not prepared to do this afternoon, but the basic principle remains what I learned from Gertrude Mahrholz thirty years ago:

The unit of translation is the utterance, not the word.

More on this later. In the meantime, think about that.

Digital Humanities 2008

After two days of drizzle and gray skies, the sun came out on Saturday to make the last day of Digital Humanities 2008 memorable and ensure that the participants all remember Finland and Oulu as beautiful and not (only) as gray and wet and chilly. Watching the sun set over the water, a few minutes before midnight, by gliding very slowly sideways beneath the horizon, gave me great satisfaction.

The Digital Humanities conference is the successor to the annual joint conference of the Association for Computers and the Humanities (ACH) and the Association for Literary and Linguistic Computing (ALLC), now organized by the umbrella organization they have founded, which in a bit of nomenclature worthy of Garrison Keillor is called the Association of Digital Humanities Organizations.

There were a lot of good papers this year, and I don’t have time to go through them all here, since I’m supposed to be getting ready to catch the airport bus. So I hope to do a sort of fragmented trip report in the form of followup posts on a number of projects and topics that caught my eye. A full-text XML search engine I had never heard of before (TauRo, from the Scuola Normale Superiore in Pisa), bibliographic software from Brown, and a whole long series of digital editions and databases are what jump to my mind now, in my haste. The attendance was better than I had expected, and confirmed what some have long suspected: Kings College London has become the 800-pound gorilla of humanities computing. Ten percent of the attendees had Kings affiliations, there was an endless series of reports on intelligently conceived and deftly executed projects from Kings, and Kings delegates seemed to play a disproportionately large role in the posing of incisive questions and in the interesting parts of discussions. There were plenty of good projects done elsewhere, too, but what Harold Short and his colleagues have done at Kings is really remarkable — someone interested in how institutions are built up to eminence (whether as a study in organizational management or because they want to build up some organization) should really do a study of how they have gone about it.

As local organizer, Lisa-Lena Opas-Hänninen has done an amazing job, and Espen Ore’s program committee deserves credit for a memorable program. Next year’s organizers at the University of Maryland in College Park have a tough act to follow.