Data curation for the humanities

[23 May 2009]

Most of this week I was in Illinois, attending a Summer Institute on Humanities Data Curation in the Humanities (SIHDC) sponsored by the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign (UIUC).

The week began on Monday with useful and proficient introductions to the general idea of data curation from Melissa Cragin, Carole Palmer, John MacMullen, and Allen Renear; Craigin, Palmer, and MacMullen talked a lot about scientific data, for which the term data curation was first formulated. (Although social scientists have been addressing these problems systematically for twenty or thirty years and have a well developed network of social science data archives and data libraries, natural scientists and the librarians working with them don’t seem to have paid much attention to their experience.) They were also working hard to achieve generally applicable insights, which had the unfortunate side effect of raising the number of abstract noun phrases in their slides. Toward the end of the day, I began finding the room a little airless; eventually I concluded that this was partly oxygen starvation from the high density of warm bodies in a room whose air conditioning was not working, and partly concrete-example starvation.

On Tuesday, Syd Bauman and Julia Flanders of the Brown Univerisity Women Writers Project (now housed in the Brown University Library) gave a one-day introduction to XML, TEI, and the utility of TEI for data curation. The encoding exercises they gave us had the advantage of concreteness, but also (alas) the disadvantage: as they were describing yet another way that TEI could be used to annotate textual material, one librarian burst out “But we can’t possibly be expected to do all this!”

If in giving an introduction to TEI you don’t go into some detail about the things it can do, no one will understand why people might prefer to archive data in TEI form instead of HTML or straight ASCII. If you do, at least some in the audience will conclude that you are asking them to do all the work, instead of (as here) making them aware of some of the salient properties of the data they may be responsible in future for curating (and, a fortiori, understanding).

Wednesday morning, I was on the program under the title Markup semantics and the preservation of intellectual content, but I had spent Monday and Tuesday concluding that I had more questions than answers, so I threw away most of my plans for the presentation and turned as much of the morning as possible into group discussion. (Perversely, this had the effect of making me think I had things I wanted to say, after all.) I took the opportunity to introduce briefly the notion of skeleton sentences as a way of representing the meaning of markup in symbolic logic or English, and to explain why I think that skeleton sentences (or other similar mechanisms) provide a way to verify the correctness and completeness of the result, when data are migrated from one representation to another. This certainly works in theory, and almost certainly it will work in practice, although the tools still need to be built to test the idea in practce. When I showed thd screen with ten lines or so of first-order predicate calculus showing the meaning of the oai:OAI-PMH root element of an OAI metadata harvesting message, some participants (not unreasonably) looked a bit like deer caught in headlights. But others seemed to follow without effort, or without more puzzlement than might be occasioned by the eccentricities of my translation.

Wednesday afternoon, John Unsworth introduced the tools for text analysis on bodies of similarly encoded TEI documents produced by the MONK project (the name is alleged to be an acronym for Metadata offers new knowledge, but I had to think hard to see where the tools actually exploited metadata very heavily. If you regard annotations like part-of-speech tagging as metadata, then the role of metadata is more obvious.)

And at the end of Thursday, Lorcan Dempsey, the vice president of OCLC, gave a thoughtful and humorous closing keynote.

For me, no longer able to spend as much time in libraries and with librarians as in some former lives, the most informative presentation was surely Dorothea Salo’s survey of issues facing institutional repositories and other organizations that wish to preserve digital objects of interest to humanists and to make them accessible. She started from two disarmingly simple questions, which seem more blindingly apposite and obvious every time I think back to them. (The world clicked into a different configuration during her talk, so it’s now hard for me to recreate that sense of non-obviousness, but I do remember that no one else had boiled things down so effectively before this talk.)

For all the material you are trying to preserve, she suggested, you should ask

  • “Why do you want to preserve this?”
  • “What threats are you trying to preserve it against?”

The first question led to a consideration of collection development policy and its (often unrecognized) relevance to data curations; the second led to an extremely helpful analysis of the threats to data against which data curators must fight.

I won’t try to summarize her talk further; her slides and her blog post about them will do it better justice than I can.

… And it don’t look like I’ll ever stop my wandering (travels, part 3)

[4 November 2008]

This is the third in a series of posts about recent travels.

From Mannheim, I traveled to Dublin to visit the Digital Humanities Observatory headed by Susan Schreibman; they stand at the beginning of a three-year project to provide access to Irish work in digital humanities and to prepare the way for long-term preservation. I wish them the best of luck in persuading the individual projects with whom they are to collaborate that the use of appropriate standards is the right way forward.

From Dublin, Susan and I traveled to Trier for <philtag n=”7″/>, a small gathering whose name is a macaronic pun involving the German words Philologie (philology), Tag (congress, conference, meeting), and the English word tag. The meeting gathered together a number of interesting people, including several of those most actively interested in computer applications in philology, among them Werner Wegstein, who has organized most of the series, and whom I know from way back as a supporter of the TEI; Andrea Rapp, one of the senior staff at the Trier center of expertise in electronic access and publishing in the humanities; and Fotis Jannidis, currently teaching in Darmstadt and the founder and editor of the annual Computerphilologie, as well as a co-editor of the important electronic edition of the young Goethe. Wegstein is retiring from his chair in Würzburg, thus leading to the creation of a new chair computational philology, for which both Rapp and Jannidis were on the short list; on the preceding Friday, they had given their trial lectures in Würzburg. Either way, Würzburg will get a worthy successor to Wegstein.

The general topic this year was “Communicating eHumanities: Archives, Textcentres, Portals”, and several of the reports were indeed focused on archives, or text centers, or portals. I spoke about the concept of schema mapping as a way of making it possible to provide a single, simple, unified user interface to heterogeneous collections, while still retaining rich tagging in resources that have it, and providing access to that rich markup through other interfaces. Susan Schreibman spoke about the DHO. Haraldur Bernharðsson of Reykjavík spoke about an electronic edition of the Codex Regius of the Poetic Edda, which cheered me a great deal, since the Edda is dear to my heart and I’m glad to see a well done electronic edition. Heike Neuroth, who is affiliated both with the Max Planck Digital Library and Berlin and with the Lower Saxon State and University Library in Göttingen, spoke on the crucial but underappreciated topic of data curation. (I did notice that many of the papers she cited as talking about the need for long-term preservation of data were published in proprietary formats, which struck me as unfortunate for both practical and symbolic reasons. But data curation is important, even if some who say so are doing a good job of making it harder to curate they data they produce.)

There were a number of other talks, all interesting and useful. But I think the high point of the two days was probably the public lecture by Fotis Jannidis under the title Die digitale Evolution der Kultur oder der bescheidene Beitrag der Geisteswissenschaften zur Virtualisierung der Welt (‘Digital evolution, or the modest contribution of the humanities to the virtualization of the world’). Jannidis took as his point of departure a suggestion by Brewster Kahle that we really should try to digitize all of the artifacts produced til now by human culture and refined and augmented Kahle’s back of the envelope calculations about how much information that would involve, and how one might go about it. At one point he showed a graphic with representations of books and paintings and buildings and so on in the upper left, and digitizations of them in the upper right, and a little row of circles labeled Standards at the bottom, like the logs on which the stones of the pyramids make their way to the construction site, in illustrations of books about ancient Egypt.

It was at about this point that, as already pointed out, he said “Standards are the essential axle grease that makes all of this work.”

… I’ve been wandering late … (travels, part 2)

[26 October 2008]

This is the second in series of posts recording some of my impressions from recent travels.

After the XSLT meetings described in the previous post, and then a week at home, during which I was distracted by events that don’t need to be described here, I left again in early October for Europe. During the first half of last week [week before last, now], I was in Mannheim attending a workshop on organized by the electronic publications working group of the Union of German Academies of Science. Most of the projects represented were dictionaries of one stripe or another, many of them historical dictionaries (the Thesaurus Linguae Latinae, the Dictionnaire Etymologique de l’Ancien Français, the Deutsches Rechtswörterbuch, the Qumran-Wörterbuch, the Wörterbuch der deutschen Vinzersprache, both an Althochdeutsches Wörterbuch and an Althochdeutsches Etymologisches Wörterbuch, a whole set of dialect dictionaries, and others too numerous to name).

Some of the projects are making very well planned, good use of information technology (the Qumran dictionary in Göttingen sticks particularly in my mind), but many suffer from the huge weight of a paper legacy, or from short-sighted decisions made years ago. I’m sure it seemed like a good idea at the time to standardize on Word 6, and to build the project work flow around a set of Word 6 macros which are thought not to run properly in Word 7 or later versions of Word, and which were built by a clever participant in the project who is now long gone, leaving no one who can maintain or recreate them. But however good an idea it seemed at the time, it was in reality a foolish decision for which project is now paying a price (being stuck in outdated software, without the ability to upgrade, and with increasing difficulty finding support), and for which the academy sponsoring the project, and the future users of the work product, will continue paying for many years to come.

I gave a lecture Monday evening under the title “Standards in der geisteswissenschaftlichen Textdatenverarbeitung: Über die Zukunftssicherung von Sprachdaten”, in which I argued that the IT practice of projects involved with the preservation of our common cultural heritage must attend to a variety of problems that can make their work inaccessible to posterity.

The consistent use of suitably developed and carefully chosen open standards is by far the best way to ensure that the data and tools we create today can still be used by the intended beneficiaries in the future. I ended with a plea for people with suitable backgrounds in the history of language and culture to participate in standardization work, to ensure that the standards developed at W3C or elsewhere provide suitable support for the cultural heritage. The main problem, of course, is that the academy projects are already stretched thin and have no resources to spare for extra work. But I hope that the academies will realize that they have a role to play here, which is directly related to their mission.

It’s best, of course, if appropriate bodies join W3C as members and provide people to serve in Working Groups. More universities, more academies, more user organizations need to join and help guide the Web. (Boeing and Chevron and other user organizations within W3C do a lot for all users of the Web, but there is only so much they can accomplish as single members; they need reinforcements!) But even if an organization does not or cannot join W3C, an individual can have an impact by commenting on draft W3C specifications. All W3C working groups are required by the W3C development process to respond formally to comments received on Last-Call drafts, and to make a good-faith effort to satisfy the originator of the comment, either by doing as suggested or by providing a persuasive rationale for not doing so. (Maybe it’s not a good idea after all, maybe it’s a good idea but conflicts with other good ideas, etc.) It is not unknown for individuals outside the working group (and outside W3C entirely) to have a significant impact on a spec just by commenting on it systematically.

Whether anyone in the academies will take the invitation to heart remains to be seen, though at least a couple of people asked hesitantly after the lecture how much membership dues actually run. So maybe someday …

Optimistic concurrency and XML parsing and validation (Balisage report 3, in which chronology is abandoned)

[22 August 2008]

My brief hope (it would be misleading to refer to it as a “plan”) to report daily from Balisage 2008 has bitten the dust — it did that shortly after noon on the first day, when my account of Sandro Hawke’s work on XTAN turned out to take more time than I had available — but there is still a lot to say. I’m going to abandon the chronological approach, however, and write about things that come to mind, in a more or less random order.

One of my favorite papers this year was the one submitted by Yu Wu, Qi Zhang, Zhiqiang Yu, and Jianhui Li, of Intel, under the slightly daunting title “A Hybrid Parallel Processing for XML Parsing and Schema Validation”. (I think they are all members of the XML Engineering Team at the Intel China Software Center in Shanghai, but I am not sure I’ve read all the affiliation info correctly; I keep being distracted by the implications of an Intel software center having an XML engineering team.)

When I paraphrased this paper to a friend recently, her response was “Wow! That’s a lot more accessible than I would have guessed from the title.” So perhaps it’s worth while to try to record here the high points of the work, in a way that’s accessible to people to people with no more than lay knowledge of the relevant technical disciplines. (This goal is made easier, of course, by the fact that I don’t have more than lay knowledge of those disciplines myself.)

For technical details, of course, readers should go to the paper in the online proceedings of the conference; all errors in this summary are mine.

The elevator speech

The quick executive summary: XML parsing, and validation, can be a lot faster if performed by a multi-threaded process using optimistic concurrency.

By “optimistic concurrency”, I mean a strategy that parallelizes aggressively, even if that means doing some work speculatively, based on guesses made when parallelizing the work, guesses that might prove wrong. When the guesses are wrong, the process has to spend extra time later fixing the resulting problems. But if the speedup gained by the concurrency is great enough, it can outweigh the cost of wrong guesses. (This is a bit like the way Ethernet chooses to clean up after packet collisions, instead of attempting to prevent them the way token-ring networks do.)

A fast and simple-minded process divides the XML document into chunks, and multiple parallel threads handle multiple chunks at the same time. The fragmentary results achieved by the chunkwise parsing can be knit back together quickly; sometimes the fixup process shows that one or the other chunk needs to be reparsed.

What does Moore’s Law have to do with XML parsing?

OK, so much for the elevator speech. If you’re still interested, here is a little more detail.

First, some background. Moore’s Law says, roughly, that the number of transistors it’s possible to put on a chip doubles every eighteen months. For many years, this doubling of transistor count was accompanied by increases in clock speed. (I don’t understand the connection, myself, not being an electrical engineer. I just take it on faith.) But higher clock speeds apparently require more power and generate more heat, and eventually this causes problems for the people who actually use the chips. (Very few laptop designers are persuaded that water-cooled systems can be made to have the requisite portability. And liquid nitrogen, which would be the next step? Don’t get them started.)

So nowadays the doubling of transistors appears to be reflected in the rise of multi-core chips; dual-core chips in the current crop of off-the-shelf machines, with every expectation that the number of cores on standard chips will rise. Intel has already shipped chips with four and eight cores, although I haven’t seen any four-core machines on my list when shopping for laptops. (I’m not sure whether cores are expected to double every eighteen months indefinitely, or not; if they do, will we end up with a 1024-core chip vaguely resembling the Connection Machine in our laptops in another fifteen years?)

It used to be that performance rose about as fast as the transistor count, because the clock speed kept going up; even if software didn’t do anything smarter, it kept getting faster because the chip was faster. But to get performance benefits out of multi-core chips, a system is going to want to keep all of the cores busy whenever possible. People have been thinking about parallel computing for a long time, and at the 30,000-foot level, the answer so far seems to boil down to the simple, general principle “Gosh, that’s hard!”

Under those circumstances it seems plausible that a manufacturer of multi-core chips might see it as in the manufacturer’s own interest to show people how to multi-thread common applications, so as to make it easier to get as much performance as possible out of multi-core chips.

Parallelizing parsing

How do you parallelize the task of parsing an XML input stream? (There may be other ways to keep multiple threads busy, but this seems like the obvious choice if we can manage it.)

The answer wasn’t obvious to me. There are references to parallelism in the titles or summaries of a number of papers in Grune and Jacobs’s online bibliography on parsing techniques (part of the supplemental material to their book on Parsing Techniques), but none that leap out at me as being easy to understand.

One way to parallelize XML parsing is to run a pre-scanner over the document to select suitable places to divide it into chunks, and then to parse the chunks in parallel. (There is earlier work on this by Wei Lu et al., which Yu Wu et al. cite. They also cite work on other ways to parallelize XML parsing, but I don’t understand them and won’t try to describe them.)

The problem with (the existing state of) the pre-scanning approach, according to the Intel team, is that the pre-scanning takes a lot of time by itself, and once the parsing process itself is optimized, the overhead imposed by the pre-scanner ends up being prohibitive (or at least deplorable).


So Yu Wu et al. use a simpler (and I guess faster) pre-scanner. They don’t attempt to find wholly optimal chunk boundaries, and they don’t attempt to ensure that the parse context at the chunk boundary is completely understood by the pre-scanner. They don’t go into great detail on the chunking method, but if I have understood it correctly, the primary criteria for division of the XML document into chunks are that (1) the chunks should be of about the same size, (2) there should be at least one chunk per thread, (3) other things being equal, fewer chunks is better (because this reduces the cost of post-processing), and (4) each chunk should start with a left angle bracket.

Until I looked at the paper again just now I thought each chunk had to begin with something that looks like a start-tag, but I can’t find that constraint in the paper. Maybe I hallucinated it.

So while I don’t know the details of the Intel pre-scanner, I imagine a pre-scanner that just looks for an angle bracket in the neighborhood of the desired chunk boundary: if you have an 800-MB document you want to divide into eight chunks (in order to allocate them to eight threads, say), my imaginary pre-scanner just looks in the vicinity of byte offset 100,000,000 for a left angle bracket, scanning forwards and/or backwards until it finds one. If we are reading the document from random-access storage, the pre-scanner doesn’t even have to scan the body of the chunk.

Some readers will be saying to themselves at this point “But wait, not everything that looks like a start-tag or an end-tag is necessarily a start-tag or an end-tag. It might be in a CDATA section. It might be in a comment. What then?”

Hold that thought. You’re quite right, and we’ll come back to it.

Parallel parsing

Now, each thread is given a chunk of XML to parse, but only one thread gets to begin at the beginning of the document, so only one thread actually knows where it is. The other threads are all following the advice of the Latin poet Horace, who recommends beginning in the middle (medias in res). (I’m tempted to see if we could call this parsing technique Horatian parsing, for that reason. That’s probably not going to fly in the community at large, but it’s more compact than “hybrid parallel-processing parser” or any other descriptive phrase I can come up with denoting the algorithm presented by the Intel team, so I’ll use it in the rest of this description.)

The nature of XML, however, is that the element structure is fairly explicit in the document and doesn’t require a lot of context information. When in a chunk you see a start-tag, some content, and a matching end-tag, you know that’s an element, and you can build the appropriate data structure. But you will also run into some things that you can’t handle all by yourself: unmatched start-tags (for elements that end after the current chunk), unmatched end-tags (for elements that started in earlier chunks), and undeclared namespace prefixes (for namespace bindings declared in some ancestor element). The Horatian threads keep track of each of these phenomena, and produce as their output a convenient representation of parts of a parse tree (think DOM, not SAX), and lists of unmatched start-tags, end-tags, and namespace prefixes, together with some miscellaneous context information.

The post-processor can use the lists of unmatched bits to knit the data structures together: the last unmatched start-tag found in chunk 1 matches the first unmatched end-tag found in chunk 2, and so on.

The post-processor can also use the context information to see whether any parsing needs to be redone. (Those of you who were worried about misleading angle brackets in comments and CDATA sections and so on? Remember I told you to hold that thought? OK, put it down now. This is where we deal with it.)

If chunk n ended in the middle of a comment or in a CDATA section, then the parsing of chunk n + 1 will need to be redone. But if we have divided the document into eight chunks, and one chunk, or four, need a second parsing, we are still running at about four times the speed of a sequential parser.


Once the chunks have been knit together (and, crucially, once all namespace prefixes are bound), the same chunking is used to perform schema-validity assessment: one thread validates chunk 1, another validates chunk 2, etc.

I won’t go into the fixup needed to knit together the various partial results; it suffices to observe that in both XSD and Relax NG, the validation needed for an element cannot be determined reliably without knowing its ancestry. The consequence is that validation cannot be performed reliably without doing the post-parsing fixup first. (You could guess which element declaration to use, I suppose, but I think the rate of wrong guesses would be too high.)

If you want to pass appinfo from the governing element declaration to the consumer, you will also in pathological cases need to know the left siblings of the element, in order to know where it occurs in the content model of its parent, so you can know which appinfo to use. I expect any schema validator designed for this kind of optimistic concurrency will therefore decline to expose appinfo from element declarations.


In theory, if parallelization imposes no overhead, then by using two threads you get a two-fold speedup, four threads gets a four-fold speedup, etc.

In practice, the test results presented by the Intel group show, as one might expect, that there is some non-negligeable overhead in this scheme. For four threads, their stand-alone parser shows a slightly better than two-fold speedup; for eight threads, slightly better than four-fold. Some of the overhead, clearly, is in the pre-scanning and the post-processing, but if I’m reading their graphs correctly, for eight threads neither of these processes comes close to the amount of time spent in parsing and validation, so I’m guessing that the main performance hit comes from the need to re-parse some chunks.

Small documents (which the paper describes as those smaller than 64 KB) do not benefit from the parallelization, and larger documents (larger than 1 MB) benefit more than medium-sized ones. They didn’t talk at any length about the nature of the test data, beyond saying that it was real data from real customers.

Optimization can be a real pain sometimes, and listening to people talk about optimization can be a good cure for insomnia. But sometimes, someone has a really simple Big Idea that makes sense even to someone as ignorant as me, and then it can be an awful lot of fun to hear about it and think about the implications.

Optimistic concurrency for XML processing: a Big Idea whose implications can be huge.