Archive for the ‘computing in the humanities’ Category

XForms and XQuery tutorials at TEI members’ meeting

Monday, August 23rd, 2010

[23 August 2010]

The TEI has published a list of workshops to be offered at the TEI Members’ Meeting this November in Zadar, Croatia.

Together with Syd Bauman of Brown University, I’m offering two tutorial workshops: one on XForms and one on XQuery. Each will last a day and a half, and involve some talking heads, some group discussion, and as much hands-on work as we can manage.

There are several other very good workshops on offer: Norm Walsh on XProc, the TEI@Oxford team on the ODD system, Elena Pierazzo and Malte Rehbein on the encoding of genetic editions, and Andreas Witt et al. on TEI for transcriptions of speech.

The organizers remind me that there is an early-bird discount for those who register before 31 August. There is some chance that tutorials which fail to attract enough participants will be canceled if they don’t get enough registration, so if you definitely want to come, you definitely want to register early, to help make sure your tutorial has enough registrants to make the cut.

Day of Digital Humanities, 18 March 2010

Wednesday, March 17th, 2010

[17 March 2010]

Tomorrow I’ll be participating in a mass experiment with self-consciousness, the 2010 edition of the Day of Digital Humanities. The organizers have persuaded close to 150 people who self-identify with the description “digital humanist” (list) to blog, during the course of 18 March, about what it is they actually spend their time doing. “The goal of the project” (say the organizers) “is to create a web site that weaves together the journals of the participants into a picture that answers the question, ‘Just what do computing humanists really do?’”

For the day, I’ll be using the special Day of Digital Humanities blog set up for me by the organizers; the blogs of all participants are aggregated on the project site; there is also an RSS feed.

ACH and ALLC co-sponsoring Balisage

Friday, February 12th, 2010

[12 February 2010]

The Association for Computers and the Humanities and the Association for Literary and Linguistic Computing have now signed on as co-sponsors of the Balisage conference held each year in August in Montréal. They join a number of other co-sponsors who also deserve praise and thanks, but I’m particularly happy about ACH and ALLC because they have provided such an important part of my intellectual home over the years.

Balisage will take place Tuesday through Friday, 3-6 August, this year; on Monday 2 August there will be a one-day pre-conference symposium on a topic to be announced real soon now. It’s a conference for anyone interested in descriptive markup, information preservation, access to and management of information, accessibility, device independence, data reuse — any of the things that descriptive markup helps enable. The deadline for peer review applications is 19 March; the deadline for papers is 16 April. Time to start thinking about what you’re going to write up; you don’t want to be caught up short at the last minute, without time to work out your idea properly.

Mark your calendars!

Trip report: Digital Humanities 2009

Monday, June 29th, 2009

[29 June 2009; subheads added 30 June 2009]

Last week I was at the University of Maryland attending Digital Humanities 2009, the annual joint conference of the Association for Computers and the Humanities (ACH), the Association for Literary and Linguistic Computing (ALLC), and the Society for Digital Humanities / Société pour l’étude des médias interactifs (SDH/SEMI) — the constituent societies of the Association of Digital Humanities Organizations. It had a fine concentration of digital humanists, ranging from stylometricians and adepts of authorship attribution to theorists of video games (this is a branch of cultural and medial studies, not to be confused with the game theory of von Neumann and Morgenstern — there may be a promotional opportunity in the slogan “Not your grandfather’s game theory!”, but I’ll let others take it up).

I saw a number of talks; some of the ones that stick in my mind are these.

Rockwell / Sinclair on Knowledge radio

Geoffrey Rockwell (Univ. of Alberta) and Stefán Sinclair (McMaster U.) talked about “Animating the knowledge radio” and showed how one could lower the threshold of text analysis by processing raw streams of data without requiring that it be indexed first. The user’s wait while the stream is being processed can be made bearable, and perhaps even informative, by animating the processing being performed. In one of their examples, word clouds of the high-frequency words in a text are created, and the cloud changes as new words are read in and the list of most-frequent words changes. The analogy with radio (as a stream you tap into without saving it to disk first) is arresting and may have potential for doing more work than they currently make it do. I wonder, too, whether going forward their analysis would benefit by considering current work on continuous queries (a big topic in database research and practice today) or streaming query processors (more XQuery processors act on streams than act on static indexed data). Bottom line: the visuals were pretty and the discipline of making tools work on streams appears to be beneficial.

Roued, Cayless, Stokes on paleography and the reading of manuscripts

Henriette Roued (Oxford) spoke on an Interpretation Support System she is building in company with others, to help readers and editors of ancient documents keep track of cruces, conjectures, arguments in favor of this or that reading of a crux, and so on. It was very impressive stuff. In the same session, Hugh Cayless (NYU) sketched out a kind of theoretical framework for the annotation of manuscript images, starting from a conventional scan, processing it into a page full of SVG paths, and attempting from that to build up a web of links connecting transcriptions to the image at the word token and line levels. This led to various hallway and lunchroom conversations about automatic detection of page structure, or mise en page, about which I know mostly that there are people who have studied it in some detail and whose results really ought to be relevant here. The session ended with Peter Stokes (Cambridge) talking about the past and future of computer-aided paleography. Among them, the three speakers seemed to have anticipated a good deal of what Claus Huitfeldt, Yves Marcoux, and I were going to say later in the week, and their pictures were nicer. This could have been depressing. But we decided to take this fact as a confirmation that our topic really is relevant.

One thing troubles me a bit. Both Roued and Cayless seem to take as a given that the regions of a document containing basic tokens provide a tessellation of the page; surely this is an oversimplification. It is perhaps true for most typewritten pages using the Roman alphabet, if they have no manuscript additions, but high ascenders, low descenders, complex scribal abbreviations, even printers’ ligatures all seem to require or suggest that it might be wise to assume that the regions occupied by basic tokens might overlap each other. (Not to mention the practice in times of paper shortage of overwriting the page with lines at right angles to the first set. And of course there are palimpsests.) And in pages with a lot of white space, it doesn’t seem obvious to me that all of the whitespace need be accounted for in the tabulation of basic tokens.

Bradley on the contributions of technical specialists to interdisciplinary projects

John Bradley (King’s College London) closed (my experience of) the first day of the conference by presenting a thought-provoking set of reflections on the contribution of specialists in digital humanities to projects undertaken jointly with humanists who are not particularly focused on the digital (analog humanists?). Of course, in my case he was preaching to the choir, but his arguments that those who contribute to the technical side of such projects should be regarded as partners, not as factotums, ought to be heeded by everyone engaged in interdisciplinary projects. Those who have ears to hear, let them hear.

Pierazzo on diplomatic editions

One of the high points of the conference for me was a talk on Wednesday by Elena Pierazzo (King’s College London), who spoke under the title “The limit of representation” about digital diplomatic editions, with particular reference to experience with a three-year project devoted to Jane Austen’s manuscripts of fiction. She spoke eloquently and insightfully about the difference between transcriptions (even diplomatic transcriptions) and the original, and about the need to choose intelligently when to capture some property of the original in a diplomatic edition and when to gesture instead toward the facsimile or leave the property uncaptured. This is a quiet step past Thomas Tanselle’s view (Studies in Bibliography 31 [1978]) that “the editor’s goal is to reproduce in print as many of the characteristics of the document as he can” — the history of digital editions, short as it is, provides plenty of examples to illustrate the proposition that editorial decisions should be driven by the requirements of the material and of the intended users of the edition, not (as in Tanselle’s view) by technology.

Ruecker and Galey on hermeneutics of design

Stan Ruecker (U. Albert) and Alan Galey (Toronto) gave a paper on “Design as a hermeneutic process: Thinking through making from book history to critical design” which I enjoyed a great deal, and think I learned a lot from, but which appears to defy paraphrase. After discussing the competing views that design should be the handmaiden of content and that design can and should itself embody an argument, they considered several examples, reading each as the embodiment of an argument, elucidating the work and the argument, and critiquing the embodiment. It gave me a pleasure much like that of sitting in on a master class in design.

Huitfeldt, Marcoux, and Sperberg-McQueen on transcription

In the same session (sic), Claus Huitfeldt (Univ. of Bergen), Yves Marcoux (Univ. de Montréal), and I gave our paper on “What is transcription? part 2”; the slides are on the Web.

Rockwell and Day, Memento mori for projects

The session concluded with a presentation by Geoff Rockwell (again! and still at Alberta) and Shawn Day (Digital Humanities Observatory, RIA Dublin) called “Burying dead projects: Depositing the Globalization Compendium”. They talked about some of the issues involved in depositing digital work with archives and repositories, as illustrated by their experience with a several-year project to develop a collection of resources on globalization (the Globalization Compendium of the title). Deposit is, they said, a requirement for all projects funded by the Canadian Social Sciences and Humanities Research Council (SSHRC), and has been for some time, but the repositories they worked with were still working out the kinks in their processes, and their own initial plans for deposit were also subject to change (deposit of the material was, interestingly, from the beginning planned into the project schedule and budget, but in the course of the project they changed their minds about what “the material” to be deposited should include).

I was glad to hear the other talks in the session, but I never did figure out what the program committee thought these three talks had in common.

Caton on transcription and its collateral losses

On the final day of the conference, Paul Caton (National Univ. of Ireland, Galway) gave a talk on transcription, in which he extended the analysis of transcription which Claus Huitfeldt and I had presented at DH 2007 (later published in Literary & Linguistic Computing) to consider information beyond the sequence of graphemes presented by a transcription and its exemplar.

There are a number of methodological and terminological pitfalls here, which mean caution is advised. For example, we seem to have different ideas about the meaning of the term token, which some people use to denote a concrete physical object (or distinguishable part of an object), but which Paul seems to use to denote a particular glyph or graphetic form. And is the uppercase / lowercase distinction of English to be taken as graphemic? I think the answer is yes (changing the case of a letter does not always produce a minimal pair, but it sometimes does, which I think suffices); Paul explicitly says the answer is no.

Paul identifies, under the cover term modality, some important classes of information which are lost by (most) transcriptions: presentation modality (e.g. font shifts), accidental modality (turned letters, malformed letters, broken type, even incorrect letters and words out of sequence), and temporal modality (the effects of time upon a document).

I think that some of the phenomena he discusses can in fact be treated as extensions of the set of types used to read and transcribe a document, but that raises thorny questions to which I do not have the answer. I think Paul has placed his finger upon a sore spot in the analysis of types and tokens: the usual view of the types instantiated by tokens is that we have a flat unstructured set of them, but as the examples of upper- and lower-case H, roman, italic, and bold instances of the word here, and other examples (e.g. long and short s, i/j, v/u) illustrate, the types we use in practice often do not form a simple flat set in which the identity of the type is the only salient information: often types are related in special ways. We can say, for purposes of analysis and discussion, that a set of types which merges upper and lower case, on the one had, and one which distinguishes them, on the other, are simply two different sets of types. But then, in practice, we operate not with one type system but with several, and the relations among type systems become a topic of interest. In particular, it’s obvious that some sets of types subsume others, and conversely that some are refinements of others. It’s not obvious that subsumption / refinement is the only relation among typesets that is worth worrying about. I assume that phonology has similar issues, both with identifying phonemes and with choosing the level of detail for phonetic transcriptions, but I know too little of phonology to be able to identify useful morals for application here.

What, no markup theory?

Looking back over this trip report, I notice that I haven’t mentioned any talks on markup theory or practice. Partly this reflects the program: a lot of discussions of markup theory seem to have migrated from the Digital Humanities conference to the Balisage conference. But partly it’s illusory: there were plenty of mentions of markup, markup languages, and so on. Syd Bauman and Dot Porter talked about the challenge of improving the cross referencing of the TEI Guidelines, and many talks mentioned their encoding scheme explicitly (usually the TEI). The TEI appears to be in wide use, and some parts of the TEI which have long been neglected appear to be coming into their own: Jan Christoph Meister of Hamburg and his students have built an annotation system (CATMA) based on TEI feature structures, and at least one other poster or paper also applied feature structures to its problem. Several people also mentioned standoff markup (though when one otherwise interesting presenter proposed using character offsets as the way to point into a base text, I left quietly to avoid screaming at him during the question session).

The hallway conversations were also very rewarding this year. Old friends and new ones were present in abundance, and I made some new acquaintances I look forward to renewing at future DH conferences. The twitter stream from the conference was also abundant (archive); not quite as active as an IRC channel during a typical W3C meeting, but quite respectable nonetheless.

All in all, the local organizers at the Maryland Institute for Technology in the Humanities, and the program committee, are to be congratulated. Good job!

Sustainability, succession plans, and PURLs — Burial societies for libraries?

Sunday, May 24th, 2009

[24 May 2009]

At the Summer Institute on Data Curation in the Humanities (SIDCH) this past week in Urbana (see previous post), Dorothea Salo surveyed a variety of threats to the longevity of humanities data, including lack or loss of institutional commitment, and/or death (failure) of the institution housing the data. People serious about maintaining data accessible for long periods need to make succession plans: what happens to the extensive collection of digital data held by the XYZ State University’s Institute for the History of Pataphysical Research when the state legislature finally notices its existence and writes into next year’s budget a rule forbidding university administrators to fund it in any year which in the Gregorian calendar is either (a) a leap year or (b) not a leap year, and (c) requiring the adminstrators to wash their mouths out with soap for having ever funded the Intitute in the first place?

Enough centers for computing in the humanities have been born in the last fifty years, flourished some years, and later died, that I can assure the reader that the prospect of going out of existence should concern not only institutes for the history of pataphysics, but all of us.

It’s good if valuable data held by an organization can survive its end; from the point of view of URI persistence it would be even better if the URL used to refer to the data didn’t have to change either.

I have found myself thinking, the last couple of days, about a possible method of mitigating this threat, that runs something like this.

  • A collection of reasonably like-minded organizations (or individuals) forms a mutual assistance pact for the preservation of data and URIs.
  • The group sets up and runs a PURL server, to provide persistent URLs for the data held by members of the group. [Alternate approach: they all agree to use OCLC's PURL server.]
  • Using whatever mechanism they choose, the members of the group arrange to mirror each other’s data in some convenient way. Some people will use rsync or similar tools; Dorothea Salo observed that LOCKSS software can also do this kind of job with very low cost in time and effort.
  • If any of the partners in the mutual assistance pact lose their funding or go out of existence for other reasons, the survivors agree on who will serve the decedent’s data. The PURL resolution tables are updated to point to the new location.
  • Some time before the count of partners is down to one, remaining partners recruit new members. (Once the count hits zero, the system has failed.)

    [Wendell Piez observed, when we got to this point of our discussion of this idea, “There's a Borges story in that, just waiting for someone to write it.” I won't be surprised if Enrique is working on one even as I write.]

In some cases, people will not want to use PURLs, because when they make available the kind of resources whose longevity is most obviously desirable, the domain name in the URLs performs a sort of marketing or public awareness function for their organization. I suppose one could allow the use of non-PURL domains, if the members of the group can arrange to ensure that upon the demise of an individual member the ownership of their domains passes seamlessly to some other member of the group, or to to the group as a whole. But this works only for domain owners, and only if you can find a way to ensure the orderly transfer of domain ownership. Steve Newcomb, my colleague in the organizing committee for the Balisage conference on the theory and practice of markup, points out a difficulty here: in cases of bankruptcy, the domain name may be regarded as an asset and it may therefore be impossible to transfer it to the other members of the mutual assistance society.

It’s a little bit like the burial societies formed by immigrants in a strange land for mutual assurance, to ensure that when one dies, there will be someone to give them a decent burial according to the customs of their common ancestral homeland, so that their souls will not wander the earth restlessly in perpetuity.

It would be nice to think that the data collections we create will have a home in the future, lest their ghosts wander the earth restlessly, bewailing their untimely demise, accusing us of carelessness and negligence in letting them die, and haunting us in perpetuity.

Data curation for the humanities

Saturday, May 23rd, 2009

[23 May 2009]

Most of this week I was in Illinois, attending a Summer Institute on Humanities Data Curation in the Humanities (SIHDC) sponsored by the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign (UIUC).

The week began on Monday with useful and proficient introductions to the general idea of data curation from Melissa Cragin, Carole Palmer, John MacMullen, and Allen Renear; Craigin, Palmer, and MacMullen talked a lot about scientific data, for which the term data curation was first formulated. (Although social scientists have been addressing these problems systematically for twenty or thirty years and have a well developed network of social science data archives and data libraries, natural scientists and the librarians working with them don’t seem to have paid much attention to their experience.) They were also working hard to achieve generally applicable insights, which had the unfortunate side effect of raising the number of abstract noun phrases in their slides. Toward the end of the day, I began finding the room a little airless; eventually I concluded that this was partly oxygen starvation from the high density of warm bodies in a room whose air conditioning was not working, and partly concrete-example starvation.

On Tuesday, Syd Bauman and Julia Flanders of the Brown Univerisity Women Writers Project (now housed in the Brown University Library) gave a one-day introduction to XML, TEI, and the utility of TEI for data curation. The encoding exercises they gave us had the advantage of concreteness, but also (alas) the disadvantage: as they were describing yet another way that TEI could be used to annotate textual material, one librarian burst out “But we can’t possibly be expected to do all this!”

If in giving an introduction to TEI you don’t go into some detail about the things it can do, no one will understand why people might prefer to archive data in TEI form instead of HTML or straight ASCII. If you do, at least some in the audience will conclude that you are asking them to do all the work, instead of (as here) making them aware of some of the salient properties of the data they may be responsible in future for curating (and, a fortiori, understanding).

Wednesday morning, I was on the program under the title Markup semantics and the preservation of intellectual content, but I had spent Monday and Tuesday concluding that I had more questions than answers, so I threw away most of my plans for the presentation and turned as much of the morning as possible into group discussion. (Perversely, this had the effect of making me think I had things I wanted to say, after all.) I took the opportunity to introduce briefly the notion of skeleton sentences as a way of representing the meaning of markup in symbolic logic or English, and to explain why I think that skeleton sentences (or other similar mechanisms) provide a way to verify the correctness and completeness of the result, when data are migrated from one representation to another. This certainly works in theory, and almost certainly it will work in practice, although the tools still need to be built to test the idea in practce. When I showed thd screen with ten lines or so of first-order predicate calculus showing the meaning of the oai:OAI-PMH root element of an OAI metadata harvesting message, some participants (not unreasonably) looked a bit like deer caught in headlights. But others seemed to follow without effort, or without more puzzlement than might be occasioned by the eccentricities of my translation.

Wednesday afternoon, John Unsworth introduced the tools for text analysis on bodies of similarly encoded TEI documents produced by the MONK project (the name is alleged to be an acronym for Metadata offers new knowledge, but I had to think hard to see where the tools actually exploited metadata very heavily. If you regard annotations like part-of-speech tagging as metadata, then the role of metadata is more obvious.)

And at the end of Thursday, Lorcan Dempsey, the vice president of OCLC, gave a thoughtful and humorous closing keynote.

For me, no longer able to spend as much time in libraries and with librarians as in some former lives, the most informative presentation was surely Dorothea Salo’s survey of issues facing institutional repositories and other organizations that wish to preserve digital objects of interest to humanists and to make them accessible. She started from two disarmingly simple questions, which seem more blindingly apposite and obvious every time I think back to them. (The world clicked into a different configuration during her talk, so it’s now hard for me to recreate that sense of non-obviousness, but I do remember that no one else had boiled things down so effectively before this talk.)

For all the material you are trying to preserve, she suggested, you should ask

  • “Why do you want to preserve this?”
  • “What threats are you trying to preserve it against?”

The first question led to a consideration of collection development policy and its (often unrecognized) relevance to data curations; the second led to an extremely helpful analysis of the threats to data against which data curators must fight.

I won’t try to summarize her talk further; her slides and her blog post about them will do it better justice than I can.

… And it don’t look like I’ll ever stop my wandering (travels, part 3)

Tuesday, November 4th, 2008

[4 November 2008]

This is the third in a series of posts about recent travels.

From Mannheim, I traveled to Dublin to visit the Digital Humanities Observatory headed by Susan Schreibman; they stand at the beginning of a three-year project to provide access to Irish work in digital humanities and to prepare the way for long-term preservation. I wish them the best of luck in persuading the individual projects with whom they are to collaborate that the use of appropriate standards is the right way forward.

From Dublin, Susan and I traveled to Trier for <philtag n=”7″/>, a small gathering whose name is a macaronic pun involving the German words Philologie (philology), Tag (congress, conference, meeting), and the English word tag. The meeting gathered together a number of interesting people, including several of those most actively interested in computer applications in philology, among them Werner Wegstein, who has organized most of the series, and whom I know from way back as a supporter of the TEI; Andrea Rapp, one of the senior staff at the Trier center of expertise in electronic access and publishing in the humanities; and Fotis Jannidis, currently teaching in Darmstadt and the founder and editor of the annual Computerphilologie, as well as a co-editor of the important electronic edition of the young Goethe. Wegstein is retiring from his chair in Würzburg, thus leading to the creation of a new chair computational philology, for which both Rapp and Jannidis were on the short list; on the preceding Friday, they had given their trial lectures in Würzburg. Either way, Würzburg will get a worthy successor to Wegstein.

The general topic this year was “Communicating eHumanities: Archives, Textcentres, Portals”, and several of the reports were indeed focused on archives, or text centers, or portals. I spoke about the concept of schema mapping as a way of making it possible to provide a single, simple, unified user interface to heterogeneous collections, while still retaining rich tagging in resources that have it, and providing access to that rich markup through other interfaces. Susan Schreibman spoke about the DHO. Haraldur Bernharðsson of Reykjavík spoke about an electronic edition of the Codex Regius of the Poetic Edda, which cheered me a great deal, since the Edda is dear to my heart and I’m glad to see a well done electronic edition. Heike Neuroth, who is affiliated both with the Max Planck Digital Library and Berlin and with the Lower Saxon State and University Library in Göttingen, spoke on the crucial but underappreciated topic of data curation. (I did notice that many of the papers she cited as talking about the need for long-term preservation of data were published in proprietary formats, which struck me as unfortunate for both practical and symbolic reasons. But data curation is important, even if some who say so are doing a good job of making it harder to curate they data they produce.)

There were a number of other talks, all interesting and useful. But I think the high point of the two days was probably the public lecture by Fotis Jannidis under the title Die digitale Evolution der Kultur oder der bescheidene Beitrag der Geisteswissenschaften zur Virtualisierung der Welt (‘Digital evolution, or the modest contribution of the humanities to the virtualization of the world’). Jannidis took as his point of departure a suggestion by Brewster Kahle that we really should try to digitize all of the artifacts produced til now by human culture and refined and augmented Kahle’s back of the envelope calculations about how much information that would involve, and how one might go about it. At one point he showed a graphic with representations of books and paintings and buildings and so on in the upper left, and digitizations of them in the upper right, and a little row of circles labeled Standards at the bottom, like the logs on which the stones of the pyramids make their way to the construction site, in illustrations of books about ancient Egypt.

It was at about this point that, as already pointed out, he said “Standards are the essential axle grease that makes all of this work.”

… I’ve been wandering late … (travels, part 2)

Sunday, October 26th, 2008

[26 October 2008]

This is the second in series of posts recording some of my impressions from recent travels.

After the XSLT meetings described in the previous post, and then a week at home, during which I was distracted by events that don’t need to be described here, I left again in early October for Europe. During the first half of last week [week before last, now], I was in Mannheim attending a workshop on organized by the electronic publications working group of the Union of German Academies of Science. Most of the projects represented were dictionaries of one stripe or another, many of them historical dictionaries (the Thesaurus Linguae Latinae, the Dictionnaire Etymologique de l’Ancien Français, the Deutsches Rechtswörterbuch, the Qumran-Wörterbuch, the Wörterbuch der deutschen Vinzersprache, both an Althochdeutsches Wörterbuch and an Althochdeutsches Etymologisches Wörterbuch, a whole set of dialect dictionaries, and others too numerous to name).

Some of the projects are making very well planned, good use of information technology (the Qumran dictionary in Göttingen sticks particularly in my mind), but many suffer from the huge weight of a paper legacy, or from short-sighted decisions made years ago. I’m sure it seemed like a good idea at the time to standardize on Word 6, and to build the project work flow around a set of Word 6 macros which are thought not to run properly in Word 7 or later versions of Word, and which were built by a clever participant in the project who is now long gone, leaving no one who can maintain or recreate them. But however good an idea it seemed at the time, it was in reality a foolish decision for which project is now paying a price (being stuck in outdated software, without the ability to upgrade, and with increasing difficulty finding support), and for which the academy sponsoring the project, and the future users of the work product, will continue paying for many years to come.

I gave a lecture Monday evening under the title “Standards in der geisteswissenschaftlichen Textdatenverarbeitung: Über die Zukunftssicherung von Sprachdaten”, in which I argued that the IT practice of projects involved with the preservation of our common cultural heritage must attend to a variety of problems that can make their work inaccessible to posterity.

The consistent use of suitably developed and carefully chosen open standards is by far the best way to ensure that the data and tools we create today can still be used by the intended beneficiaries in the future. I ended with a plea for people with suitable backgrounds in the history of language and culture to participate in standardization work, to ensure that the standards developed at W3C or elsewhere provide suitable support for the cultural heritage. The main problem, of course, is that the academy projects are already stretched thin and have no resources to spare for extra work. But I hope that the academies will realize that they have a role to play here, which is directly related to their mission.

It’s best, of course, if appropriate bodies join W3C as members and provide people to serve in Working Groups. More universities, more academies, more user organizations need to join and help guide the Web. (Boeing and Chevron and other user organizations within W3C do a lot for all users of the Web, but there is only so much they can accomplish as single members; they need reinforcements!) But even if an organization does not or cannot join W3C, an individual can have an impact by commenting on draft W3C specifications. All W3C working groups are required by the W3C development process to respond formally to comments received on Last-Call drafts, and to make a good-faith effort to satisfy the originator of the comment, either by doing as suggested or by providing a persuasive rationale for not doing so. (Maybe it’s not a good idea after all, maybe it’s a good idea but conflicts with other good ideas, etc.) It is not unknown for individuals outside the working group (and outside W3C entirely) to have a significant impact on a spec just by commenting on it systematically.

Whether anyone in the academies will take the invitation to heart remains to be seen, though at least a couple of people asked hesitantly after the lecture how much membership dues actually run. So maybe someday …

Words to remember

Wednesday, October 15th, 2008

[15 October 2008]

“Standards are the essential grease of the digital future.” -Fotis Jannidis (in a lecture in Trier on 13 October 2008)

Thinking about schema mappings

Sunday, August 3rd, 2008

[3 August 2008]

At the Digital Humanities conference in Finland in June, two papers made me think about a problem that has worried me off and on for a long time, ever since Mark Olsen at the ARTFL Project at the University of Chicago asked how he was supposed to provide searches across a large collection of documents, if all the documents were marked up differently.

Mark’s solution was simple, Procrustean, and effective: if I understood things correctly and remember aright, he translated everything into a single common vocabulary, which in the nature of things was a sort of lowest common denominator of text structure.

Stephen Ramsay and Brian Pytlik Zillig spoke about “Text analytics: a TEI format for cross-collection text analysis”, in which they described an approach similar to Mark’s in spirit, but crucially different in details. That is, like him their idea is to translate into a single common system of markup, so that the collection they are searching uses consistent ways of signaling textual features. Along the way, they will throw away information they believe to be of no interest for the kind of text analysis their tool is to support. The next day, Fotis Jannidis and Thorsten Vitt gave a paper on “Markup in Textgrid”, which also touched on the problem of providing a homogeneous interface to a heterogeneous collection of documents; if I understood them correctly, they didn’t want to throw away information, but were planning simply to store both the original and a modified (homogenized) form of the data. In the discussion period, we discussed briefly the relative merits of translating the heterogeneous material into a common format and of leaving it in its original formats.

The translation into a common format frequently involves loss of some information. For example, if not every document in the collection has been encoded in such a way as to mark all line-end hyphens according to the recommendations of the MLA’s Committee on Scholarly Editions, then it may be better to strip that information out rather than expose it and risk allowing the user to conclude that the other documents were printed originally without any line-end hyphens at all (after all, the query shows no line-end hyphens in those documents!). But that, in turn, means that you’d better be careful if you expect the work performed through the common interface to produce results which may lead to someone wanting to enrich the markup in the documents. If you’ve stripped out information from the original encoding, and now you enrich your stripped copy, later users are unlikely to thank you when they find themselves trying to re-unify the information you’ve added and the information you stripped out.

It would be nice to have a way to present heterogeneous collections through an interface that allows them to look homogeneous, without actually having to lose the details of the original markup.

It has become clear to me that this problem is closely related to problems of interest in relational databases and in RDF queries. (And probably in other areas where people worry about query languages, too, but if Topic Maps people have talked about this in my hearing, they did so without my understanding that they were also addressing this same problem.)

“Ah,” said Enrique. “They used the muffliato spell on you, did they?” “Hush,” I said.

Database people are interested in this problem in a variety of contexts. Perhaps they are performing a federated search and the common schema in terms of which the query is formulated doesn’t match the actual schemas in which the data are stored and exposed by the database management systems. Perhaps it’s not a federated query but there are other reasons we (a) want to query the data in terms of a schema that doesn’t match the ‘native’ schema, and (b) don’t want to transform the storage from the native schema into the query schema. My colleague Eric Prud’hommeaux has been working on a similar problem in the context of RDF. And of course as I say it’s been on the minds of markup people for a while; I’ve just found a paper that Nancy Ide and I wrote for the ASIS 97 conference in which we tried to stagger towards a better understanding of the problem. I have the sense that I understand the problem better now than I did then, but I could be wrong.

Two basic techniques seem to be possible, if you have a body of data in one vocabulary (let’s call it the “source vocabulary”) and would like to be able to query it using terms from a different vocabulary (the “target vocabulary”). Both assume that it’s possible to map information from the source vocabulary to the target vocabulary.

The first technique is Mark Olsen’s: you have or develop a mapping to go from the source vocabulary to the target vocabulary; you apply that mapping. You now have data in the target vocabulary, and you can query it in the usual way. Done. I believe this is what database people call “materializing the view”.

The second technique took me a while to get my head around. Again, we start from a mapping from the source vocabulary to the target vocabulary, and a query using the target vocabulary. The technique has several steps.

  1. Invert the mapping, so it maps from the target vocabulary to the source vocabulary. (Call the result “the inverse mapping”.)
  2. Apply the inverse mapping to the query, to produce a semantically equivalent query expressed in terms of the source vocabulary. (Since the query is not itself a relational database, or an RDF graph, or an XML document, there’s a certain sleight-of-hand going on here: even if you have successfully inverted the mapping, it will take some legerdemain to apply it to a query instead of to data. But just how hard or easy that is will depend a lot on the nature of the query and the nature of the mapping rules. One of the reasons for this klog post is that I want to be able to set up this context, so I can usefully think aloud about the implications for query languages and mapping rules.)
  3. Apply the source-vocabulary query to the source-vocabulary data. Simple, right? Well, no, not simple, but at least it’s a well known problem.
  4. Take the results of your query, and apply the original source-to-target mapping to them, to produce results expressed in / marked up in the target vocabulary.

Eric Prud’hommeaux may have been surprised, when he brought this topic up the other day, at the speed with which I told him that the key rule which any application of the second technique must obey is a principle I first learned in a course on language pedagogy, years ago in graduate school. (If so, he hid it well.)

The unit of translation is the utterance, not the word.

Everything else follows from this, so let me say it again. The unit of translation is the utterance, not the word. And almost every account of ’semantic mapping’ systems I have heard in the last fifteen years goes wrong because it assumes the contrary. So let me say it a third time. The specific implications of this may vary from system to system, and need some unpacking I’m not prepared to do this afternoon, but the basic principle remains what I learned from Gertrude Mahrholz thirty years ago:

The unit of translation is the utterance, not the word.

More on this later. In the meantime, think about that.