Data curation for the humanities

[23 May 2009]

Most of this week I was in Illinois, attending a Summer Institute on Humanities Data Curation in the Humanities (SIHDC) sponsored by the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign (UIUC).

The week began on Monday with useful and proficient introductions to the general idea of data curation from Melissa Cragin, Carole Palmer, John MacMullen, and Allen Renear; Craigin, Palmer, and MacMullen talked a lot about scientific data, for which the term data curation was first formulated. (Although social scientists have been addressing these problems systematically for twenty or thirty years and have a well developed network of social science data archives and data libraries, natural scientists and the librarians working with them don’t seem to have paid much attention to their experience.) They were also working hard to achieve generally applicable insights, which had the unfortunate side effect of raising the number of abstract noun phrases in their slides. Toward the end of the day, I began finding the room a little airless; eventually I concluded that this was partly oxygen starvation from the high density of warm bodies in a room whose air conditioning was not working, and partly concrete-example starvation.

On Tuesday, Syd Bauman and Julia Flanders of the Brown Univerisity Women Writers Project (now housed in the Brown University Library) gave a one-day introduction to XML, TEI, and the utility of TEI for data curation. The encoding exercises they gave us had the advantage of concreteness, but also (alas) the disadvantage: as they were describing yet another way that TEI could be used to annotate textual material, one librarian burst out “But we can’t possibly be expected to do all this!”

If in giving an introduction to TEI you don’t go into some detail about the things it can do, no one will understand why people might prefer to archive data in TEI form instead of HTML or straight ASCII. If you do, at least some in the audience will conclude that you are asking them to do all the work, instead of (as here) making them aware of some of the salient properties of the data they may be responsible in future for curating (and, a fortiori, understanding).

Wednesday morning, I was on the program under the title Markup semantics and the preservation of intellectual content, but I had spent Monday and Tuesday concluding that I had more questions than answers, so I threw away most of my plans for the presentation and turned as much of the morning as possible into group discussion. (Perversely, this had the effect of making me think I had things I wanted to say, after all.) I took the opportunity to introduce briefly the notion of skeleton sentences as a way of representing the meaning of markup in symbolic logic or English, and to explain why I think that skeleton sentences (or other similar mechanisms) provide a way to verify the correctness and completeness of the result, when data are migrated from one representation to another. This certainly works in theory, and almost certainly it will work in practice, although the tools still need to be built to test the idea in practce. When I showed thd screen with ten lines or so of first-order predicate calculus showing the meaning of the oai:OAI-PMH root element of an OAI metadata harvesting message, some participants (not unreasonably) looked a bit like deer caught in headlights. But others seemed to follow without effort, or without more puzzlement than might be occasioned by the eccentricities of my translation.

Wednesday afternoon, John Unsworth introduced the tools for text analysis on bodies of similarly encoded TEI documents produced by the MONK project (the name is alleged to be an acronym for Metadata offers new knowledge, but I had to think hard to see where the tools actually exploited metadata very heavily. If you regard annotations like part-of-speech tagging as metadata, then the role of metadata is more obvious.)

And at the end of Thursday, Lorcan Dempsey, the vice president of OCLC, gave a thoughtful and humorous closing keynote.

For me, no longer able to spend as much time in libraries and with librarians as in some former lives, the most informative presentation was surely Dorothea Salo’s survey of issues facing institutional repositories and other organizations that wish to preserve digital objects of interest to humanists and to make them accessible. She started from two disarmingly simple questions, which seem more blindingly apposite and obvious every time I think back to them. (The world clicked into a different configuration during her talk, so it’s now hard for me to recreate that sense of non-obviousness, but I do remember that no one else had boiled things down so effectively before this talk.)

For all the material you are trying to preserve, she suggested, you should ask

  • “Why do you want to preserve this?”
  • “What threats are you trying to preserve it against?”

The first question led to a consideration of collection development policy and its (often unrecognized) relevance to data curations; the second led to an extremely helpful analysis of the threats to data against which data curators must fight.

I won’t try to summarize her talk further; her slides and her blog post about them will do it better justice than I can.

3 thoughts on “Data curation for the humanities

  1. Hello Michael,
    would you conclude from this summer institute that there is a need for data curators – is this a new job model for librarians? I am asking since teaching future librarians is a main part of my job, and having the TEI as part of our curriculum is something we actually discussed just today – with very useful input, see

  2. Felix, I hesitate to make anything like a prediction about job prospects for anyone, but it does seem clear that more and more library holdings will take electronic form, and that electronic holdings will impose new requirements on librarians seeking to ensure (a) that their collections are of active use to their communities and (b) that their collections are preserved.

    The fundamental goals for electronic materials in the collection will be essentially the same as for non-electronic materials, but the details of how to make them available for use, and how to preserve them against threats, will differ because of the new medium. So there is an odd mix of feeling “everything is the same, nothing is new” and thinking that everything is different. During the summer institute, several people observed that special requirements may arise for data curation from the uniqueness of materials, but it seems to me that unique and non-unique materials may still engender similar problems in some respects.

    I do encourage you to include consideration of data issues in your curriculum, and I think TEI is a good instantiation of the kinds of data and complications those concerned with humanities textual data will be confronted with.

Comments are closed.