Archive for the ‘Data longevity’ Category

XML for the Long Haul

Tuesday, February 23rd, 2010

[23 February 2010]

The organizing committee for Balisage 2010 have announced the topic of this year’s one-day pre-conference symposium: “XML for the Long Haul: Issues in the Long-term preservation of XML,” and have issued a call for participation. The basic question is roughly this: what do we need to do to make sure that XML-encoded data is usable for long periods? Descriptive markup was invented by people who needed and desired longevity, application independence, and device independence for their data, so longevity is often used as a selling point for the use of SGML and XML. And as sometimes happens with selling points, the precise nature of the relation between XML and long-lived data is sometimes obscured, to the point where some potential customers may believe they have been told that the use of XML in itself guarantees data longevity. And maybe they have, but not (I should think) by anyone who knows what they are talking about. The use of XML, or more generally descriptive markup, may be a necessary condition of data longevity, but it’s unlikely to be sufficient, just as a hammer may be necessary (or extremely helpful) in getting a nail driven, but buying a hammer does not by itself get the nail into the wall.

There’s a lot to be said about the facets and ramifications of the topic, but I think I’ll save those for later posts. For now, I’ll just say that I’ll be chairing the symposium this year, and I hope to see readers of this blog in Montréal in August.

[My evil twin Enrique had been tugging my elbow for some time, and now asked “So why is the logo a moving truck? Will non-native speakers of English understand the reference?” I don't know (but if you do, I'm interested to learn: native speakers of other languages, please speak up! Does the logo make sense outside of English?), but I can at least explain. The English phrase “long haul” refers most literally to long distances, especially for the transport of freight or people (as in “long-haul flights” and “long-haul trucking”). In an extended sense (originally metaphorical, I guess) it denotes a protracted or difficult task (“we're in it for the long haul”) or an extended period of time. Long-term preservation of data and meaning involves a long haul both in the sense of being a difficult task and of involving long period of time. “Oh,” said Enrique. “I get it! The logo is a truck used for long-haul freight transport, the way XML may be used for long-haul preservation of information. Don't you think you should explain that somewhere?” “Maybe,” I said. Maybe I should.

XML as a sort-of open format

Thursday, December 3rd, 2009

[3 December 2009]

I just encountered the following statements in technical documentation for a family of products which I’ll leave nameless.

This document does not describe the complete XML schema for either [Application 1] or [Application 2]. The complete XML schema for both applications is not available and will not be made public.

Perhaps there can be good reasons for such a situation. Perhaps the developers really don’t know how to use any existing schema language to describe the set of documents they actually accept; perhaps only a Turing machine can actually identify set of documents accepted, and the developers were unwilling to work with a simpler set whose membership could be more cheaply decided. (Well, wait, those may be reasons, but they don’t actually qualify as “good”.)

I wonder whether this is an insidious attempt to look like the products have an open format (See? it’s XML! How much more open can you get?) while ensuring that the commercial products in question remain the only arbiters of acceptable documents? Or whether the programmers in question were just too lazy to specify a clean vocabulary and ensure that their software handles all documents which meet some standard of validity that does not require Turing completeness?

Having a partially defined XML format is, at least for me, still a great deal more convenient than having the format be binary and completely undocumented. But it certainly seems to fall a long distance short of what XML could make possible.

Quote of the day

Tuesday, August 25th, 2009

[25 August 2009]

Data persistence is a crapshoot. Load the dice.

-Dorothea Salo, Equipment and data curation, 7 August 2009 (on preferring widely supported open formats to niche formats and closed formats).

Mainframe terminal rooms and the oral tradition

Tuesday, July 7th, 2009

[7 July 2009]

A number of XML experts I know use Emacs for editing XML, employing either James Clark’s nxml mode or Lennart Staflin’s psgml mode. But few people who don’t already know Emacs are eager to learn it.

My evil twin Enrique suggested a reason: “In the old days,” (he means thirty years ago, when he first learned to use computers), “using a computer mostly meant using a mainframe. Which meant, on most university campuses, using a public terminal room. Which meant there were usually other people around who might be able to help figure out how to make the editor do something. Emacs was able to spread widely in that culture because the written documentation was not the only available source of information. (Did Emacs even have written documentation in those days?) Emacs, and a lot of other tools, were propagated by oral tradition.

“Nowadays, however, the oral traditions of the public terminal room are mostly dead. What the user cannot figure out how to use from the user interface and (perhaps) a glance at the documentation, might as well not be in the program. Fewer and fewer users will trouble to learn Emacs.

“I predict that when the people who first learned computing in a mainframe terminal room are dead, Emacs will be effectively dead, too. Its natural method of propagation is by looking over someone’s shoulder at what they are doing and asking ‘How did you do that?’ That doesn’t happen when computing almost always happens in private places.

“R.I.P., Emacs,” he intoned mournfully. “And probably TeX and LaTeX, too.”

“Well, hang on,” I said. “Neither Emacs nor TeX is dead yet.”

“Maybe not, but it’s only a matter of time. They’ll end up in the Retro-Computing Museum.” I could have sworn I saw a tear in his eye.

“But, you know, it’s only a matter of time for all of us. And besides, you’re wrong in at least some ways. I did indeed spend the first few years of my computing life haunting university terminal rooms. I got a lot of help from other people, and I passed it on. But I didn’t use TeX or Emacs until years later. The oral traditions of the terminal room, if they ever actually existed, had nothing to do with it. Both Emacs and TeX are perfectly capable of acquiring new users without oral transmission.”

He looked up. “You mean, there’s hope yet?”

“There’s always hope. But no, I’m still not going to help you debug that self-modifying 360 Assembler program you brought over. I’ve got work to do.”

Sustainability, succession plans, and PURLs — Burial societies for libraries?

Sunday, May 24th, 2009

[24 May 2009]

At the Summer Institute on Data Curation in the Humanities (SIDCH) this past week in Urbana (see previous post), Dorothea Salo surveyed a variety of threats to the longevity of humanities data, including lack or loss of institutional commitment, and/or death (failure) of the institution housing the data. People serious about maintaining data accessible for long periods need to make succession plans: what happens to the extensive collection of digital data held by the XYZ State University’s Institute for the History of Pataphysical Research when the state legislature finally notices its existence and writes into next year’s budget a rule forbidding university administrators to fund it in any year which in the Gregorian calendar is either (a) a leap year or (b) not a leap year, and (c) requiring the adminstrators to wash their mouths out with soap for having ever funded the Intitute in the first place?

Enough centers for computing in the humanities have been born in the last fifty years, flourished some years, and later died, that I can assure the reader that the prospect of going out of existence should concern not only institutes for the history of pataphysics, but all of us.

It’s good if valuable data held by an organization can survive its end; from the point of view of URI persistence it would be even better if the URL used to refer to the data didn’t have to change either.

I have found myself thinking, the last couple of days, about a possible method of mitigating this threat, that runs something like this.

  • A collection of reasonably like-minded organizations (or individuals) forms a mutual assistance pact for the preservation of data and URIs.
  • The group sets up and runs a PURL server, to provide persistent URLs for the data held by members of the group. [Alternate approach: they all agree to use OCLC's PURL server.]
  • Using whatever mechanism they choose, the members of the group arrange to mirror each other’s data in some convenient way. Some people will use rsync or similar tools; Dorothea Salo observed that LOCKSS software can also do this kind of job with very low cost in time and effort.
  • If any of the partners in the mutual assistance pact lose their funding or go out of existence for other reasons, the survivors agree on who will serve the decedent’s data. The PURL resolution tables are updated to point to the new location.
  • Some time before the count of partners is down to one, remaining partners recruit new members. (Once the count hits zero, the system has failed.)

    [Wendell Piez observed, when we got to this point of our discussion of this idea, “There's a Borges story in that, just waiting for someone to write it.” I won't be surprised if Enrique is working on one even as I write.]

In some cases, people will not want to use PURLs, because when they make available the kind of resources whose longevity is most obviously desirable, the domain name in the URLs performs a sort of marketing or public awareness function for their organization. I suppose one could allow the use of non-PURL domains, if the members of the group can arrange to ensure that upon the demise of an individual member the ownership of their domains passes seamlessly to some other member of the group, or to to the group as a whole. But this works only for domain owners, and only if you can find a way to ensure the orderly transfer of domain ownership. Steve Newcomb, my colleague in the organizing committee for the Balisage conference on the theory and practice of markup, points out a difficulty here: in cases of bankruptcy, the domain name may be regarded as an asset and it may therefore be impossible to transfer it to the other members of the mutual assistance society.

It’s a little bit like the burial societies formed by immigrants in a strange land for mutual assurance, to ensure that when one dies, there will be someone to give them a decent burial according to the customs of their common ancestral homeland, so that their souls will not wander the earth restlessly in perpetuity.

It would be nice to think that the data collections we create will have a home in the future, lest their ghosts wander the earth restlessly, bewailing their untimely demise, accusing us of carelessness and negligence in letting them die, and haunting us in perpetuity.

Data curation for the humanities

Saturday, May 23rd, 2009

[23 May 2009]

Most of this week I was in Illinois, attending a Summer Institute on Humanities Data Curation in the Humanities (SIHDC) sponsored by the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign (UIUC).

The week began on Monday with useful and proficient introductions to the general idea of data curation from Melissa Cragin, Carole Palmer, John MacMullen, and Allen Renear; Craigin, Palmer, and MacMullen talked a lot about scientific data, for which the term data curation was first formulated. (Although social scientists have been addressing these problems systematically for twenty or thirty years and have a well developed network of social science data archives and data libraries, natural scientists and the librarians working with them don’t seem to have paid much attention to their experience.) They were also working hard to achieve generally applicable insights, which had the unfortunate side effect of raising the number of abstract noun phrases in their slides. Toward the end of the day, I began finding the room a little airless; eventually I concluded that this was partly oxygen starvation from the high density of warm bodies in a room whose air conditioning was not working, and partly concrete-example starvation.

On Tuesday, Syd Bauman and Julia Flanders of the Brown Univerisity Women Writers Project (now housed in the Brown University Library) gave a one-day introduction to XML, TEI, and the utility of TEI for data curation. The encoding exercises they gave us had the advantage of concreteness, but also (alas) the disadvantage: as they were describing yet another way that TEI could be used to annotate textual material, one librarian burst out “But we can’t possibly be expected to do all this!”

If in giving an introduction to TEI you don’t go into some detail about the things it can do, no one will understand why people might prefer to archive data in TEI form instead of HTML or straight ASCII. If you do, at least some in the audience will conclude that you are asking them to do all the work, instead of (as here) making them aware of some of the salient properties of the data they may be responsible in future for curating (and, a fortiori, understanding).

Wednesday morning, I was on the program under the title Markup semantics and the preservation of intellectual content, but I had spent Monday and Tuesday concluding that I had more questions than answers, so I threw away most of my plans for the presentation and turned as much of the morning as possible into group discussion. (Perversely, this had the effect of making me think I had things I wanted to say, after all.) I took the opportunity to introduce briefly the notion of skeleton sentences as a way of representing the meaning of markup in symbolic logic or English, and to explain why I think that skeleton sentences (or other similar mechanisms) provide a way to verify the correctness and completeness of the result, when data are migrated from one representation to another. This certainly works in theory, and almost certainly it will work in practice, although the tools still need to be built to test the idea in practce. When I showed thd screen with ten lines or so of first-order predicate calculus showing the meaning of the oai:OAI-PMH root element of an OAI metadata harvesting message, some participants (not unreasonably) looked a bit like deer caught in headlights. But others seemed to follow without effort, or without more puzzlement than might be occasioned by the eccentricities of my translation.

Wednesday afternoon, John Unsworth introduced the tools for text analysis on bodies of similarly encoded TEI documents produced by the MONK project (the name is alleged to be an acronym for Metadata offers new knowledge, but I had to think hard to see where the tools actually exploited metadata very heavily. If you regard annotations like part-of-speech tagging as metadata, then the role of metadata is more obvious.)

And at the end of Thursday, Lorcan Dempsey, the vice president of OCLC, gave a thoughtful and humorous closing keynote.

For me, no longer able to spend as much time in libraries and with librarians as in some former lives, the most informative presentation was surely Dorothea Salo’s survey of issues facing institutional repositories and other organizations that wish to preserve digital objects of interest to humanists and to make them accessible. She started from two disarmingly simple questions, which seem more blindingly apposite and obvious every time I think back to them. (The world clicked into a different configuration during her talk, so it’s now hard for me to recreate that sense of non-obviousness, but I do remember that no one else had boiled things down so effectively before this talk.)

For all the material you are trying to preserve, she suggested, you should ask

  • “Why do you want to preserve this?”
  • “What threats are you trying to preserve it against?”

The first question led to a consideration of collection development policy and its (often unrecognized) relevance to data curations; the second led to an extremely helpful analysis of the threats to data against which data curators must fight.

I won’t try to summarize her talk further; her slides and her blog post about them will do it better justice than I can.

Persistence and dereferenceability

Tuesday, March 31st, 2009

[31 March 2009]

My esteemed former colleague Thomas Roessler has posted a musing on the fragility of the electronic historical record and the difficulties of achieving persistence, when companies go out of existence and coincidentally stop maintaining their Web sites or paying their domain registration fees.

After reading Thomas’s post, my evil twin Enrique came close to throwing a temper tantrum. (Actually, that’s quite unfair. For Enrique, he was remarkably well behaved.)

“The semantic web partisans,” he shouted, “have spent the last ten years or more telling us that URLs are the perfect naming mechanism: a single, integrated space of names with distributed naming authority. Haven’t they?”

“Well,” I said, “strictly speaking, I think they have mostly been talking about URIs, for the last few years at least.” He ignored this.

“They have been telling us we should use URLs for naming absolutely everything. Including everything we care about. Including Aeschylus and Euripides! Homer! Sappho! Including Shelley, and Keats, and Pope!”

I couldn’t help starting to hum ‘Brush up your Shakespeare’ at this, but he ignored me. This in itself was unusual; he is usually a sucker for Cole Porter. I guess he really was kind of worked up.

“And when anyone expressed concern about (a) the fact that the power to mint URLs is tied up with the regular payment of fees, so it’s really not equally accessible to everyone, or (b) the possibility that URLs don’t have the kind of longevity needed for real persistence, they just told us again, louder, that we should be using URLs for everything.”

“Now, don’t bring up URNs!” I told him, in a warning tone. “We don’t want to open those old wounds again, do we?”

“And why the hell not?” he roared. “What do the SemWeb people think they are playing at?!”

“Well,” I said.

“Either they are surprised at this problem, in which case you have to ask: ‘How can they be surprised? What kind of idiots must they be not to have seen this coming?’“

“Well,” I said.

“Or else they aren’t surprised, in which case you have to ask what they are smoking! Is it their attention span so short that it has never occurred to them that names sometimes need to last for longer than Netscape, Inc., happens to be in business?”

“Well,” I said. I realized I didn’t really have a good answer.

“And you?!” he snarled, turning on me and grabbing my lapels. “You were there for years — you couldn’t take a moment to point out to them that a naming convention can be used for everything we care about only if it can be used for the monuments of human culture? You couldn’t be bothered to point out that URLs can be suitable for naming parts of our cultural heritage only if they can last for a few hundred, preferably a few tens of thousands, of years? What use are you?!”

“Well,” I said.

“What use are URLs and their much hyped dereferenceability, if they can break this fast?”

“Well,” I said.

Long pause.

I am not sure Enrique’s complaints are entirely fair, but I also didn’t know how to answer them. I fear he is still waiting for an answer.