Persistence and dereferenceability

[31 March 2009]

My esteemed former colleague Thomas Roessler has posted a musing on the fragility of the electronic historical record and the difficulties of achieving persistence, when companies go out of existence and coincidentally stop maintaining their Web sites or paying their domain registration fees.

After reading Thomas’s post, my evil twin Enrique came close to throwing a temper tantrum. (Actually, that’s quite unfair. For Enrique, he was remarkably well behaved.)

“The semantic web partisans,” he shouted, “have spent the last ten years or more telling us that URLs are the perfect naming mechanism: a single, integrated space of names with distributed naming authority. Haven’t they?”

“Well,” I said, “strictly speaking, I think they have mostly been talking about URIs, for the last few years at least.” He ignored this.

“They have been telling us we should use URLs for naming absolutely everything. Including everything we care about. Including Aeschylus and Euripides! Homer! Sappho! Including Shelley, and Keats, and Pope!”

I couldn’t help starting to hum ‘Brush up your Shakespeare’ at this, but he ignored me. This in itself was unusual; he is usually a sucker for Cole Porter. I guess he really was kind of worked up.

“And when anyone expressed concern about (a) the fact that the power to mint URLs is tied up with the regular payment of fees, so it’s really not equally accessible to everyone, or (b) the possibility that URLs don’t have the kind of longevity needed for real persistence, they just told us again, louder, that we should be using URLs for everything.”

“Now, don’t bring up URNs!” I told him, in a warning tone. “We don’t want to open those old wounds again, do we?”

“And why the hell not?” he roared. “What do the SemWeb people think they are playing at?!”

“Well,” I said.

“Either they are surprised at this problem, in which case you have to ask: ‘How can they be surprised? What kind of idiots must they be not to have seen this coming?’“

“Well,” I said.

“Or else they aren’t surprised, in which case you have to ask what they are smoking! Is it their attention span so short that it has never occurred to them that names sometimes need to last for longer than Netscape, Inc., happens to be in business?”

“Well,” I said. I realized I didn’t really have a good answer.

“And you?!” he snarled, turning on me and grabbing my lapels. “You were there for years — you couldn’t take a moment to point out to them that a naming convention can be used for everything we care about only if it can be used for the monuments of human culture? You couldn’t be bothered to point out that URLs can be suitable for naming parts of our cultural heritage only if they can last for a few hundred, preferably a few tens of thousands, of years? What use are you?!”

“Well,” I said.

“What use are URLs and their much hyped dereferenceability, if they can break this fast?”

“Well,” I said.

Long pause.

I am not sure Enrique’s complaints are entirely fair, but I also didn’t know how to answer them. I fear he is still waiting for an answer.

Managing to disagree

[30 March 2009]

For some reason, lately I’ve found an old remark of Allen Renear’s running through my head.

“We can disagree about many things; but can we disagree about everything?

“Or would that be like positing the existence of an airline so small, it has no nonstop flights?”

[Memory tells me that he said this at a meeting of the Society for Textual Scholarship; Google, aided by Robin Cover’s Cover Pages, tells me that it was in April 1995.]

It’s on my mind, perhaps, because with Claus Huitfeldt and Yves Marcoux I’ve been doing some work on a formal model of transcription, and when we have examined how multiple divergent transcriptions of the same exemplar look in our model, it has proven much harder than I would have thought possible to make the transcriptions actually contradict one another. (More on this in another post, perhaps.)

Consistency checking in prose

[26 March 2009]

Twice in the last couple of days I’ve run into documents with consistency problems. So I’ve been thinking about consistency checking in prose as a challenge.

The web site for a large organization has, in the About Us section of the site, a side bar saying so-and-so many employees in so-and-so many countries. And one of the documents within the About Us section talked about the organization’s efforts to be a good corporate citizen and employer at all of its so-and-so many locations. If you are in n locations in m countries, though, shouldn’t n be greater than or equal to m?

The other example was documentation for a specialized XML vocabulary which included a summary of gaps in the vocabulary’s coverage and shortcomings in the design. The main gap, said the documentation, was that “the vocabulary offers no solution to the problem of XYZ” But the vocabulary does offer a solution to that problem: the revision of the vocabulary to deal with problem XYZ is described with some pride two or three sections further up in the document.

One may speculate that in both cases, a perfectly true statement in the document was rendered false by later events, and statements added later to the document, reflecting the later state of affairs, contradict the earlier statements. (There was a gap in the vocabulary, and the documentation mentioned it as a potentially painful one. Eventually it was painful enough to be filled. And the tag-by-tag account of the markup was even revised to document the new constructs. But the description of gaps and shortcomings was not revised. And it’s not hard to believe that an organization may be in m locations at one point, and in a larger number of locations, in n countries, later on.)

In neither of these cases is the contradiction particularly embarrassing or problematic.

[“But I notice you name no names,” said Enrique. “Chicken.” “Hush,” I said. “The names are not relevant.”]

But of course the same problem happens in specs, where inconsistencies may have graver consequences.

[“Ha! I can think of some names under this rubric!” crowed Enrique. “Shut up!” I explained. “I’m trying to describe and understand the problem, not confess my sins.” “Oh, go on! Admitting you have a problem is the first step towards recovery. If you don’t admit that you have the problem, you’ll never — what the ?!” At this point, I managed to get the duct tape over his mouth.]

I think there must be two basic approaches to trying to avoid inconsistencies.

(1) You can try to prevent them arising at all.

(2) You can try to make them easier to detect automatically, so that an automated process can review a set of documents and flag passages that need attention.

Neither of these seems to be easy to do. But for both of them, it’s not hard to think of techniques that can help. And thinking about any kind of information system, whether it’s an XML vocabulary or a database management system or a programming language or a Web site content management system, or a complicated combination of the above, we can usefully ask ourselves:

How could we make it easier to prevent inconsistency in a set of documents?

How could we make it easier to keep a set of documents in synch with each other as they change?

How could we document the information dependencies between documents at a useful level of granularity? (Helpful, perhaps, to say “if document X changes, documents Y and Z, which depend on it, must be checked to see if they need corresponding revisions”, but a lot more helpful if you can say which sections in Y and Z depend on which bits of X.) Could we do it automatically?

It seems plausible that detecting inconsistencies between distant parts of a document would be easier if we could get a list of (some of) the entailments of each bit of a document.

How can we make it easier to document the entailments of a particular passage in a spec?

For making the entailments of a passage explicit (and thus amenable to mechanical consistency checking with other parts of the document set) I think there are several obvious candidates: RDF, Topic Maps, RDFa, the work Sam Hunting has been doing with embedded topic maps (see for example his report at Balisage 2008), colloquial XML structures designed for the specific system in question. Years ago, José Carlos Ramalho and colleagues were talking about semantic validation of document content; they must have had something to say about this too. (In the DTD, I recall, they used processing instructions.) Even standard indexing tools may be relevant.

How do these candidates compare? Are they all essentially equally expressive? Does one or the other make it easier to annotate the document? Is one or the other easier to process?

[“If you don’t admit that you have the problem, you’ll never be able to fix it. And you keep that roll of duct tape away from me, you hear?”]

Blogging and apothegms

[25 March 2009]

For some time now I’ve been carrying around a little notebook with (among other things) notes on various topics I have thought it would be useful and interesting to make blog posts about. I haven’t had time to work out coherent expositions or arguments on most of the topics, though, so nothing happens. All I’ve got are short fragments in a telegraphic style — just enough (I hope) to remind myself, when I come back to the topic, of the line of thought I wanted to pursue.

Sometimes I think I should post the notes I’ve got, despite their incomplete, inadequate formulations. It might not help you, dear reader (sorry) but it might make this lab notebook more useful for me.

(See also Matt Kirschenbaum’s ruminations from 2005 on the use(s) of blog posts, which is a message in a bottle I’ve just run across.)

And I have begun to wonder if this explains the aphoristic, telegraphic style I associate with the posthumous notebooks and journals of great writers, full of incomprehensibly terse remarks. Are the fragments of (say) the Schlegels nothing but notes for things they would later have worked up into blog posts, if only they had not been born two hundred years too soon?


Or perhaps I should say:

Notebook full of ideas for posts.

Telegraphic — aphoristic — apothegmatic?

Schlegels (Nietzsche?) as bloggers avant la lettre?

Is profundity nothing more than haste to get something — something — a trail of breadcrumbs? — down quick?

Hmm. Breadcrumbs. Guess DanC (all of DIG?) thinks so.

Hmm.