Daniel Boone meets the consistent Web

[22 July 2008]

My colleague Thomas Roessler writes:

[The monotonic semantics of RDF] guarantee that you won’t run into a world of inconsistency when you discover additional information, and they also guarantee that you can learn things about the world piece by piece.

My evil twin Enrique responds: So let us start with the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is identical to the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which I assume I can express using some predicate like the OWL sameAs.

And now let us discover additional information in another triple store, which contains the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is distinct from the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which it expresses using some predicate like the OWL differentFrom.

I’m having trouble understanding (concludes Enrique) how we can do this without either running into a world of inconsistency (a small world, perhaps, bounded in a nutshell, but still a world big enough for joe and Josephus to be both the same and different), or else running into a world in which we find that “inconsistency” has been defined to have a highly technical meaning under which the two triples just described are not actually inconsistent in the technical sense (why do I expect someone to start lecturing me about Herbrand models any moment now?), even though any application relying on the usual notions of identity and difference may find itself at a loss as to what to make of seeing them both in the same graph.

I reminded Enrique of the American pioneer Daniel Boone, who proudly claimed that he had never been lost in his life. Never? Never. [Pause.] “But I was a mite bewildered once for three days.” [Rimshot.]

Balisage offers hope for the deadline-challenged

In my never-ending quest to help those who, like myself, never get around to things until the deadline is breathing down their necks, I have until now avoided mentioning that Balisage, the conference on markup theory and practice, has issued a call for late-breaking news.

The deadline for late-breaking submissions is 13 June 2008. It is now officially breathing down your neck.

There, will that do the job?

The 13th of June is this Friday. Just enough time to write up that great piece of work you just did, but not long enough to make a huge big thing of it and get all worked up in knots.

Balisage is an annual conference on markup theory and practice, held in early August each year in Montréal. Well, I say annual, but strictly speaking this is Balisage’s first year. The organizers have in the past been involved in other conferences in Montreal in August (most recently Extreme Markup Languages), and we regard Balisage as the natural continuation. So if you have always wanted to go to the Extreme Markup Languages conference, and are disappointed to see no announcements this year for that conference, come to Balisage. I think you’ll find what you’re looking for.

The full call for late-breaking news, and details of how to make a submission, are at http://www.balisage.net/latebreaking-call.html.

XML catalogs vs local caching proxy

[21 May 2008]

I have a senior colleague who has maintained, for several years, that SGML and XML catalogs are a deplorable special-case hack for a problem that should be solved by the more general means of HTTP caches. (Most recently, he was arguing against a proposal that the W3C distribute convenient packages of our most frequently used DTDs and schemas, with a catalog to make them easy to use. How someone so smart can be so deeply wrong-headed, I’m not sure.)

So when I had a network outage the other day that made it hard to get any work done, I thought about setting up a local caching proxy. Why did the outage make it hard to get anything done? Because I do use some software that doesn’t support catalogs, and which reacts to network outages by imposing a thirty-second delay for each DTD fetch (while its network request times out) and then proceeds anyway, without the DTD. Since it does proceed eventually, I can in fact build a new HTML version of the XSD spec (for example); it’s only that the process becomes painfully slow (or rather, even slower and more painful than usual).

But, I thought, the systems guys assure me that it’s not really hard for a user (not the system administrator, just a user) to set up a local caching proxy. So I’ll give it a try.

The upshot so far is: yes, it’s possible, though I wouldn’t call it easy. And managing catalogs still seems an order of magnitude easier and more straightforward. Here’s what I’ve done so far:

1 Apache ships with Mac OS X, it’s already running on my system (I use a local CGI script to log where my time goes), and mod_proxy enables it to serve as a local caching proxy. So I decided to try that, instead of installing squid or something similar. Found instructions for configuring Apache as a local caching proxy on a Mac OSX site; they worked (although they suggest commenting out the line “Deny all”, in the mistaken belief that otherwise nothing works). I followed his advice and blocked a couple of random sites I can live without, in order to be able to request them and tell, from the resulting failure message, that the proxy service was working.

2 In System Preferences / Network / Airport / Proxies, I told the system to use http://localhost:80 as a proxy for HTTP requests.

I had illusions that this was it. At the system level (I fantasized), outgoing HTTP requests would be re-routed to the local Apache.

Ha.

This does suffice for Safari, and possibly for other Apple software (I don’t know, haven’t looked, don’t much care right now). But Firefox must be told separately about the proxy server. And Opera.

And the command-line tools that were the main reason I wanted a caching proxy in the first place? RXP, libxml, Saxon, and so on? Nope, not using the proxy.

3 After some disappointing experiences with the documentation for the tools I’m using (none of the documentation I found says anything at all about how to tell the software to use a proxy server), I learned from oblique references somewhere that setting the environment variable http_proxy works for some Unix tools.

So I tried export http_proxy=http://localhost:80 and curl, at least, started using the proxy server. libxml (and thus xmllint and xsltproc) also started using it, I think, or trying to, but the main symptom of this success was that they started emitting error messages informing me helpfully that

error : Operation in progress

When I stopped Apache, that message went away. When I unset the http_proxy environment variable, it also went away (whether Apache was running or not).

4 Along about this time I decided just to make libxml use my local catalogs. This turned out to be harder than I thought: setting XML_CATALOG_FILES=/Library/SGML/Public/catalog.xml elicited only the laconic message from xsltproc, xmllint, xmlcatalog, and anything else that uses libxml: /Library/SGML/Public/Misc/catalog.xml:0: Catalog error : File /Library/SGML/Public/Misc/catalog.xml is not an XML Catalog.

But of course it is an XML catalog. I can see that.

I validated it, just to make sure, using both xmllint and rxp. No problems.

5 Eventually, it became clear that libxml wanted an explicit namespace declaration in the root element. (I had been relying on the default value given in the DTD.) So <catalog> had to become <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> in all my XML catalogs. (DV, are you listening? The Namespaces Rec is quite explicit that namespace declarations may be defaulted by the DTD. Otherwise I never would have voted for it. RXP gets it right; thank you, Richard!)

6 Eventually, the sick minds of Liam Quin and John Snelson suggested that perhaps I should try a different value for http_proxy: instead of http://localhost:80 I should try export http_proxy=http://127.0.0.1:80/. This eliminated the “error : Operation in progess” messages.

So I now have a local caching proxy working, and some of my tools, at least, are using it when they don’t find what they need using catalogs. I’ll assume that this is a Good Thing. But nothing I’ve seen so far tells me how to configure Apache (or squid, or any other proxy) the way I want to. I want a convenient list of the resources in the local cache, and I want to be able to mark some of them (e.g. the DTDs and the W3C stylesheets I use most often) as “Never ever delete this; ALWAYS have a copy handy; check every few months to see if it needs updating.” From the documentation of Apache and of Squid, I am inclined to believe this is not actually possible. At the very least, it’s not obvious. By default, Apache’s mod_proxy appears to plan to delete everything after 24 hours regardless of its expiration date. And the default size of the cache appears (can this possibly be?!) to be 5 KB.

So so far, the caching proxy does not give me the guarantees I want, about always having the resources I care about available, network or no network.

For catalogs, on the other hand, it would be nice to have some software that would augment the catalog with information about when a particular copy of the resource was fetched, when it was last modified, what its expiration date is (if the server provides one; surprising how few Web servers actually provide useful expiration information), and would check the Web periodically (say, once a month or so) to see whether any of my local copies of Web resources should be updated.

My interim conclusion is: both catalogs and HTTP caches could use improvement. As a way to ensure that the work I want to do can proceed without the network, however, catalogs are a lot more convenient, straightforward, and functional.

A kind of hommage

[12 March 2008]

I’m traveling to Germany, so I had an overnight flight last night. A long time ago I swore off red-eye flights from California to the East Coast, but for travel to Europe there seem to be no alternatives. Arriving in Amsterdam and then taking the train further seemed like a civilized way to organize things: it doesn’t make that much difference in the final arrival time, and trains are almost always more comfortable to travel in than airplanes.

I forgot to ask whether there were direct connections. There aren’t.

I had dimly imagined climbing into a train, finding a seat, and traveling further in a half sleep for a few hours before being deposited at my destination. The wait in Amsterdam seemed normal enough — I like Schiphol, it feels like one of the more civilized of European airports I travel through. But the change in Utrecht was a bit of a strain, and by noon and my second change of trains in Duisburg I had begun to feel as if I were in a surrealist movie.

Somewhere along the way, by a process that seems to have resembled automatic writing, the following account of an encounter with my evil twin appeared in my notebook.

My evil twin dropped by again the other day. He was not happy that in a couple of recent posts I had given him the pseudonym “Skippy”.

“Skippy!? What kind of name is that for an evil twin? I want a new name.”

“Look, I’m sorry if you didn’t like it. I was in kind of a hurry and it was the name that came to mind.”

“‘Skippy!’” He sulked a little more. “It sounds like the kind of evil twin George H. W. Bush would have.”

“Well, exactly,” I said. “It’s a reference to Garry Trudeau’s strips about Bush 41. Think of it as a kind of hommage.”

“Hommage, hell, you just got lazy. I don’t want to be Skippy, even as a cover name. Call me … Enrique.”

“Enrique.”

“Yes. Don’t you recognize it? It’s a reference to the Incredible Zambini Brothers.”

“Who? Sounds like an obscure San Francisco band.”

“No, you’re probably thinking of the Sons of Champlin. But it’s true, the Zambini brothers did have a cult classic once — The Incredible Zambini Brothers, All-Stars Again. It was a kind of combination jam session, cookbook, literary anthology, and football playbook. Riveting, really, if you have the right kind of chemical enhancements. So yeah, think of it as a kind of hommage.”

I have got to start getting more sleep when I do trans-Atlantic flights.

Scenes from a Recommendation 3: Boston, Prudential Tower

Another memory from the development of XML.

It’s November 1996, at the GCA SGML ’96 conference, at the Sheraton in Boston. The SGML on the Web Working Group and ERB have just been through an exhausting and exhilarating few weeks, when from a standing start we prepared the first public working draft of XML. At this conference, we have been given a slot for late-breaking news and will give the first public presentation of our work.

Lou Burnard, of Oxford University Computing Services, the founder of the Oxford Text Archive, is there to give an opening plenary talk about the British National Corpus, a 100-million-word representative corpus of British English, tagged in SGML. Lou and I are old friends; since 1988 we have worked together as editors of the Guidelines of the Text Encoding Initiative. Working together to shepherd a couple dozen working groups and task forces full of recalcitrant academics and other-worldly text theorists (“but why should a stanza have to contain lines of verse? I can perfectly well imagine a stanza containing no lines at all”) from requirements to draft proposals, to turn their wildly inconsistent and incomplete results into something resembling a coherent set of rules for encoding textual material likely to be useful to scholarship, and to produce in the end 1500 pages of mostly coherent prose documentation for the TEI encoding scheme, Lou and I have been effectively joined at the hip for years. We have consumed large quantities of good Scotch whisky together, and some quantities of beer and not so good whisky. We have told each other our life stories, in a state of sufficient inebriation that neither of us remembers any details beyond our shared admiration for Frank Zappa. We have sympathized with each other in our struggles with our respective managements; we have supported each other in working group and steering committee meetings; we have pissed each other off repeatedly, and learned, with a little help from our friends (thank you, Elaine), to patch things up and keep going. No one but my wife knew me better than Lou; no one but my wife could push my buttons and enrage me more effectively. (And she didn’t push those buttons nearly as often as Lou did.)

Tim Bray is also there, naturally. He and I have not worked together nearly as long as Lou and I have, but the compressed schedule and the intensity of the XML work have made it a similarly close relationship. We spend time on the phone arguing about the best way to define this feature or that, or counting noses to see which way a forthcoming decision is likely to come out (we liked to try to draft wording in advance of the decisions, when possible). We commiserate when Charles Goldfarb calls and spends a couple hours trying to wear us down on the technical issue of the day. (Fortunately, Charles called Tim and Jon Bosak more often than me. Either he decided he couldn’t wear me down, or he concluded I was a lightweight not worth worrying about. I’m not complaining.) Like Lou, Tim often reads a passage I have drafted and says “This is way too complicated, let’s just leave this, and this, and this, and that, out. See? Now it’s a lot simpler.”

At one point I believed it was generally a good idea for an editorial team to have a minimalist and a maximalist yoked together: the maximalist gets you the functionality you need, and the minimalist keeps it from being much more than you need. Maybe it is a good idea in general. Or maybe it was just that it worked well both in the TEI and in XML. At the very least, it’s suggestive that in the work on the XML Schema spec, I was the resident minimalist; if in any working group I am the minimalist, it’s a good bet that the product of that WG will be regarded as baroque by most readers.

It’s the evening before the conference proper, and there is a reception for attendees in a lounge at the top of the Prudential Tower. I am standing chatting with Tim Bray and Lauren Wood, and suddenly Lou comes striding urgently across the room towards us. He reaches us. He looks at me; he looks at Tim; and he says, in pitch-perfect tones of the injured spouse, “So this is the other editor you’ve been seeing behind my back!”