XML catalogs vs local caching proxy

[21 May 2008]

I have a senior colleague who has maintained, for several years, that SGML and XML catalogs are a deplorable special-case hack for a problem that should be solved by the more general means of HTTP caches. (Most recently, he was arguing against a proposal that the W3C distribute convenient packages of our most frequently used DTDs and schemas, with a catalog to make them easy to use. How someone so smart can be so deeply wrong-headed, I’m not sure.)

So when I had a network outage the other day that made it hard to get any work done, I thought about setting up a local caching proxy. Why did the outage make it hard to get anything done? Because I do use some software that doesn’t support catalogs, and which reacts to network outages by imposing a thirty-second delay for each DTD fetch (while its network request times out) and then proceeds anyway, without the DTD. Since it does proceed eventually, I can in fact build a new HTML version of the XSD spec (for example); it’s only that the process becomes painfully slow (or rather, even slower and more painful than usual).

But, I thought, the systems guys assure me that it’s not really hard for a user (not the system administrator, just a user) to set up a local caching proxy. So I’ll give it a try.

The upshot so far is: yes, it’s possible, though I wouldn’t call it easy. And managing catalogs still seems an order of magnitude easier and more straightforward. Here’s what I’ve done so far:

1 Apache ships with Mac OS X, it’s already running on my system (I use a local CGI script to log where my time goes), and mod_proxy enables it to serve as a local caching proxy. So I decided to try that, instead of installing squid or something similar. Found instructions for configuring Apache as a local caching proxy on a Mac OSX site; they worked (although they suggest commenting out the line “Deny all”, in the mistaken belief that otherwise nothing works). I followed his advice and blocked a couple of random sites I can live without, in order to be able to request them and tell, from the resulting failure message, that the proxy service was working.

2 In System Preferences / Network / Airport / Proxies, I told the system to use http://localhost:80 as a proxy for HTTP requests.

I had illusions that this was it. At the system level (I fantasized), outgoing HTTP requests would be re-routed to the local Apache.

Ha.

This does suffice for Safari, and possibly for other Apple software (I don’t know, haven’t looked, don’t much care right now). But Firefox must be told separately about the proxy server. And Opera.

And the command-line tools that were the main reason I wanted a caching proxy in the first place? RXP, libxml, Saxon, and so on? Nope, not using the proxy.

3 After some disappointing experiences with the documentation for the tools I’m using (none of the documentation I found says anything at all about how to tell the software to use a proxy server), I learned from oblique references somewhere that setting the environment variable http_proxy works for some Unix tools.

So I tried export http_proxy=http://localhost:80 and curl, at least, started using the proxy server. libxml (and thus xmllint and xsltproc) also started using it, I think, or trying to, but the main symptom of this success was that they started emitting error messages informing me helpfully that

error : Operation in progress

When I stopped Apache, that message went away. When I unset the http_proxy environment variable, it also went away (whether Apache was running or not).

4 Along about this time I decided just to make libxml use my local catalogs. This turned out to be harder than I thought: setting XML_CATALOG_FILES=/Library/SGML/Public/catalog.xml elicited only the laconic message from xsltproc, xmllint, xmlcatalog, and anything else that uses libxml: /Library/SGML/Public/Misc/catalog.xml:0: Catalog error : File /Library/SGML/Public/Misc/catalog.xml is not an XML Catalog.

But of course it is an XML catalog. I can see that.

I validated it, just to make sure, using both xmllint and rxp. No problems.

5 Eventually, it became clear that libxml wanted an explicit namespace declaration in the root element. (I had been relying on the default value given in the DTD.) So <catalog> had to become <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> in all my XML catalogs. (DV, are you listening? The Namespaces Rec is quite explicit that namespace declarations may be defaulted by the DTD. Otherwise I never would have voted for it. RXP gets it right; thank you, Richard!)

6 Eventually, the sick minds of Liam Quin and John Snelson suggested that perhaps I should try a different value for http_proxy: instead of http://localhost:80 I should try export http_proxy=http://127.0.0.1:80/. This eliminated the “error : Operation in progess” messages.

So I now have a local caching proxy working, and some of my tools, at least, are using it when they don’t find what they need using catalogs. I’ll assume that this is a Good Thing. But nothing I’ve seen so far tells me how to configure Apache (or squid, or any other proxy) the way I want to. I want a convenient list of the resources in the local cache, and I want to be able to mark some of them (e.g. the DTDs and the W3C stylesheets I use most often) as “Never ever delete this; ALWAYS have a copy handy; check every few months to see if it needs updating.” From the documentation of Apache and of Squid, I am inclined to believe this is not actually possible. At the very least, it’s not obvious. By default, Apache’s mod_proxy appears to plan to delete everything after 24 hours regardless of its expiration date. And the default size of the cache appears (can this possibly be?!) to be 5 KB.

So so far, the caching proxy does not give me the guarantees I want, about always having the resources I care about available, network or no network.

For catalogs, on the other hand, it would be nice to have some software that would augment the catalog with information about when a particular copy of the resource was fetched, when it was last modified, what its expiration date is (if the server provides one; surprising how few Web servers actually provide useful expiration information), and would check the Web periodically (say, once a month or so) to see whether any of my local copies of Web resources should be updated.

My interim conclusion is: both catalogs and HTTP caches could use improvement. As a way to ensure that the work I want to do can proceed without the network, however, catalogs are a lot more convenient, straightforward, and functional.

Balisage preliminary program

Balisage 2008 logo

The Balisage 2008 program committee met this week, and we now have a preliminary program. (As I write, the schedule-at-a-glance isn’t up yet, but it really should be up any moment now.)

We’ve got a good range of topics, with talks on resource-oriented architectures and a framework for managing constraints and how markup relates to the philosophical problem of universals and how to handle overlap in a format that makes good use of the non-overlapping properties of XML. And a parallel algorithm for XML parsing and validation that exploits multiple-core machines, and an XML Exchange for messaging that uses XQuery to route messages, and structural metadata as a socio-technological problem and using the Atom Publishing Protocol to build dynamic applications.

The day before Balisage begins, there will be a one-day symposium on versioning issues, which has also shaped up very nicely. (More on that in another post.)

Great topics, great people (I’ve heard a lot of these speakers, and I know they’re good), and Montreal in August. What more could you want?

So think hard about being in Montreal the week of 11-15 August, for the versioning symposium and for Balisage 2008. I look forward to seeing people there!

Transitive apertures

Sometimes ignorance really is bliss.

If I had normal mathematical training, I would probably know the answers to some of the questions that have been puzzling me lately. But I would have missed the fun of figuring out the answers for myself.

Consider a relation R on some domain, and consider its transitive closure R*. Or, for some purposes, consider its positive transitive closure R+.

Now, sometimes when one is interested both in R and in R*, it seems convenient to have R be as small as it is possible to make it, as long as the transitive closure remains the same. So if there are any ‘redundant’ members of R — members that we can remove without changing R*, we would like to remove them. Since this operation feels like the opposite of transitive closure, I have privately called it “transitive aperture” since I first encountered it nine or ten years ago.

Given two relations R and R’, R’ is a transitive aperture of R if and only if R’ is a subset of R and R’* = R*. R’ is a minimal transitive aperture of R if and only if R’ has no subset R” such that R”* = R’* = R*.

For example: consider the < (less-than) relation on integers. It’s clear (a) that the transitive closure of < is < itself, and (b) that the successor relation on integers (1 = succ(0), 2 = succ(1), etc.) has the same transitive closure. It’s also clear (intuitively — I haven’t tried a proof) that if you omit anything from the successor relation, its transitive closure won’t be <, so the successor relation is apparently a minimal transitive aperture of <.

Now, in all the examples I’ve thought of so far, it’s been obvious both how to calculate the minimal transitive aperture of a given relation, and that there was only one.

But I’ve been asking myself several questions recently. Can I define the notion of minimal transitive aperture cleanly? Does every relation have one? Is it unique? I.e., does every relation have exactly one? Is there an algorithm for finding a minimal transitive aperture for an arbitrary relation?

I spent some time thinking about this today, while driving to Santa Fe and back, and I believe I know some of the answers. I think the definition above is clean and precise. And I think it’s obvious that there’s an algorithm for finding all minimal transitive apertures for any finite relation. (It’s finite, right? Take all the subsets and see if they are (a) a transitive aperture of R and (b) a minimal transitive aperture — if R is finite, then R has a finite number of subsets, so the algorithm will terminate. I didn’t say it was fast, I just said there’s an algorithm.) I don’t know whether there’s an algorithm for infinite relations. And it’s clear that for some relations, there is more than one minimal transitive aperture. But I don’t have time tonight to sketch any of this. I shouldn’t be writing this in the first place, I should be packing for a trip, but I wanted to get at least the formulation of the problem down.

The idea can also be formulated in terms of graphs. Take any directed graph G; you have a set of nodes and a relation on that set. Take the transitive closure of that relation, and draw another graph T with the same set of nodes as G, and arcs representing the transitive closure of the arcs of G. Call T the transitive closure of G, or the reachability graph of G. (For any two nodes n and m, there is an arc from n to m in T if and only if m is reachable from n in G.) A transitive aperture of G is a digraph with the same set of nodes, a subset of G’s arcs, and the same reachability graph.

It’s clear, as I say, that there can be multiple minimal apertures that have the same transitive closure. Any graph of three nodes a, b, c with a cycle among all three nodes a → b → c → a will do to show this: its transitive closure is a complete graph of three nodes. But a → c → b → a has the same transitive closure. QED.

So it’s clear that for digraphs with cycles, there may be multiple minimal transitive apertures.

And for acyclic digraphs?