XML catalogs vs local caching proxy

[21 May 2008]

I have a senior colleague who has maintained, for several years, that SGML and XML catalogs are a deplorable special-case hack for a problem that should be solved by the more general means of HTTP caches. (Most recently, he was arguing against a proposal that the W3C distribute convenient packages of our most frequently used DTDs and schemas, with a catalog to make them easy to use. How someone so smart can be so deeply wrong-headed, I’m not sure.)

So when I had a network outage the other day that made it hard to get any work done, I thought about setting up a local caching proxy. Why did the outage make it hard to get anything done? Because I do use some software that doesn’t support catalogs, and which reacts to network outages by imposing a thirty-second delay for each DTD fetch (while its network request times out) and then proceeds anyway, without the DTD. Since it does proceed eventually, I can in fact build a new HTML version of the XSD spec (for example); it’s only that the process becomes painfully slow (or rather, even slower and more painful than usual).

But, I thought, the systems guys assure me that it’s not really hard for a user (not the system administrator, just a user) to set up a local caching proxy. So I’ll give it a try.

The upshot so far is: yes, it’s possible, though I wouldn’t call it easy. And managing catalogs still seems an order of magnitude easier and more straightforward. Here’s what I’ve done so far:

1 Apache ships with Mac OS X, it’s already running on my system (I use a local CGI script to log where my time goes), and mod_proxy enables it to serve as a local caching proxy. So I decided to try that, instead of installing squid or something similar. Found instructions for configuring Apache as a local caching proxy on a Mac OSX site; they worked (although they suggest commenting out the line “Deny all”, in the mistaken belief that otherwise nothing works). I followed his advice and blocked a couple of random sites I can live without, in order to be able to request them and tell, from the resulting failure message, that the proxy service was working.

2 In System Preferences / Network / Airport / Proxies, I told the system to use http://localhost:80 as a proxy for HTTP requests.

I had illusions that this was it. At the system level (I fantasized), outgoing HTTP requests would be re-routed to the local Apache.

Ha.

This does suffice for Safari, and possibly for other Apple software (I don’t know, haven’t looked, don’t much care right now). But Firefox must be told separately about the proxy server. And Opera.

And the command-line tools that were the main reason I wanted a caching proxy in the first place? RXP, libxml, Saxon, and so on? Nope, not using the proxy.

3 After some disappointing experiences with the documentation for the tools I’m using (none of the documentation I found says anything at all about how to tell the software to use a proxy server), I learned from oblique references somewhere that setting the environment variable http_proxy works for some Unix tools.

So I tried export http_proxy=http://localhost:80 and curl, at least, started using the proxy server. libxml (and thus xmllint and xsltproc) also started using it, I think, or trying to, but the main symptom of this success was that they started emitting error messages informing me helpfully that

error : Operation in progress

When I stopped Apache, that message went away. When I unset the http_proxy environment variable, it also went away (whether Apache was running or not).

4 Along about this time I decided just to make libxml use my local catalogs. This turned out to be harder than I thought: setting XML_CATALOG_FILES=/Library/SGML/Public/catalog.xml elicited only the laconic message from xsltproc, xmllint, xmlcatalog, and anything else that uses libxml: /Library/SGML/Public/Misc/catalog.xml:0: Catalog error : File /Library/SGML/Public/Misc/catalog.xml is not an XML Catalog.

But of course it is an XML catalog. I can see that.

I validated it, just to make sure, using both xmllint and rxp. No problems.

5 Eventually, it became clear that libxml wanted an explicit namespace declaration in the root element. (I had been relying on the default value given in the DTD.) So <catalog> had to become <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> in all my XML catalogs. (DV, are you listening? The Namespaces Rec is quite explicit that namespace declarations may be defaulted by the DTD. Otherwise I never would have voted for it. RXP gets it right; thank you, Richard!)

6 Eventually, the sick minds of Liam Quin and John Snelson suggested that perhaps I should try a different value for http_proxy: instead of http://localhost:80 I should try export http_proxy=http://127.0.0.1:80/. This eliminated the “error : Operation in progess” messages.

So I now have a local caching proxy working, and some of my tools, at least, are using it when they don’t find what they need using catalogs. I’ll assume that this is a Good Thing. But nothing I’ve seen so far tells me how to configure Apache (or squid, or any other proxy) the way I want to. I want a convenient list of the resources in the local cache, and I want to be able to mark some of them (e.g. the DTDs and the W3C stylesheets I use most often) as “Never ever delete this; ALWAYS have a copy handy; check every few months to see if it needs updating.” From the documentation of Apache and of Squid, I am inclined to believe this is not actually possible. At the very least, it’s not obvious. By default, Apache’s mod_proxy appears to plan to delete everything after 24 hours regardless of its expiration date. And the default size of the cache appears (can this possibly be?!) to be 5 KB.

So so far, the caching proxy does not give me the guarantees I want, about always having the resources I care about available, network or no network.

For catalogs, on the other hand, it would be nice to have some software that would augment the catalog with information about when a particular copy of the resource was fetched, when it was last modified, what its expiration date is (if the server provides one; surprising how few Web servers actually provide useful expiration information), and would check the Web periodically (say, once a month or so) to see whether any of my local copies of Web resources should be updated.

My interim conclusion is: both catalogs and HTTP caches could use improvement. As a way to ensure that the work I want to do can proceed without the network, however, catalogs are a lot more convenient, straightforward, and functional.

5 thoughts on “XML catalogs vs local caching proxy

  1. I played a bit around that space, with Squid:
    http://people.w3.org/~dom/archives/2006/09/offline-web-cache-with-squid/

    I think the configuration I gave for Squid allows you to make sure that nothing gets deleted as long as it fits in the alloted disk space. It also relies on being informed when you’re offline vs online, which is fairly easy to set up on Linux, but might not apply so well on Mac.

    (alos, the setting of the proxy through the UI in Linux will work across all programs because it indeed sets the http_proxy environment variable :).

  2. Thanks for the report!

    I use XML catalog-based entity resolution, and I like to keep local copies of XML related resources as a mirror of the canonical URLs. That makes it a no-brainer to add new resources, in contrast to the way most packages are structured today. See:

    http://cygwin.com/ml/cygwin-apps/2003-05/msg00175.html

    In addition to the requirements that I listed there, I would add:

    – It should be easy to manage layered entity resolution, so that I
    could have a general set of rules for a company, some additional
    rules for each user, and a special set of rules for a specific project.

    This is possible using XML catalog.

    I don’t quite like the idea of using caching proxies (that will affect
    all HTTP traffic) just for caching XML resources. Besides, not all
    resources can reliably be retrived using HTTP; some never had an HTTP
    URL to begin with, and I really don’t want them to.

  3. When I explain XML Catalogs, some people suggest that they aren’t really necessary, that the right answer to the problem of access to sporadically available resources is some form of cache. Most recently, I noticed that Mark Baker said it, but I’ve heard others say it to.

  4. Pingback: W3C sperrt Java aus | blog.linkwerk.com

Comments are closed.