URI casuistry | Messages in a Bottle

[4 March 2013]

I’ve been working lately on improving a DTD parser I wrote some time ago.

It’s been instructive to work with libcurl, the well known URL library by Daniel Stenberg and others, and with uriparser, a library for parsing and manipulating URIs written by Weijia Song and Sebastian Pipping (to all of whom thanks); I use the latter to absolutize relative references in entity declarations and the former to dereference the resulting absolute URIs.

A couple of interesting problems arise.

Relative references in parameter-entity declarations

When a parameter entity declaration uses a relative reference in its system identifier, you need to resolve that relative reference against the base URI. Section 5 of RFC 3986 is clear that the base URI should come from the URI of the DTD file in which the relative reference is used. So if the DTD http://example.com/2013/doc.dtd contains a parameter entity declaration of the form

<!ENTITY % chapters SYSTEM "chapters.dtd">

then the relative reference chapters.dtd is resolved against the base URI http://example.com/2013/doc.dtd to produce the absolutized form http://example.com/2013/chapters.dtd. This is true even if the reference to the parameter entity chapters occurs not in doc.dtd but in some other file: the logic is, as I understand it, that the relative reference is actually used when the parameter entity is declared, not when it is referenced, and the base URI comes from the place where the relative reference is used. Of course, in many or most cases the declaration and the reference to an external parameter entity will occur in the same resource.

I should qualify my statement; this is what I believe to be correct and natural practice, and what I believe to be implied by RFC 3986. I have occasionally encountered software which behaved differently; I have put it down to bugs, but it may mean that some developers of XML software interpret the base-URI rules of RFC 3986 differently. And one could also argue that use is not the issue; the base URI to be used is the URI of the resource within which the relative reference physically occurs; in this case it amounts to the same thing.

I’m not sure, however, what ought to happen if we add a level of indirection. Suppose …

DTD file http://example.com/a.dtd contains the declaration <!ENTITY % chapdecl '<!ENTITY % chapters SYSTEM "chapters.dtd">'> (not a declaration of the parameter entity chapters, but the declaration of a parameter entity containing the declaration of that parameter entity).
DTD file http://example.com/extras/b.dtd contains a parameter entity reference to %chapdecl; (and thus, logically, it is this DTD file that contains the actual declaration of chapters and the actual use of the relative reference).
DTD http://example.com/others/c.dtd contains a reference to %chapters;.

Which DTD file should provide the base URI for resolving the relative reference? I think the declaration/reference logic rules out the third candidate. If we say that we should take the base URI from the entity in which the relative reference was used, and that the relative reference is used when the parameter entity chapters is declared, then the second choice (b.dtd) seems to have the strongest claim. If we say that we should take the base URI from the entity in which the relative reference appears, and that the relative reference here clearly appears in the declaration of the parameter entity chapdecl, then the first choice (a.dtd) seems to have the strongest claim.

I am sometimes tempted by the one and sometimes by the other. The logic that argues for a.dtd has, however, a serious problem: the declaration of chapdecl might just as easily look like this: <!ENTITY % chapdecl '<!ENTITY % chapters SYSTEM "%fn_pt1;%fn_pt2;.%fn_ext;">'>, with the relative reference composed of references to three parameter entities each perhaps declared in a different resource with a different URI. Where does the relative reference “appear” in that case? So on the whole the second choice (b.dtd) seems the right one. But my code actually chooses a.dtd for the base URI in this case: each parameter entity carries a property that indicates what base URI to use for relative references that occur within the text of that parameter entity, and in the case of internal entities like chapdecl the base URI is inherited from the entity on top of the entity stack when the declaration is read. Here, the top of the stack is chapdecl, which as an internal entity inherits its base-URI-for-children property from a.dtd. Changing to use the base URI of the resource within which the entity declaration appears (to get b.dtd as the answer) would require adding new code to delay the calculation of the base URI: possible, but fiddly, easy to get wrong, and not my top priority. (I am grateful that in fact these cases never appear in the DTDs I care about, though if they did I might have intuitions about what behavior to regard as correct.)

HTTP_REFERER

A similar complication arises when we wish to follow the advice of some commentators on the W3C system team’s blog post on excessive DTD traffic and provide an HTTP_REFERER value that indicates the source of the URI from which we are retrieving a DTD. In the case given above, which URI file should be listed as the source of the reference? Is it a.dtd, b.dtd, or c.dtd?

It may be that answers to this are laid out in a spec somewhere. But … where?

2 thoughts on “URI casuistry”

Jacek Kopecky on 5 March 2013 at 13:22 said:

Hi, maybe it could be viewed as a data typing issue. If I understand it correctly (and I don’t have much confidence in my understanding of DTDs), a.dtd declares basically a string (a data type that doesn’t reference and needs no base), which is put in place in b.dtd and parsed there – that’s when it becomes (is first interpreted as) a URI reference. Therefore it’s b.dtd that does the referencing and should provide both the base and the referer.
John Cowan on 5 March 2013 at 14:12 said:

I suppose it’s this sort of thing that induces people to treat URIs as something other than just strings. In your final example the URI is constituted from a string, but we don’t know exactly when that happens. What about this simpler case: let a1.dtd define prefix as “foo”, a2.dtd define %suffix as “bar”, and b.dtd define chapters as “%prefix;.%suffix”? It seems clear that neither a1 nor a2 is really defining a URI, but we don’t actually know that b defines a URI either until it’s used as one.

I tried to think about how cpp does it, but #include does not allow token-pasting: you can write “#include foo” where foo is a defined token, but you can’t write “#include foo##”.”##bar”.

Comments are closed.