[4 March 2013]
I’ve been working lately on improving a DTD parser I wrote some time ago.
It’s been instructive to work with libcurl, the well known URL library by Daniel Stenberg and others, and with uriparser, a library for parsing and manipulating URIs written by Weijia Song and Sebastian Pipping (to all of whom thanks); I use the latter to absolutize relative references in entity declarations and the former to dereference the resulting absolute URIs.
A couple of interesting problems arise.
Relative references in parameter-entity declarations
When a parameter entity declaration uses a relative reference in its system identifier, you need to resolve that relative reference against the base URI. Section 5 of RFC 3986 is clear that the base URI should come from the URI of the DTD file in which the relative reference is used. So if the DTD http://example.com/2013/doc.dtd
contains a parameter entity declaration of the form
<!ENTITY % chapters SYSTEM "chapters.dtd">
then the relative reference chapters.dtd
is resolved against the base URI http://example.com/2013/doc.dtd
to produce the absolutized form http://example.com/2013/chapters.dtd
. This is true even if the reference to the parameter entity chapters
occurs not in doc.dtd
but in some other file: the logic is, as I understand it, that the relative reference is actually used when the parameter entity is declared, not when it is referenced, and the base URI comes from the place where the relative reference is used. Of course, in many or most cases the declaration and the reference to an external parameter entity will occur in the same resource.
I should qualify my statement; this is what I believe to be correct and natural practice, and what I believe to be implied by RFC 3986. I have occasionally encountered software which behaved differently; I have put it down to bugs, but it may mean that some developers of XML software interpret the base-URI rules of RFC 3986 differently. And one could also argue that use is not the issue; the base URI to be used is the URI of the resource within which the relative reference physically occurs; in this case it amounts to the same thing.
I’m not sure, however, what ought to happen if we add a level of indirection. Suppose …
- DTD file
http://example.com/a.dtd
contains the declaration<!ENTITY % chapdecl '<!ENTITY % chapters SYSTEM "chapters.dtd">'>
(not a declaration of the parameter entitychapters
, but the declaration of a parameter entity containing the declaration of that parameter entity). - DTD file
http://example.com/extras/b.dtd
contains a parameter entity reference to%chapdecl;
(and thus, logically, it is this DTD file that contains the actual declaration ofchapters
and the actual use of the relative reference). - DTD
http://example.com/others/c.dtd
contains a reference to%chapters;
.
Which DTD file should provide the base URI for resolving the relative reference? I think the declaration/reference logic rules out the third candidate. If we say that we should take the base URI from the entity in which the relative reference was used, and that the relative reference is used when the parameter entity chapters
is declared, then the second choice (b.dtd
) seems to have the strongest claim. If we say that we should take the base URI from the entity in which the relative reference appears, and that the relative reference here clearly appears in the declaration of the parameter entity chapdecl
, then the first choice (a.dtd
) seems to have the strongest claim.
I am sometimes tempted by the one and sometimes by the other. The logic that argues for a.dtd
has, however, a serious problem: the declaration of chapdecl
might just as easily look like this: <!ENTITY % chapdecl '<!ENTITY % chapters SYSTEM "%fn_pt1;%fn_pt2;.%fn_ext;">'>
, with the relative reference composed of references to three parameter entities each perhaps declared in a different resource with a different URI. Where does the relative reference “appear” in that case? So on the whole the second choice (b.dtd
) seems the right one. But my code actually chooses a.dtd
for the base URI in this case: each parameter entity carries a property that indicates what base URI to use for relative references that occur within the text of that parameter entity, and in the case of internal entities like chapdecl
the base URI is inherited from the entity on top of the entity stack when the declaration is read. Here, the top of the stack is chapdecl
, which as an internal entity inherits its base-URI-for-children property from a.dtd
. Changing to use the base URI of the resource within which the entity declaration appears (to get b.dtd
as the answer) would require adding new code to delay the calculation of the base URI: possible, but fiddly, easy to get wrong, and not my top priority. (I am grateful that in fact these cases never appear in the DTDs I care about, though if they did I might have intuitions about what behavior to regard as correct.)
HTTP_REFERER
A similar complication arises when we wish to follow the advice of some commentators on the W3C system team’s blog post on excessive DTD traffic and provide an HTTP_REFERER value that indicates the source of the URI from which we are retrieving a DTD. In the case given above, which URI file should be listed as the source of the reference? Is it a.dtd
, b.dtd
, or c.dtd
?
It may be that answers to this are laid out in a spec somewhere. But … where?