URI casuistry

[4 March 2013]

I’ve been working lately on improving a DTD parser I wrote some time ago.

It’s been instructive to work with libcurl, the well known URL library by Daniel Stenberg and others, and with uriparser, a library for parsing and manipulating URIs written by Weijia Song and Sebastian Pipping (to all of whom thanks); I use the latter to absolutize relative references in entity declarations and the former to dereference the resulting absolute URIs.

A couple of interesting problems arise.

Relative references in parameter-entity declarations

When a parameter entity declaration uses a relative reference in its system identifier, you need to resolve that relative reference against the base URI. Section 5 of RFC 3986 is clear that the base URI should come from the URI of the DTD file in which the relative reference is used. So if the DTD http://example.com/2013/doc.dtd contains a parameter entity declaration of the form

<!ENTITY % chapters SYSTEM "chapters.dtd">

then the relative reference chapters.dtd is resolved against the base URI http://example.com/2013/doc.dtd to produce the absolutized form http://example.com/2013/chapters.dtd. This is true even if the reference to the parameter entity chapters occurs not in doc.dtd but in some other file: the logic is, as I understand it, that the relative reference is actually used when the parameter entity is declared, not when it is referenced, and the base URI comes from the place where the relative reference is used. Of course, in many or most cases the declaration and the reference to an external parameter entity will occur in the same resource.

I should qualify my statement; this is what I believe to be correct and natural practice, and what I believe to be implied by RFC 3986. I have occasionally encountered software which behaved differently; I have put it down to bugs, but it may mean that some developers of XML software interpret the base-URI rules of RFC 3986 differently. And one could also argue that use is not the issue; the base URI to be used is the URI of the resource within which the relative reference physically occurs; in this case it amounts to the same thing.

I’m not sure, however, what ought to happen if we add a level of indirection. Suppose …

  • DTD file http://example.com/a.dtd contains the declaration <!ENTITY % chapdecl '&#x003C;!ENTITY % chapters SYSTEM "chapters.dtd">'> (not a declaration of the parameter entity chapters, but the declaration of a parameter entity containing the declaration of that parameter entity).
  • DTD file http://example.com/extras/b.dtd contains a parameter entity reference to %chapdecl; (and thus, logically, it is this DTD file that contains the actual declaration of chapters and the actual use of the relative reference).
  • DTD http://example.com/others/c.dtd contains a reference to %chapters;.

Which DTD file should provide the base URI for resolving the relative reference? I think the declaration/reference logic rules out the third candidate. If we say that we should take the base URI from the entity in which the relative reference was used, and that the relative reference is used when the parameter entity chapters is declared, then the second choice (b.dtd) seems to have the strongest claim. If we say that we should take the base URI from the entity in which the relative reference appears, and that the relative reference here clearly appears in the declaration of the parameter entity chapdecl, then the first choice (a.dtd) seems to have the strongest claim.

I am sometimes tempted by the one and sometimes by the other. The logic that argues for a.dtd has, however, a serious problem: the declaration of chapdecl might just as easily look like this: <!ENTITY % chapdecl '&#x003C;!ENTITY % chapters SYSTEM "%fn_pt1;%fn_pt2;.%fn_ext;">'>, with the relative reference composed of references to three parameter entities each perhaps declared in a different resource with a different URI. Where does the relative reference “appear” in that case? So on the whole the second choice (b.dtd) seems the right one. But my code actually chooses a.dtd for the base URI in this case: each parameter entity carries a property that indicates what base URI to use for relative references that occur within the text of that parameter entity, and in the case of internal entities like chapdecl the base URI is inherited from the entity on top of the entity stack when the declaration is read. Here, the top of the stack is chapdecl, which as an internal entity inherits its base-URI-for-children property from a.dtd. Changing to use the base URI of the resource within which the entity declaration appears (to get b.dtd as the answer) would require adding new code to delay the calculation of the base URI: possible, but fiddly, easy to get wrong, and not my top priority. (I am grateful that in fact these cases never appear in the DTDs I care about, though if they did I might have intuitions about what behavior to regard as correct.)

HTTP_REFERER

A similar complication arises when we wish to follow the advice of some commentators on the W3C system team’s blog post on excessive DTD traffic and provide an HTTP_REFERER value that indicates the source of the URI from which we are retrieving a DTD. In the case given above, which URI file should be listed as the source of the reference? Is it a.dtd, b.dtd, or c.dtd?

It may be that answers to this are laid out in a spec somewhere. But … where?

Posted in XML

Sequence as source of identity

[19 February 2013]

For reasons not worth going into (it’s a long story) I was once involved in a discussion of how to tell, given a sequence of objects of a given kind and two descriptions (or variable references) denoting items X and Y in the sequence, whether X and Y are identical or distinct. (This topic came up during a discussion of whether we needed clearer notions of identity conditions for this class of objects.)

One of my interlocutors suggested that we did not need to specify any identity conditions at all. Since X and Y were presented to us in a sequence, we could always tell whether X and Y are the same or different by looking at their position in the sequence. If X and Y are both at the same position in the sequence, then X must be identical to Y, and if X and Y are at different positions in the sequence, then X and Y must be distinct from each other.

Now, the first claim (if X and Y are at the same position in the sequence, then X = Y) is obviously true. And the second (if X and Y are at different positions in the sequence, then X ≠ Y) is manifestly false, though sadly my interlocutor never stood still long enough for me to point this out. Let X be “the Fibonacci number F1” and Y be “the Fibonacci number F2“. These occur at different positions (1 and 2) in the Fibonacci series, but both expressions denote the integer 1. So we cannot safely infer, from the fact that X and Y identify things at different positions in a sequence, that X and Y identify different things.

As I discovered the other day, Frege turns out to have a word to say on this topic, too. I wish I had had this quotation ready to hand during that discussion.

… die Stelle in der Reihe kann nicht der Grund des Unterscheidens der Gegenstände sein, weil diese schon irgendworan unterschieden sein müssen, um in einer Reihe geordnet werden zu können.

… their positions in the series cannot be the basis on which we distinguish the objects, since they must already have been distinguished somehow or other, for us to have been able to arrange them in a series.

(Gottlob Frege, Die Grundlagen der Arithmetic, 1884; rpt Stuttgart: Reclam, 1987; tr. by J. L. Austin as The Foundations of arithmetic 1950, rpt. Evanston, Illinois: Northwestern University Press, 1980, § 42.)

Recursive descent parsing in XQuery (and other functional languages)

[7 January 2013; typos in code patterns corrected 8 January and 21 June 2013]

Everyone who designs or builds systems to do interesting work occasionally needs to deal with input in some specialized notation or other. Nowadays a lot of specialized information is in XML, and in that case the need is to deal with vocabularies designed for the particular kind of specialized information involved. But sometimes specialized data comes in its own non-XML syntax — ISBNs, URIs, expressions in symbolic logic, chess-game notation, textual user interface languages, query languages, and so on. Even in the XML universe there are plenty of notations for structured information of various kinds that are not XML: DTDs, XPath expressions, CSS, XQuery.

In these cases, if you want to do anything useful that depends on understanding the structure of the data, you’re going to need to write a parser for its notation. For some simple cases like ISBNs and ISSNs, or URIs, you can get by with regular expressions. (At least, more or less: if you want to validate the check digit in an ISBN or ISSN, regular expressions are going to have a hard time getting the job done, though oddly enough a finite state automaton would have no particular trouble with it.) But many interesting notations are context-free languages, which means regular expressions don’t suffice to distinguish well-formed expressions from other strings of characters, or to identify the internal structure of expressions.

Now, if you’re writing in C and you run into this problem, you can easily use yacc and lex to generate a parser for your language (assuming it satisfies the requirements of yacc). If you’re writing in Java, there are several parser-generator tools to choose from. If you’re writing in a less widely used language, you may find a parser generator, or you may not.

It’s handy, in this situation, to be able to write your own parsers from scratch.

By far the simplest method for hand-written parsing is the one known as recursive descent. For each non-terminal symbol in the grammar, there is a routine whose job it is to read strings in the input which represent that non-terminal. The structure of the routine follows the structure of the grammar rules for that non-terminal in a simple way, which makes recursive-descent parsers feel close to the structure of the information and also to the structure of the parsing process. (The parser generated by yacc, on the other hand, remains a completely opaque black box to me, even after years of using it.)

In his book Compiler Construction (Harlow, England: Addison-Wesley, 1996, tr. from Grundlagen und Techniken des Compilerbaus [Bonn: Addison-Wesley, 1996]), the Swiss computer scientist Niklaus Wirth summarizes the rules for formulating a recursive-descent parser in a table occupying less than half a page. For each construct in the EBNF notation for grammars, Wirth shows a corresponding construct in an imperative programming language (Oberon), so before I show the table I should perhaps review the EBNF notation. In any formal notation for grammars, a grammar is made up of a sequence of grammar productions, and each production (in the case of context-free grammars) consists of a single non-terminal symbol on the left-hand side of the rule, and an expression on the right-hand side of the rule which represents the possible realizations of the non-terminal. The right-hand side expression is made up of non-terminal symbols and terminal symbols (e.g. quoted strings), arranged in sequences, separated as need be by choice operators (for choice, the or-bar | is used), with parentheses, square brackets (which mark optional material), and braces (which mark material that can occur zero or more times).

Wirth’s EBNF for EBNF will serve to illustrate the syntax:

syntax = {production}.
production = identifier “=” expression “.”.
expression = term {“|” term}.
term = factor {factor}.
factor = identifier | string | “(” expression “)” | “[” expression “]” | “{” expression “}”.
identifier = letter {letter | digit}.
string = “”” {character} “””.
letter = “A” | … | “Z”.
digit = “0” | … | “9”.

(It may be worth noting that this formulation achieves its simplicity in part by hand-waving: it doesn’t say anything about whitespace separating identifiers, and the definition of string is not one a machine can be expected to read and understand. But Wirth isn’t writing this grammar for a machine, but for human readers.)

It’s easy to see that the routines in a recursive-descent parser for a grammar in this notation must deal with six constructs on the right-hand side of rules: strings, parenthesized expressions (three kinds), sequences of expressions, and choices between expressions. Wirth summarizes the necessary code in this table with the construct K on the left, and the program for it, Pr(K), on the right. In the code fragments, sym is a global variable representing the symbol most recently read from the input stream, and next is the routine responsible for reading the input stream and updating sym. The meta-expression first(K) denotes the set of symbols which can possibly occur as the first symbol of a string derived from construct K.

k Pr(k)
“x” IF sym = “x” THEN next ELSE error END
(exp) Pr(exp)
[exp] IF sym IN first(exp) THEN Pr(exp) END
{exp} WHILE sym IN first(exp) DO Pr(exp) END
fac0 fac1facn Pr(fac0); Pr(fac1); … ; Pr(facn)
term0 | term1 | … | termn CASE sym of
   first(term0) : Pr(term0)
| first(term1) : Pr(term1)
| …
| first(termn) : Pr(termn)
END

This is easy enough to express in any language that has global variables and assignment statements. But what do we do when we are writing an application in a functional language, like XQuery or XSLT, and need to parse sentences in some context-free language? No assignment statements, and all functions have the property that if you call them several times with the same argument you will always get the same results back. [Addendum, 8 January 2013: XQuery and XSLT users do in fact have access to useful parser generators: see the pointers to Gunther Rademacher’s REx and Dmitre Novatchev’s FXSL provided in the comments. The question does, however, still arise for those who want to write recursive-descent parsers along the lines sketched by Wirth, which is where this post is trying to go.]

I’ve wondered about this for some time (my notes show I was worrying about it a year ago today), and the other day a simple solution occurred to me: each of the functions in a recursive descent parser depends on the state of the input, so in a functional language the state of the input has to passed to the function as an argument. And each function changes the state of the input (by advancing the input pointer), which in a functional language we can represent by having each function return the new state of the input and the new current symbol as its result.

A new table, analogous to Wirth’s, but with XQuery code patterns on the right hand side, looks like this. Here, the common assumption is that each function is passed a parameter named $E0 whose value is an env variable, with two children: sym contains the current symbol and input contains the remainder of the input (which for simplicity I’m going to assume is a string). If an error condition arises, an error element is added to the environment. The job of reading the next token is handled by the function next().

k Pr(k, $E0)
“x” if ($E0/sym = “x”)
  then next($E0)
  else <env>
    <error>expected “x” but did not find it</error>
    {$E0/*}
  </env>
(exp) Pr(exp, $E0)
[exp] if ($E0/sym = first(exp)) then
  Pr(exp, $E0)
else
  $E0
{exp} This requires two translations. For each such sequence exp, we declare a function:

declare function seq_exp(
  $E0 as element(env)
) as element(env) {
  if ($E0/sym = first(exp)) then
    let $E1 := Pr(exp),
        $E2 := seq_exp($E1)
    return $E2
  else
    $E0
};
Inline, we just call that function:

seq_exp($E0)
fac0 fac1facn let $E1 := Pr(fac0, $E0),
    $E2 := Pr(fac1, $E1),
    … ,
    $En + 1 := Pr(facn, $En)
return $En + 1
term0 | term1 | … | termn if ($E0/sym = first(term0)) then
  Pr(term0)
else if ($E0/sym = first(term1)) then
  Pr(term1)

else if ($E0/sym = first(termn)) then
  Pr(termn)

Like Wirth’s, the code shown here produces a recognizer that doesn’t do anything with the input except read it and accept or reject it; like Wirth’s, it can easily be extended to do things with the input. (My first instinct, of course, is to return an XML representation of the input string’s abstract syntax tree, so I can process it further using other XML tools.)

A new look / Death to spies

http://cmsmcq.com/mib/wp-content/uploads/2012/12/glass-bead-game.288×1000.a.4059919398_5791a65aa7_b.jpg
[6 December 2012]

This blog has a new look, courtesy of the crackers who broke in and rendered it necessary to delete the blog, reinstall WordPress from trusted media, re-configure things, change passwords, and re-import the posts and comments. The old theme wasn’t part of the new installation, and while I kind of liked it, I didn’t like it enough to spend any time trying to retrieve it. The old theme was the then-current default theme for WordPress blogs; the new theme is (again) the default for WordPress blogs.

[“Are you kidding me? You can’t even be bothered to change the _____ theme?” hissed my evil twin Enrique at this point. “I did look at alternatives. Really. I just happened to like the default pretty well. It’s not like I’m lazy.” “Not just that you’re lazy, I think you mean?” sighed Enrique. “Oh, hush,” I said.]

The new theme seems to want a header image; I am grateful to Flickr for providing search qualifiers that allow one to search only for photos licensed under Creative Commons and allowing commercial use. The image above is drawn from a photo published on Flickr under the name Glass Bead Game by the photographer Darren Kirby of Edmonton, Alberta, to whom thanks (and an acknowledgement in the footer).

[“Wow,” said Enrique. “His blog photo makes him look way too young to be a reader of Hermann Hesse. Has there been a Hesse resurgence while I wasn’t looking?” “Not sure, but I think I heard that in fact there has been. It’s not the 80s anymore, Enrique.” “Are you sure? The House majority does not agree with you. And haven’t you heard anything about higher education in the UK lately? Sure sounds like Mrs. Thatcher’s Britain to me!” “Oh, be quiet, and leave politics out of this.”]

On the minus side, the links to other blogs have been lost, at least for the moment (I do have a list; it’s just a matter of typing them in again). And of course, I’ve lost a few days’ work (and counting), cleaning up after the Viagra hawkers.

On the plus side, I now have nicer blog backup utilities and more convenient tools for intrusion detection (on the theory that making them more convenient is a good way to increase the likelihood of their being used regularly). It turns out that when a web site is just a checked-out working copy of a Subversion repository which gets updated automatically when things are checked into the repository using a commit hook (as this Web site is), then just logging in to the server and running svn status gives you a nice list of things that have been placed on the server by intruders coming in the back door. So for those who do manage their sites this way, and who have never managed to get around to installing tripwire or similar tools, a local shell script reading, in its entirety, ssh hostname 'for d in *.com ; do (echo; cd $d; pwd; svn status); done' may be able to serve as at least a poor, partial substitute.

Still, while it’s almost always nice to learn new things, there are other things I’d rather have spent the last few days on. I find I have new sympathy for the motto Death to spies.

Another XForms thought experiment: mixed-content editing

[23 July 2012]

The other day, I posted a kind of design challenge for XForms: what is the best way to provide an editing interface for tightly constrained, arbitrarily recursive structures like those found in arithmetic or logical expressions or in languages like XPath or programming languages?

Another topic comes up from time to time as a challenge for XForms interfaces: mixed content.

One view, which seems to be reasonably widely held, is that the best way to handle mixed content in XForms is to provide some sort of ‘rich-text’ editor like the tools provided by some libraries for editing HTML documents in the browser. (I put scare quotes around ‘rich text’ in that phrase because HTML-encoded text doesn’t seem particularly rich by some standards.) Many (all?) of the major XForms implementations have rich-text editors of one kind or another tha work this way.

I think such widgets work well for some applications. But they are almost never what I think I want when I think about handling mixed content in applications I’d like to build with XForms. I think I can distinguish two cases, or classes of cases.

The first case is similar to that handled by the existing rich-text widgets: allow the user to edit a paragraph or a series of paragraphs, and to mark phrases within the paragraphs with phrase-level markup. In this use case, I think I want several things (some of which may be achievable with current extension widgets, with a little work or a lot):

  • I’d like to be able to specify, for each individual instantiation of the control, which phrase-level elements are allowed. (E.g.: bold and italic here, italic-only over there.)
  • I’d like to be able to specify elements in vocabularies other than HTML. (E.g. TEI, TEI Lite, DocBook, JATS, …)
  • I’d like the editing widget to allow crystals as sub-structures. (By crystals I mean sub-elements with fixed internal structure; they may float in a sort of text soup, but they retain their own internal structure while doing so. In XHTML, lists [ordered, unordered, and definition-] are a simple example.)
  • If the editor automatically interprets hard returns as marking paragraph boundaries (as some do), I’d like to be able to specify what the paragraph element is called. And ideally I’d like to allow more than one such top-level element, so the user can use the widget to create a sequence of (for example) p, div, or ab elements.

In the second case (or class of cases), I want to be able to display mixed-content elements in a normal read-only way, with flowed text and font shifts and so on, and I want to be able to allow the user to edit selected aspects of the material, but not all. For example:

  • In some cases I’d want the user to be able to edit the key attribute on the person and place elements in the text, but I would like it to be impossible for the user to change the text of the paragraph.
  • In some cases I’d like to allow the user to change element types (changing a person element to place, or vice versa), and in other cases not.
  • Often I’d like to allow the user to select a contiguous sequence of characters in the paragraph and say that they should be tagged as person or place.
  • And then there is the scenario Henry Thompson once used to introduce me to the idea of padded-cell editors: we are preparing a corpus for linguistic research and we have a provisional segmentation of the text into sentences (or sentence-like objects). We want an interface that will allow a human being to review the segmentation and correct it. They should be able to open a document, split existing s elements, join adjacent s elements, save, or quit without saving.

The same questions apply here as for the earlier thought experiment:

1 Are there any really good ways to implement interfaces of this class today (in XForms 1.1, or with extensions in existing XForms implementations)?

2 What are the possible ways of doing this kind of thing (or: any of these kinds of things) today? (In the absence of a really good to do it, any technically feasible way is worth knowing about; it’s better than nothing.) Extra credit, as always, for sound analysis of the pros and cons and for pointers to examples to illustrate the techniques.

3 What changes to XForms might allow implementations of a future version of the spec to handle this class of problem (more) easily?