Recursive descent parsing in XQuery (and other functional languages)

[7 January 2013; typos in code patterns corrected 8 January and 21 June 2013]

Everyone who designs or builds systems to do interesting work occasionally needs to deal with input in some specialized notation or other. Nowadays a lot of specialized information is in XML, and in that case the need is to deal with vocabularies designed for the particular kind of specialized information involved. But sometimes specialized data comes in its own non-XML syntax — ISBNs, URIs, expressions in symbolic logic, chess-game notation, textual user interface languages, query languages, and so on. Even in the XML universe there are plenty of notations for structured information of various kinds that are not XML: DTDs, XPath expressions, CSS, XQuery.

In these cases, if you want to do anything useful that depends on understanding the structure of the data, you’re going to need to write a parser for its notation. For some simple cases like ISBNs and ISSNs, or URIs, you can get by with regular expressions. (At least, more or less: if you want to validate the check digit in an ISBN or ISSN, regular expressions are going to have a hard time getting the job done, though oddly enough a finite state automaton would have no particular trouble with it.) But many interesting notations are context-free languages, which means regular expressions don’t suffice to distinguish well-formed expressions from other strings of characters, or to identify the internal structure of expressions.

Now, if you’re writing in C and you run into this problem, you can easily use yacc and lex to generate a parser for your language (assuming it satisfies the requirements of yacc). If you’re writing in Java, there are several parser-generator tools to choose from. If you’re writing in a less widely used language, you may find a parser generator, or you may not.

It’s handy, in this situation, to be able to write your own parsers from scratch.

By far the simplest method for hand-written parsing is the one known as recursive descent. For each non-terminal symbol in the grammar, there is a routine whose job it is to read strings in the input which represent that non-terminal. The structure of the routine follows the structure of the grammar rules for that non-terminal in a simple way, which makes recursive-descent parsers feel close to the structure of the information and also to the structure of the parsing process. (The parser generated by yacc, on the other hand, remains a completely opaque black box to me, even after years of using it.)

In his book Compiler Construction (Harlow, England: Addison-Wesley, 1996, tr. from Grundlagen und Techniken des Compilerbaus [Bonn: Addison-Wesley, 1996]), the Swiss computer scientist Niklaus Wirth summarizes the rules for formulating a recursive-descent parser in a table occupying less than half a page. For each construct in the EBNF notation for grammars, Wirth shows a corresponding construct in an imperative programming language (Oberon), so before I show the table I should perhaps review the EBNF notation. In any formal notation for grammars, a grammar is made up of a sequence of grammar productions, and each production (in the case of context-free grammars) consists of a single non-terminal symbol on the left-hand side of the rule, and an expression on the right-hand side of the rule which represents the possible realizations of the non-terminal. The right-hand side expression is made up of non-terminal symbols and terminal symbols (e.g. quoted strings), arranged in sequences, separated as need be by choice operators (for choice, the or-bar | is used), with parentheses, square brackets (which mark optional material), and braces (which mark material that can occur zero or more times).

Wirth’s EBNF for EBNF will serve to illustrate the syntax:

syntax = {production}.
production = identifier “=” expression “.”.
expression = term {“|” term}.
term = factor {factor}.
factor = identifier | string | “(” expression “)” | “[” expression “]” | “{” expression “}”.
identifier = letter {letter | digit}.
string = “”” {character} “””.
letter = “A” | … | “Z”.
digit = “0” | … | “9”.

(It may be worth noting that this formulation achieves its simplicity in part by hand-waving: it doesn’t say anything about whitespace separating identifiers, and the definition of string is not one a machine can be expected to read and understand. But Wirth isn’t writing this grammar for a machine, but for human readers.)

It’s easy to see that the routines in a recursive-descent parser for a grammar in this notation must deal with six constructs on the right-hand side of rules: strings, parenthesized expressions (three kinds), sequences of expressions, and choices between expressions. Wirth summarizes the necessary code in this table with the construct K on the left, and the program for it, Pr(K), on the right. In the code fragments, sym is a global variable representing the symbol most recently read from the input stream, and next is the routine responsible for reading the input stream and updating sym. The meta-expression first(K) denotes the set of symbols which can possibly occur as the first symbol of a string derived from construct K.

k Pr(k)
“x” IF sym = “x” THEN next ELSE error END
(exp) Pr(exp)
[exp] IF sym IN first(exp) THEN Pr(exp) END
{exp} WHILE sym IN first(exp) DO Pr(exp) END
fac0 fac1facn Pr(fac0); Pr(fac1); … ; Pr(facn)
term0 | term1 | … | termn CASE sym of
   first(term0) : Pr(term0)
| first(term1) : Pr(term1)
| …
| first(termn) : Pr(termn)
END

This is easy enough to express in any language that has global variables and assignment statements. But what do we do when we are writing an application in a functional language, like XQuery or XSLT, and need to parse sentences in some context-free language? No assignment statements, and all functions have the property that if you call them several times with the same argument you will always get the same results back. [Addendum, 8 January 2013: XQuery and XSLT users do in fact have access to useful parser generators: see the pointers to Gunther Rademacher’s REx and Dmitre Novatchev’s FXSL provided in the comments. The question does, however, still arise for those who want to write recursive-descent parsers along the lines sketched by Wirth, which is where this post is trying to go.]

I’ve wondered about this for some time (my notes show I was worrying about it a year ago today), and the other day a simple solution occurred to me: each of the functions in a recursive descent parser depends on the state of the input, so in a functional language the state of the input has to passed to the function as an argument. And each function changes the state of the input (by advancing the input pointer), which in a functional language we can represent by having each function return the new state of the input and the new current symbol as its result.

A new table, analogous to Wirth’s, but with XQuery code patterns on the right hand side, looks like this. Here, the common assumption is that each function is passed a parameter named $E0 whose value is an env variable, with two children: sym contains the current symbol and input contains the remainder of the input (which for simplicity I’m going to assume is a string). If an error condition arises, an error element is added to the environment. The job of reading the next token is handled by the function next().

k Pr(k, $E0)
“x” if ($E0/sym = “x”)
  then next($E0)
  else <env>
    <error>expected “x” but did not find it</error>
    {$E0/*}
  </env>
(exp) Pr(exp, $E0)
[exp] if ($E0/sym = first(exp)) then
  Pr(exp, $E0)
else
  $E0
{exp} This requires two translations. For each such sequence exp, we declare a function:

declare function seq_exp(
  $E0 as element(env)
) as element(env) {
  if ($E0/sym = first(exp)) then
    let $E1 := Pr(exp),
        $E2 := seq_exp($E1)
    return $E2
  else
    $E0
};
Inline, we just call that function:

seq_exp($E0)
fac0 fac1facn let $E1 := Pr(fac0, $E0),
    $E2 := Pr(fac1, $E1),
    … ,
    $En + 1 := Pr(facn, $En)
return $En + 1
term0 | term1 | … | termn if ($E0/sym = first(term0)) then
  Pr(term0)
else if ($E0/sym = first(term1)) then
  Pr(term1)

else if ($E0/sym = first(termn)) then
  Pr(termn)

Like Wirth’s, the code shown here produces a recognizer that doesn’t do anything with the input except read it and accept or reject it; like Wirth’s, it can easily be extended to do things with the input. (My first instinct, of course, is to return an XML representation of the input string’s abstract syntax tree, so I can process it further using other XML tools.)

Giants and the KISS principle

[12 January 2011]

From time to time, my evil twin Enrique runs across passages in literature that seem to him to provide useful illustrations of important principles in information management. I don’t know, maybe he’s saving them up for the next time he teaches a class or something.

The other day, he came by and pointed me to a passage in J.K. Rowling’s Harry Potter and the Order of the Phoenix ([New York]: Scholastic; London: Bloomsbury, 2003):

“In any case, giants … — overload ’em with information an’ they’ll kill yeh jus’ to simplify things.”

“That’s nice,” I said. “The KISS principle in a nutshell.”

“What I want to know.” Enrique said, “is: Where were the giants when they were needed in the [WG name suppressed] working group?”

Mystery and mathematics

[2 February 2010]

Lately I’ve been spending some time of an evening reading E. T. Bell’s classic collection of biographies of mathematicians: Men of Mathematics (1937; rpt. New York: Simon and Schuster, 1985). (As the title suggests, Bell is thoroughly pre-feminist; a collection called Women of Mathematics has been published more recently, which may be intended as a kind of pendant to Bell, though sadly it’s nowhere like as much fun to read; its authors are for the most part better historians and much less opinionated.) And something rather puzzling caught my eye the other evening.

Bell writes (p. 357 of the Simon and Schuster paperback reprint) in his sketch of the Irish mathematician William Rowan Hamilton:

So long as there is a shred of mystery attached to any concept that concept is not mathematical.

Granted, Bell was probably tired of having to tell generations of incredulous undergraduates that there is nothing especially imaginary about imaginary numbers; one might grow peevish on the subject over the years. But still: did Bell seriously believe that the natural numbers (say) are not mysterious?

Can anyone not dead to all sense of wonder contemplate the commutative property of integer addition without feeling themselves to be in the presence of mystery?

Bare-bones TEI

[6 January 2010]

The markup vocabulary defined by the Text Encoding Initiative is, for good and sound reasons, rather large and in some ways rather complicated. From time to time, however, it’s useful to have a radically cut-down version of the TEI vocabulary. For people just learning TEI markup (to name just one instance), a cut-down version can simplify initial training and make the TEI feel a little less intimidating.

Many people use TEI Lite, which is much smaller than full TEI. But for some work I’m doing with Yves Marcoux and Claus Huitfeldt, we felt it would be handy to have an even smaller subset of TEI to work with. Years ago (1994), I defined a profile of TEI called ‘Bare Bones TEI’, intended not so much for serious use as for training and for thought experiments.

I had long regarded the full details of bare-bones TEI as lost to history: the documentation has been preserved by Robin Cover at the XML Cover Pages, but it didn’t include the DTD modification file showing the precise changes from the then-current TEI DTD. I had tried a few times, searching both the Web and my hard disk, to find the original data, but had had no luck. But the other day, for reasons I don’t think I can explain, an attempt at one more search eventually found an old copy of the documentation, the modification files, and the full DTD for bare-bones TEI.

I’ve now translated the original documentation to XML, added updated links to the various DTD files, and added parameter entities to the DTD to make it work both for SGML contexts and for XML. The documentation for Bare bones TEI is now available on this site, as is the modified DTD.

The current version of the bare-bones DTD is based on TEI P3, and the translation to XML loses a small amount of information involving the pb (page break) element. Eventually, perhaps, I’ll apply the bare-bones customization to TEI P5 and produce an updated version of the schema and documentation.

Note: in searching for Bare-bones TEI on the TEI Consortium site just now, I discovered that someone has produced a similar profile, called ‘Bare TEI’. [Further research shows that although the page on customizations does not identify the authors, the work was done originally by Laurent Romary and later edited by Syd Bauman, Lou Burnard, and possibly also Sebastian Rahtz.] This may be worth exploring, for those seeking a minimal profile of TEI. Unfortunately, I haven’t found any usable documentation for Bare TEI, so I don’t know the design principles that govern it, and the DTD is full of undefined ghost elements (or ‘zombies’), which renders it unfortunately cumbersome in a syntax-directed editor. So for now I’m sticking with bare-bones TEI.