[16 January 2009]
A while back, ambulance a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, youth health or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.
If the language is regular, troche of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:
But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if itâ€™s regular. Is that language regular? I havenâ€™t looked at it for a while, so Iâ€™m not really sure. At the very least, itâ€™s not obvious that itâ€™s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.
My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.
First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact
refers(X,Y) is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the
refers relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).
But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.
There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.
makes a set
Vs of all terms in either argument position for
refers. This is the set of vertices in the non-terminal reachability graph.
similarly makes a set of expressions of the form
L-R for terms linked by the
refers relation. For example, since the ABNF in the RFC includes the rule
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
Es (edges) will contain
'URI'-fragment. Prolog quotes
URI here to avoid having it taken as a variable.
uses an SWI utility to turn the lists
Es of vertices and edges into an unweighted graph
G, in the form used by the
ugraph library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.
G has the form of a list of pairs X-[Y1,Y2,…], where there are edges from X to Y1, Y2, … in the graph.
In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form
X = ... Y ...)
takes the transitive closure of graph
G and assigns it to variable
Gc. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.
The final lines
seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.
Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.
It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book Grundlagen und Techniken des Compilerbaus (Bonn: Addison-Wesley, 1996), also in English as Compiler Construction (Harlow, England: Addison-Wesley, 1996). He writes:
Eine Sprache ist regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.
A language is regular, if its syntax can be expressed by a single EBNF expression.
(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)
In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.
So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.