[16 January 2009]

A while back, a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.

If the language *is* regular, of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:

But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if itâ€™s regular. Is that language regular? I havenâ€™t looked at it for a while, so Iâ€™m not really sure. At the very least, itâ€™s not obvious that itâ€™s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.

First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact `refers(X,Y)`

is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the `refers`

relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).

But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.

setof(V,X^(refers(V,X);refers(X,V)),Vs), setof(L-R,refers(L,R),Es), vertices_edges_to_ugraph(Vs,Es,G), transitive_closure(G,Gc), member(LHS-RHS,Gc), member(LHS,RHS).

There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.

The line

setof(V,X^(refers(V,X);refers(X,V)),Vs),

makes a set `Vs`

of all terms in either argument position for `refers`

. This is the set of vertices in the non-terminal reachability graph.

The line

setof(L-R,refers(L,R),Es),

similarly makes a set of expressions of the form `L-R`

for terms linked by the `refers`

relation. For example, since the ABNF in the RFC includes the rule

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

the set `Es`

(edges) will contain `'URI'-scheme`

, `'URI'-'hier-part'`

, `'URI'-query`

, and `'URI'-fragment`

. Prolog quotes `URI`

here to avoid having it taken as a variable.

The line

vertices_edges_to_ugraph(Vs,Es,G),

uses an SWI utility to turn the lists `Vs`

and `Es`

of vertices and edges into an unweighted graph `G`

, in the form used by the `ugraph`

library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.

`G`

has the form of a list of pairs X-[Y1,Y2,…], where there are edges from X to Y1, Y2, … in the graph.

In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form `X = ... Y ...`

)

The line

transitive_closure(G,Gc),

takes the transitive closure of graph `G`

and assigns it to variable `Gc`

. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.

The final lines

member(LHS-RHS,Gc), member(LHS,RHS).

seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.

Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.

Q.E.D.

It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book

Eine Sprache ist

regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.

Or:

A language is

regular, if its syntax can be expressed by a single EBNF expression.

(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)

In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.

So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.

Having worked this out, I asked my colleague, Dan Connolly, what line of reasoning he had followed in answer the question. “Well, I just tried the construction, and it worked.“ He has a Web page with Javascript that performs the translation from the ABNF of the RFC into Javascript regexes, and allows the user to test strings to see if they match that grammar. If you are interested in this kind of question, you may find that page fun to play with.

Errrm… appendix B?

manwithaplan: good point.

But (a) appendix B of RFC 3986 can at most be read as a claim by the authors of the RFC that the grammar in appendix A is regular; it certainly doesn’t offer any proof. And (b) where’s the fun in taking someone else’s word for it?

Further work on the grammar of RFC 3986 now enables a more useful answer to manwithaplan’s comment above. The regex in appendix B of RFC 3986 does not recognize the same language as the grammar in appendix A. As far as I can tell, it is intended to recognize a superset, and does so (but I have not attempted to prove it; I’ve just generated a few hundred strings and failed to find any that are accepted by the grammar of appendix A and not by the regex of appendix B.