</w3c:msm>

[30 January 2009]

Today is my last workday as a staff member at the World Wide Web Consortium.

I have learned a lot during my time here, I’ve enjoyed the work, and I have tried to help make the Web a better place for people who care about information and for the information we care about. But it’s ten years since I joined the staff, and it’s time to move on.

What’s next? I will be doing consulting and contract work through Black Mesa Technologies LLC. If you have interesting problems touching on documents, electronic representation of information (documents or other), validation, XSLT, XQuery, or the like; if you have concerns about the proper application of information technology to the preservation of commercial or cultural-heritage information, then give me a call; I’ll be around.

XPath 1, Enrique 0

[27 January 2009]

My evil twin Enrique came by the other evening, excited and full of himself. “I’ve just found a bug in Saxon!” he announced.

“Really?” I said. “That would be entertaining. Michael Kay keeps finding new bugs in the XSD 1.1 spec; it would be nice to pay him back. Still, I doubt very seriously that you’ve found a bug. What’s the story?”

“I’m working on a stylesheet to generate an SVG image showing a particular hierarchy of objects. And at one point, I have to know how many elements named object there are, descended from the object with name="X", including X itself, if X is a preceding sibling of the current element or one of its ancestors. At another point, I need the same number, but excluding X itself. So I wrote two XPath expressions, like this:

count(preceding::object[@name="X"]//object)
count(preceding::object[@name="X"]/descendant::object)

“I expected both to evaluate to 0, if X is not somewhere on our left, and to some pair of numbers n and n – 1, if it is.”

“I think I see what you’re trying to do,” I said. (And I did; I did something very similar myself, not long ago, working on a new type hierarchy diagram for XSD 1.1 Datatypes.) “And what did you get?”

He pulled out his laptop and ran a stylesheet, from which messages reported that the two expressions evaluated sometimes to 0 and 0, and sometimes to 11 and 11, respectively, but never to 12 and 11, which is what, by inspection, we established was what Enrique wanted.

At this point, dear reader, you may already know what Enrique’s mistake was. If so, I congratulate your perspicuity. I confess that I did not. If you share my uncertainty, it might be rewarding to pause now, before reading on, to figure out for yourself why Enrique was wrong.

“So now do you believe I’ve found a bug in Saxon?” asked Enrique.

“Well, no,” I said.

“What do you mean, no?!” he protested. “Object X has eleven descendants of type object. Right?”

“Right.”

“Plus one for self, so the count of objects descended from X by zero or more steps (i.e. including X itself) is twelve, right?”

“Right.”

“So “preceding::object [@name = "X"] / descendant-or-self :: object’ should be returning 12, not 11! Right?”

“Right.”

“You know that ‘//’ is short for descendant-or-self, right?”

“Right.”

[Well, wrong, actually. See below.]

“So it’s a bug! Why won’t you admit that I’ve found a bug in Saxon?”

“Well, let’s put it this way. I know Michael Kay. And I know you. And if your expectations disagree with his code — even if my expectations disagree with his code — well, I know where I’m putting my money. What does xsltproc say?”

Xsltproc also gave the same value to both expressions.

And both Saxon and xsltproc gave the expected answer of 12 for the expression

count(preceding::object[@name="X"]/descendant-or-self::object)

“A bug in both Saxon and xsltproc?” marveled Enrique. “That’s amazing, I must be brilliant!”

“A bug in both Saxon and xsltproc? That is incredible,” I corrected him. “As in: not believable. Michael can be wrong; that’s possible. Daniel Veillard can be wrong; that’s possible. Michael and DV both wrong, possible, but extremely unlikely. Michael and DV wrong, and you right? Slightly less likely than the spontaneous formation, by a set of atoms, of a working Infinite Improbability Drive.”

[Enrique doesn’t need to be told this, but some readers may need to be reminded that xsltproc is the command-line interface to libxslt, which is written by my friend and former W3C colleague Daniel Veillard, best known to friends as DV. And Michael Kay, of course, is the author of Saxon. If you don’t know what Saxon and libxslt are, dear reader, why on earth are you reading this posting?]

But what was the story?

Eventually, a certain amount of groveling through the prose of section 2.5 of the XPath 1.0 spec showed where Enrique and I had gone wrong. “//” is short for “descendant-or-self” only in a rough and ready way. In particular, “$X//object” is not equivalent to “$X/descendant-or-self::object”, which is clearly what Enrique (and I) had reckoned. Strictly speaking, what XPath says is:

// is short for /descendant-or-self::node()/.

So “$X // object” is equivalent not to “$X/ descendant-or-self :: object”, but to “$X/ descendant-or-self :: node()/ child::object” — or (confusingly, to me, and despite the note in the XPath spec which appears to say differently) to “$X / descendant :: object”. (The note is making a point about how predicates are evaluated, which doesn’t apply in Enrique’s case.)

Enrique was crestfallen; he had been sure that his technical credibility would rise sharply if he had found a bug in Saxon and libxslt. I, on the other hand, was relieved; I now knew how to fix the bug in the stylesheet that generates the SVG image of the XSD 1.1 type hierarchy.

Me, I’m going back to my old habit of just ignoring the abbreviated syntax and using the full syntax all the time: it’s less error prone, because it’s more explicit.

Simple proof that the URI grammar of RFC 3986 defines a regular language

[16 January 2009]

A while back, a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.

If the language is regular, of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:

But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if it’s regular. Is that language regular? I haven’t looked at it for a while, so I’m not really sure. At the very least, it’s not obvious that it’s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.

First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact refers(X,Y) is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the refers relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).

But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.

setof(V,X^(refers(V,X);refers(X,V)),Vs),
setof(L-R,refers(L,R),Es),
vertices_edges_to_ugraph(Vs,Es,G),
transitive_closure(G,Gc),
member(LHS-RHS,Gc),
member(LHS,RHS).

There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.

The line

setof(V,X^(refers(V,X);refers(X,V)),Vs),

makes a set Vs of all terms in either argument position for refers. This is the set of vertices in the non-terminal reachability graph.

The line

setof(L-R,refers(L,R),Es),

similarly makes a set of expressions of the form L-R for terms linked by the refers relation. For example, since the ABNF in the RFC includes the rule

URI    = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

the set Es (edges) will contain 'URI'-scheme, 'URI'-'hier-part', 'URI'-query, and 'URI'-fragment. Prolog quotes URI here to avoid having it taken as a variable.

The line

vertices_edges_to_ugraph(Vs,Es,G),

uses an SWI utility to turn the lists Vs and Es of vertices and edges into an unweighted graph G, in the form used by the ugraph library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.

G has the form of a list of pairs X-[Y1,Y2,…], where there are edges from X to Y1, Y2, … in the graph.

In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form X = ... Y ...)

The line

transitive_closure(G,Gc),

takes the transitive closure of graph G and assigns it to variable Gc. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.

The final lines

member(LHS-RHS,Gc),
member(LHS,RHS).

seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.

Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.

Q.E.D.

It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book Grundlagen und Techniken des Compilerbaus (Bonn: Addison-Wesley, 1996), also in English as Compiler Construction (Harlow, England: Addison-Wesley, 1996). He writes:

Eine Sprache ist regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.

Or:

A language is regular, if its syntax can be expressed by a single EBNF expression.

(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)

In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.

So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.

Having worked this out, I asked my colleague, Dan Connolly, what line of reasoning he had followed in answer the question. “Well, I just tried the construction, and it worked.“ He has a Web page with Javascript that performs the translation from the ABNF of the RFC into Javascript regexes, and allows the user to test strings to see if they match that grammar. If you are interested in this kind of question, you may find that page fun to play with.

Baby, you should fix your site

[5 January 2009]

Maybe it was an hallucination, or just an after-effect of seeing the article on T. V. Raman in the Sunday New York Times yesterday. But my evil twin Enrique showed up this evening.

“I’ve written a song,” he said. “You have, have you?” I said, guardedly. “Want to hear it?” he asked. “I doubt it.” But there was no stopping him.

[Before you read this, you may need to be reminded that “WAI” (pronounced “way”) is the W3C’s Web Accessibility Initiative. And “section 508” refers to the section of the U.S. legal code that requires at least some Web sites to be accessible to the disabled. There are analogous requirements in many civilized countries, but the girl in this song appears to be an American.]

I asked a girl what she wanted to do,
And she said ‘Baby, surf the Web, just like you.
I got some software, it’s readin’ my screen,
but your damn page breaks the whole machine!

Chorus:
“Baby, you should fix your site!
WAI can help you make it right.
Baby, if you fix your site,
Then maybe I’ll love you.”

I told that girl that my pages were good.
And she said “Listen, babe, they don’t do what they should.
You know, there’s rules, section 508
Fix that page, if you want a date!

Chorus:
“Baby, you should fix your site!
WAI can help you make it right.
Baby, if you fix your site,
Then maybe I’ll love you.

[Instrumental.]

Chorus:
“Baby, you should fix your site!
WAI can help you make it right.
Baby, if you fix your site,
Then maybe I’ll love you.”

I asked that girl how to start right away/
And she said, “Baby, do Level A.
You won’t stop there, not if you are smart.
But Level A, baby, now that’s a start!

Chorus:
“Baby, you can make it right.
WAI can help you fix your site!
Baby, if you see the light,
Then, baby, I’ll love you!”

I hate to encourage Enrique in his song-writing career, and especially I hate to encourage further songs using Beatles tunes (if you didn’t recognize “Baby, you can drive my car” here, then you are a lot younger than I am), but well, you know, that girl has got a point.

If your site is not accessible, you really should fix it.

My management asks me to point out that we do not warrant that making your Web site more accessible will actually get you more dates. But it’s still the right thing to do.

Spam Karma 2, again

[2 January 2009]

As noted earlier, I hesitated to try to install Spam Karma 2 in the new location of this blog, since I wasn’t sure it would work with WordPress 2.6.5.

But after a couple of weeks with only Bad Behavior minding the fort, I became desperate. (BB does filter out a lot — my only problem was that almost everything it let through for me to moderate was in fact spam. I have better things to do than get ten or twenty messages in a day asking me to moderate comment spam.)

So I’ve installed Spam Karma 2 again, and so far it appears to work well with WordPress 2.6.5.

And my life is a bit quieter. (Actually, it would be quieter even without SK2, since the day with twenty comment spams appears to have been a fluke; SK2 hasn’t been catching twenty a day since its installation. But the next time the bots find me, SK2 will be in place.)