[22 March 2008]

My evil twin Enrique [formerly known as Skippy] dropped by again the other day.

“I’ve written the world’s smallest conforming XSD 1.0 implementation, #x201D; he announced.


“And it’s open source. Wanna see it? I call it Cheney. (Think of it as a kind of hommage.)”

I began to have a bad feeling about this. “Cheney?”

“Yeah. He da man, when it comes to fast schema validation.”

“Tell me about it.”

“Well, you know how proud the XML Schema working group is that the spec doesn’t require that XSD processors accept schema documents or input that’s actually in XML?”

“Er, some of the working group, yes.”

“And you know how carefully 1.0 was drafted so as to ensure that processors can be written for very specialized environments? For example, highly security-conscious ones. So there’s no required output format, no required API — nothing is required.”

“Ah, yes, right.”

“And you will recall, of course, how strongly some in the working group have resisted any suggestion in the spec that implementations be required to document anything.”

“I remember.”

“Because that would be telling people how to write their documentation.” He was smiling.

I refused to take that bait.

“And the working group is so proud of achieving perfect interoperability in XSD 1.0.”

As it happens, I prefer not to talk about that, especially not with Enrique when he is smiling like a cat working with a mouse. So I contented myself with asking “So tell me about Cheney. You mentioned security a moment ago. Is that the connection to the Vice President?”

“Yes, Cheney is intended for deployment in high-security situations. It’s got rock-solid security, and it’s fully conforming to XSD 1.0. Oh, and it’s highly optimized, so it’s blazingly fast.”

“How does it work?”

He handed me a piece of paper on which was written:

echo "You are not authorized for that information."

“That’s it?” I asked.

“Yes. Sweet, isn’t it? And fast. I figure I call sell this for a bundle, to people who need a support contract.”

“Uh, you think this is conforming?”

“Sure is,” he said.

“But it doesn’t seem to do anything.”

“Well, I’m not required to expose all of the PSVI, or any particular part of it. I’m not required to expose any of the PSVI. So I don’t.”

“But —”

“And if it can be proven that some particular part of the PSVI (a) isn’t exposed by a given implementation, and (b) doesn’t affect the parts of the PSVI that are exposed by that implementation, then I can optimize by not actually performing the calculation that would otherwise be necessary. Since I don’t expose anything, I get some pretty good optimization.”

“So I see. But isn’t — well, but having a schema validator that doesn’t actually do anything — isn’t that kind of useless?”


Long pause.

“My customers show up with a checklist saying they need a conforming processor, and I give it to them. They have been trained to think that conformance or non-conformance is a useful way to describe processors. Most of them think it’s the most important way. And you can understand why. Lots of specs go to great lengths to try to ensure that conforming processors are useful. For those specs, asking first of all about conformance is a very useful way to find useful software. Is it my fault that the XSD 1.0 spec doesn’t bother to make sure that conformance is a useful idea? Eventually, customers catch on; I don’t get much repeat business. But they don’t usually catch on until after their checks have cleared.”

“Isn’t that cheating?”

“Cry me a river, liberal. If you had wanted conforming processors to be useful, and to give the same results as other conforming processors, then you would have done a better job defining conformance, wouldn’t you?”

He paused and then declaimed in the tones of a television advertisement: “Cheney. What you don’t know, we won’t tell you. Interoperate with that.”

Grune and Jacobs, Parsing Techniques, Second Edition

[18 March 2008]

Good news.

The second edition of Parsing techniques: A practical guide, by Dick Grune and Ceriel J. H. Jacobs, has now appeared. I found this out not long ago and ordered a copy, and it was here waiting for me when I came home from my most recent trip, and I’ve now had a day or so to look at it.

The first edition has been out of print for a while, which was frustrating at first since I liked recommending the book. But then the authors got the rights back from the publisher, and made Postscript and PDF of the first edition available on Dick Grune’s web site. A few years ago Dick Grune told me a second edition was planned, and I have been waiting rather impatiently ever since.

It was worth the wait. Everything that made the first edition so useful and fun to read has been retained.

Well, almost; there is one minor exception. The second edition has 662 pages, compared to the first edition’s 322, and while Springer has done a very good job of keeping the book compact and handy, it does weigh more, so that when I sit on the couch to read it, eventually the weight makes my little finger hurt. But the new material is, well, worth the weight. (Sorry!)

The best way to give a feel for the book is to quote from the preface to the first edition:

This book does not address a strictly uniform audience. On the contrary, while writing this book, we have consistently tried to imagine giving a course on the subject to a diffuse mixture of students and faculty members of assorted faculties, sophisticated laymen, the avid readers of the science supplement of the large newspapers, etc. Such a course was never given; a diverse audience like that would be too uncoordinated to convene at regular intervals, which is why we wrote this book, to be read, studied, perused or consulted wherever or whenever desired.

For myself, I’ve read, studied, and perused the first edition repeatedly since I bought the first edition at Kroch’s and Brentano’s in downtown Chicago in 1992 (back when their computer science department had computer science books, not just dummies books about popular computer programs). And I’ve consulted it for reference material many times.

They give a less formal account of automata and the key theorems (pumping lemma and so on) than, say, Hopcroft and Ullman, but nevertheless a clear one, and incomparably more accessible. And they give full, interesting accounts of a lot of topics not well covered in other books on parsing I’ve looked at. Not everyone knows that parsing does not necessarily proceed left to right, but can run right to left, or even non-directionally, but anyone who has read this book will have a very clear understanding of the matter. (This came up once in connection with an obscure passage of ISO 8879, which defines SGML: the spec says that certain attributes default to the “most recently specified” value. But most recent on what time axis? It turned out that the spec was written in the belief that constructs in a language must necessarily be recognized left to right; the editor expressed disbelief when I told him that other recognition orders were possible.)

The first edition had a very good treatment of van Wijngaarden grammars, affix grammars, and attribute grammars; the second edition expands the treatment into a full chapter on non-Chomsky grammars which also describes tree-adjoining grammars (which I now understand in principle, though I still don’t know how to write a parser for them), coupled grammars, ordered grammars, ‘recognition systems’, and boolean grammars. I had just noticed that there’s no discussion of adaptive grammars and was feeling just a little smug (“wow, there’s an unusual approach to parsing that I know about that Grune and Jacobs don’t mention; I must be hot stuff”) when I found a paragraph that says

One interesting possibility not explored here is to modify the CF grammar under the influence of parser actions. This leads to dynamic grammars, also called modifiable grammars, or adaptable grammars.

And the web-based bibliography for the second edition lists publications on the topic going back to 1963.

The book is much more than “a reservation for endangered parsers” (in the words of Kees van Reeuwijk, quoted in the preface): the descriptions of the standard algorithms are clear and easy to follow, and the authors provide a clear sense of the varying points of view and value schemes of mathematicians studying formal languages, computer scientists, and users from other fields seeking to apply grammars and parsing techniques in those fields.

If you have any interest at all in formal languages, grammars, or parsing, this book should be on your shelf.

A kind of hommage

[12 March 2008]

I’m traveling to Germany, so I had an overnight flight last night. A long time ago I swore off red-eye flights from California to the East Coast, but for travel to Europe there seem to be no alternatives. Arriving in Amsterdam and then taking the train further seemed like a civilized way to organize things: it doesn’t make that much difference in the final arrival time, and trains are almost always more comfortable to travel in than airplanes.

I forgot to ask whether there were direct connections. There aren’t.

I had dimly imagined climbing into a train, finding a seat, and traveling further in a half sleep for a few hours before being deposited at my destination. The wait in Amsterdam seemed normal enough — I like Schiphol, it feels like one of the more civilized of European airports I travel through. But the change in Utrecht was a bit of a strain, and by noon and my second change of trains in Duisburg I had begun to feel as if I were in a surrealist movie.

Somewhere along the way, by a process that seems to have resembled automatic writing, the following account of an encounter with my evil twin appeared in my notebook.

My evil twin dropped by again the other day. He was not happy that in a couple of recent posts I had given him the pseudonym “Skippy”.

“Skippy!? What kind of name is that for an evil twin? I want a new name.”

“Look, I’m sorry if you didn’t like it. I was in kind of a hurry and it was the name that came to mind.”

“‘Skippy!’” He sulked a little more. “It sounds like the kind of evil twin George H. W. Bush would have.”

“Well, exactly,” I said. “It’s a reference to Garry Trudeau’s strips about Bush 41. Think of it as a kind of hommage.”

“Hommage, hell, you just got lazy. I don’t want to be Skippy, even as a cover name. Call me … Enrique.”


“Yes. Don’t you recognize it? It’s a reference to the Incredible Zambini Brothers.”

“Who? Sounds like an obscure San Francisco band.”

“No, you’re probably thinking of the Sons of Champlin. But it’s true, the Zambini brothers did have a cult classic once — The Incredible Zambini Brothers, All-Stars Again. It was a kind of combination jam session, cookbook, literary anthology, and football playbook. Riveting, really, if you have the right kind of chemical enhancements. So yeah, think of it as a kind of hommage.”

I have got to start getting more sleep when I do trans-Atlantic flights.

Skippy returns (XML documents and infosets)

[5 March 2008]

My evil twin Skippy continues his reaction to the issue Henry Thompson has raised against the SML specification. He has lost the first point, in which he claimed that the XML spec doesn’t actually say that XML documents are character sequences.

Well, OK. But even if only character sequences can be XML documents strictu sensu, how does referring to information sets instead of XML documents actually help? Any system which handles XML documents in their serial form can also, if it wishes to, handle other representations of the same information without any loss of conformance.

At this point, I interrupt to remind Skippy that some people, at least, believe that if a spec requires that its implementations work upon XML documents, then by that same rule it forbids them from accepting DOM objects, or SAX event streams, and working on them. This is roughly how the XSD spec came to describe an XML syntax for schema documents, while carefully denying that its input needed to be in XML.

Skippy responds:

Suppose I claim to be a conforming implementation of the XYZ spec, which defines document the way SML does, as well-formed XML, instead of as XSD does, as “an information set as defined in [XML-Infoset]”.

Actually, I interject, this isn’t offered as a definition of document, just as a characterization of the input required for schema-validity assessment. Skippy points out, in return, that “as defined in [XML-Infoset]” is more hand-waving, since in fact the Infoset spec does not offer any definition of the term information set.

Eventually, we stop squabbling and I let him continue his argument.

I allow people to define new documents through a DOM or SAX interface, and I work with those documents just as I work with documents I acquire through a sequence of characters conforming to the grammar of the XML spec.

Now, suppose someone who believes that angle brackets are the true test of XMLness decides to challenge my claim to conformance. They will claim, I believe, that I am working with things that are not really XML documents, and thus violating the XYZ spec, which says that they MUST be XML documents. My response will be to ask why they think that. “Show us the data!” they will cry. And I will provide a sequence of characters which is indisputably an XML document.

The claim that I am not conforming is necessarily based on a claim about what I am doing ‘under the covers’, rather than at my interfaces to the outside world. But precisely because it’s a question of what I do “inside”, there is no way for an outside observer to know whether I am working directly from the DOM data structure or SAX event stream, or am making a sequence of characters internally, and working from that. One reason there is so little problem in reality about the distinction between serialized XML documents and other representations, even in arguments about conformance, is that in reality, if XML is accepted at, and produced at, the interfaces when requested, there is no way to make any argument that forbids other forms from being accepted at interfaces.

I have always thought of myself as a reasonably good language lawyer, but I am not sure how to refute Skippy’s claim. Can any readers suggest ways to prove that it matters? Or is Skippy right that SML will do better to refer to the XML spec, to describe its inputs, than to the Infoset spec?

Are XML documents character sequences?

[5 March 2008]

Henry Thompson has raised an issue against the SML spec, asking why SML defines the term document as

A well-formed XML document, as defined in [XML].

Henry suggests:

‘document’ as defined in the XML spec. is a character sequence — this seems unnecessarily restrictive — surely a reference to the Infoset spec. would be more appropriate here?

At this point, a voice which I identify as that of my evil twin Skippy sounds in my head, to wonder whether it really makes a difference or not.

I disagree with the premise that the XML spec defines “an XML document” solely as a character sequence. What I remember from the XML spec about the fundamental nature of XML documents is mostly energetic hand-waving.

Hmm. What is the truth of the matter?

I think Skippy has a point here, at least about the hand-waving. Both the first edition of the XML spec, with which I am probably most familiar, and the current (fourth) edition, say “A textual object is a well-formed XML document if” certain conditions are met. Surely the phrase “textual object” begs the question.

Both versions of the spec also hyperlink the words “XML document” to what seems to be a definition of sorts:

A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.

The phrase “data object” may be ever so slightly less suspect than “textual object”, but I’m feeling quite a breeze here.

Certainly this is consistent with my recollection that I actively resisted Dan Connolly’s repeated request to nail down, once and for all, what counted as an “XML document”. Most SGML users, I think, were inclined to regard the terms document and SGML document as denoting not just sequences of characters, but also any data structures with recognizable elements, attributes, parents, siblings, and children, and so forth that one might choose to represent the same information in. And I certainly planned to do the same for the term XML document. If anyone had told me that a DOM or SAX interface does not expose an XML document but something which must be religiously distinguished from XML documents, I would have said “Bullshit. The document production defines the serialization of a document, not the document.”

Some people might say, given the level of abstraction implicit in the XML WG’s usage of the term document (or at least in mine), that we regarded the term document as denoting any instance of the data model. (Of course, others will immediately chime in to say that SGML and XML don’t have data models. But as I’ve said elsewhere, at this point I in the discussion I don’t have any idea what the second group means by the words “have a data model”, so I can’t say one way or another.) Speaking for myself, I think I would have been horrified by the suggestion that once one had read a document into memory and begun working with it, one was no longer working with an XML document.

But nevertheless, the XML spec may have nailed things down in just the way I thought I had successfully resisted doing. The conditions imposed on a data object, in order for it to be an XML document, include that it be well-formed, which in turn entails that

Taken as a whole, it matches the production labeled document.

It’s embarrassing for those who like to think of XML documents in a slightly more abstract way, but it’s not clear to me how anything but a character sequence can match the ‘document’ production.

Sorry, Skippy, but while I think Henry’s note over-simplified things, he is probably more nearly right than you.

it probably won’t change by one whit the way I use the term XML document. But it’s nice to learn some new things now and then, even if what you learn is that you failed at something you attempted, ten years ago, and that as a consequence, you’ve been wrong about a related question ever since.