March | 2008 | Messages in a Bottle

The recent publication of XML 1.0 Fifth Edition, and the ensuing discussion of how best to define the set of name characters, has made me think about a proposal that came up during the original development of XML 1.0.

We had taken as our task the definition of a grammar production to select all and only the Unicode characters suitable for use in identifiers. For our SGML-influenced minds, suitability essentially meant being a letter or something letter-like, by analogy with the characters A through Z and a through z allowed in SGML’s reference concrete syntax, or a numeric symbol, by analogy with 0 through 9. Now, it is an empirical fact that any such grammar production will be complicated, especially if it excludes not just unsuitable Unicode characters but unassigned code points.

As we were contemplating the unappetizing lists of ranges and properties that became productions [84] through [89] of the XML spec, someone — my memory says it was Tim Bray, but I have not tried to verify this memory against the historical record, and I don’t remember whether it was in email or on a phone call — said “Well, I realize I risk being labeled a xenophobe for suggesting this, but you know, the lexer has to have character tables for recognizing names and other tokens, and this rule is going to make those character tables just huge. We all know that in the ideal, normal case, the user is not going to be eyeballing XML source directly, you’re going to want to have some sort of presentation interface. So the element names are just codes, not utterances in natural language. They might as well be E001, E002, E003, and so on, as far as the parser is concerned. So why don’t we just do the same thing as SGML’s reference concrete syntax and restrict names to ASCII characters? That would make the tables you need for the lexical scanner a lot smaller. Any UCS character can go in the data, and the data is all the human reader needs to care about anyway. It’s only in the element and attribute names that we make the restriction, and the proposal is not really Anglo-centric since element and attribute names don’t have to be natural-language words; they shouldn’t be part of the user interface in the first place.”

I remember having a sinking heart at this point. The analysis is so plausible, and yet wrong in so many ways. One of the design goals of XML is that it should be legible to humans, and having identifiers that take the form of words in a natural language is an important tool in making markup legible. E001, E002, and so on just don’t cut it. (A librarian did suggest once in all seriousness that SGML vocabularies should have numeric identifiers, like the numeric field identifiers in MARC records. 245 means main title, everyone knows that, but unlike “main title” it doesn’t have an Anglophone bias. But as a design principle for actual vocabularies, this seemed to me too much like ensuring that all languages are treated equally by ensuring that the vocabulary is hard to understand for everyone.)

Also, while I have seen some very nice interfaces to marked up text, I’ve spent most of my writing life over the last twenty years working with SGML and XML source, mostly in emacs, and no interface that hides the markup has ever made me want to change. (For a while I did use the hide-tags mode in XMetal for rereading and copy editing, but I could never draft in it, and when I upgraded my operating system and XMetal ceased to run, I didn’t actually spend a lot of time looking for a new copy, or a new GUI.) I don’t want to hide the tags. The tags are what make things work. There is a story that a worker came into Bertolt Brecht’s apartment in Berlin once to install some curtains, and after doing so he started to put up a valance to hide the curtain rod and the hooks and strings. Brecht came into the room and made him stop and take it back down, with (as I remember it — I have not spent any time looking for the source of this story to make sure I’m getting the details right) the words “Never hide the mechanism!” When it comes to markup, at least, I’m with Brecht.

So for me, and anyone like me, the claim “no human really ever looks at this stuff” is false (and also conveys indirectly the suggestion that we counter-examples aren’t really human and thus don’t need to be accounted for). Even if everyone in the world used tag-hiding editors, the programmers who create those editors and the stylesheet authors who specify the mapping from tags to display form will spend time looking at the raw markup. And last I looked, they were all humans, too.

But none of those arguments were going to go anywhere against that brutally simple argument about the size and complexity of the scanner tables, and the seductive suggestion that it really would be inhumane to require users to use tag-visible interfaces. So for a while I was really worried that the proposal for ASCII-only identifiers was going to carry the day.

I’m not sure whether my memory of what happened next is what happened in reality, or whether I’ve just persuaded myself to accept as real a fantasy of what could have happened, if only someone had had the wit to think of it, and the nerve to say it, at the time.

The way I remember it is this. Someone (I don’t remember who) responds to the suggestion carefully, thoughtfully, and without screaming. They agree that it’s important to keep the lexical scanner simple, and that identifiers really don’t need to be used to represent arbitrary natural-language words or phrases. And they conclude by saying “So I think I’m willing to support this proposal, if we can improve it a little bit. Right now, the natural way to represent the scanner table for ASCII-only identifiers is to use a table 128 octets wide. But that’s a lot bigger than we need. So I propose that we don’t allow all the characters A through Z in both upper and lower case. If we restrict ourselves to A through O, uppercase only, then we can get by with a much smaller table. So I’ll support the proposal if we restrict identifiers not to any ASCII letter, but to A through O uppercase only, i.e. U+0041 through U+004F. And since no one wants to use natural-language words for identifiers, that’s not a problem, right?”

And then, silence. You can just about hear the proponents of ASCII-only identifiers starting to say “but then you couldn’t use ‘q” or ‘quote” or even ‘html” or ‘body” as element names,” before biting their tongue as they realize that they can’t say that without giving away the store.

Whether it happened the way I remember it or in some other less dramatic and memorable way, the end is the same: after people thought about it for a while, the proposal for ASCII-only identifiers evaporated, like dew in bright sunlight.

Messages in a Bottle

CMSMcQ's klog

Monthly Archives: March 2008

Scenes from a Recommendation 4: name characters