Allowing case-insensitive language tags in XSD

[30 July 2008]

I found a note to myself today, while going through some old papers, reminding me to write up an idea the i18n Working Group and I had when we were discussing the problem of case (in)sensitivity and language tags, some time ago. Here it is, for the record.

The discussion of the language datatype in XSD 1.1 includes a note reading:

Note: [BCP 47] specifies that language codes “are to be treated as case insensitive; there exist conventions for capitalization of some of the subtags, but these MUST NOT be taken to carry meaning.” Since the language datatype is derived from string, it inherits from string a one-to-one mapping from lexical representations to values. The literals ‘MN’ and ‘mn’ (for Mongolian) therefore correspond to distinct values and have distinct canonical forms. Users of this specification should be aware of this fact, the consequence of which is that the case-insensitive treatment of language values prescribed by [BCP 47] does not follow from the definition of this datatype given here; applications which require case-insensitivity should make appropriate adjustments.

The same is true of XSD 1.0, even if it doesn’t point out the problem as clearly.

As can be imagined, there have been requests that XSD 1.1 define a new variant of xsd:string which is case-insensitive, to allow language tags to be specified properly as a subtype of case-insensitive string and not as a subtype of the existing string type. Or perhaps language tags need to be a primitive type (as John Cowan has argued) instead of a subtype of string. The former opens a large can of worms, the same one that led XML 1.0 to be case-sensitive in the first place. (If you haven’t tried to work out a really good locale-insensitive internationalized rule for case folding, try it sometime when your life is too placid and simple and all your problems are too tractable; if the difference between metropolitan French and Quebecois doesn’t make you give up, remember to explain how you’re going to handle Turkish and English in the same rule.) The latter doesn’t open that can of worms (for the restricted character set allowed in language tags, case folding is well behaved and well defined), but it does open others. I’ve talked about language codes as a subtype of string before, so I won’t repeat it here.

In some cases the case-sensitivity of xsd:language is not a serious problem: we can write our schema to enforce the usual case conventions, by which language tags should be lower-case and country codes should be uppercase, and we can specify that a particular element should have a language tag of “en”, “fr”, “ja”, or “de” by using an enumeration in the usual way:

<xsd:simpleType name="langtag">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:enumeration value="en"/>
<xsd:enumeration value="fr"/>
<xsd:enumeration value="ja"/>
<xsd:enumeration value="de"/>
</xsd:restriction>
</xsd:simpleType>

This datatype will accept any of the four language codes indicated, but only if they are written in lower case.

But what if we want a more liberal schema, which allows case-insensitive language tags? We want to accept not just “en” but also “EN”, “En”, and (because we are determined to do the thing properly) even “eN”.

We could add those to the enumeration: for each language, specify all four possible forms. No one seems to like this idea: it makes the declaration four times as big but much less clear. When I suggested it to the i18n WG, they just groaned and François Yergeau looked at me as if I had emitted an indelicate noise he didn’t want to call attention to.

We were all happier when a different idea occurred to us. First note that the datatype definition given above can easily be reformulated using a pattern facet instead of an enumeration:

<xsd:simpleType name="langtag2">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="en|fr|ja|de"/>
</xsd:restriction>
</xsd:simpleType>

This definition can be adjusted to make it case sensitive in a relatively straightforward way:

<xsd:simpleType name="langtag3">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="[eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE]"/>
</xsd:restriction>
</xsd:simpleType>

Voilà, case-insensitive language tags. The pattern is not quite four times larger than the old pattern, but the declaration is still smaller than the first one using enumerations.

A side benefit of using the pattern instead of the enumeration is that it’s easier to allow for subtags (so we can accept “en-US” and “en-UK”, etc., as well as just “en”) by expanding on the pattern:

<xsd:simpleType name="langtag4">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern
value="([eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE])(-[a-zA-Z0-9]{1,8})*"/>
</xsd:restriction>
</xsd:simpleType>

In a perfect system, there would be some way to signal that the four upper-, lower-, and mixed-case forms of “en” all mean the same thing and map to the same value. This technique does not provide that. But then, I don’t know any good way to provide it. (I do know ways to provide it, just not any ones I think are good.) In an imperfect world, if I want case-insensitive language tags, I suppose I should be happy that I can find a way to define them without much inconvenience. And that, this technique provides.

XSD 1.1 is in Last Call

Yesterday the World Wide Web Consortium published new drafts of its XML Schema Definition Language (XSD) 1.1, as ‘last-call’ drafts.

The idiom has an obscure history, but is clearly related to the last call for orders in pubs which must close by a certain hour. The working group responsible for a specification labels it ‘last call’, as in ‘last call for comments’, to indicate that the working group believes the spec is finished and ready to move forward. If other working groups or external readers have been waiting to review the document, thinking “there’s no point reviewing it now because they are still changing things”, the last call is a signal that the responsible working group has stopped changing things, so if you want to review it, it’s now or never.

The effect, of course, can be to evoke a lot of comments that require significant rework of the spec, so that in fact it would be foolish for a working group to believe they are essentially done when they reach last call. (Not that it matters what the WG thinks: a working group that believes last call is the end of the real work will soon be taught better.)

In the case of XSD 1.1, this is the second last call publication both for the Datatypes spec and for the Structures spec (published previously as last-call working drafts in February 2006 and in August 2007, respectively). Each elicited scores of comments: by my count there are 126 Bugzilla issues on Datatypes opened since 17 February 2006, and 96 issues opened against Structures since 31 August 2007. We have closed all of the substantive comments, most by fixing the problem and a few (sigh) by discovering either that we could not reach consensus on what to do about the problem (or in some cases could not reach consensus about whether there was really a problem before us) or that we could not make the requested change without more delay than seemed warrantable. There are still a number of ‘editorial’ issues open, which are expected not to affect the conformance requirements for the spec or to change the results of anyone’s review of the spec, and which we therefore hope to be able to close after going to last call.

XSD 1.1 is, I think, somewhat improved over XSD 1.0 in a number of ways, ranging from the very small but symbolically very significant to much larger changes. On the small but significant side: the spec has a name now (XSD) that is distinct from the generic noun phrase used to describe the subject matter of the spec (XML schemas), which should make it easier for people to talk about XML schema languages other than XSD without confusing some listeners. On the larger side:

  • XSD 1.1 supports XPath 2.0 assertions on complex and simple types. The subset of XPath 2.0 defined for assertions in earlier drafts of XSD 1.1 has been dropped; processors are expected to support all of XPath 2.0 for assertions. (There is, however, a subset defined for conditional type assignment, although here too schema authors are allowed to use, and processors are allowed to support, full XPath.)
  • ‘Negative’ wildcards are allowed, that is wildcards which match all names except some specified set. The excluded names can be listed explicitly, or can be “all the elements defined in the schema” or “all the elements present in the content model”.
  • The xs:redefine element has been deprecated, and a new xs:override element has been defined which has clearer semantics and is easier to use.

Some changes vis-a-vis 1.0 were already visible in earlier drafts of 1.1:

  • The rules requiring deterministic content models have been relaxed to allow wildcards to compete with elements (although the determinism rule has not been eliminated completely, as some would prefer).
  • XSD 1.1 supports both XML 1.0 and XML 1.1.
  • A conditional inclusion mechanism is defined for schema documents, which allows schema authors to write schema documents that will work with multiple versions of XSD. (This conditional inclusion mechanism is not part of XSD 1.0, and cannot be added to it by an erratum, but there is no reason a conforming XSD 1.0 processor cannot support it, and I encourage makers of 1.0 processors to add support for it.)
  • Schema authors can specify various kinds of ‘open content’ for content models; this can make it easier to produce new versions of a vocabulary with the property that any document valid against the new vocabulary will also be valid against the old.
  • The Datatypes spec includes a precisionDecimal datatype intended to support the IEEE 754R floating-point decimal specification recently approved by IEEE.
  • Processors are allowed to support primitive datatypes, and datatype facets, additional to those defined in the specification.
  • We have revised many, many passages in the spec to try to make them clearer. It has not been easy to rewrite for clarity while retaining the kind of close correspondence to 1.0 that allows the working group and implementors to be confident that the rewrite has not inadvertently changed the conformance criteria. Some readers will doubtless wish that the working group had done more in this regard. But I venture to hope that many readers will be glad for the improvements in wording. The spec is still complex and some parts of it still make for hard going, but I think the changes are trending in the right direction.

If you have any interest in XSD, or in XML schema languages in general, I hope you will take the time to read and comment on XSD 1.1. The comment period runs through 12 September 2008. The specs may be found on the W3C Technical Reports index page.

Caspar

[8 April 2008]

I had another bad dream last night. Enrique came back.

“I don’t think you liked Cheney very much. Or Eric van der Vlist, either, judging from his comment. So I’ve written another conforming XSD 1.0 processor, in a different vein.” He handed me a piece of paper which read:

#!/bin/sh
echo "Input not accepted; unknown format."

“This is Caspar,”

“Caspar as in Weinberger?”

“Hauser. As you can see, Caspar doesn’t have the security features of Cheney, but it’s also conforming.”

I should point out for readers who have not encountered the figure of Kaspar Hauser that he was a mysterious young man found in Nuremberg in the 1820s, raised allegedly without language or human contact, who never quite found a way to fit into society, and is thus a convenient focal point for meditations on language and civilization by artists as diverse as Werner Herzog, Paul Verlaine, and Elizabeth Swados.

“Conforming? I don’t think I want to know why you think so.”

“Oh, sure you do. Don’t be such a wet blanket.”

“I gather you’re going to tell me anyway. OK, if there’s no way out of it, let’s get this over with. Why do you think Caspar is a conforming XSD processor?”

“I’m not required to accept XML, right? Because that, in the logic of the XSD 1.0 spec, would impede my freedom to accept input from a DOM or from a sequence of SAX events. And I’m not required to accept any other specific representation of the infoset. And I’m also not required to document the formats I do accept.”

“There don’t seem to be any conditionals here: it looks at first glance as if Caspar doesn’t support any input formats.” (Enrique, characteristically sloppy, seems to have thought that Hauser was completely alingual and thus could not understand anything at all. That’s not so. But I didn’t think it was worth my time, even during a bad dream, to argue with Enrique over the name of his program.)

“Right. Caspar understands no input formats at all. I love XSD 1.0; I could write conforming XSD 1.0 processors all day; I don’t understand why some people find it difficult!”

Cheney

[22 March 2008]

My evil twin Enrique [formerly known as Skippy] dropped by again the other day.

“I’ve written the world’s smallest conforming XSD 1.0 implementation, #x201D; he announced.

“Really?”

“And it’s open source. Wanna see it? I call it Cheney. (Think of it as a kind of hommage.)”

I began to have a bad feeling about this. “Cheney?”

“Yeah. He da man, when it comes to fast schema validation.”

“Tell me about it.”

“Well, you know how proud the XML Schema working group is that the spec doesn’t require that XSD processors accept schema documents or input that’s actually in XML?”

“Er, some of the working group, yes.”

“And you know how carefully 1.0 was drafted so as to ensure that processors can be written for very specialized environments? For example, highly security-conscious ones. So there’s no required output format, no required API — nothing is required.”

“Ah, yes, right.”

“And you will recall, of course, how strongly some in the working group have resisted any suggestion in the spec that implementations be required to document anything.”

“I remember.”

“Because that would be telling people how to write their documentation.” He was smiling.

I refused to take that bait.

“And the working group is so proud of achieving perfect interoperability in XSD 1.0.”

As it happens, I prefer not to talk about that, especially not with Enrique when he is smiling like a cat working with a mouse. So I contented myself with asking “So tell me about Cheney. You mentioned security a moment ago. Is that the connection to the Vice President?”

“Yes, Cheney is intended for deployment in high-security situations. It’s got rock-solid security, and it’s fully conforming to XSD 1.0. Oh, and it’s highly optimized, so it’s blazingly fast.”

“How does it work?”

He handed me a piece of paper on which was written:

#!/bin/sh
echo "You are not authorized for that information."

“That’s it?” I asked.

“Yes. Sweet, isn’t it? And fast. I figure I call sell this for a bundle, to people who need a support contract.”

“Uh, you think this is conforming?”

“Sure is,” he said.

“But it doesn’t seem to do anything.”

“Well, I’m not required to expose all of the PSVI, or any particular part of it. I’m not required to expose any of the PSVI. So I don’t.”

“But —”

“And if it can be proven that some particular part of the PSVI (a) isn’t exposed by a given implementation, and (b) doesn’t affect the parts of the PSVI that are exposed by that implementation, then I can optimize by not actually performing the calculation that would otherwise be necessary. Since I don’t expose anything, I get some pretty good optimization.”

“So I see. But isn’t — well, but having a schema validator that doesn’t actually do anything — isn’t that kind of useless?”

“So?”

Long pause.

“My customers show up with a checklist saying they need a conforming processor, and I give it to them. They have been trained to think that conformance or non-conformance is a useful way to describe processors. Most of them think it’s the most important way. And you can understand why. Lots of specs go to great lengths to try to ensure that conforming processors are useful. For those specs, asking first of all about conformance is a very useful way to find useful software. Is it my fault that the XSD 1.0 spec doesn’t bother to make sure that conformance is a useful idea? Eventually, customers catch on; I don’t get much repeat business. But they don’t usually catch on until after their checks have cleared.”

“Isn’t that cheating?”

“Cry me a river, liberal. If you had wanted conforming processors to be useful, and to give the same results as other conforming processors, then you would have done a better job defining conformance, wouldn’t you?”

He paused and then declaimed in the tones of a television advertisement: “Cheney. What you don’t know, we won’t tell you. Interoperate with that.”

A little formalism (variable names)

Many readers associate the use of variables with mathematics and feel threatened by paragraphs that begin “Let E be … and F be …. Then …” And similarly with technical terms: when a text defines and uses a lot of technical terms, it can be very daunting to the first-time reader (and many others).

So it’s understandable that sometimes, in trying to keep a text accessible to the reader, one works hard to avoid having to introduce variables to refer to things, and to avoid relying on technical terms with special meanings.

But sometimes such efforts backfire. In the XSD (XML Schema Definition Language) 1.0 spec, you end up with rules that read like this:

Validation Rule: Element Locally Valid (Type)

For an element information item to be locally ·valid· with respect to a type definition all of the following must be true:

1 The type definition must not be ·absent·;
2 It must not have {abstract} with value true.
3 The appropriate case among the following must be true:

3.1 If the type definition is a simple type definition, then all of the following must be true:

3.1.1 The element information item’s [attributes] must be empty, excepting those whose [namespace name] is identical to http://www.w3.org/2001/XMLSchema-instance and whose [local name] is one of type, nil, schemaLocation or noNamespaceSchemaLocation.
3.1.2 The element information item must have no element information item [children].
3.1.3 If clause 3.2 of Element Locally Valid (Element) (§3.3.4) did not apply, then the ·normalized value· must be ·valid· with respect to the type definition as defined by String Valid (§3.14.4).
3.2 If the type definition is a complex type definition, then the element information item must be ·valid· with respect to the type definition as per Element Locally Valid (Complex Type) (§3.4.4);

I would say “Maybe it’s just me, but I find that kind of hard to read,” but that would be disingenuous. There is ample evidence from the last eight or nine years that I am not the only reader of the XSD 1.0 spec who finds parts of it hard to read. This is a relatively mild example, as the XSD spec goes. But if we can overcome our fear of formality, the text can become a bit simpler. Two changes in particular seem useful here.

  • Introduce the names E for the element and T for the type, and use them.
  • Follow the example of most specs that define and use namespaces: specify and use a conventional prefix to represent a given namespace, and say once and for all, when that prefix is identified, that in practice the user can use any prefix they wish (or none). Then just use the QNames, rather than writing out the namespace in full each time you have to talk about names in that namespace.

Applying these rules to the fragment just given, we get something a bit easier to read.

Validation Rule: Element Locally Valid (Type)

For an element information item E to be locally ·valid· with respect to a type definition T all of the following must be true:

1 T is not ·absent·;
2 T does not have {abstract} with value true.
3 The appropriate case among the following is true:

3.1 If T is a simple type definition, then all of the following are true:

3.1.1 E‘s [attributes] are empty, excepting those named xsi:type, xsi:nil, xsi:schemaLocation, or xsi:noNamespaceSchemaLocation.
3.1.2 E has no element information item [children].
3.1.3 If clause 3.2 of Element Locally Valid (Element) (§3.3.4.3) did not apply, then the ·normalized value· is ·valid· with respect to T as defined by String Valid (§3.16.4).
3.2 If T is a complex type definition, then E is ·valid· with respect to T as per Element Locally Valid (Complex Type) (§3.4.4.2);
4 If E has an xsi:type [attribute] and does not have a ·governing element declaration·, then the ·actual value· of xsi:type ·resolves· to T.

I won’t claim that the text has become easy to read and follow, but I think there is one salient difference: in the first text above, my first difficulty as a reader is understanding what the text is trying to say, and once I have figured that out, I may or may not have energy left to try to understand why it’s saying that. In the second text, it’s easier (I think) to understand what the individual clauses are saying. The reader still has the task of understanding why, but at least the difficulties of comprehension are now those related to the intrinsic difficulty of the topic, without the additional barrier of complex syntax.

Another tactic adopted by some in trying to make difficult material easier to read is to avoid defining technical terms. The XSD 1.0 spec raises this to a fine art; often, the easiest way to understand how a given rule came to be formulated as it is, is to imagine that it was first written in a simple, straightforward clause using technical terms, and then the technical terms were eliminated and their definitions inserted inline. And then the process was repeated once, or twice, or more. The result is mostly devoid of difficult or obscure technical usages, but it’s often also a sentence only an eighth-grade English teacher teaching the unit on sentence diagramming could love.

If we re-introduce appropriate technical terms, this process can be reversed. Sometimes the introduction of even a single technical term can do a surprising amount of good.

Take the following example from the XSD spec:

2.3.1 The element declaration is local (i.e. its {scope} must not be global), its {abstract} is false, the element information item’s [namespace name] is identical to the element declaration’s {target namespace} (where an ·absent· {target namespace} is taken to be identical to a [namespace name] with no value) and the element information item’s [local name] matches the element declaration’s {name}.

In this case the element declaration is the ·context-determined declaration· for the element information item with respect to Schema-Validity Assessment (Element) (§3.3.4) and Assessment Outcome (Element) (§3.3.5).

This is followed by another clause with almost identical wording, covering global elements.

If we make use of the term expanded names, defined by the Namespaces in XML recommendation, and refer to the expanded names of the declaration and element instead of inlining the definition of expanded name by referring to namespace name + local name pairs — this entails defining the term expanded name as it applies to schema components — and supply the obvious variable names for element and declaration, then it’s easier to see that this rule for local element declarations can be merged with the following rule for global element declarations, since the two do exactly the same thing. So we can replace both the rule above and the the rule that follows it in the spec with:

If I’m smiling this evening, it’s because this morning the XML Schema working group agreed to these changes, and scores of other similar changes, to the text of the XSD 1.1 spec. The design of the language, I admit, is still very complex. The exposition, I concede, still has a sub-optimal structure. But the third source of difficulty, namely the complexity of individual sentences in the validation rules and contraints on schema components, is somewhat diminished by this change.

Variable names as a short-hand for complex noun phrases; technical terms to capture frequently needed concepts; conventions to allow things to be said simply instead of in convoluted clauses: it’s almost enough to make you think that mathematical writing is the way it is, in order to make things easier to read, instead of harder to read. Food for thought.