[30 July 2008]
I found a note to myself today, while going through some old papers, reminding me to write up an idea the i18n Working Group and I had when we were discussing the problem of case (in)sensitivity and language tags, some time ago. Here it is, for the record.
The discussion of the language
datatype in XSD 1.1 includes a note reading:
Note: [BCP 47] specifies that language codes “are to be treated as case insensitive; there exist conventions for capitalization of some of the subtags, but these MUST NOT be taken to carry meaning.” Since the language datatype is derived from string, it inherits from string a one-to-one mapping from lexical representations to values. The literals ‘MN’ and ‘mn’ (for Mongolian) therefore correspond to distinct values and have distinct canonical forms. Users of this specification should be aware of this fact, the consequence of which is that the case-insensitive treatment of language values prescribed by [BCP 47] does not follow from the definition of this datatype given here; applications which require case-insensitivity should make appropriate adjustments.
The same is true of XSD 1.0, even if it doesn’t point out the problem as clearly.
As can be imagined, there have been requests that XSD 1.1 define a new variant of xsd:string
which is case-insensitive, to allow language tags to be specified properly as a subtype of case-insensitive string and not as a subtype of the existing string type. Or perhaps language tags need to be a primitive type (as John Cowan has argued) instead of a subtype of string. The former opens a large can of worms, the same one that led XML 1.0 to be case-sensitive in the first place. (If you haven’t tried to work out a really good locale-insensitive internationalized rule for case folding, try it sometime when your life is too placid and simple and all your problems are too tractable; if the difference between metropolitan French and Quebecois doesn’t make you give up, remember to explain how you’re going to handle Turkish and English in the same rule.) The latter doesn’t open that can of worms (for the restricted character set allowed in language tags, case folding is well behaved and well defined), but it does open others. I’ve talked about language codes as a subtype of string before, so I won’t repeat it here.
In some cases the case-sensitivity of xsd:language
is not a serious problem: we can write our schema to enforce the usual case conventions, by which language tags should be lower-case and country codes should be uppercase, and we can specify that a particular element should have a language tag of “en
”, “fr
”, “ja
”, or “de
” by using an enumeration in the usual way:
<xsd:simpleType name="langtag"> <xsd:annotation> <xsd:documentation> <p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code> type lists the language codes that Amalgamated Interkluge accepts: English (en), French (fr), Japanese (ja), or German (de).</p> </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:language"> <xsd:enumeration value="en"/> <xsd:enumeration value="fr"/> <xsd:enumeration value="ja"/> <xsd:enumeration value="de"/> </xsd:restriction> </xsd:simpleType>
This datatype will accept any of the four language codes indicated, but only if they are written in lower case.
But what if we want a more liberal schema, which allows case-insensitive language tags? We want to accept not just “en
” but also “EN
”, “En
”, and (because we are determined to do the thing properly) even “eN
”.
We could add those to the enumeration: for each language, specify all four possible forms. No one seems to like this idea: it makes the declaration four times as big but much less clear. When I suggested it to the i18n WG, they just groaned and François Yergeau looked at me as if I had emitted an indelicate noise he didn’t want to call attention to.
We were all happier when a different idea occurred to us. First note that the datatype definition given above can easily be reformulated using a pattern facet instead of an enumeration:
<xsd:simpleType name="langtag2"> <xsd:annotation> <xsd:documentation> <p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code> type lists the language codes that Amalgamated Interkluge accepts: English (en), French (fr), Japanese (ja), or German (de).</p> </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:language"> <xsd:pattern value="en|fr|ja|de"/> </xsd:restriction> </xsd:simpleType>
This definition can be adjusted to make it case sensitive in a relatively straightforward way:
<xsd:simpleType name="langtag3"> <xsd:annotation> <xsd:documentation> <p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code> type lists the language codes that Amalgamated Interkluge accepts: English (en), French (fr), Japanese (ja), or German (de).</p> </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:language"> <xsd:pattern value="[eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE]"/> </xsd:restriction> </xsd:simpleType>
Voilà, case-insensitive language tags. The pattern is not quite four times larger than the old pattern, but the declaration is still smaller than the first one using enumerations.
A side benefit of using the pattern instead of the enumeration is that it’s easier to allow for subtags (so we can accept “en-US
” and “en-UK
”, etc., as well as just “en
”) by expanding on the pattern:
<xsd:simpleType name="langtag4"> <xsd:annotation> <xsd:documentation> <p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code> type lists the language codes that Amalgamated Interkluge accepts: English (en), French (fr), Japanese (ja), or German (de).</p> </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:language"> <xsd:pattern value="([eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE])(-[a-zA-Z0-9]{1,8})*"/> </xsd:restriction> </xsd:simpleType>
In a perfect system, there would be some way to signal that the four upper-, lower-, and mixed-case forms of “en
” all mean the same thing and map to the same value. This technique does not provide that. But then, I don’t know any good way to provide it. (I do know ways to provide it, just not any ones I think are good.) In an imperfect world, if I want case-insensitive language tags, I suppose I should be happy that I can find a way to define them without much inconvenience. And that, this technique provides.
Perhaps a user-supplied way to map values to canonical forms might work?
I can imagine a property alongside facets—hmm, maybe two of them, it’s already too complicated—that would be a regular expression substitution to be performed before validation; the second would be applied afterwards and is redundant, so let’s not mention it.
You could then say, s/([[:lower:]]+)/\U\1/g to map En, eN, en, EN all into EN.
Maybe this is a chainsaw trying to whittle a cotton reel. It doesn’t work for all values in all locales, but for language codes it doesn’t need to.
The alternative of having to specify a language and region when doing case mapping, and requiring every implementation to have a complete and up-to-date table of case mapping, seems crazy.
Of course, one could also do the substitution with XSLT before validating with schema. An up side is you can do it fairly simply today. A down side is that the error messages you get from the schema validation will probably have different line numbers and different text from the user’s input, making debugging feel like landing a spaceship by instructing a robot via a teletype, L for go Left, R for Right; XSD Moonlander anyone?
So if it’s to be done then I think I’m in favour of “user empowerment” here 🙂
Liam, yes, a language for mapping literals to values (or to canonical lexical forms of values, which amounts to the same thing) is one of the ways I can see to capture the idea that en and En and EN and eN all map to the same value. But I think the nature of things is that any such language will start simple, to handle just simple cases, and people will then ask why can’t it handle these slightly harder cases, and it will grow, until eventually it is Turing-complete and the ability to reason about the types being defined, or indeed in principle to decide whether a literal maps to a value, becomes an endangered property.
A user with a Turing complete language and a chain saw may be maximally empowered. But personally, I’d prefer to keep both the chain saw and the Turing-completeness out of XSD.
I notice a few things in your blog that require some attention :-). The least of these is that “UK” isn’t a valid subtag (your example should be en-GB).
The larger problem is that BCP 47 defines a number of subtags, not just language codes and region codes. Script subtags use titlecase. Variant subtags are all lowercase. There is, in RFC 4646, significant text about case folding tags (http://www.inter-locale.com/ID/draft-ietf-ltru-registry-14.html#canonical). The pending update to 4646 (to add ISO 639-3 codes to the mix) has even more on the topic (http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-16.html#casing)
In any case, mapping everything to lowercase using default case folding would probably work okay (much better than very elaborate patterns, when we start to consider baroque tags such as “sl-Latn-IT-rozaj”).
In any event, we’d be much better off, IMHO, if xs:language weren’t an xs:string but actually *meant* BCP 47 “tags for the identification of languages”. Too many XML technologies are bumping up against the need to match, process, select, etc. content based on the language. Unlike other XML type values, language tags depend on the internal structure of the language tags themselves. Simple string equality isn’t enough in these contexts, to everyone’s chagrin.
~Addison
Ur… and I should have said in the previous comment that script and variant subtags are *recommended* to use various casing conventions. The tags may use whatever case they so desire 🙂
Pingback: Pages tagged "indelicate"
Merci beaucoup pour les eclaircissements ! 🙂