Allowing case-insensitive language tags in XSD

[30 July 2008]

I found a note to myself today, while going through some old papers, reminding me to write up an idea the i18n Working Group and I had when we were discussing the problem of case (in)sensitivity and language tags, some time ago. Here it is, for the record.

The discussion of the language datatype in XSD 1.1 includes a note reading:

Note: [BCP 47] specifies that language codes “are to be treated as case insensitive; there exist conventions for capitalization of some of the subtags, but these MUST NOT be taken to carry meaning.” Since the language datatype is derived from string, it inherits from string a one-to-one mapping from lexical representations to values. The literals ‘MN’ and ‘mn’ (for Mongolian) therefore correspond to distinct values and have distinct canonical forms. Users of this specification should be aware of this fact, the consequence of which is that the case-insensitive treatment of language values prescribed by [BCP 47] does not follow from the definition of this datatype given here; applications which require case-insensitivity should make appropriate adjustments.

The same is true of XSD 1.0, even if it doesn’t point out the problem as clearly.

As can be imagined, there have been requests that XSD 1.1 define a new variant of xsd:string which is case-insensitive, to allow language tags to be specified properly as a subtype of case-insensitive string and not as a subtype of the existing string type. Or perhaps language tags need to be a primitive type (as John Cowan has argued) instead of a subtype of string. The former opens a large can of worms, the same one that led XML 1.0 to be case-sensitive in the first place. (If you haven’t tried to work out a really good locale-insensitive internationalized rule for case folding, try it sometime when your life is too placid and simple and all your problems are too tractable; if the difference between metropolitan French and Quebecois doesn’t make you give up, remember to explain how you’re going to handle Turkish and English in the same rule.) The latter doesn’t open that can of worms (for the restricted character set allowed in language tags, case folding is well behaved and well defined), but it does open others. I’ve talked about language codes as a subtype of string before, so I won’t repeat it here.

In some cases the case-sensitivity of xsd:language is not a serious problem: we can write our schema to enforce the usual case conventions, by which language tags should be lower-case and country codes should be uppercase, and we can specify that a particular element should have a language tag of “en”, “fr”, “ja”, or “de” by using an enumeration in the usual way:

<xsd:simpleType name="langtag">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:enumeration value="en"/>
<xsd:enumeration value="fr"/>
<xsd:enumeration value="ja"/>
<xsd:enumeration value="de"/>
</xsd:restriction>
</xsd:simpleType>

This datatype will accept any of the four language codes indicated, but only if they are written in lower case.

But what if we want a more liberal schema, which allows case-insensitive language tags? We want to accept not just “en” but also “EN”, “En”, and (because we are determined to do the thing properly) even “eN”.

We could add those to the enumeration: for each language, specify all four possible forms. No one seems to like this idea: it makes the declaration four times as big but much less clear. When I suggested it to the i18n WG, they just groaned and François Yergeau looked at me as if I had emitted an indelicate noise he didn’t want to call attention to.

We were all happier when a different idea occurred to us. First note that the datatype definition given above can easily be reformulated using a pattern facet instead of an enumeration:

<xsd:simpleType name="langtag2">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="en|fr|ja|de"/>
</xsd:restriction>
</xsd:simpleType>

This definition can be adjusted to make it case sensitive in a relatively straightforward way:

<xsd:simpleType name="langtag3">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="[eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE]"/>
</xsd:restriction>
</xsd:simpleType>

Voilà, case-insensitive language tags. The pattern is not quite four times larger than the old pattern, but the declaration is still smaller than the first one using enumerations.

A side benefit of using the pattern instead of the enumeration is that it’s easier to allow for subtags (so we can accept “en-US” and “en-UK”, etc., as well as just “en”) by expanding on the pattern:

<xsd:simpleType name="langtag4">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern
value="([eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE])(-[a-zA-Z0-9]{1,8})*"/>
</xsd:restriction>
</xsd:simpleType>

In a perfect system, there would be some way to signal that the four upper-, lower-, and mixed-case forms of “en” all mean the same thing and map to the same value. This technique does not provide that. But then, I don’t know any good way to provide it. (I do know ways to provide it, just not any ones I think are good.) In an imperfect world, if I want case-insensitive language tags, I suppose I should be happy that I can find a way to define them without much inconvenience. And that, this technique provides.

Namespace documents (kudos to XHTML)

[28 July 2008]

Lately I’ve had occasion to spend some time dereferencing namespaces and looking at what you get when you do so. If, for example, you have encountered some qualified name and want to know what it might mean, the “follow-your-nose” principle says it’s a good idea that you should be able to find out by dereferencing the namespace name. (The follow your nose priniplce introduced to me under that name by Dan Connolly, but I think he’d prefer to think of it as a general principle of Web architecture than as an invention of his own. And indeed the Architecture of the World Wide Web, as documented by the W3C’s Technical Architecture Group, explicitly recommends that namespace documents be provided for all namespaces.)

The upshot of my recent examinations is that for some namespaces, even otherwise exemplary applications and demos fail to provide namespace documents. For others, the only namespace document is a machine-readable document (e.g. an OWL ontology) without any human-comprehensible documentation of what the terms in the namespace are intended to mean; for still others, there is useful human-readable description (sometimes only in a comment, but it’s there) if you can can find it. And for a few, there is something approaching a document intended to be accessible to a human reader.

So far, however, the best namespace document I’ve seen recently is the one produced by the XHTML Working Group for the namespace http://www.w3.org/1999/xhtml/vocab — human-readable, and reasonably clear. Not perfect (no document date? no description of whether the vocabulary is subject to change?) but far, far, better than average.

Kudos to the XHTML Working Group!

Posted in XML

RDF and Wittgenstein

[27 July 2008]

The more I think about RDF’s goal of providing a monotonic semantics for RDF graphs (under pressure in part from my colleague Thomas Roessler, to whom thanks), the more the RDF triple seems to be an attempt to operationalize Wittgenstein’s notion of atomic fact, with all the advantages and disadvantages that that entails. Is this an insight, or just a blazing truism? Or false?

Interesting that this possibility seems to run counter to Steve Pepper’s remark “RDF/OWL is to Aristotle as Topic Maps is to Wittgenstein.” Perhaps SP has the Wittgenstein of the Untersuchungen in mind?

Why …

Why do so many proponents of new technologies spend so much time misrepresenting existing technologies and spreading misinformation about them?

Do they misrepresent the existing technologies in order to make the new technology they are selling look better?

Or did they get involved in the new technology only because they failed to understand the existing technology and how to use it well?

Daniel Boone meets the consistent Web

[22 July 2008]

My colleague Thomas Roessler writes:

[The monotonic semantics of RDF] guarantee that you won’t run into a world of inconsistency when you discover additional information, and they also guarantee that you can learn things about the world piece by piece.

My evil twin Enrique responds: So let us start with the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is identical to the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which I assume I can express using some predicate like the OWL sameAs.

And now let us discover additional information in another triple store, which contains the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is distinct from the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which it expresses using some predicate like the OWL differentFrom.

I’m having trouble understanding (concludes Enrique) how we can do this without either running into a world of inconsistency (a small world, perhaps, bounded in a nutshell, but still a world big enough for joe and Josephus to be both the same and different), or else running into a world in which we find that “inconsistency” has been defined to have a highly technical meaning under which the two triples just described are not actually inconsistent in the technical sense (why do I expect someone to start lecturing me about Herbrand models any moment now?), even though any application relying on the usual notions of identity and difference may find itself at a loss as to what to make of seeing them both in the same graph.

I reminded Enrique of the American pioneer Daniel Boone, who proudly claimed that he had never been lost in his life. Never? Never. [Pause.] “But I was a mite bewildered once for three days.” [Rimshot.]