Primitives and non-primitives in XSDL

John Cowan asks, in a comment on another post here, what possible rationale could have governed the decisions in XSDL 1.0 about which types to define as primitives and which to derive from other types.

I started to reply in a follow-up comment, but my reply grew too long for that genre, so I’m promoting it to a separate post.

The questions John asks are good ones. Unfortunately, I don’t have good answers. In all the puzzling cases he notes, my explanation of why XSDL is as it is begins with the words “for historical reasons …”.

Why are floats, doubles, and decimals all primitive rather than being derived from a primitive notion of number?

[Warning: I am not a mathematician, and have not recently reviewed the IEEE specs or any of the numerous explanations of them on the Web. What follows may contain howlers; the mathematically literate should not read it while drinking coffee or other liquids near their keyboards.]

The various numbers are now primitives because when we tried to derive them all from the same root we proved unable to identify any plausible set of facets with which the perform the derivation. Of all the real numbers in the universe, why is it just this particular subset that forms the value space of float? Is it because of some simple and obvious numeric property they all share? Something like the property of having no fractional part, which distinguishes the integers as a subset of xs:decimal?

Well, no. Certainly, one can define a facet with values that select exactly the subsets representable in IEEE float and double. And one can define it with reference to mathematical properties of those numbers. But a simple and obvious mathematical property, something that feels like a natural choice for a restriction facet? Hardly.

What exactly is the property selected by a facet to select doubles and floats out of the universe of real numbers? Hmm. It’s the property of being representable by a mathematical expression in a rather particular (and not wholly unarbitrary) form, which as it happens makes perfect sense if (but pretty much only if) you think about the properties of a fixed-width floating-point binary representation of real numbers, in particular a representation which looks a lot like … the one specified by IEEE.

One could define a facet, or a set of facets, that would allow slightly less ad-hoc selections of reals. Instead of assuming base two, and the particular widths for exponent and mantissa required by IEEE, one could have facets to specify the base, and the width of the exponent part, and the width of the mantissa. If the work were done right, the result would look a lot like a specification for radix-independent floating-point numbers, much like the one IEEE was developing at the time. One could use such facets to define types similar to float and double, but different. One might perhaps define a type corresponding not to IEEE floating point numbers, but to other historically important floating-point formats. That would be fun.

Some cynics in the Working Group suggested that implementors might feel some reluctance, based on their perception of user needs, to expend the effort necessary to build the machinery for supporting floating-point numbers in (say) base 7, with 49 heptary digits for the mantissa and 14 for the exponent, or for supporting variants of floating-point decimal that the relevant IEEE working group had perhaps considered but decided against. So if we did specify a non-arbitrary set of facets capable of deriving float and double from the set of reals, there would be a certain amount of pressure to ensure that users never actually used any of those facets.

It seemed simpler, less error-prone, and more honest just to define float and double as primitives.

The various date and time types, on the other hand (about which John did not complain, although plenty of others have), are primitives only because the W3C Internationalization Working Group objected strongly to their being derived from a single primitive. (Don’t ask me what this has to do with internationalization.) Eventually the discussion became surreal enough that it seemed best to capitulate on this particular point, so we went from one primitive type with a set of fairly well-defined and clean facets to develop the other types, to eight primitives formally unrelated to each other.

Why are URIs primitive (because they represent web resources?) while language tags are not (they represent language varieties, after all)?

Whether anyURI should be derived from string or not engendered a good deal of often heated discussion in the XML Schema Working Group and elsewhere. Some argued for what I can only describe as a kind of metaphysical difference: URIs, in their view, are not essentially strings. (I do not believe any of them would accept this paraphrase of their argument, but I can’t do better because I never really understood more of it than that.) Others pointed out that the URI spec (or: one of the ever-growing number of specs casually referred to by outsiders as “the URI spec”) actually defines URIs as sequences of characters, i.e. as strings, but this argument made no headway. As far as I could understand, the answer to this observation was, essentially, that “the URI-nature which can be reduced to mere string-nature is not the True URI-nature”. If you want to discuss this further, I suggest you take it up with the TAG.

A rather different argument eventually persuaded me that it would be just as well to make xs:anyURI primitive. It seemed like a good idea to define the value space as the set of strings which (after XLink-style escaping) belong to the language defined by the grammar in the URI spec (there’s that phrase again). That wouldn’t capture any scheme-specific rules (like those governing the internal structure of HTTP URIs), but the only feasible alternative seemed to be to say, in effect, that any string is legal as an anyURI value.

I spent some time thinking that the grammar of the RFC was so loose that we might as well have said that any string was legal; one can hear normally well-informed people say that the only constraint enforced by the scheme-independent grammar of RFC 2396 (or nowadays by RFC 3987) is that you can’t have two hash marks. (Others equally well-informed deny that it enforces any such constraint. Me, I have given up trying to understand what the RFCs do and don’t say: they don’t seem to be interested in providing a crisp answer to the question, for some arbitrary string, “is this a legal URI or not?”.) But it turns out I was wrong: I once spent an instructive hour generating 10,000 or so random ASCII strings and parsing them against the grammar of RFC 2396, and the large majority were not legal, if only because they used illegal characters. So I now believe it is useful to enforce that grammar, even if one doesn’t enforce all the relevant scheme-specific rules. (XSDL can’t require that, because we don’t want to get in the way of new schemes.)

So: we want to specify that anyURI values must at least be in the language defined by the generic URI grammar in the relevant RFC.

But that language can be defined for a derived type only if it’s regular. Is that language regular? I haven’t looked at it for a while, so I’m not really sure. At the very least, it’s not obvious that it’s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

The set of legal language tags, by contrast, is obviously a regular language and can be described conveniently by a regular expression.

For the same reason, if a datatype for legal XPath were ever introduced, it would need to be primitive, because the lexical space cannot be described by a regular expression.

Why does that conclusion follow?

Because however erratic the design of the XSDL datatype system has been, there is at least one basic principle to which the design has adhered fairly faithfully: There is no magic in derived types. All magic (by which I mean: everything relevant to validity that has to be explained in natural-language prose because it cannot be expressed by the formalism acting alone) is in the primitives. The lexical and value spaces of any type derived by restriction are precisely those of the base type, as restricted by the facets used in the derivation, and without other restrictions.

It is for this reason that it is not possible in XSDL to define (for example) a datatype whose value space is the set of prime numbers. In order to define a type of that kind, one must either define it as a new primitive type (which would seem not only eccentric, since the primes really are a subset of the integers, but would be a case of lying to the processor), or else define a new kind of facet.

Henry Thompson has (in private conversation) suggested the facet maximum-number-of-divisors, set to the value 2. (A straight number-of-divisors facet set to the value 2 would be simpler, he said, but it would exclude the value 1, which many people wish to include among the primes. I note however that Oystein Ore’s Number theory and its history, which happens still to be on my desk, defines primes as integers p > 1 whose only divisors are the trivial ones ± 1 and ± p. So when the day comes, I vote for the number-of-divisors facet, set to the value 4, with a note in the annotation reading “Remember that -1 is a divisor!”

We can live without a value space of Gaussian integers, most of the time; it~s utterly obnoxious to have no way of mapping RFC 822 dates to the same underlying value space as ISO 8601 dates.

Here, too, I sympathize with your point. We decided, at one point, to define mirror images of all of the current built-ins as abstract types, with the existing concrete built-ins derived from them by specifying a particular lexical mapping. (For this, I think we would have used keywords, to avoid having to design a lot of machinery for defining lexical mapping without magic.) The idea was that if someone wanted to define an rfc2822:date type for email dates, they wouldn’t be able to define the lexical mapping in an interoperable way, but they could at least say clearly that the value space consisted of dates.

The editors of the spec looked at this decision, struggled with it for a while, and eventually told the Working Group “If you want this to happen, you’re going to have to explain to us how to build a type hierarchy in which xs:integer is derived from xs:abstractInteger, not from xs:decimal, but is nevertheless substitutable for xs:decimal.” The editors had been unsympathetic to the idea, so some Working Group members may have suspected them of not trying hard enough to work out the kinks. But no one ever stepped forward with a solution to the problem.

And it’s only at this moment that I see the solution: don’t define an abstract analogue to every built-in, define them only for the primitives. Oh, well. About eight years too late, that one.

One can argue that if the explanation for why this type is primitive and that one is derived begins “Well, for historical reasons …”, then it’s a sign that the design is insufficiently clean. And that argument may be right. But I am unmoved by it. I like mathematics and admire the idea of mathematical purity, but I don’t confuse spec-making with mathematics.

In all of this business of deciding what’s primitive and what’s not, some members of the Working Group may also have been influenced by the belief that in any system like this the choice of primitives, while not wholly unmotivated by general principles, is seldom entirely free from a certain arbitrariness. Certainly I was, and am. One can do better or worse, and it’s worth a certain amount of effort to do better. But, you know, even the axioms of Russell and Whitehead (or any modern logical system) look a bit accidental. These seven axioms? And not some other set? Surely you’re joking; why these? Well, because they work. But other sets of axioms also work. If Russell and Whitehead’s choice of primitives feels arbitrary, then a Working Group which has failed to find a wholly non-arbitrary principle on which to divide primitives from other types can at least feel that it has failed in good company.

Hmm.

I grant the arbitrariness of the hasTheIEEENature facet, but the result is that not only aren’t floats and doubles interoperable with decimals of various sorts, they aren’t even interoperable with each other — and defining float as a restriction of double in both maximum value and precision would be comfortably non-magic.

My real complaint with the gHorribleKludge types goes to the underlying ISO 8601 standard, which is that it’s simply silly to treat T1159Z as *every* 11:59 AM in the UTC time zone; it should be *some* unspecified 11:59, not a repeating event. When I see 11:59 in a log record, I don’t suppose that the same event recurred daily, but rather that the day information is implicit somewhere else (perhaps there is a new log every day).

In any event, duration is clearly different from the rest.

The argument from regular languages proves too much, because if URIs are primitive because they can’t (perspicuously) be defined by a regular expression whereas language tags can, what is to say about decimals, which are defined by a singularly trivial regular expression? That’s when the spec goes all semantic-y on us and says huffily “Well, numbers are not strings, surely you see *that*.”

Well, numerals are digit strings and numbers are not strings, but the language-tags are strings but languages are not strings, by the same token. It’s this two-facedness that really distresses me, this sense that whether something is a primitive only *sometimes* depends on whether it is really a different value space from the other primitives or not.

We now have URIs (with lots of illegal characters), IRIs (with the same set of illegal ASCII characters but allowing all non-ASCII characters), and soon we will have LEIRIs (with almost no illegal characters), these last being the format of things like system identifiers and XLink hrefs.

I may be wrong, but I think anyURIs are also LEIRIs (that is, a < in a LEIRI gets escaped to %3C, although they are not allowed in URIs or IRIs). Anyhow, one thing is clear, “%%%” or other violations of the %-escaping convention are illegal everywhere.

Abstract supertypes of the primitie types would have been cool.

Lastly, I learned first-order logic by the method of natural deduction, which has what seems to me the only correct number of axioms, viz. zero; and when I first came across axiomatic logic, I immediately gagged on the horrid arbitrariness of the axioms. True, some might complain that ND has rather too many inference rules, but we can’t have everything, and at least the inference rules are tolerably obvious, whereas the axiom sets that have been (indecently) exposed to me are not.

6 thoughts on “Primitives and non-primitives in XSDL”

John Cowan on 19 January 2008 at 07:48 said:

Hmm.

I grant the arbitrariness of the hasTheIEEENature facet, but the result is that not only aren’t floats and doubles interoperable with decimals of various sorts, they aren’t even interoperable with each other — and defining float as a restriction of double in both maximum value and precision would be comfortably non-magic.

My real complaint with the gHorribleKludge types goes to the underlying ISO 8601 standard, which is that it’s simply silly to treat T1159Z as *every* 11:59 AM in the UTC time zone; it should be *some* unspecified 11:59, not a repeating event. When I see 11:59 in a log record, I don’t suppose that the same event recurred daily, but rather that the day information is implicit somewhere else (perhaps there is a new log every day).

In any event, duration is clearly different from the rest.

The argument from regular languages proves too much, because if URIs are primitive because they can’t (perspicuously) be defined by a regular expression whereas language tags can, what is to say about decimals, which are defined by a singularly trivial regular expression? That’s when the spec goes all semantic-y on us and says huffily “Well, numbers are not strings, surely you see *that*.”

Well, numerals are digit strings and numbers are not strings, but the language-tags are strings but languages are not strings, by the same token. It’s this two-facedness that really distresses me, this sense that whether something is a primitive only *sometimes* depends on whether it is really a different value space from the other primitives or not.

We now have URIs (with lots of illegal characters), IRIs (with the same set of illegal ASCII characters but allowing all non-ASCII characters), and soon we will have LEIRIs (with almost no illegal characters), these last being the format of things like system identifiers and XLink hrefs.

I may be wrong, but I think anyURIs are also LEIRIs (that is, a < in a LEIRI gets escaped to %3C, although they are not allowed in URIs or IRIs). Anyhow, one thing is clear, “%%%” or other violations of the %-escaping convention are illegal everywhere.

Abstract supertypes of the primitie types would have been cool.

Lastly, I learned first-order logic by the method of natural deduction, which has what seems to me the only correct number of axioms, viz. zero; and when I first came across axiomatic logic, I immediately gagged on the horrid arbitrariness of the axioms. True, some might complain that ND has rather too many inference rules, but we can’t have everything, and at least the inference rules are tolerably obvious, whereas the axiom sets that have been (indecently) exposed to me are not.
cmsmcq on 19 January 2008 at 09:39 said:

W.r.t. language – if one thinks of the xs:language type as having a value space of languages, then yes, one will feel a painful asymmetry in the treatment of xs:language and xs:decimal. If one things of the value space as being one of language tags or language codes, which is the point of view encouraged (in this reader, at least) both by the ISO 639 and 10639 specs and by the series of RFCs which define (in the words of RFC 1766) “tags for the identification of languages”, then the asymmetry is not only not painful, but well motivated.

When one stores (representations of) dates or numbers in a computer, the operations one typically wants to perform on those representations are analogues of operations applicable to (real, abstract) dates and numbers: addition, comparison to upper or lower bounds, etc. But few people, perhaps, expect to write “en” or “de” and then to perform operations applicable to the languages English or German, using the machine. There are few libraries of functions that allow one to inquire of “en” or “de”, for example “Is the language SVO?” “… SOV?” or which allow one to inquire into the number of cases or into the presence or absence of other grammatical categories.

It’s convenient to ignore, much of the time, the distinction between “integer” and “binary twos-complement representation of integer”, and think of a variable in a program as holding, or a function as returning, an integer, rather than forcing oneself always to think “no, not an integer but a binary twos-complement representation of an integer”, or “… a packed binary-coded decimal representation of an integer”. That is one of the advantages of high-level languages.

But we don’t have representations of human languages, or even of artificial languages, that make it easy or convenient to ignore that thing/representation distinction. And so it seems natural to me, at least, to regard “en” and “de” not as representations of languages, but only as representations of codes / identifiers of languages. YMMV.
David Carlisle on 19 January 2008 at 16:12 said:

> It seemed simpler, less error-prone, and more honest just to >define float and double as primitives.

Trouble is really it’s simpler for the Schema WG but harder for everyone else who has to work around these problems.
It’s especially noticeable for example in XPath/Xquery which has to invent a lot of extra machinery of type promotion amongst numeric types (and between string/uri) that simply would not be necessary if these types fitted into a hierarchy.

> It’s convenient to ignore, much of the time, the distinction between “integer” and “binary twos-complement representation

True for integers (which are stored exactly) but not usually true for float, which are a consequence of the fact that the abstract real numbers can’t be stored exactly. So for integers the internal storage schema is irrelevant and invisible to higher level operations, for float/double etc this is not the case at all, the visible values returned by all operations can only be explained by reference to the particular storage layout.
Toby White on 21 January 2008 at 05:02 said:

“[…] what is to say about decimals, which are defined by a singularly trivial regular expression? That’s when the spec goes all semantic-y on us and says huffily “Well, numbers are not strings, surely you see *that*.”

I ran into this irritation in trying to use XML to store floating point numbers in the http://cmlcomp.org language.

Well, actually, I think the irritation was initially due to XPath 1.0 refusing to understand exponential notation, which then led me to wonder what on earth an XML standard of all things was doing trying to describe numerical models in this way.

If I had my druthers, all any XML standard would have to say about numbers is “you know what a number looks like (here is the obvious regex). If you are an implementation wanting to deal with numbers, convert it to whatever internal representation you prefer to use, but do be sensible.”

Indeed, that’s precisely what I ended up doing for CMLComp which specifically doesn’t use any of the xsd datatypes, but works on the lexical space defined by:

http://uszla.me.uk/gitweb/CMLComp.git/master:XMLFP.rnc

A longer explanation of the background and design choice is at:
http://cmlcomp.org/t/wiki/FloatingPointXml (I am quite willing to be told I’m talking nonsense though)
Pingback: Messages in a bottle » Blog Archive » Allowing case-insensitive language tags in XSD
Pingback: Messages in a Bottle » Blog Archive » Simple proof that the URI grammar of RFC 3986 defines a regular language

Comments are closed.