John Cowan asks, in a comment on another post here, what possible rationale could have governed the decisions in XSDL 1.0 about which types to define as primitives and which to derive from other types.
I started to reply in a follow-up comment, but my reply grew too long for that genre, so I’m promoting it to a separate post.
The questions John asks are good ones. Unfortunately, I don’t have good answers. In all the puzzling cases he notes, my explanation of why XSDL is as it is begins with the words “for historical reasons …”.
Why are floats, doubles, and decimals all primitive rather than being derived from a primitive notion of number?
[Warning: I am not a mathematician, and have not recently reviewed the IEEE specs or any of the numerous explanations of them on the Web. What follows may contain howlers; the mathematically literate should not read it while drinking coffee or other liquids near their keyboards.]
The various numbers are now primitives because when we tried to derive them all from the same root we proved unable to identify any plausible set of facets with which the perform the derivation. Of all the real numbers in the universe, why is it just this particular subset that forms the value space of float? Is it because of some simple and obvious numeric property they all share? Something like the property of having no fractional part, which distinguishes the integers as a subset of
Well, no. Certainly, one can define a facet with values that select exactly the subsets representable in IEEE float and double. And one can define it with reference to mathematical properties of those numbers. But a simple and obvious mathematical property, something that feels like a natural choice for a restriction facet? Hardly.
What exactly is the property selected by a facet to select doubles and floats out of the universe of real numbers? Hmm. It’s the property of being representable by a mathematical expression in a rather particular (and not wholly unarbitrary) form, which as it happens makes perfect sense if (but pretty much only if) you think about the properties of a fixed-width floating-point binary representation of real numbers, in particular a representation which looks a lot like … the one specified by IEEE.
One could define a facet, or a set of facets, that would allow slightly less ad-hoc selections of reals. Instead of assuming base two, and the particular widths for exponent and mantissa required by IEEE, one could have facets to specify the base, and the width of the exponent part, and the width of the mantissa. If the work were done right, the result would look a lot like a specification for radix-independent floating-point numbers, much like the one IEEE was developing at the time. One could use such facets to define types similar to float and double, but different. One might perhaps define a type corresponding not to IEEE floating point numbers, but to other historically important floating-point formats. That would be fun.
Some cynics in the Working Group suggested that implementors might feel some reluctance, based on their perception of user needs, to expend the effort necessary to build the machinery for supporting floating-point numbers in (say) base 7, with 49 heptary digits for the mantissa and 14 for the exponent, or for supporting variants of floating-point decimal that the relevant IEEE working group had perhaps considered but decided against. So if we did specify a non-arbitrary set of facets capable of deriving float and double from the set of reals, there would be a certain amount of pressure to ensure that users never actually used any of those facets.
It seemed simpler, less error-prone, and more honest just to define float and double as primitives.
The various date and time types, on the other hand (about which John did not complain, although plenty of others have), are primitives only because the W3C Internationalization Working Group objected strongly to their being derived from a single primitive. (Don’t ask me what this has to do with internationalization.) Eventually the discussion became surreal enough that it seemed best to capitulate on this particular point, so we went from one primitive type with a set of fairly well-defined and clean facets to develop the other types, to eight primitives formally unrelated to each other.
Why are URIs primitive (because they represent web resources?) while language tags are not (they represent language varieties, after all)?
Whether anyURI should be derived from string or not engendered a good deal of often heated discussion in the XML Schema Working Group and elsewhere. Some argued for what I can only describe as a kind of metaphysical difference: URIs, in their view, are not essentially strings. (I do not believe any of them would accept this paraphrase of their argument, but I can’t do better because I never really understood more of it than that.) Others pointed out that the URI spec (or: one of the ever-growing number of specs casually referred to by outsiders as “the URI spec”) actually defines URIs as sequences of characters, i.e. as strings, but this argument made no headway. As far as I could understand, the answer to this observation was, essentially, that “the URI-nature which can be reduced to mere string-nature is not the True URI-nature”. If you want to discuss this further, I suggest you take it up with the TAG.
A rather different argument eventually persuaded me that it would be just as well to make
xs:anyURI primitive. It seemed like a good idea to define the value space as the set of strings which (after XLink-style escaping) belong to the language defined by the grammar in the URI spec (there’s that phrase again). That wouldn’t capture any scheme-specific rules (like those governing the internal structure of HTTP URIs), but the only feasible alternative seemed to be to say, in effect, that any string is legal as an anyURI value.
I spent some time thinking that the grammar of the RFC was so loose that we might as well have said that any string was legal; one can hear normally well-informed people say that the only constraint enforced by the scheme-independent grammar of RFC 2396 (or nowadays by RFC 3987) is that you can’t have two hash marks. (Others equally well-informed deny that it enforces any such constraint. Me, I have given up trying to understand what the RFCs do and don’t say: they don’t seem to be interested in providing a crisp answer to the question, for some arbitrary string, “is this a legal URI or not?”.) But it turns out I was wrong: I once spent an instructive hour generating 10,000 or so random ASCII strings and parsing them against the grammar of RFC 2396, and the large majority were not legal, if only because they used illegal characters. So I now believe it is useful to enforce that grammar, even if one doesn’t enforce all the relevant scheme-specific rules. (XSDL can’t require that, because we don’t want to get in the way of new schemes.)
So: we want to specify that anyURI values must at least be in the language defined by the generic URI grammar in the relevant RFC.
But that language can be defined for a derived type only if it’s regular. Is that language regular? I haven’t looked at it for a while, so I’m not really sure. At the very least, it’s not obvious that it’s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.
The set of legal language tags, by contrast, is obviously a regular language and can be described conveniently by a regular expression.
For the same reason, if a datatype for legal XPath were ever introduced, it would need to be primitive, because the lexical space cannot be described by a regular expression.
Why does that conclusion follow?
Because however erratic the design of the XSDL datatype system has been, there is at least one basic principle to which the design has adhered fairly faithfully: There is no magic in derived types. All magic (by which I mean: everything relevant to validity that has to be explained in natural-language prose because it cannot be expressed by the formalism acting alone) is in the primitives. The lexical and value spaces of any type derived by restriction are precisely those of the base type, as restricted by the facets used in the derivation, and without other restrictions.
It is for this reason that it is not possible in XSDL to define (for example) a datatype whose value space is the set of prime numbers. In order to define a type of that kind, one must either define it as a new primitive type (which would seem not only eccentric, since the primes really are a subset of the integers, but would be a case of lying to the processor), or else define a new kind of facet.
Henry Thompson has (in private conversation) suggested the facet maximum-number-of-divisors, set to the value 2. (A straight number-of-divisors facet set to the value 2 would be simpler, he said, but it would exclude the value 1, which many people wish to include among the primes. I note however that Oystein Ore’s Number theory and its history, which happens still to be on my desk, defines primes as integers p > 1 whose only divisors are the trivial ones ± 1 and ± p. So when the day comes, I vote for the number-of-divisors facet, set to the value 4, with a note in the annotation reading “Remember that -1 is a divisor!”
We can live without a value space of Gaussian integers, most of the time; it~s utterly obnoxious to have no way of mapping RFC 822 dates to the same underlying value space as ISO 8601 dates.
Here, too, I sympathize with your point. We decided, at one point, to define mirror images of all of the current built-ins as abstract types, with the existing concrete built-ins derived from them by specifying a particular lexical mapping. (For this, I think we would have used keywords, to avoid having to design a lot of machinery for defining lexical mapping without magic.) The idea was that if someone wanted to define an
rfc2822:date type for email dates, they wouldn’t be able to define the lexical mapping in an interoperable way, but they could at least say clearly that the value space consisted of dates.
The editors of the spec looked at this decision, struggled with it for a while, and eventually told the Working Group “If you want this to happen, you’re going to have to explain to us how to build a type hierarchy in which
xs:integer is derived from
xs:abstractInteger, not from
xs:decimal, but is nevertheless substitutable for
xs:decimal.” The editors had been unsympathetic to the idea, so some Working Group members may have suspected them of not trying hard enough to work out the kinks. But no one ever stepped forward with a solution to the problem.
And it’s only at this moment that I see the solution: don’t define an abstract analogue to every built-in, define them only for the primitives. Oh, well. About eight years too late, that one.
One can argue that if the explanation for why this type is primitive and that one is derived begins “Well, for historical reasons …”, then it’s a sign that the design is insufficiently clean. And that argument may be right. But I am unmoved by it. I like mathematics and admire the idea of mathematical purity, but I don’t confuse spec-making with mathematics.
In all of this business of deciding what’s primitive and what’s not, some members of the Working Group may also have been influenced by the belief that in any system like this the choice of primitives, while not wholly unmotivated by general principles, is seldom entirely free from a certain arbitrariness. Certainly I was, and am. One can do better or worse, and it’s worth a certain amount of effort to do better. But, you know, even the axioms of Russell and Whitehead (or any modern logical system) look a bit accidental. These seven axioms? And not some other set? Surely you’re joking; why these? Well, because they work. But other sets of axioms also work. If Russell and Whitehead’s choice of primitives feels arbitrary, then a Working Group which has failed to find a wholly non-arbitrary principle on which to divide primitives from other types can at least feel that it has failed in good company.