Allowing ‘extension primitives’ in XML Schema?

In issue 3251 against XSDL 1.1 (aka ‘XML Schema 1.1’ to those who haven’t internalized the new acronym), Michael Kay suggests that XSDL, like other languages, allow implementations to define primitive types additional to those described in the XSDL spec.

I’ve been thinking about this a fair bit recently.

The result is too much information to put comfortably into a single post, so I’ll limit this post to a discussion of the rationale for the idea.

‘User-defined’ primitives?

Michael is not the first to suggest allowing the set of primitives to be extended without revving the XSDL spec. I’ve heard others, too, express a desire for this or for something similar (see below). In one memorable exchange at Extreme Markup Languages a couple years ago, Ann Wrightson noted that in some projects she has worked on, the need for additional primitives is keenly felt. In the heat of the moment, she spoke feelingly of the “arrogance” of languages that don’t allow users to define their own primitives. I remember it partly because that word stung; I doubt that my reply was as calmly reasoned and equable as I would have liked.

Strictly speaking, of course, the notion of ‘user-defined primitives’ is a contradiction in terms. If a user can define something meaningfully (as opposed to just declaring a sort of black box), then it seems inevitable that that definition will appeal to some other concepts, in terms of which the new thing is to be understood and processed. Those concepts, in turn, either have formal definitions in terms of yet other concepts, or they have no formal definition but must just be known to the processor or understood by human readers by other means. The term primitive denotes this last class of things, that aren’t defined within the system. Whatever a user can define, in the nature of things it can’t very well be a primitive in this sense of the word.

Defining types without lying to the processor

But if one can keep one’s pedantry under control long enough, it’s worth while trying to understand what is desired, before dismissing the expression of the desire as self-contradictory. It’s suggestive that some people point to DTLL (the Datatype Library Language being designed by Jeni Tennison) as an instance of the kind of thing needed. I have seen descriptions of DTLL that claimed that it had no primitives, or that it allowed user-defined primitives (thus falling into the logical trap just mentioned), but I believe that in public discussions Jeni has been more careful.

In practice, DTLL does have primitives, in the sense of basic concepts not defined in terms of other concepts still more basic. In the version Jeni presented at Extreme Markup Languages 2006, the primitive notions in terms of which all types are defined are (a) the idea of tuples with named parts and (b) the four datatypes of XPath 1.0. (Note: I specify the version not because I know that DTLL has since changed, but only because I don’t know that it has not changed; DTLL, alas, is still on the ‘urgent / important / read this soon’ pile, which means it’s continually being shoved aside by things labeled ‘super-urgent / life-threatening / read this yesterday’. It also suffers from my expectation that I’ll have to concentrate and think hard; surfing the Web and reading people’s blogs seems less strenuous.)

But DTLL does not have primitives, in the sense of a set of semantically described types from which all other types are to be constructed. All types (if I remember correctly) are in the same pool, and there is no formal distinction between primitive types and derived types.

Of course, the distinction between primitives and derived types has no particular formal significance in XSDL, either. What you can do with one, you can do with the other. The special types, on the other hand, are, well, special: you can derive new types from the primitives, but you can’t derive new types from the specials, like xs:anySimpleType or (in XSDL 1.1) xs:anyAtomicType. Such derivations require magic (which is one way of saying there isn’t any formal method of defining the nature of a type whose base type is xs:anyAtomicType: it requires [normative] prose).

But XSDL 1.0 and the current draft of XSDL 1.1 do require that all user-defined types be defined in terms of other known types. And this has a couple of irritating corollaries.

If you want to define an XSDL datatype for dates in the form defined by RFC 2822 (e.g. “18 Jan 2008”), or for lengths in any of the forms defined by (X)HTML or CSS (which accepts “50%”, and “50” [pixels], and “50*”, and “50em”, and so on), you can do so if you have enough patience to work out the required regular expressions. But you can’t derive the new rfc2822:date type from xs:date (as you might wish to do, to signal that they share the same value space). You must lie to the processor, and say that really, the set of RFC 2822 dates is a subtype of xs:string.

Exercise for those gifted at casuistry: write a short essay explaining that what is really being defined here is the set of RFC 2822 date expressions, which really are strings, so this is actually all just fine and nothing to complain about.

Exercise for those who care about having notations humans can actually use: read the essay produced by your casuist colleague and refrain from punching the author in the eye.

Lying to the processor is always dangerous, and usually a bad idea; designing a system that requires lying to the processor in order to do something useful is similarly a bad idea (and can lead to heated remarks about the arrogance of the system). When the schema author is forced to pretend that the value space of email dates is the value space of strings (i.e. sequences of UCS characters), it not only does violence to the soul of the schema author and makes the processor miss the opportunity to use a specially optimized storage format for the values, but it also makes it impossible to derive further types by imposing lower and upper bounds on the value space (e.g. a type for plausible email dates: email dated before 1950? probably not). And you can’t expect the XSDL validator to understand the relation among pixels and points and picas and ems and inches, so just forget about the idea of restricting the length type with an upper bound of “2 pc” (two picas) and having the software know that “30pt” (thirty points) exceeds that upper bound.

What about NOTATION?

As suggested in the examples just above, there are a lot of specialized notations that could usefully be defined as extension primitives in an XSDL context. Different date formats are a rich vein in themselves, but any specialized form of expression used to capture specialized information concisely is a candidate. Lengths, in a document formatting context. Rational numbers. Colors (again in a display context). Read the sections on data types in the HTML and CSS specs. Read the section on abstract data types in the programming textbook of your choice. Lots of possibilities.

One might argue that the correct way to handle these is to declare them for what they are: specialized notations, which may be processed and validated by a specialized processor (called, perhaps, as a plugin by the validator) but which a general-purpose markup validator needn’t be expected to know about.

In principle, this could work, I guess. And it may be (related to) what the designers of ISO 8879 (the SGML spec) had in mind when they defined NOTATION. But I don’t think NOTATION will fly as a solution for this kind of problem today, for several reasons:

  • There is no history or established practice of SGML or XML validators calling subvalidators to validate the data declared as being in a particular notation. So the ostensible reason for using NOTATION (“It’s already there! you don’t need to do anything!”) doesn’t really hold up: using declared notations to guide validation would be plowing new ground.
  • Almost no one ever really wants to use notations. In part this is because few people ever feel they really understand what notations are good for, and usually not for long, and they don’t always agree. So software developers never really know what to do with them, and end up doing nothing much with them.
  • If RFC 2822 dates are a ‘notation’ rather than a ‘datatype’, then surely the same applies to ISO 8601 dates. Why treat some date expressions as lexical representations of dates, and others as magic cookies for a black-box process? If notations were intended to keep software that works with SGML and XML markup from having to understand things like integers and dates, then the final returns are now in, and we can certify that that attempt didn’t succeed. (Some document-oriented friends of mine like to tell me that all this datatype stuff was foisted on XSDL by data-heads and does nothing at all for documents. I keep having to remind them that they spent pretty much the entire period from 1986 to 1998 warning their students that neither content models nor attribute declarations allowed you to specify, for example, that the content of a quantity element, or the value of a quantity attribute, ought to be [for example] a quantity — i.e. an integer, preferably restricted to a plausible range. XSDL may have given you things you never asked for, and it may give you things you asked for in a form you didn’t expect and don’t much like. But don’t claim you never asked for datatypes. I was there, and while I don’t have videotapes, I do remember.)

Who defines the new primitives?

Some people are nervous at the idea of trying to allow users to define new kinds of dates, or new kinds of values, in part because the attempt to allow the definition of arbitrary value spaces, in a form that can actually be used to check the validity of data, seems certain to end up by putting a Turing complete language into the spec, or by pointing to some existing programming language and requiring that people use that language to define their new types (and requiring all schema processors to include an interpreter for that language). And the spec ends up defining not just a schema language, but a set of APIs.

However you cut it, it seems a quick route to a language war; include me out.

Michael Kay has pointed out that there is a much simpler way. Don’t try to provide for user-defined types derived by restriction from xs:anyAtomicType in some interoperable way. That would require a lot of machinery, and would be difficult to reach consensus on.

Michael proposes: just specify that implementations may provide additional implementation-defined primitive types. In the nature of things, an implementation can do this however it wants. Some implementors will code up email dates and CSS lengths the same way they code the other primitives. Fine. Some implementors will expose the API that their existing primitive types use, so they choose, at the appropriate moment, to link in a set of extension types, or not. Some will allow users to provide implementations of extension types, using that API, and link them at run time. Some may provide extension syntax to allow users to describe new types in some usable way (DTLL, anyone?) without having to write code in Java or C or [name of language here].

That way, all the burden of designing the way to allow user-specified types to interface with the rest of the implementation falls on the implementors, if they wish to carry it, and not on the XSDL spec. (Hurrah, cries the editor.)

If enough implementations do that, and enough experience is gained, the relevant community might eventually come to some consensus on a good way to do it itneroperably. And at that point, it could go into the spec for some schema language.

This has worked tolerably well for the analogous situation with implementation-defined / user-defined XPath functions in XSLT 1.0. XSLT 1.0 doesn’t provide syntax for declaring user-defined functions; it merely allows implementations to suport additional functions in XPath expressions. Some implementations did so, either their own functions, or functions defined by projects like EXSLT. And some implementations did in fact provide syntax for allowing users to define functions without writing Java or C code and running the linker. (And lo! the experience thus gained has made it possible to get consensus in XSLT 2.0 for standardized syntax for user-defined functions.)

But with that thought, I have reached a topic perhaps better separated out into another post. Other languages, specs, and technologies have faced similar decisions (should we allow implementations to add new blorts to our language, or should we restrict them to using the standard blorts?); what has the experience been like? What can we learn from the experience?

3 thoughts on “Allowing ‘extension primitives’ in XML Schema?

  1. What makes XSDL datatypes confusing to me is the lack of rationale for what is a top-level type (and thus semantically distinct from all other top-level types) and what is not:

    Why are floats, doubles, and decimals all primitive rather than being derived from a primitive notion of number?

    Why are URIs primitive (because they represent web resources?) while language tags are not (they represent language varieties, after all)?

    The lack of extensibility hurts in two ways, which must be distinguished: there is no way to introduce a new value type, and there is no way to introduce a novel representation of existing value types. The former is tolerable, perhaps unavoidable; the latter is not.

    We can live without a value space of Gaussian integers, most of the time; it’s utterly obnoxious to have no way of mapping RFC 822 dates to the same underlying value space as ISO 8601 dates.

    On notations: The authors of the XML Rec should have made clear that the purpose of a NOTATION attribute is to specify the datatype of the element to which it’s attached. Then it wouldn’t have been necessary to massively reinvent the wheel. (While I’m at it, newline normalization in attributes and the use of URIs as system ids rather than public ids were also mistakes.)

  2. John,

    thank you; good questions. My answer has grown into a separate post; see it for one Working Group member’s take on the reasons for the puzzling phenomena you mention.

Comments are closed.