VC-Filter

Posted on 6 May 2009 by cmsmcq

[6 May 2009]

Short version: I’ve made a new toy for playing with one aspect of XSD 1.1, namely the conditional inclusion of elements in schema documents, controlled by attributes in the vc (version control) namespace. The experience has reminded me of reasons to like the vc:* design and of reasons to like (and hate) Javascript in the browser.

Continue reading →

XSD 1.1 is a Candidate Recommendation

Posted on 4 May 2009 by cmsmcq

[4 May 2009; some typos corrected and phrases tweaked 5, 6, and 7 May]

The World Wide Web Consortium has published XSD 1.1 Part 1: Structures and Part 2: Datatypes as Candidate Recommendations, and issued a call for implementation.

As the version number is intended to suggest, XSD 1.1 is mostly very similar to XSD 1.0 and restricts itself to relatively modest changes to the spec.

[At this point, Enrique snorted loudly enough to break my concentration. “If it’s just modest changes, why did it take so long? Let’s see, when did you start? XSD 1.0 was 2001, so …”

“Well, we didn’t start on 1.1 right away,” I hurriedly interjected. “But, well, I guess you’re right. It did take a lot longer than you would have expected.”

“Why? What could possibly take that long?”

“Well, different members of the working group turned out to entertain rather different views of what counts as a modest change. So we spent a lot of the last several years arguing about the relative importance of compatibility, of fixing problems in the spec, and of making the spec more useful for users. And then, on the next issue, arguing about them again. And again. And again.”

“And again?”

“And again. You know, some people say you can be a success in committee work in several different ways: being smarter than everyone else, —”

“You mean, the James Clark approach?”

“Yeah — only that doesn’t always work for people who aren’t James Clark. Or by working harder than everyone else,”

“Paul Cotton always used to talk about how much leverage you have to influence a group if you are the one who always does the minutes. I always thought he was just trying to find a sucker.”

“Well, maybe. But I think he also meant it; it really can be an important role.”

“Then why are members of the W3C Team so strongly encouraged not to do it?”

“Long story; another time, perhaps. Or, third alternative, you can just have more endurance than everyone else.”

“The ‘Iron Butt Rule’?”

“Exactly. The XML Schema working group had several members who seemed determined to try their hand at that technique.”

“Well, there’s you, of course. That would be your only option, really, wouldn’t it? I mean, the other methods …. But you mean, others tried to play the Iron Butt card, too?”

“Hush. I was going to talk about what 1.1 has that 1.0 doesn’t have.”

“So who’s stopping you?”]

XSD 1.1 is mostly similar to 1.0, I was saying before being interrupted. But it does have a number of improvements that can make a difference.

XSD 1.1 supports XML 1.1 and XML 1.0 Fifth Edition. (That last does not distinguish it, in my view, from XSD 1.0. But some people believe that 1.0 requires old versions of its normative dependencies, because the working group did not instruct the editors to say explicitly that of course newer editions can be used. Some things should go without saying, you know?)
This constitutes a significant improvement from the point of view of internationalization.
There’s a conditional inclusion mechanism (the vc:* attributes) for allowing a schema document to provide multiple versions of a declaration and select the right one at schema construction time based on which version of XSD the processor supports, what spec- and implementation-defined datatypes are automatically available, and so on.
This mechanism should make it much easier to produce new versions of XSD without being tied in knots over questions of what back-level processors will make of schema documents which use new constructs. (If XSD 1.0 had had such a mechanism, we could probably have done a better 1.1 in half the time. But we did not learn enough, when doing 1.0, from the example of, say, XSLT 1.0.)
Elements can now be declared with a form of conditional type assignment that makes the type assigned in an instance depend on the values of its attributes; this allows a variety of co-occurrence constraints on attributes and content to be expressed.
Assertions can be associated with complex and simple types. This also makes it easier (or in some cases possible for the first time) to express certain co-occurrence constraints on attributes and content.
The assertions of XSD 1.1 are less powerful than the assertions of Schematron, in that they cannot refer to anything outside the element being validated. They will in some cases be less convenient to express. (Ask about the HTML input rule, for example.) But they preserve the context-independence of type validity and an aggressive optimizer should be able to check them in a streaming context, which is not true in general of Schematron assertions.
Attributes can be marked inherited; inherited values are written into the XDM data model instance before assertions and conditional type assignment evaluate any XPath expressions, which means that inherited attributes like xml:lang can be consulted in conditional type assignment and assertions.
I’m proud of this not only because it helps handle internationalization better, but because it aligns the principle of context-free validation better with some of the core idioms of XML.
A precisionDecimal datatype has been added, which is intended to mirror the new IEEE 754-2008 specification of floating-point decimal.
This one is controversial: some members of the XSL and XML Query working groups are vocal in saying it’s a bad idea, it will complicate their type hierarchy and type coercion rules yet again, and we shouldn’t support it.

[“Of course, some of the same members of QT also predicted that the IEEE spec would never be finished at all, and that the sky would fall, hell would freeze over, and Intel would fall into the Pacific Ocean before supporting it, didn’t they?” said Enrique. “But the spec was published, and Intel is supporting it. So …” “Hush,” I said. “They’ll hear you.” But it doesn’t matter: they don’t much care what Enrique thinks.]
The xsd:redefine construct has been deprecated.
This is a disappointment to some people, who believe that it had great promise. And they are right: it did have great promise. But the 1.0 spec is vague (to put it charitably) on some points; interoperability problems in 1.0 implementations have been reported and the working group has been unable to agree on the correct interpretation of the 1.0 spec.
A simpler mechanism for reusing an existing schema document while changing it selectively is now provided under the name xsd:override. For the situations where redefine turns out to be under- (or over-) specified, override provides relatively clear, straightforward answers.
The rules for restriction have been made much simpler and more correct. It is no longer possible to use xsi:type with the name of a member type in order to evade facet restrictions on a union.
The determinism rule (the so-called “unique particle attribution” constraint) has been relaxed. It’s now legal for wildcards to compete with element declarations; elements win.
It’s easier to specify ‘open content’ and effectively insert wildcards everywhere, without cluttering up your content models.
Wildcards can now say, in effect, “any of these, except for those.” Some people call these “negative wildcards”.
All-groups can now contain wildcards, the elements and wildcards in all-groups can now have maxOccurs greater than one, and all-groups can be extended.
To align better with XPath 2.0 and related specs, the simple type hierarchy now includes an xsd:anyAtomicType. Also, the two totally ordered subtypes of duration defined for XPath 2.0 and related specs have (with the cooperation of the XML Query and XSL working groups) been integrated into the XML Schema namespace.
A new facet has been added for requiring the timezone to be present (or absent) in datatypes derived by restriction from any of the date/time types; a dateTimeStamp datatype which requires a timezone has been added, at the suggestion of the OWL working group.
Lists and unions contructed from ID and IDREF retain the ID- and IDREFness of the ID and IDREF values. Also, you can have more than one ID on an element, which means it’s now a lot easier to support xml:id without having to whack the rest of your vocabulary.
Much of the spec has been rewritten, sentence by sentence and phrase by phrase. It was not possible to reorganize the exposition from the ground up (although I agree with those who believe the spec could use it), but while retaining the same organization we were able to make individual paragraphs and sentences easier to follow and understand. More liberal use of technical terms, variable notation, and section headings may seem like trivial changes, but empirically they appear to have a perceptible effect on the readability of the spec.
Most users, of course, don’t read the spec, even power users. But implementors do, members of the working group do, members of other working groups who need to layer their stuff on top of XSD do. And some users do. I wish we could do more to make the spec more welcoming and legible for them. But while there is a lot of room for further improvement, I think (if I say so myself) that 1.1 is somewhat easier to read than 1.0. It benefits, of course, from being the second go at formulating these things.

It has been a long, hard slog — I lied to Enrique, we actually did start on it in 2001, though we also were doing a lot of other things at the same time — and I think we would not have made it without the perseverance of the chair, David Ezell of Verifone, representing the National Association of Convenience Stores (to both of whom thanks for seconding David to the group and supporting the time he spends on XSD), and the hard work of Sandy Gao of IBM on the Structures spec and Dave Peterson of SGMLWorks! (who serves as an invited expert) on the Datatypes spec. XSD 1.1 is not a perfect spec, by any means. But it’s an improvement on 1.0, and it’s worth pushing forward for that reason. And without David, and Sandy, and Dave, it would not be happening. Anyone interested in the validation of XML owes these three a debt of gratitude.

The long hard slog is by no means over. Publication as a Candidate Recommendation means the W3C has now called for implementations. If you are a programmer looking for a challenge, I challenge you to implement XSD 1.1! If you are a user, not a provider, of XSD software, urge the supplier of your software to implement XSD 1.1, and test their implementation! The more you push on the implementations now, the stronger they will be when the time comes to demonstrate implementation experience and progress the spec to Proposed Recommendation. And the more experience we will have gained towards the goal of having a broadly supported validation language which supports the full spectrum of XML usage.

[“Wow!” said Enrique. “Did you know that perseverance is a theological term? “‘continuance in a state of grace leading to a state of glory’!” “In other words,” I said, “you looked it up because you didn’t think I knew how to spell it correctly, did you?” “Oh, hush,” he said.]

Base 64 binary, hex binary, and bit sequences

Posted on 26 April 2009 by cmsmcq

[26 April 2009]

I had an interesting exchange with a friend recently. He wrote

Presumably a binary octet is a sequence of bits. A sequence of sequences of bits is not a sequence of bits, since a bit is not a sequence of bits. (Please, let’s not violate the axiom of regularity!) Therefore, a finite-length sequence of zero or more binary octets is not a sequence of bits.

The upshot, he said, was that the description of the value space of the base64Binary and hexBinary datatypes was wrong in both XSD 1.0 and XSD 1.1. The 1.0 spec says (in section 3.2.16):

The value space of base64Binary is the set of finite-length sequences of binary octets.

XSD 1.1 says, very similarly (but with pedantic attention to details on which people have professed uncertainty):

The value space of base64Binary is the set of finite-length sequences of zero or more binary octets. The length of a value is the number of octets.

But if my friend is right, and the binary datatypes are intended to have value spaces which are sequences of bits, then those descriptions are wrong. So, my friend continues, the description of the value space really ought to be:

… the set of finite-length concatenations of sequences of zero or more binary octets.

Shouldn’t it?

This sounds plausible at first glance but the answer is no, it shouldn’t. The two binary datatypes can certainly be used, in fairly obvious ways, to encode sequences of bits, but their value space is not the set of bit sequences, not even the set of bit sequences whose length is an integer multiple of eight (and whose length for purposes of the XSD length facet would be their bit length divided by eight).

My friend’s argument suffers, I think, from two faults. Most important, he seems to assume

That an octet is a sequence of bits.
That purpose of base64Binary (and of the Base 64 Encoding on which it is based) is to encode sequences of bits.

I don’t think either of these is true in detail.

Are octets sequences of bits?

Is an octet a sequence of bits? Certainly it’s often thought of that way (e.g. in the Wikipedia article on ‘octet’). But strictly speaking I think the term ‘octet’ is best taken as denoting a group of bits, without assuming a geometry, in which (for purposes of most network transmission protocols) each bit is asociated with a power of two.

But if we number the bits with their powers as 0 .. 7, is the octet the sequence of b0 b1 b2 b3 b4 b5 b6 b7? or b7 b6 b5 b4 b3 b2 b1 b0? Or some other sequence? On architectures where the byte is the smallest addressable unit, there is no requirement that the bits be thought of as being in any particular order, although the design of big- and little-endian machines makes better intuitive sense if we assume least-significant-bit first order for little-endian, and most-significant-first for big-endian. I believe that some protocols for serial port protocols specify least-first, others greatest-first, order (with least-first most common). But I suspect that most networking protocols today (and for a long time since) assume parallel transmission of bits, in which asking about the sequence of bits within an octet is nothing but a category error.

But IANAEE. Am I wrong?

Does base 64 encoding encode bits?

RFC 3548, which defines Base 64 encoding, says

The Base 64 encoding is designed to represent arbitrary sequences of octets in a form that requires case sensitivity but need not be humanly readable.

It uses similar wording for the base 32 and base 16 encodings it also defines.

Note the choice of words: octets, not bits.

I wasn’t there when it was developed, but I’m guessing that base 64 notation is carefully formulated to be agnostic on the sequence of bits within an octet, and to be equally implementable on big-endian, little-endian, and other machines. That would be one reason for it to talk about encoding sequences of octets, not sequences of bits.

One consequence of such agnosticism is that if one wanted to specify a sequence of bits using base64Binary, or hexBinary, one would need to specify what sequence to assign to the eight bits of each octet. And indeed RFC 2045 specifies that “When encoding a bit stream via the base64 encoding, the bit stream must be presumed to be ordered with the most-significant-bit first.” I believe that that stipulation is necessary precisely because it doesn’t follow from the nature of octets and thus doesn’t go without saying. The RFCs also don’t say, when you are encoding a sequence of six bits, for example, whether it should be left-aligned or right-aligned in the octet.

Bottom line: I think the XSD spec is right to say the value spaces of the binary types are sequences of octets, not of bits.

[In preparing this post, I notice (a) that RFC 3548 appears to have been obsoleted by RFC 4648, and (b) that XSD 1.1 still cites RFC 2045. I wonder if anyone cares.]

Such a little thing, to provoke so many thoughts

Posted on 3 February 2009 by cmsmcq

[3 February 2009]

I saw an interesting bit of stray XML this morning, which raises a number of questions worth mulling over.

The New York Times sends me email every morning with headlines from the day’s papers; this is one of the perqs, I think, of being a subscriber. Given the choice, I asked for the ASCII-only version (like many people I know, I don’t much like HTML email; I’m not sure this is a well-founded prejudice, but there it is).

In today’s headlines, I find the following fragment:

- QUOTATION OF THE DAY -

"Oh, you're one of <em style="i">them."
- IRIS CHAU, recounting an acquaintance's reaction when she said she
worked at a banking company.

My eye was caught (perhaps this is just a deformation professionelle) by the unmatched start-tag for an ’em’ element, appearing in what was intended to be a markup-free context. This seems to tell us several things about the internal system used by the Times, and to invite several questions:

They seem to be using markup (perhaps HTML, perhaps some other XML vocabulary), not just for delivery of the headlines but in the system for preparing features like the quotation of the day.
Their XML vocabulary seems to require an inline specification of rendering information. This seems odd; wouldn’t it be natural for a stylesheet to say that ’em’ elements should be italic? How many other renderings of emphasis are there likely to be in the main story block, in a broadsheet? (Oh, well; not my design, and I don’t know what constraints the vocabulary designer was working under.)
Assuming that this quotation was cut and pasted out of the story in which it appears (which does not appear to be ill-formed), could we infer that the journalist who did the cut and paste could was using an editing tool that had no markup awareness? Would a tool with better awareness of markup have helped here? (E.g. by picking up the end-tag of the ’em’ element, as soon as it picks up the start-tag, as SoftQuad’s Author/Editor used to do [and presumably still does in its current incarnation], or by dropping the start-tag.)
Could a better validator somewhere in the system that generates the ASCII headlines have caught this error? Could it have been fixed automatically, or with minimal-cost human intervention?
What properties would a validator for that process need to have, to induce people to view it as an asset instead of a nuisance?
What properties would a schema language need to have, in order to make it easier to write a validator with those properties? And how might the schema language go about encouraging implementors to write such validators?

Hmmmmm.

Simple proof that the URI grammar of RFC 3986 defines a regular language

Posted on 16 January 2009 by cmsmcq

[16 January 2009]

A while back, a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.

If the language is regular, of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:

But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if itâ€™s regular. Is that language regular? I havenâ€™t looked at it for a while, so Iâ€™m not really sure. At the very least, itâ€™s not obvious that itâ€™s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.

First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact refers(X,Y) is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the refers relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).

But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.

setof(V,X^(refers(V,X);refers(X,V)),Vs),
setof(L-R,refers(L,R),Es),
vertices_edges_to_ugraph(Vs,Es,G),
transitive_closure(G,Gc),
member(LHS-RHS,Gc),
member(LHS,RHS).

There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.

The line

setof(V,X^(refers(V,X);refers(X,V)),Vs),

makes a set Vs of all terms in either argument position for refers. This is the set of vertices in the non-terminal reachability graph.

The line

setof(L-R,refers(L,R),Es),

similarly makes a set of expressions of the form L-R for terms linked by the refers relation. For example, since the ABNF in the RFC includes the rule

URI    = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

the set Es (edges) will contain 'URI'-scheme, 'URI'-'hier-part', 'URI'-query, and 'URI'-fragment. Prolog quotes URI here to avoid having it taken as a variable.

The line

vertices_edges_to_ugraph(Vs,Es,G),

uses an SWI utility to turn the lists Vs and Es of vertices and edges into an unweighted graph G, in the form used by the ugraph library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.

G has the form of a list of pairs X-[Y1,Y2,…], where there are edges from X to Y1, Y2, … in the graph.

In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form X = ... Y ...)

The line

transitive_closure(G,Gc),

takes the transitive closure of graph G and assigns it to variable Gc. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.

The final lines

member(LHS-RHS,Gc),
member(LHS,RHS).

seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.

Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.

Q.E.D.

It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book Grundlagen und Techniken des Compilerbaus (Bonn: Addison-Wesley, 1996), also in English as Compiler Construction (Harlow, England: Addison-Wesley, 1996). He writes:

Eine Sprache ist regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.

Or:

A language is regular, if its syntax can be expressed by a single EBNF expression.

(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)

In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.

So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.

Having worked this out, I asked my colleague, Dan Connolly, what line of reasoning he had followed in answer the question. “Well, I just tried the construction, and it worked.“ He has a Web page with Javascript that performs the translation from the ABNF of the RFC into Javascript regexes, and allows the user to test strings to see if they match that grammar. If you are interested in this kind of question, you may find that page fun to play with.

Messages in a Bottle

CMSMcQ's klog

Category Archives: schema languages

VC-Filter

XSD 1.1 is a Candidate Recommendation

Base 64 binary, hex binary, and bit sequences

Are octets sequences of bits?

Does base 64 encoding encode bits?

Such a little thing, to provoke so many thoughts

Simple proof that the URI grammar of RFC 3986 defines a regular language