Archive for the ‘schema languages’ Category

Xerophily, a parser for XSD regular expressions

Wednesday, December 30th, 2009

[30 December 2009]

A while back, in connection with the work of the W3C XML Schema working group, I wrote a parser, in Prolog, for the regular-expression language defined by the XSD spec. This has been on the Web for some time, at http://www.w3.org/XML/2008/03/xsdl-regex/.

Not long ago, though, I received an inquiry from Jan Wielemaker, the maintainer of SWI Prolog (and ipso facto someone to whom I owe a great debt, along with every other user of that well maintained system), asking if it wouldn’t be possible to re-issue the code under the Lesser Gnu Public License, so that he could use parts of it in some work he’s doing on support for the simple types of XSD. (Strictly speaking, he could do that under the W3C license, but that would mean he has to manage not one license but two for the same library: too complicated.)

Fine with me, but since I wrote it while working at W3C, I don’t own the copyright. To make a long story short, after some cavilling and objections and assorted mutual misunderstandings, agreement was reached, and W3C gave their blessing.

The LGPL boilerplate assumes that the package being issued has a name, so I have given the XSD regex parser the name Xerophily.

[“Xerophily? What on earth is that?” asked Enrique. The word, I explained to him, denotes the property possessed by plants well adapted to growing in dry, especially hot and dry, conditions. “Well, that makes it apt for code produced in New Mexico, I guess,” he allowed. “And your code, like those plants, is a little scraggly and scrawny. But what name are you going to use for your next program?” “It's also one of the few words I could find containing X (for XSD), R (for regular expression) and P (for parser, and for Prolog).” “Oh.”]

A word of caution: Xerophily was written to support some very dry work on the grammar of the regex language; it doesn’t really have a user interface; and it doesn’t actually perform regex matching against strings. It just parses the regex into an abstract syntax tree (AST). I do have code, or I think I did some years back, to do the matching, but the shape of the AST has changed since then. In other words, Jan Wielemaker may be the only person on earth who could have any plausible reason for wanting to use or read this code.

But for what it’s worth, as of a few minutes ago, it’s on the Web, under the LGPL, at the address http://www.w3.org/XML/2008/03/xsdl-regex/LGPL/. The W3C-licensed version remains available, at the address given above.

Automata and infinite strings

Tuesday, December 15th, 2009

[15 December 2009]

[This is another one of those ignorance-is-bliss postings. If I had studied automata theory properly, this would (I guess) have been covered in class; that would have deprived me of the fun of thinking about it without knowing the right answer. If you did study automata theory, and you know how infinite strings are handled, and it irritates you to see someone waffling on and on about it instead of just doing some homework and reading the damn literature, you might want to stop reading soon.]

Some time ago, Michael Kay suggested that it was pointless for the XSD datatypes spec to specify that the lexical representations of integers, or strings, or various other simple types, were finite sequences of characters with certain properties. No implementation, he pointed out, can reliably distinguish finite from infinite strings, so it’s a non-testable assertion.

[“True if you're using conventional I/O and conventional representations of strings, maybe,” said Enrique. “But if you represent the sequence of characters using a description, rather than an array of characters, it's not clear that that's true. Instead of the sequence "3.141592...", store an algorithm for calculating, on demand, the nth digit of the decimal expansion of π. Ditto for the square root of 2. And so on!” “You may be right,” I said. “But that wasn't what I wanted to talk about, so be quiet.”]

The working group declined the proposal to drop the word “finite” on the grounds that if the strings in question are required to be finite, then we know that all the lexical representations of integers (for example) can in principle be recognized by a finite state automaton. Without the restriction to finite sequences, most of what people know about finite state automata isn’t guaranteed to apply.

I found myself wondering this morning about the possible application of automata to infinite and semi-infinite strings. I know that in principle automata theorists have historically not restricted their interest to finite automata; it seems plausible to assume they have also considered infinite strings. But I don’t know what they have said, without spending time looking it up; instead, I am going to enjoy myself for a little while seeing how much I can figure out for myself.

One obvious question to answer is: if you want to use an automaton to identify infinite sequences, how do you define acceptance of the sequence? For a finite sequence, you ask what state you’re in at the end of the sequence, and whether that state is an “accept state” or not. That won’t work for an infinite sequence: there is no final state.

Perhaps we can consider the sequence of states the automaton enters and define acceptance in terms of that sequence. Possible answers:

  1. Accept if (a) the automaton eventually ends up in a single state which it never again leaves, and (b) that state is an accept state.
  2. Accept if there is some point in the sequence of states such that every state following that point is an accept state.

These would work (in the sense of providing a yes/no answer).
Do these rules for acceptance of strings define sets of automata with different discriminating power?

It seems obvious that they do, but what exactly are the differences?

Consider, for example, automata for recognizing various classes of numbers written as an infinite sequence of decimal digits. Numbers seem to be on my mind, perhaps because of the tie-in to XSD datatypes.

For such infinite strings of digits (including a decimal point), integers have the property that every digit to the right of (i.e. following) the decimal point is a 0. If you build the obvious automaton, for an integer it will spend all its time in the zero-after-decimal-point state, and for a non-integer it will, eventually, end up caught in an error state.

[Enrique tells me I should pause to draw pictures of these automata, but I'm not going to take the time just yet. Apologies to those who find it hard to visualize what I'm talking about.]

So the first acceptance rule suffices for recognizing integers. It may be relevant that the same automaton can be used to recognize finite strings as representing integers: any prefix of the infinite string representing an integer will also be accepted as representing an integer.

The first rule would also suffice to allow us to build a recognizer for certain fractions, e.g. 1/3: the infinite string denoting 1/3 ends up perpetually in the “we’ve just read a 3” state.

On the other hand, it doesn’t suffice for all rationals: in decimal notation,1/7 has an infinitely repeating sequence of digits (142857, if you were wondering). To distinguish 1/7 in decimal notation we’d need a cycle of six states in the automaton.

All rational numbers have a decimal expansion that eventually settles into an infinite series of repeated strings of digits (if only an infinitely repeating sequence of zeroes). So if we adopt the second rule for defining acceptance of the string, we can say: for every rational number, there is a finite state automaton that recognizes that rational number. And irrationals, which have no repeating sequences, aren’t recognizable by an automaton with finite states. (An automaton with infinitely many states might be able to recognize the decimal expansion of a particular irrational number, say π, but it’s hard to know what to do with that information — maybe it’s a reason to say that languages recognizable with an infinite automaton are not necessarily regular.)

That sounds like a nice pattern. It would be even nicer if we could devise an automaton to recognize the set of decimal expansions of rational numbers, but I suspect that’s not feasible, since the complement of that set is the irrationals, and being able to recognize the one set by regular means would entail being able to recognize the other, too.

Does it make sense to require that the automaton eventually end up spending all its time in accept states? (Or equivalently, that the sequence of states have a suffix in which every element in the suffix is an accept state.)

What if that is too restrictive a rule? What if we said instead

  1. Accept if at every point in the sequence of states there are an infinite number of accept states among the states following that point.

That is, allow the string to put the automaton into a non-accepting state, as long as it’s only temporary, and it eventually gets back into an accepting state.

Consider an automaton which has two states, A and B. Every time a B is found in the input, we go to state B; for any other symbol we go to state A. B is an accept state.

If we adopt the second story about termination, a string ending in an unending series of Bs will be accepted and is thus recognizable by an automaton. A string with an infinite number of Bs, interspersed with other symbols, will not be accepted by this automaton (nor by any other, as far as I can tell).

OK, that seems to establish (if we accept the conjecture about strings with infinitely many Bs) that the second and third rules define distinct sets of languages. I suppose that one chooses to use the second rule, or the third, or some other I haven’t thought of yet, in part based on whether it feels right to count as regular the languages one can recognize using that rule.

Hmm. OK, time to look at the bookshelves.

I’ve just checked and found that John E. Hopcroft and Jeffrey D. Ullman, in Introduction to automata theory, languages, and computation (Reading: Addison-Wesley, 1979), restrict their attention to finite strings.

Dick Grune and Ceriel J. H. Jacobs, Parsing techniques: a practical guide, second edition (New York: Springer, 2008), don’t explicitly impose this restriction. But a quick scan of their opening pages also doesn’t show any explicit consideration of infinite sequences of symbols, either. I’m guessing they do treat infinite input somewhere, if only because if you can contemplate van Wijngaarden grammars, which have infinite numbers of context-free rules (and remember, Grune didn’t just contemplate van Wijngaarden grammars, he wrote a parser generator for them), infinite strings are just not going to frighten you.

I suppose the idea of thinking seriously about infinitely long sentences in a language is one I first encountered in D. Terence Langendoen and Paul Postal, The vastness of natural languages (Oxford: Blackwell, 1984). To whom (for this, as for many other things) thanks!

I’m pretty sure that there was some treatment of infinite automata and/or infinite input strings in S. C. Kleene, “Representation of events in nerve nets and finite automata”, in Automata studies, ed. C. E. Shannon and J. McCarthy (Princeton: PUP, 1956), and V. M. Glushkov, “The abstract theory of automata”, Russian mathematical surveys: a translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk 16 (1961). They are both kind of tough sledding, but I suppose I really ought to go back and read them carefully with an eye to this topic.

XML as a sort-of open format

Thursday, December 3rd, 2009

[3 December 2009]

I just encountered the following statements in technical documentation for a family of products which I’ll leave nameless.

This document does not describe the complete XML schema for either [Application 1] or [Application 2]. The complete XML schema for both applications is not available and will not be made public.

Perhaps there can be good reasons for such a situation. Perhaps the developers really don’t know how to use any existing schema language to describe the set of documents they actually accept; perhaps only a Turing machine can actually identify set of documents accepted, and the developers were unwilling to work with a simpler set whose membership could be more cheaply decided. (Well, wait, those may be reasons, but they don’t actually qualify as “good”.)

I wonder whether this is an insidious attempt to look like the products have an open format (See? it’s XML! How much more open can you get?) while ensuring that the commercial products in question remain the only arbiters of acceptable documents? Or whether the programmers in question were just too lazy to specify a clean vocabulary and ensure that their software handles all documents which meet some standard of validity that does not require Turing completeness?

Having a partially defined XML format is, at least for me, still a great deal more convenient than having the format be binary and completely undocumented. But it certainly seems to fall a long distance short of what XML could make possible.

VC-Filter

Wednesday, May 6th, 2009

[6 May 2009]

Short version: I’ve made a new toy for playing with one aspect of XSD 1.1, namely the conditional inclusion of elements in schema documents, controlled by attributes in the vc (version control) namespace. The experience has reminded me of reasons to like the vc:* design and of reasons to like (and hate) Javascript in the browser.

(more…)

XSD 1.1 is a Candidate Recommendation

Monday, May 4th, 2009

[4 May 2009; some typos corrected and phrases tweaked 5, 6, and 7 May]

The World Wide Web Consortium has published XSD 1.1 Part 1: Structures and Part 2: Datatypes as Candidate Recommendations, and issued a call for implementation.

As the version number is intended to suggest, XSD 1.1 is mostly very similar to XSD 1.0 and restricts itself to relatively modest changes to the spec.


[At this point, Enrique snorted loudly enough to break my concentration. “If it's just modest changes, why did it take so long? Let's see, when did you start? XSD 1.0 was 2001, so ...”

“Well, we didn't start on 1.1 right away,” I hurriedly interjected. “But, well, I guess you're right. It did take a lot longer than you would have expected.”

“Why? What could possibly take that long?”

“Well, different members of the working group turned out to entertain rather different views of what counts as a modest change. So we spent a lot of the last several years arguing about the relative importance of compatibility, of fixing problems in the spec, and of making the spec more useful for users. And then, on the next issue, arguing about them again. And again. And again.”

“And again?”

“And again. You know, some people say you can be a success in committee work in several different ways: being smarter than everyone else, —”

“You mean, the James Clark approach?”

“Yeah — only that doesn't always work for people who aren't James Clark. Or by working harder than everyone else,”

“Paul Cotton always used to talk about how much leverage you have to influence a group if you are the one who always does the minutes. I always thought he was just trying to find a sucker.”

“Well, maybe. But I think he also meant it; it really can be an important role.”

“Then why are members of the W3C Team so strongly encouraged not to do it?”

“Long story; another time, perhaps. Or, third alternative, you can just have more endurance than everyone else.”

“The ‘Iron Butt Rule’?”

“Exactly. The XML Schema working group had several members who seemed determined to try their hand at that technique.”

“Well, there's you, of course. That would be your only option, really, wouldn't it? I mean, the other methods .... But you mean, others tried to play the Iron Butt card, too?”

“Hush. I was going to talk about what 1.1 has that 1.0 doesn't have.”

“So who's stopping you?”]


XSD 1.1 is mostly similar to 1.0, I was saying before being interrupted. But it does have a number of improvements that can make a difference.

  • XSD 1.1 supports XML 1.1 and XML 1.0 Fifth Edition. (That last does not distinguish it, in my view, from XSD 1.0. But some people believe that 1.0 requires old versions of its normative dependencies, because the working group did not instruct the editors to say explicitly that of course newer editions can be used. Some things should go without saying, you know?)

    This constitutes a significant improvement from the point of view of internationalization.

  • There’s a conditional inclusion mechanism (the vc:* attributes) for allowing a schema document to provide multiple versions of a declaration and select the right one at schema construction time based on which version of XSD the processor supports, what spec- and implementation-defined datatypes are automatically available, and so on.

    This mechanism should make it much easier to produce new versions of XSD without being tied in knots over questions of what back-level processors will make of schema documents which use new constructs. (If XSD 1.0 had had such a mechanism, we could probably have done a better 1.1 in half the time. But we did not learn enough, when doing 1.0, from the example of, say, XSLT 1.0.)

  • Elements can now be declared with a form of conditional type assignment that makes the type assigned in an instance depend on the values of its attributes; this allows a variety of co-occurrence constraints on attributes and content to be expressed.
  • Assertions can be associated with complex and simple types. This also makes it easier (or in some cases possible for the first time) to express certain co-occurrence constraints on attributes and content.

    The assertions of XSD 1.1 are less powerful than the assertions of Schematron, in that they cannot refer to anything outside the element being validated. They will in some cases be less convenient to express. (Ask about the HTML input rule, for example.) But they preserve the context-independence of type validity and an aggressive optimizer should be able to check them in a streaming context, which is not true in general of Schematron assertions.

  • Attributes can be marked inherited; inherited values are written into the XDM data model instance before assertions and conditional type assignment evaluate any XPath expressions, which means that inherited attributes like xml:lang can be consulted in conditional type assignment and assertions.

    I’m proud of this not only because it helps handle internationalization better, but because it aligns the principle of context-free validation better with some of the core idioms of XML.

  • A precisionDecimal datatype has been added, which is intended to mirror the new IEEE 754-2008 specification of floating-point decimal.

    This one is controversial: some members of the XSL and XML Query working groups are vocal in saying it’s a bad idea, it will complicate their type hierarchy and type coercion rules yet again, and we shouldn’t support it.

    [“Of course, some of the same members of QT also predicted that the IEEE spec would never be finished at all, and that the sky would fall, hell would freeze over, and Intel would fall into the Pacific Ocean before supporting it, didn't they?” said Enrique. “But the spec was published, and Intel is supporting it. So ...” “Hush,” I said. “They'll hear you.” But it doesn't matter: they don't much care what Enrique thinks.]

  • The xsd:redefine construct has been deprecated.

    This is a disappointment to some people, who believe that it had great promise. And they are right: it did have great promise. But the 1.0 spec is vague (to put it charitably) on some points; interoperability problems in 1.0 implementations have been reported and the working group has been unable to agree on the correct interpretation of the 1.0 spec.

  • A simpler mechanism for reusing an existing schema document while changing it selectively is now provided under the name xsd:override. For the situations where redefine turns out to be under- (or over-) specified, override provides relatively clear, straightforward answers.
  • The rules for restriction have been made much simpler and more correct. It is no longer possible to use xsi:type with the name of a member type in order to evade facet restrictions on a union.
  • The determinism rule (the so-called “unique particle attribution” constraint) has been relaxed. It’s now legal for wildcards to compete with element declarations; elements win.
  • It’s easier to specify ‘open content’ and effectively insert wildcards everywhere, without cluttering up your content models.
  • Wildcards can now say, in effect, “any of these, except for those.” Some people call these “negative wildcards”.
  • All-groups can now contain wildcards, the elements and wildcards in all-groups can now have maxOccurs greater than one, and all-groups can be extended.
  • To align better with XPath 2.0 and related specs, the simple type hierarchy now includes an xsd:anyAtomicType. Also, the two totally ordered subtypes of duration defined for XPath 2.0 and related specs have (with the cooperation of the XML Query and XSL working groups) been integrated into the XML Schema namespace.
  • A new facet has been added for requiring the timezone to be present (or absent) in datatypes derived by restriction from any of the date/time types; a dateTimeStamp datatype which requires a timezone has been added, at the suggestion of the OWL working group.
  • Lists and unions contructed from ID and IDREF retain the ID- and IDREFness of the ID and IDREF values. Also, you can have more than one ID on an element, which means it’s now a lot easier to support xml:id without having to whack the rest of your vocabulary.
  • Much of the spec has been rewritten, sentence by sentence and phrase by phrase. It was not possible to reorganize the exposition from the ground up (although I agree with those who believe the spec could use it), but while retaining the same organization we were able to make individual paragraphs and sentences easier to follow and understand. More liberal use of technical terms, variable notation, and section headings may seem like trivial changes, but empirically they appear to have a perceptible effect on the readability of the spec.

    Most users, of course, don’t read the spec, even power users. But implementors do, members of the working group do, members of other working groups who need to layer their stuff on top of XSD do. And some users do. I wish we could do more to make the spec more welcoming and legible for them. But while there is a lot of room for further improvement, I think (if I say so myself) that 1.1 is somewhat easier to read than 1.0. It benefits, of course, from being the second go at formulating these things.

It has been a long, hard slog — I lied to Enrique, we actually did start on it in 2001, though we also were doing a lot of other things at the same time — and I think we would not have made it without the perseverance of the chair, David Ezell of Verifone, representing the National Association of Convenience Stores (to both of whom thanks for seconding David to the group and supporting the time he spends on XSD), and the hard work of Sandy Gao of IBM on the Structures spec and Dave Peterson of SGMLWorks! (who serves as an invited expert) on the Datatypes spec. XSD 1.1 is not a perfect spec, by any means. But it’s an improvement on 1.0, and it’s worth pushing forward for that reason. And without David, and Sandy, and Dave, it would not be happening. Anyone interested in the validation of XML owes these three a debt of gratitude.

The long hard slog is by no means over. Publication as a Candidate Recommendation means the W3C has now called for implementations. If you are a programmer looking for a challenge, I challenge you to implement XSD 1.1! If you are a user, not a provider, of XSD software, urge the supplier of your software to implement XSD 1.1, and test their implementation! The more you push on the implementations now, the stronger they will be when the time comes to demonstrate implementation experience and progress the spec to Proposed Recommendation. And the more experience we will have gained towards the goal of having a broadly supported validation language which supports the full spectrum of XML usage.

[“Wow!” said Enrique. “Did you know that perseverance is a theological term? “‘continuance in a state of grace leading to a state of glory’!” “In other words,” I said, “you looked it up because you didn't think I knew how to spell it correctly, did you?” “Oh, hush,” he said.]

Base 64 binary, hex binary, and bit sequences

Sunday, April 26th, 2009

[26 April 2009]

I had an interesting exchange with a friend recently. He wrote

Presumably a binary octet is a sequence of bits. A sequence of sequences of bits is not a sequence of bits, since a bit is not a sequence of bits. (Please, let’s not violate the axiom of regularity!) Therefore, a finite-length sequence of zero or more binary octets is not a sequence of bits.

The upshot, he said, was that the description of the value space of the base64Binary and hexBinary datatypes was wrong in both XSD 1.0 and XSD 1.1. The 1.0 spec says (in section 3.2.16):

The value space of base64Binary is the set of finite-length sequences of binary octets.

XSD 1.1 says, very similarly (but with pedantic attention to details on which people have professed uncertainty):

The value space of base64Binary is the set of finite-length sequences of zero or more binary octets. The length of a value is the number of octets.

But if my friend is right, and the binary datatypes are intended to have value spaces which are sequences of bits, then those descriptions are wrong. So, my friend continues, the description of the value space really ought to be:

… the set of finite-length concatenations of sequences of zero or more binary octets.

Shouldn’t it?

This sounds plausible at first glance but the answer is no, it shouldn’t. The two binary datatypes can certainly be used, in fairly obvious ways, to encode sequences of bits, but their value space is not the set of bit sequences, not even the set of bit sequences whose length is an integer multiple of eight (and whose length for purposes of the XSD length facet would be their bit length divided by eight).

My friend’s argument suffers, I think, from two faults. Most important, he seems to assume

  1. That an octet is a sequence of bits.
  2. That purpose of base64Binary (and of the Base 64 Encoding on which it is based) is to encode sequences of bits.

I don’t think either of these is true in detail.

Are octets sequences of bits?

Is an octet a sequence of bits? Certainly it’s often thought of that way (e.g. in the Wikipedia article on ‘octet’). But strictly speaking I think the term ‘octet’ is best taken as denoting a group of bits, without assuming a geometry, in which (for purposes of most network transmission protocols) each bit is asociated with a power of two.

But if we number the bits with their powers as 0 .. 7, is the octet the sequence of b0 b1 b2 b3 b4 b5 b6 b7? or b7 b6 b5 b4 b3 b2 b1 b0? Or some other sequence? On architectures where the byte is the smallest addressable unit, there is no requirement that the bits be thought of as being in any particular order, although the design of big- and little-endian machines makes better intuitive sense if we assume least-significant-bit first order for little-endian, and most-significant-first for big-endian. I believe that some protocols for serial port protocols specify least-first, others greatest-first, order (with least-first most common). But I suspect that most networking protocols today (and for a long time since) assume parallel transmission of bits, in which asking about the sequence of bits within an octet is nothing but a category error.

But IANAEE. Am I wrong?

Does base 64 encoding encode bits?

RFC 3548, which defines Base 64 encoding, says

The Base 64 encoding is designed to represent arbitrary sequences of octets in a form that requires case sensitivity but need not be humanly readable.

It uses similar wording for the base 32 and base 16 encodings it also defines.

Note the choice of words: octets, not bits.

I wasn’t there when it was developed, but I’m guessing that base 64 notation is carefully formulated to be agnostic on the sequence of bits within an octet, and to be equally implementable on big-endian, little-endian, and other machines. That would be one reason for it to talk about encoding sequences of octets, not sequences of bits.

One consequence of such agnosticism is that if one wanted to specify a sequence of bits using base64Binary, or hexBinary, one would need to specify what sequence to assign to the eight bits of each octet. And indeed RFC 2045 specifies that “When encoding a bit stream via the base64 encoding, the bit stream must be presumed to be ordered with the most-significant-bit first.” I believe that that stipulation is necessary precisely because it doesn’t follow from the nature of octets and thus doesn’t go without saying. The RFCs also don’t say, when you are encoding a sequence of six bits, for example, whether it should be left-aligned or right-aligned in the octet.

Bottom line: I think the XSD spec is right to say the value spaces of the binary types are sequences of octets, not of bits.

[In preparing this post, I notice (a) that RFC 3548 appears to have been obsoleted by RFC 4648, and (b) that XSD 1.1 still cites RFC 2045. I wonder if anyone cares.]

Such a little thing, to provoke so many thoughts

Tuesday, February 3rd, 2009

[3 February 2009]

I saw an interesting bit of stray XML this morning, which raises a number of questions worth mulling over.

The New York Times sends me email every morning with headlines from the day’s papers; this is one of the perqs, I think, of being a subscriber. Given the choice, I asked for the ASCII-only version (like many people I know, I don’t much like HTML email; I’m not sure this is a well-founded prejudice, but there it is).

In today’s headlines, I find the following fragment:

- QUOTATION OF THE DAY -

"Oh, you're one of <em style="i">them."
- IRIS CHAU, recounting an acquaintance's reaction when she said she
worked at a banking company.

My eye was caught (perhaps this is just a deformation professionelle) by the unmatched start-tag for an ‘em’ element, appearing in what was intended to be a markup-free context. This seems to tell us several things about the internal system used by the Times, and to invite several questions:

  • They seem to be using markup (perhaps HTML, perhaps some other XML vocabulary), not just for delivery of the headlines but in the system for preparing features like the quotation of the day.
  • Their XML vocabulary seems to require an inline specification of rendering information. This seems odd; wouldn’t it be natural for a stylesheet to say that ‘em’ elements should be italic? How many other renderings of emphasis are there likely to be in the main story block, in a broadsheet? (Oh, well; not my design, and I don’t know what constraints the vocabulary designer was working under.)
  • Assuming that this quotation was cut and pasted out of the story in which it appears (which does not appear to be ill-formed), could we infer that the journalist who did the cut and paste could was using an editing tool that had no markup awareness? Would a tool with better awareness of markup have helped here? (E.g. by picking up the end-tag of the ‘em’ element, as soon as it picks up the start-tag, as SoftQuad’s Author/Editor used to do [and presumably still does in its current incarnation], or by dropping the start-tag.)
  • Could a better validator somewhere in the system that generates the ASCII headlines have caught this error? Could it have been fixed automatically, or with minimal-cost human intervention?
  • What properties would a validator for that process need to have, to induce people to view it as an asset instead of a nuisance?
  • What properties would a schema language need to have, in order to make it easier to write a validator with those properties? And how might the schema language go about encouraging implementors to write such validators?

Hmmmmm.

Simple proof that the URI grammar of RFC 3986 defines a regular language

Friday, January 16th, 2009

[16 January 2009]

A while back, a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.

If the language is regular, of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:

But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if it’s regular. Is that language regular? I haven’t looked at it for a while, so I’m not really sure. At the very least, it’s not obvious that it’s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.

First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact refers(X,Y) is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the refers relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).

But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.

   setof(V,X^(refers(V,X);refers(X,V)),Vs),
   setof(L-R,refers(L,R),Es),
   vertices_edges_to_ugraph(Vs,Es,G),
   transitive_closure(G,Gc),
   member(LHS-RHS,Gc),
   member(LHS,RHS).

There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.

The line

   setof(V,X^(refers(V,X);refers(X,V)),Vs),

makes a set Vs of all terms in either argument position for refers. This is the set of vertices in the non-terminal reachability graph.

The line

   setof(L-R,refers(L,R),Es),

similarly makes a set of expressions of the form L-R for terms linked by the refers relation. For example, since the ABNF in the RFC includes the rule

URI    = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

the set Es (edges) will contain 'URI'-scheme, 'URI'-'hier-part', 'URI'-query, and 'URI'-fragment. Prolog quotes URI here to avoid having it taken as a variable.

The line

   vertices_edges_to_ugraph(Vs,Es,G),

uses an SWI utility to turn the lists Vs and Es of vertices and edges into an unweighted graph G, in the form used by the ugraph library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.

G has the form of a list of pairs X-[Y1,Y2,...], where there are edges from X to Y1, Y2, … in the graph.

In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form X = ... Y ...)

The line

   transitive_closure(G,Gc),

takes the transitive closure of graph G and assigns it to variable Gc. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.

The final lines

   member(LHS-RHS,Gc),
   member(LHS,RHS).

seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.

Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.

Q.E.D.

It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book Grundlagen und Techniken des Compilerbaus (Bonn: Addison-Wesley, 1996), also in English as Compiler Construction (Harlow, England: Addison-Wesley, 1996). He writes:

Eine Sprache ist regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.

Or:

A language is regular, if its syntax can be expressed by a single EBNF expression.

(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)

In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.

So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.

Having worked this out, I asked my colleague, Dan Connolly, what line of reasoning he had followed in answer the question. “Well, I just tried the construction, and it worked.“ He has a Web page with Javascript that performs the translation from the ABNF of the RFC into Javascript regexes, and allows the user to test strings to see if they match that grammar. If you are interested in this kind of question, you may find that page fun to play with.

An ineffable moment

Wednesday, December 31st, 2008

[2008-12-31+07:00 / 2009-01-01Z]

If we add it all up, the XML Schema Working Group has spent a lot of time, since the beginning of the group, worrying about leap seconds.

XSD 1.0 attempts to accommodate them in its descriptions of the date/time types, but leaves some aspects of behavior unspecified. Accordingly, implementations of XSD 1.0 vary wildly in how carefully they handle leap seconds; not every implementation comes with built-in knowledge of when leap seconds have occurred in the past, and not every implementation enforces the rules which specify that, when they occur in the future, leap seconds will occur only at midnight, UTC, at the end of a month. You can read some things on the Web that seem to imply leap seconds can only occur at the end of June or December, or possibly also March and September, but that’s not what the relevant spec says. The quarter days are to be preferred, but in principle a leap second could occur at the end of any month. You can also read things on the Web that suggest many people are uncertain whether the responsible authorities could in principle insert (or delete) two leap seconds at a time, or even more. I was unsure myself, until some time after I read the relevant specification it finally dawned on me that the answer is no.

[“The relevant spec?” asked Enrique. “Which is that?” “Well, if you want to be precise, it's Recommendation ITU-R TF.460-6: Standard-frequency and time-signal emissions, published by the International Telecommunications Union (Geneva: ITU, February 2002). I don't remember how I managed to acquire a copy.” “And how can you be sure there can never be adjacent leap seconds, if it doesn't say that flat out?” “What it says is that leap seconds are to be added at the end of some month, UTC, in order to keep UTC within 0.8 seconds of the appropriate solar time measure (maybe UT1 or UT2, but it's been a while and I forget the details). If after adding two leap seconds, UTC is within 0.8 seconds of solar time, then before they were added it must have been more than 0.8 seconds off. But it's not allowed to be more than 0.8 seconds off — that's the point of adding or deleting leap seconds.” “What if there was a large change between January and June?” insisted Enrique. “Then the spec implies that a leap second should be added between January and June. The spec does not limit the insertion of leap seconds to December 31 and June 30, it just says to prefer those dates. I think the implication is pretty clear that if you need to add a leap second at the end of May, you are supposed to do so.”

[“Yeah, but what if the world slowed down by two seconds in the course of a single month? Isn't that logically possible?” “Logically possible, yeah, I guess so. But astronomically implausible. If the rotation of the earth starts to vary that much, it's likely to be because a large asteroid just hit us, or something. Under those circumstances, schema-validity is likely to be the least of our worries” “Well, my point exactly,” said Enrique. “If the world is falling apart, that's the last time you want your systems to start failing because the schema validator doesn't like your time stamps. There will be more important things to be worrying about!”]

In developing XSD 1.1, we spent a lot of time trying to nail things down better, but ultimately reached the conclusion that there just was no good way to allow all real leap seconds and only real leap seconds, to handle validation of dateTime values for the future, and to maintain the principle that a document’s schema-validity against a given schema is the same today and in the future; it should not change from day to day depending on decisions made by the managers of Universal Coordinated Time. In the end, we said that XSD 1.1 processors just don’t handle leap seconds at all: the moments in the global time-line which are occupied by leap seconds do not correspond to values in the xsd:dateTime value space.

It’s an important principle of schema design (and of the use of other formalisms as well, I think) that in the general case, what the formal notation can express may be only an approximation to the reality you are modeling. Some things may exist without being able to be spoken. Mostly we mean by that that specific rules that apply in a given context may not be expressible in a given formal notation, since the expressive power of the formalism may be hobbled in order to preserve its tractability. It’s nice, I think, that the principle is also instantiated by the dateTime type: there are some moments of UTC time that cannot be captured as values in the dateTime value space.

All this is on my mind, of course, because one of those moments is scheduled to occur today. At midnight UTC. Any moment, now, in fact.

[Pause.]

[“Shouldn't it be midnight local time?” hissed Enrique. “No, you're thinking of shifting to and from Daylight Savings Time. Leap seconds are inserted at the same moment all around the globe. Hush, now, don't spoil it. Just wait and watch.”]

And there it went. Midnight UTC has passed, and the sequence of seconds shown by the applet at http://www.time.gov for Mountain Time went:

  • 16:59:56
  • 16:59:57
  • 16:59:58
  • 16:59:59
  • 16:59:60 [That's it! That's it!! “Hey, come look at this!” I wanted to call to my wife.]
  • 17:00:00 [“No, wait, never mind. It’s over already.”
  • 17:00:01

All of these past weeks, as events in W3C and in the economy and in the world have gone from bad to worse, I’ve been waiting impatiently to shake the dust of this year from my feet, and yearning for 2009 and a new leaf. The new year will surely be hard in many ways, I tell myself, but it cannot be as bad as the year just ending. As far as I can tell, I am not alone; 2009 has a heavy freight of hope and expectations to carry. A heavier freight than it’s fair to ask any year to bear.

So I like the idea that between the old year, so widely and deservingly anathematized, and the new one which carries so much fragile hope, time paused, just for a second, to gather its forces before picking up its burden and marching forward again.

Happy New Year, o my readers. Happy New Year.

Another way to define and validate overlapping structures

Friday, October 17th, 2008

[17 October 2008]

At a meeting in Edinburgh organized by the e-Science center here, I encountered for the first time a piece of work done some years ago by Siu-wai Leung and others (Siu-wai Leung, Chris Mellish, and Dave Robertson, “Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences,” Bioinformatics 17.3 [2001]: 226-236). The ‘Basic Gene Grammar’ formalism introduced by Leung distinguishes several kinds of arrows separating the left-hand and right-hand sides of grammar rules, so:

  A ---> B, C, D

means, just as in conventional grammar notations, that the lexical form of the non-terminal A is the concatenation of lexical forms of B, C, and D, in order. The rules

  X ===> B, C, D
  Y <=== B, C, D

mean (if I have correctly understood the paper, which is not given) that the lexical form of an X and the lexical form of a Y are regions on characters in which a B, a C, and a D occur, but in which the B, the C, and the D may overlap. The arrow ===> means that the B must start before the C, and the C must start before the D; the arrow <=== means the ends of the B, C, and D regions must occur in that order. (I don’t know how to say that both the beginnings and the ends must occur in order, without using additional non-grammatical constraints, or that the order is unconstrained; I’ll have to buttonhole Mr. Leung during a break and ask.)

[Answer: yes, use the mechanism for additional constraints. To say that A, B, and C can occur in any order, use multiple rules.]

This means I now know of three different formalisms for specifying and validating overlapping structures: the Basic Gene Grammars of Leung’s MSc thesis (1993, and the 2001 paper cited above), the Creole notation described by Jeni Tennison at XTech 2007, and the rabbit/duck grammars I described at Extreme Markup Languages 2006. (There are obviously other possibilities for specification and validation of overlap, but I don’t know of worked-out proposals.)

Several questions arise:

  1. What classes of languages (viewed as string sets) do these mechanisms describe?
  2. What classes of languages (viewed as tree sets) do these mechanisms describe?
  3. How do the expressive poweres of the different notations relate to each other? Are they the same? Do they overlap? Are there subset/superset relations?
  4. For various common textual phenomena (e.g. page structure and formal structure of the text, or verse structure and dramatic structure of verse drama), what would a definition of the structure look like in each of these formalisms?
  5. Are there obvious correspondences between idioms in the different languages?
  6. Are there differences in the likely or possible performance of parsing / validation algorithms for the different formalisms? (The Basic Gene Grammar mechanism is easy to understand in terms of the underlying chart parser, but it would be nice to have a less expensive method of validation than exhaustive checking of all contiguous subsequences in the input. I guess that for markup applications, the use of explicit start- and end-tags might reduce the expected cost of a naive implementation: bracketed languages are cheaper to parse. But I haven’t worked it out clearly.)

All of the talks at this meeting have been interesting and informative, but even had they not been, learning about this formalism would have made me glad to have come to the meeting.