Flattening structure without losing all rules

[25 November 2013, typos corrected 26 November; spam links removed 27 April 2017]

The other day during the question-and-answer session after my talk at Markupforum 2013, Marko Hedler speculated that some authors (and perhaps some others) might prefer to think of text as having a rather flat structure, not a hierarchical nesting structure: a sequence of paragraphs, strewn about with the occasional heading, rather than a collection of hierarchically nesting sections, each with a heading of the appropriate level.

One possible argument for this would be how well the flat model agrees with the text model implicit in word processors. Whether word processors use a flat model because their developers have correctly intuited what authors left to their own devices think, or whether author who want a flat structure think as they do because their minds have been poisoned by exposure to word processors, we need not now decide.

In a hallway conversation, Prof. Hedler wondered if it might not be possible to invent some sort of validation technique to define such a flat sequence. To this, the immediate answer is yes (we can validate such a sequence) and no (no new invention is needed): as the various schemas for HTML demonstrate, it’s easy enough to define a text body as a sequence of paragraphs and headngs:

<!ELEMENT body (p | ul | ol | blockquote
| h1 | h2 | h3 ...)*>

What, we wondered then, if want to enforce the usual rules about heading levels? We need to check the sequence of headings and ensure that any heading is at most one level deeper than the preceding heading, so a level-one heading is always followed by another level-one heading or by a level-two heading, but not by a heading of level three or greater, while (for example) a fourth-level heading can be immediately followed by a fifth-level heading, but not a sixth-level heading. And so forth.

The discussion having begun with the conjecture that conventional vocabularies (like TEI or JATS) use nesting sections because nesting sections are the kinds of things we can specify using context-free grammars, we were inclined to believe that it might be hard to perform this level check with a conventional grammar-based schema language. In another talk at the conference, Gerrit Imsieke sketched a solution in Schematron (from memory, something like this rule for an h1: following-sibling::* [matches(local-name(), '^hd')] [1] /self::*[self::h1 or self::h2] — that can’t be right as it stands, but you get the idea).

It turns out we were wrong. It’s not only not impossible to check heading levels in a flat structure using a grammar-based schema language, it’s quite straightforward. Here is a solution in DTD syntax:

<!ENTITY p-level 'p | ul | ol | blockquote' >
<!ENTITY pseq '(%p-level;)+' >
<!ENTITY h6seq 'h6, %pseq;' >
<!ENTITY h5seq 'h5, (%pseq;)?, (%h6seq;)*'>
<!ENTITY h4seq 'h4, (%pseq;)?, (%h5seq;)*'>
<!ENTITY h3seq 'h3, (%pseq;)?, (%h4seq;)*'>
<!ENTITY h2seq 'h2, (%pseq;)?, (%h3seq;)*'>
<!ENTITY h1seq 'h1, (%pseq;)?, (%h2seq;)*'>

<!ELEMENT body ((%pseq;)?, (%h1seq;)*) '>

Note: as defined, these fail (for simplicity’s sake) to guarantee that a section at any level must contain either paragraph-level elements or subsections (lower-level headings) or both; a purely mechanical reformulation can make that guarantee:

<!ENTITY h1seq 'h1,
((%pseq;, (%h2seq;)*)
| (%h2seq;)+)' >

Vocabulary specialization and generalization, use and interchange: things our schema languages should make easier than they do

[23 January 2011]

Perhaps it’s because the call for participation in the 2011 Balisage Pre-conference Symposium on XML Document Interchange has just come out, or perhaps it’s for other reasons, but I found myself thinking today about the problem of specialized uses of generalized vocabularies.

There are lots of vocabularies defined in fairly general terms: HTML, TEI, DocBook, the NLM article tag set, you can surely think of plenty yourself.

Often, for a specific purpose in a specific organization or project, it would be handy to have a much tighter, much more specific vocabulary (and thus one that’s semantically richer, easier to process, and easier to validate tightly). For example, consider writing and managing an issues list (or a list of use cases, or any other list containing items of a specialized genre), in a generic vocabulary. It’s easy enough: you just have a section for each issue and with that section you have standard sections on where the issue came from, what part of the project it relates to, its current status, and the history of your work on it. Easy enough. And if you’re the kind of person who write macros in whatever editor you use, you can write a macro to set up a new issue by adding a section of type ‘issue’ with subsections with appropriate types and headings. But isn’t that precisely what a markup-aware editor typically does? Well, yes, typically: any schema-aware editor can look at the schema, and as soon as you say “add a new issue” they can populate it with all of the required subelements. Or, they could, if you had an element type called ‘issue’, with appropriately named sub-elements. If instead you are using a generic ‘div’ element, your editor is going to have a hard time helping you, because you haven’t said what you really mean. You want an issue, but what you’ve said is ‘add a div’.

Some schemas, and some schema languages, try to make this easier by allowing you to say, essentially, that an issue element is kind of div, and that the content model for issue is a specialization of that for div (and so on). This is better than nothing, but I’m probably not the only person who fails to use these facilities in all the cases where they would be helpful. And when I do, I have to extend the standard stylesheets for my generic vocabulary to handle my new elements, because even when the stylesheet language supports the specialization mechanisms of the schema language (as XSLT 2.0 supports element substitution groups in XSD), most stylesheets are not written to take advantage of it. And if I’m exchanging documents with someone else, they may or may not want to have to deal with my extensions to the schema.

I wonder if we might get a better answer if (a) in our schema languages it were as easy to write a rule for div type='issue' as for issue, and (b) in our validation tools it were as easy to apply multiple grammars to a document as a single grammar, and to specify that the class of documents we are interested in is given by the intersection of the grammars, or by their union, or (for grammars A, B, C) by A ∪ (B ∩ ¬ C). Also (c) if for any schema extension mechanism it were easy to generate a transformation to take documents in the extended schema into the base schema, and vice versa.

Perhaps NVDL may be in a position to help with (b), though I’ve never learned it well enough to know and it seems to be more heavily freighted with unacknowledged assumptions about schema languages and validation than I’d like.

And perhaps Relax NG already can handle both (a) and (b).

Homework to do.

Xerophily, a parser for XSD regular expressions

[30 December 2009]

A while back, in connection with the work of the W3C XML Schema working group, I wrote a parser, in Prolog, for the regular-expression language defined by the XSD spec. This has been on the Web for some time, at http://www.w3.org/XML/2008/03/xsdl-regex/.

Not long ago, though, I received an inquiry from Jan Wielemaker, the maintainer of SWI Prolog (and ipso facto someone to whom I owe a great debt, along with every other user of that well maintained system), asking if it wouldn’t be possible to re-issue the code under the Lesser Gnu Public License, so that he could use parts of it in some work he’s doing on support for the simple types of XSD. (Strictly speaking, he could do that under the W3C license, but that would mean he has to manage not one license but two for the same library: too complicated.)

Fine with me, but since I wrote it while working at W3C, I don’t own the copyright. To make a long story short, after some cavilling and objections and assorted mutual misunderstandings, agreement was reached, and W3C gave their blessing.

The LGPL boilerplate assumes that the package being issued has a name, so I have given the XSD regex parser the name Xerophily.

[“Xerophily? What on earth is that?” asked Enrique. The word, I explained to him, denotes the property possessed by plants well adapted to growing in dry, especially hot and dry, conditions. “Well, that makes it apt for code produced in New Mexico, I guess,” he allowed. “And your code, like those plants, is a little scraggly and scrawny. But what name are you going to use for your next program?” “It’s also one of the few words I could find containing X (for XSD), R (for regular expression) and P (for parser, and for Prolog).” “Oh.”]

A word of caution: Xerophily was written to support some very dry work on the grammar of the regex language; it doesn’t really have a user interface; and it doesn’t actually perform regex matching against strings. It just parses the regex into an abstract syntax tree (AST). I do have code, or I think I did some years back, to do the matching, but the shape of the AST has changed since then. In other words, Jan Wielemaker may be the only person on earth who could have any plausible reason for wanting to use or read this code.

But for what it’s worth, as of a few minutes ago, it’s on the Web, under the LGPL, at the address http://www.w3.org/XML/2008/03/xsdl-regex/LGPL/. The W3C-licensed version remains available, at the address given above.

Automata and infinite strings

[15 December 2009]

[This is another one of those ignorance-is-bliss postings. If I had studied automata theory properly, this would (I guess) have been covered in class; that would have deprived me of the fun of thinking about it without knowing the right answer. If you did study automata theory, and you know how infinite strings are handled, and it irritates you to see someone waffling on and on about it instead of just doing some homework and reading the damn literature, you might want to stop reading soon.]

Some time ago, Michael Kay suggested that it was pointless for the XSD datatypes spec to specify that the lexical representations of integers, or strings, or various other simple types, were finite sequences of characters with certain properties. No implementation, he pointed out, can reliably distinguish finite from infinite strings, so it’s a non-testable assertion.

[“True if you’re using conventional I/O and conventional representations of strings, maybe,” said Enrique. “But if you represent the sequence of characters using a description, rather than an array of characters, it’s not clear that that’s true. Instead of the sequence "3.141592...", store an algorithm for calculating, on demand, the nth digit of the decimal expansion of π. Ditto for the square root of 2. And so on!” “You may be right,” I said. “But that wasn’t what I wanted to talk about, so be quiet.”]

The working group declined the proposal to drop the word “finite” on the grounds that if the strings in question are required to be finite, then we know that all the lexical representations of integers (for example) can in principle be recognized by a finite state automaton. Without the restriction to finite sequences, most of what people know about finite state automata isn’t guaranteed to apply.

I found myself wondering this morning about the possible application of automata to infinite and semi-infinite strings. I know that in principle automata theorists have historically not restricted their interest to finite automata; it seems plausible to assume they have also considered infinite strings. But I don’t know what they have said, without spending time looking it up; instead, I am going to enjoy myself for a little while seeing how much I can figure out for myself.

One obvious question to answer is: if you want to use an automaton to identify infinite sequences, how do you define acceptance of the sequence? For a finite sequence, you ask what state you’re in at the end of the sequence, and whether that state is an “accept state” or not. That won’t work for an infinite sequence: there is no final state.

Perhaps we can consider the sequence of states the automaton enters and define acceptance in terms of that sequence. Possible answers:

  1. Accept if (a) the automaton eventually ends up in a single state which it never again leaves, and (b) that state is an accept state.
  2. Accept if there is some point in the sequence of states such that every state following that point is an accept state.

These would work (in the sense of providing a yes/no answer).
Do these rules for acceptance of strings define sets of automata with different discriminating power?

It seems obvious that they do, but what exactly are the differences?

Consider, for example, automata for recognizing various classes of numbers written as an infinite sequence of decimal digits. Numbers seem to be on my mind, perhaps because of the tie-in to XSD datatypes.

For such infinite strings of digits (including a decimal point), integers have the property that every digit to the right of (i.e. following) the decimal point is a 0. If you build the obvious automaton, for an integer it will spend all its time in the zero-after-decimal-point state, and for a non-integer it will, eventually, end up caught in an error state.

[Enrique tells me I should pause to draw pictures of these automata, but I’m not going to take the time just yet. Apologies to those who find it hard to visualize what I’m talking about.]

So the first acceptance rule suffices for recognizing integers. It may be relevant that the same automaton can be used to recognize finite strings as representing integers: any prefix of the infinite string representing an integer will also be accepted as representing an integer.

The first rule would also suffice to allow us to build a recognizer for certain fractions, e.g. 1/3: the infinite string denoting 1/3 ends up perpetually in the “we’ve just read a 3” state.

On the other hand, it doesn’t suffice for all rationals: in decimal notation,1/7 has an infinitely repeating sequence of digits (142857, if you were wondering). To distinguish 1/7 in decimal notation we’d need a cycle of six states in the automaton.

All rational numbers have a decimal expansion that eventually settles into an infinite series of repeated strings of digits (if only an infinitely repeating sequence of zeroes). So if we adopt the second rule for defining acceptance of the string, we can say: for every rational number, there is a finite state automaton that recognizes that rational number. And irrationals, which have no repeating sequences, aren’t recognizable by an automaton with finite states. (An automaton with infinitely many states might be able to recognize the decimal expansion of a particular irrational number, say π, but it’s hard to know what to do with that information — maybe it’s a reason to say that languages recognizable with an infinite automaton are not necessarily regular.)

That sounds like a nice pattern. It would be even nicer if we could devise an automaton to recognize the set of decimal expansions of rational numbers, but I suspect that’s not feasible, since the complement of that set is the irrationals, and being able to recognize the one set by regular means would entail being able to recognize the other, too.

Does it make sense to require that the automaton eventually end up spending all its time in accept states? (Or equivalently, that the sequence of states have a suffix in which every element in the suffix is an accept state.)

What if that is too restrictive a rule? What if we said instead

  1. Accept if at every point in the sequence of states there are an infinite number of accept states among the states following that point.

That is, allow the string to put the automaton into a non-accepting state, as long as it’s only temporary, and it eventually gets back into an accepting state.

Consider an automaton which has two states, A and B. Every time a B is found in the input, we go to state B; for any other symbol we go to state A. B is an accept state.

If we adopt the second story about termination, a string ending in an unending series of Bs will be accepted and is thus recognizable by an automaton. A string with an infinite number of Bs, interspersed with other symbols, will not be accepted by this automaton (nor by any other, as far as I can tell).

OK, that seems to establish (if we accept the conjecture about strings with infinitely many Bs) that the second and third rules define distinct sets of languages. I suppose that one chooses to use the second rule, or the third, or some other I haven’t thought of yet, in part based on whether it feels right to count as regular the languages one can recognize using that rule.

Hmm. OK, time to look at the bookshelves.

I’ve just checked and found that John E. Hopcroft and Jeffrey D. Ullman, in Introduction to automata theory, languages, and computation (Reading: Addison-Wesley, 1979), restrict their attention to finite strings.

Dick Grune and Ceriel J. H. Jacobs, Parsing techniques: a practical guide, second edition (New York: Springer, 2008), don’t explicitly impose this restriction. But a quick scan of their opening pages also doesn’t show any explicit consideration of infinite sequences of symbols, either. I’m guessing they do treat infinite input somewhere, if only because if you can contemplate van Wijngaarden grammars, which have infinite numbers of context-free rules (and remember, Grune didn’t just contemplate van Wijngaarden grammars, he wrote a parser generator for them), infinite strings are just not going to frighten you.

I suppose the idea of thinking seriously about infinitely long sentences in a language is one I first encountered in D. Terence Langendoen and Paul Postal, The vastness of natural languages (Oxford: Blackwell, 1984). To whom (for this, as for many other things) thanks!

I’m pretty sure that there was some treatment of infinite automata and/or infinite input strings in S. C. Kleene, “Representation of events in nerve nets and finite automata”, in Automata studies, ed. C. E. Shannon and J. McCarthy (Princeton: PUP, 1956), and V. M. Glushkov, “The abstract theory of automata”, Russian mathematical surveys: a translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk 16 (1961). They are both kind of tough sledding, but I suppose I really ought to go back and read them carefully with an eye to this topic.

XML as a sort-of open format

[3 December 2009]

I just encountered the following statements in technical documentation for a family of products which I’ll leave nameless.

This document does not describe the complete XML schema for either [Application 1] or [Application 2]. The complete XML schema for both applications is not available and will not be made public.

Perhaps there can be good reasons for such a situation. Perhaps the developers really don’t know how to use any existing schema language to describe the set of documents they actually accept; perhaps only a Turing machine can actually identify set of documents accepted, and the developers were unwilling to work with a simpler set whose membership could be more cheaply decided. (Well, wait, those may be reasons, but they don’t actually qualify as “good”.)

I wonder whether this is an insidious attempt to look like the products have an open format (See? it’s XML! How much more open can you get?) while ensuring that the commercial products in question remain the only arbiters of acceptable documents? Or whether the programmers in question were just too lazy to specify a clean vocabulary and ensure that their software handles all documents which meet some standard of validity that does not require Turing completeness?

Having a partially defined XML format is, at least for me, still a great deal more convenient than having the format be binary and completely undocumented. But it certainly seems to fall a long distance short of what XML could make possible.