[25 November 2013, typos corrected 26 November; spam links removed 27 April 2017]
The other day during the question-and-answer session after my talk at Markupforum 2013, Marko Hedler speculated that some authors (and perhaps some others) might prefer to think of text as having a rather flat structure, not a hierarchical nesting structure: a sequence of paragraphs, strewn about with the occasional heading, rather than a collection of hierarchically nesting sections, each with a heading of the appropriate level.
One possible argument for this would be how well the flat model agrees with the text model implicit in word processors. Whether word processors use a flat model because their developers have correctly intuited what authors left to their own devices think, or whether author who want a flat structure think as they do because their minds have been poisoned by exposure to word processors, we need not now decide.
In a hallway conversation, Prof. Hedler wondered if it might not be possible to invent some sort of validation technique to define such a flat sequence. To this, the immediate answer is yes (we can validate such a sequence) and no (no new invention is needed): as the various schemas for HTML demonstrate, it’s easy enough to define a text body as a sequence of paragraphs and headngs:
<!ELEMENT body (p | ul | ol | blockquote | h1 | h2 | h3 ...)*>
What, we wondered then, if want to enforce the usual rules about heading levels? We need to check the sequence of headings and ensure that any heading is at most one level deeper than the preceding heading, so a level-one heading is always followed by another level-one heading or by a level-two heading, but not by a heading of level three or greater, while (for example) a fourth-level heading can be immediately followed by a fifth-level heading, but not a sixth-level heading. And so forth.
The discussion having begun with the conjecture that conventional vocabularies (like TEI or JATS) use nesting sections because nesting sections are the kinds of things we can specify using context-free grammars, we were inclined to believe that it might be hard to perform this level check with a conventional grammar-based schema language. In another talk at the conference, Gerrit Imsieke sketched a solution in Schematron (from memory, something like this rule for an h1
: following-sibling::* [matches(local-name(), '^hd')] [1] /self::*[self::h1 or self::h2]
— that can’t be right as it stands, but you get the idea).
It turns out we were wrong. It’s not only not impossible to check heading levels in a flat structure using a grammar-based schema language, it’s quite straightforward. Here is a solution in DTD syntax:
<!ENTITY p-level 'p | ul | ol | blockquote' > <!ENTITY pseq '(%p-level;)+' > <!ENTITY h6seq 'h6, %pseq;' > <!ENTITY h5seq 'h5, (%pseq;)?, (%h6seq;)*'> <!ENTITY h4seq 'h4, (%pseq;)?, (%h5seq;)*'> <!ENTITY h3seq 'h3, (%pseq;)?, (%h4seq;)*'> <!ENTITY h2seq 'h2, (%pseq;)?, (%h3seq;)*'> <!ENTITY h1seq 'h1, (%pseq;)?, (%h2seq;)*'> <!ELEMENT body ((%pseq;)?, (%h1seq;)*) '>
Note: as defined, these fail (for simplicity’s sake) to guarantee that a section at any level must contain either paragraph-level elements or subsections (lower-level headings) or both; a purely mechanical reformulation can make that guarantee:
<!ENTITY h1seq 'h1, ((%pseq;, (%h2seq;)*) | (%h2seq;)+)' >