[23 January 2011]
Perhaps it’s because the call for participation in the 2011 Balisage Pre-conference Symposium on XML Document Interchange has just come out, or perhaps it’s for other reasons, but I found myself thinking today about the problem of specialized uses of generalized vocabularies.
There are lots of vocabularies defined in fairly general terms: HTML, TEI, DocBook, the NLM article tag set, you can surely think of plenty yourself.
Often, for a specific purpose in a specific organization or project, it would be handy to have a much tighter, much more specific vocabulary (and thus one that’s semantically richer, easier to process, and easier to validate tightly). For example, consider writing and managing an issues list (or a list of use cases, or any other list containing items of a specialized genre), in a generic vocabulary. It’s easy enough: you just have a section for each issue and with that section you have standard sections on where the issue came from, what part of the project it relates to, its current status, and the history of your work on it. Easy enough. And if you’re the kind of person who write macros in whatever editor you use, you can write a macro to set up a new issue by adding a section of type ‘issue’ with subsections with appropriate types and headings. But isn’t that precisely what a markup-aware editor typically does? Well, yes, typically: any schema-aware editor can look at the schema, and as soon as you say “add a new issue” they can populate it with all of the required subelements. Or, they could, if you had an element type called ‘issue’, with appropriately named sub-elements. If instead you are using a generic ‘div’ element, your editor is going to have a hard time helping you, because you haven’t said what you really mean. You want an issue, but what you’ve said is ‘add a div’.
Some schemas, and some schema languages, try to make this easier by allowing you to say, essentially, that an issue element is kind of div, and that the content model for issue is a specialization of that for div (and so on). This is better than nothing, but I’m probably not the only person who fails to use these facilities in all the cases where they would be helpful. And when I do, I have to extend the standard stylesheets for my generic vocabulary to handle my new elements, because even when the stylesheet language supports the specialization mechanisms of the schema language (as XSLT 2.0 supports element substitution groups in XSD), most stylesheets are not written to take advantage of it. And if I’m exchanging documents with someone else, they may or may not want to have to deal with my extensions to the schema.
I wonder if we might get a better answer if (a) in our schema languages it were as easy to write a rule for div type='issue'
as for issue
, and (b) in our validation tools it were as easy to apply multiple grammars to a document as a single grammar, and to specify that the class of documents we are interested in is given by the intersection of the grammars, or by their union, or (for grammars A, B, C) by A ∪ (B ∩ ¬ C). Also (c) if for any schema extension mechanism it were easy to generate a transformation to take documents in the extended schema into the base schema, and vice versa.
Perhaps NVDL may be in a position to help with (b), though I’ve never learned it well enough to know and it seems to be more heavily freighted with unacknowledged assumptions about schema languages and validation than I’d like.
And perhaps Relax NG already can handle both (a) and (b).
Homework to do.
RNG can help partially with both (a) and (b). It is indeed as easy to write a pattern for <div type=”issue”> as for <issue>, and indeed to have a pattern for <div> distinct from the above. What is not readily written down is a pattern for cases in which @type has an unexpected value; you can perhaps painfully construct a regular expression to match all such values, but I doubt if this can be done in the general case — at least, I see no obvious recipe for doing it.
The good news is that RNG is closed under union, and given grammars A and B, the grammar A | B is valid RNG and means the union of A and B. The bad news is that RNG is not, in general, closed under intersection or negation. The mildly good news is that RNG provides A – B (read “A except B”) where both A and B are simple types with or without facets (e.g. xsd:string – xsd:integer), and where both are names or name classes (e.g. attribute * – foo, element * – xslt:*), but not otherwise.
Actually, my second paragraph is incomplete: A – B works if B is a typed literal value as well as a type, or a choice thereof. So you can write xsd:string – (“foo” | “bar” | “baz”), which effectively eliminates the restriction I mentioned in the first paragraph.
I’m surprised to hear you say Relax NG isn’t closed under negation or (content model) subtraction; I thought I remembered one of Murata-san’s favorite examples of the utility of closure as being the ability to make a schema which accepts all and only the v1.0 documents which are not valid against v2.0 (and which therefore must now all be edited) — a sort of schema-as-complicated-query application.
Looks like you are right and I am wrong: this paper by Haruo Hosoya and Makoto Murata contains a proof of closure under intersection and negation, at least for elements and attributes. Intersection and difference of full RELAX NG with its simple data types may be another story, however.
Well, I hope they come up with something that has legacy and portability/expandability for the future. It is the being “locked in” to a certain schema where things get messy. And it is 99% of the time, unnecessary.