Using our own tools / means and ends

[20 April 2008]

One often hears, in information technology and in spec development, the injunction to eat one’s own dog food, meaning to use, oneself, the technologies one is developing. By confronting the technology as a user, the designer will become aware sooner of flaws in the implementation, gaps in the functionality, or other usability problems. I hate the metaphor, because I don’t think of the technology I work on as dog food, but it’s good advice.

I’ve been thinking lately that I should be making a lot more use, in my work, of the technologies I’m responsible for. There is a radical difference between the attitude of someone who has designed a system, or engaged with it as an object of study, but never much used it, and someone who has actually used the system to do things they wanted to do, treating the system as a means not an end in itself.

I once sat in working group meetings listening to brilliant computer scientists telling us how the working group should approach various problems, and being struck by the fact that if they showed five XML examples of two to ten lines each, at least two of them would be ill-formed. They had a pretty good theoretical grasp of XML, although they didn’t grok its essence. But their mistakes showed clearly that they had never spent as much as thirty minutes writing XML and running it through a parser. It was no wonder that their view of markup was so different from the view of those of us who worked with markup during most of the hours of our working lives. To them, XML was an object of study. To those of us who actively used it, it was no less an object of study, but it was also a means to achieve other ends. I mentioned to one of the computer scientists that the well-formedness errors in his examples made it hard for me to take his proposals seriously, and to his great credit he did address the problem. He never showed an example in XML notation again; instead he wrote everything down in Haskell. (And since he used Haskell more or less constantly, he didn’t make dumb syntax errors in it.)

In practice, I think the ‘use your own tools’ principle means I should spend some time trying to upgrade my personal tool chain to make use of technologies like XProc and SML. (There are some other technologies I should make more use of, too, but I’m not prepared to identify them in public.)

At the same time, I’m also acutely conscious of the difference between experimental systems and production systems. Experimental systems need to be able to change rapidly and radically as we learn. Production systems need to be stable and reliable, and cannot usually change more quickly than the slowest user who relies on them. The maintainers of a production system seldom have the ability to force their users to upgrade or make other changes. (And if they succeed in acquiring it, they will be regarded by their users as tyrannical oppressors to be deceived and evaded whenever possible.)

Specs in development sometimes need to be experimental systems.

Some people will say no, never standardize anything that requires experimentation: only standardize what is already existing practice. The example that sticks in my head from long-ago reading on the theory of standardization is machine-tool tolerances: if no one sells or buys tools with a given tolerance, it’s because either there’s no market for them (so no need to standardize) or no understanding of how to achieve that tolerance (so a standard would be pointless and might get crucial bits wrong out of ignorance); standardize the tolerances people are actually using and you’ll produce a standard that is useful and based on good knowledge of the domain. This principle may well work for machine tools; I am not a mechanical engineer. But if you wait until a given piece of information technology is already common practice, then by and large you will be looking at a marketplace in the form of a monopoly with one player. If you’re looking for a non-proprietary standard to provide a level playing field for competing implementations, you’re too late.

In information technology, standards that go out in front of existing practice appear to be the only way to define a standard that actually defined a non-proprietary technology. Ideally, you don’t want to be too far out in front of existing practice, but you can’t really be behind it, either.

If you’re out in front of well established practice, the spec needs in some sense to be experimental. If the responsible working group finds a new and better way to say or do something, after issuing their second public draft but before the spec is finished, they need to be free to adopt it in the third working draft.

If I rebuild the tool chain for the XSD 1.1 spec to use XProc, for example, that would be interesting, the re-engineering would probably give us a cleaner tool chain, and it might provide useful feedback for the XProc working group. But when the XProc working group changes its mind about something, and the implementation I’m using changes with it, then my tool chain breaks, and not necessarily at a time when it’s convenient to spend time rebuilding it. (Long ago, my brother was trying to persuade me I should be interested in personal computers, which I regarded as toys compared to the mainframe I did my work on. Nothing he said made any dent until he said “Having a personal computer means you get to upgrade your software when you want to, not when the computer center finds it convenient.” That sold me; we bought a personal computer as soon after that as we could afford one.)

Is there a way to manage the tension between a desire to use the tools one is building and the need for the tool chains one uses in production work to be stable?

I don’t know; but I hope, in the coming months, to find out.

Dumbing grammars down

[15 April 2008]

For reasons too complicated to go into just now (I tried, and it takes paragraphs and paragraphs), I have found myself thinking lately about a grammar manipulation process well known to be possible and wondering how to do it in fact.

Background

Sometimes what you have is a context-free language L, but what you need is a related regular language. Or, more precisely, what you have is a context-free grammar for L, and what you need is a regular grammar or an FSA or a regular expression that recognizes either the largest practicable regular sub-language of L or the smallest regular super-language of L. You might need this because the context-free grammar, in all its hairy glory, is too difficult to handle with the limited resources at your disposal. (The tractable-sublanguage approach was pioneered by students of natural languages, back when they were trying to make the equivalent of an 80386 parse English.) Or you might need it because you want to define an element or attribute for expressions of some kind defined by a context-free grammar, and you want to define a simple type for them, for use with XSD, or you want to use Simon St. Laurent’s regular fragmentation or similar tools to describe them, but the tools (like the XSD pattern facet) only deal with regular expressions, not context-free grammars.

Since regular grammars are weaker (or dumber) than context-free grammars, the question turns into: how can we most usefully go about dumbing down a context-free grammar to make it into a regular grammar?

The largest practicable regular sub-language

Now, it’s reasonably well known that for any context-free language which allows arbitrarily deep nesting of structures (that is, for any context-free language at all — if it doesn’t allow arbitrarily deep nesting, the language isn’t context-free), it’s possible to describe a related regular language which allows some bounded amount of nesting.

For example: the language of arithmetic expressions defined by this grammar is context-free:

expression = term (add-op term)*
term = factor (mul-op factor)*
factor = NUMBER | VAR-REF | '(' expression ')'
add-op = '+' | '-'
mul-op = '*' | '/'

A similar language that allows nesting of parenthesized expressions only up to some fixed depth is regular, not context-free, and can be recognized with a finite-state automaton or described with a regular expression. And every sentence in the regular language is a sentence in the original context-free language.

This fact has been exploited before now to make certain parsing and grammar-management tasks more tractable. I read about it first (if memory serves) in Grune and Jacobs, Parsing techniques: A practical guide, who cite a number of papers, most of which I have not read. But while they provide a clear high-level description of how the necessary grammar transformation works, they don’t go into any detail about how to perform the transformation in practice for arbitrary grammars. So it’s been in the back of my mind for some time to think it through and see if I could reduce it to code.

The smallest regular super-language

Sometimes, the regular language you want is not a sub-language, but a super-language. Suppose we wanted to write a regular expression that would accept all legal arithmetic expressions and as few other strings as possible. (If you’re defining a simple type for arithmetic expressions, you don’t want it rejecting things just because the parentheses are nested too deep. So you need a superlanguage, not a sub-language.) The nature of the context-free / regular distinction is that such a super-language would not require parentheses to match up correctly. But it should be possible to enforce some of the other rules relating to parentheses: we want strings like “(3 +)” or “4 (* x)” to be rejected, because in the original grammar operators are not allowed to be adjacent to parentheses in this way, and you don’t need a context-free grammar to enforce those rules.

As it happens, the regular expression (*d+()|[+-*/]d+)* comes pretty close to capturing the rules. (For simplicity, I have just written “d+” for the places where the original grammar has NUM or VAR-REF.) That is, unless I have botched something somewhere, any string legal against the original grammar will be legal against this regular expression, and nothing else will except strings that would be correct except that their parentheses are messed up.

Towards a method

Given a particular context-free grammar, it’s generally possible to sit down with it and cipher out how to dumb it down into a regular grammar for fixed depth, or a regular grammar for arbitrary depth but no checks on parenthesis nesting. But how do you do it for arbitrary grammars, methodically and mechanically? How do you reduce it to code?

I daresay it’s in the literature, but I haven’t read much of the relevant literature; I expect to, when I can, but it was more fun to try to solve the problem from scratch, armed only with Grune and Jacobs’ concise description

Even (finite) context-dependency can be incorporated: for languages that require the verb to agree in number with the subject, we duplicate the first rule:

MainClause -->
SubjectSingular VerbSingular Object
| SubjectPlural VerbPlural Object

and duplicate the rest of the grammar accordingly. The result is still regular…. We replicate the grammar the desired number of times and remove the possibility of further levels from the deepest level. Then the deepest level is regular, which makes the other levels regular in turn. the resulting grammar will be huge but regular and will be able to profit fom all simple and efficient techniques known for regular grammars. The required duplications and modifications are mechanical and can be one by a program.

I don’t have code yet. But I do have a sketch of a methodical approach. Some of these steps probably need further explanation, which I do not have time to provide today.

  1. Find the set of non-terminals in the grammar that are reachable from themselves (call these cyclic non-terminals).
  2. To each production for a relevant non-terminal (details below), add a synthesized attribute to the grammar, to show the nesting level for that non-terminal. Non-terminals are relevant if one of the cyclic non-terminals is reachable from them.
  3. Draw the CFG as an augmented transition network (this amounts to drawing each rule as an FSA, with terminals or non-terminals labeling the arcs).
  4. Using (an adaptation of) the Thompson algorithm, combine the FSAs from the transition network into a single FSA with epsilon transitions as necessary, and only terminals labeling the arcs. Keep track of the attributes using whatever additions to the usual tools your wit can provide. Me, I use counting FSAs as I described them in 2004; I still can’t believe that work is new, but I still haven’t found anything else in that vein.
  5. Eliminate epsilon transitions in the usual way, to clean things up. If your heart is strong, you may also be able to merge states that clearly can be merged safely, but I don’t know how to define that test formally just yet.
  6. You now have a fairly standard counting FSA; it can readily be re-transcribed into a counting regular attribute grammar (CRAG).
  7. To generate a regular sub-language, assign a maximum value to each counter; you now have an attribute grammar over a finite lattice and can generate a normal (non-attributed) grammar in the obvious way.
  8. To generate a regular super-language, just ignore (or: delete) the guards and counters.

Strictly speaking, you can generate both the sub- and the super-language as soon as you have the attribute grammar of step

Let me illustrate using the arithmetic expression grammar above.

First, note that expression, term, and factor are all cyclic, and nothing else is. In this particular case, we actually only need one counter, but it does no harm to add all three. For brevity, I’ll abbreviate the non-terminals to single letters.

e(i,j,k) = t(i+1,j,k) (a t(i+1,j,k))*
t(i,j,k) = f(i,j+1,k) (m f(i,j+1,k))*
f(i,j,k) = NUMBER | VAR-REF | '(' e(i,j,k+1) ')'
a = '+' | '-'
m = '*' | '/'

The result of drawing a transition network (not shown) and then mushing it all together, after minimization, with guards and counter operations, is something like this: FSA for the superlanguage

A counting regular attribute grammar for this, with e(0) the start symbol, is:

e(n) = '(' e(n+1)
e(n) = NUM q(n)
e(n) = VAR-REF q(n)
q(n) = [+-*/] e(n)
q(n) = {n > 0} ')' q(n-1)
q(n) = {n = 0} ''

(The last line could of course be written “q(0) = ''”, but for reasons I haven’t figured out the form with the explicit and verbose guard seems better, perhaps clearer. Maybe it is.)

This can be multiplied out in the obvious way to produce an unattributed grammar which accepts two levels of parenthesized expressions:

e = '(' e1
e = NUM q
e = VAR-REF q
q = [+-*/] e
q = ''
e1 = '(' e2
e1 = NUM q1
e1 = VAR-REF q1
q1 = [+-*/] e1
q1 = ')' q
e2 = NUM q2
e2 = VAR-REF q2
q2 = [+-*/] e2
q2 = ')' q1

And the smallest regular superlanguage appears to be

e = '(' e
e = NUM q
e = VAR-REF q
q = [+-*/] e
q = ')' q
q = ''

Some open questions and follow-on topics:

  • Can a formally satisfactory definition be given which explicates the intuitive notion of the ‘smallest regular super-language’?
  • If so, can it be proven that the construction above produces a grammar (an FSA, a regular expression) which in fact accepts the smallest regular superlanguage?
  • To check this, I really ought to turn this into code; if any readers of this have a use for such code in Prolog or XSLT, let me know.
  • It would be interesting to use this technique to write simple types for approximations of arithmetic expressions, and XPath expressions, and various other specialized types that in principle require context-free grammars. I foresee a union type with first a large regular subset, then the smallest regular superset, then arbitrary strings.
  • This whole thing came up because I was worrying about issues relating to test cases for grammars; it would be nice to try to apply this work there.

Caspar

[8 April 2008]

I had another bad dream last night. Enrique came back.

“I don’t think you liked Cheney very much. Or Eric van der Vlist, either, judging from his comment. So I’ve written another conforming XSD 1.0 processor, in a different vein.” He handed me a piece of paper which read:

#!/bin/sh
echo "Input not accepted; unknown format."

“This is Caspar,”

“Caspar as in Weinberger?”

“Hauser. As you can see, Caspar doesn’t have the security features of Cheney, but it’s also conforming.”

I should point out for readers who have not encountered the figure of Kaspar Hauser that he was a mysterious young man found in Nuremberg in the 1820s, raised allegedly without language or human contact, who never quite found a way to fit into society, and is thus a convenient focal point for meditations on language and civilization by artists as diverse as Werner Herzog, Paul Verlaine, and Elizabeth Swados.

“Conforming? I don’t think I want to know why you think so.”

“Oh, sure you do. Don’t be such a wet blanket.”

“I gather you’re going to tell me anyway. OK, if there’s no way out of it, let’s get this over with. Why do you think Caspar is a conforming XSD processor?”

“I’m not required to accept XML, right? Because that, in the logic of the XSD 1.0 spec, would impede my freedom to accept input from a DOM or from a sequence of SAX events. And I’m not required to accept any other specific representation of the infoset. And I’m also not required to document the formats I do accept.”

“There don’t seem to be any conditionals here: it looks at first glance as if Caspar doesn’t support any input formats.” (Enrique, characteristically sloppy, seems to have thought that Hauser was completely alingual and thus could not understand anything at all. That’s not so. But I didn’t think it was worth my time, even during a bad dream, to argue with Enrique over the name of his program.)

“Right. Caspar understands no input formats at all. I love XSD 1.0; I could write conforming XSD 1.0 processors all day; I don’t understand why some people find it difficult!”

Balisage — the paper deadline nears

Two weeks to go before the Balisage paper submission deadline. That’s far enough away that with a little luck and some late nights, you can still get a draft paper put together that is complete enough for review.

But it’s close enough that you have to start now. The Balisage review process has a very tight schedule and long extensions are not an option.

(Enrique interrupts me at this point to remind me that not everyone in the world — not even every reader of this blog — will necessarily know what Balisage is. He’s right.)

Balisage is a new conference on markup and its applications, organized by a group which has, in past years, worked together as the organizing committee for Extreme Markup Languages and a variety of other conferences organized by IDEAlliance (and the Graphic Communications Association) before that. It will take place this August in Montréal. To be very specific, it will take place 12-15 August, with a pre-conference symposium on 11 August.

Any topic related to the theory or practice of markup (preferably both! there is nothing as practical as a good theory! there is nothing so theoretically fruitful as thoughtful practice!) is in scope. We are hoping for papers on vocabulary design, XQuery, XSLT, SGML, XML, alternatives to XML, overlapping hierarchies, general problems of knowledge engineering and information management, topic maps, OWL, RDF, UBL, validation, the futility of validation, the utility of validation despite its futility, using markup for graphics, SVG, XSL-FO, ontologies, content management, markup theory, information structure, and more. Theory, case studies, developer reports, all are in scope.

Last year, the day before Extreme Markup Languages we organized a one-day symposium on the handling of overlapping structures, which was hugely informative. This year, there will be a one-day symposium on versioning, covering any and all issues relating to revision, variation, and managing change in XML documents and vocabularies. I confidently expect to learn a lot, and I hope you will, too. But for that to happen, you have to attend.

I’m one of the organizers, so I have a vested interest in making you, dear Reader, think that Balisage is an interesting and informative conference to attend, and a cool place to be, and a wonderful place to get your interesting new idea heard by smart people who care and know about markup.

So don’t take my word for it: check out the Web site and ask around, and decide for yourself. More info at http://www.balisage.net.

I hope to hear from you soon.