Writing tight

[13 November 2008]

Dimitre Novatchev has called attention to a recent question on the
stackoverflow programming Q and A web site:

I have an XPath expression which provides me a sequence of values like the one below:

1 2 2 3 4 5 5 6 7

It is easy to convert this to a set of unique values “1 2 3 4 5 6 7” using the distinct-values function. However, what I want to extract is the list of duplicate values = “2 5”. I can’t think of an easy way to do this. Can anyone help?

Dimitre’s solution is beautiful and concise: 21 characters long (longer if you use variables with names longer than singler characters), and well worth the five or ten minutes it took me to work out why it works. I won’t explain it to you, dear reader; you deserve the brief puzzlement and Aha-moment of understanding it.

Despite being terse, it’s not the kind of thing you’d enter in an Obfuscated XPath contest, it just uses an idiom I haven’t seen before. I’ll see it again, though, because I’ll use it myself; as I say, it’s beautiful. (I do confess to a certain curiosity about how he would modify it if, as he says, efficiency needed to be addressed.

Dmitre gets my vote for this month’s best programming-language application of Strunk and White’s rule: “Omit needless words!”

Thinking about test cases for grammars

[19 June 2008]

I’ve been thinking about testing and test cases a lot recently.

I don’t have time to write it all up, and it wouldn’t fit comfortably in a single post anyway. But several questions have turned out to provide a lot of food for thought.

The topic first offered itself in connection with several bug reports against the grammar for regular expressions in XSD Part 2: Datatypes, and with the prospect of revising the grammar to resolve the issues. When revising a grammar, it would be really useful to be confident that the changes one is making change the parts of the language one wants to change, and leave the rest of the language untouched. In the extreme case, perhaps we don’t want to change the language at all, just to reformulate the grammar to be easier to follow or to provide semantics for. How can we be confident that the old grammar and the new one describe the same language?

For regular languages, I believe the problem is fairly straightforward. You can calculate the minimal FSA for each language and check whether the two minimal FSAs are isomorphic. Or you can calculate both set differences (L1 – L2 and L2 – L1) and check that both of them are the empty set. And there are tools like Grail that can help you perform the check, although the notation Grail uses is just different enough from the notation XSD uses to make confusion and translation errors possible (but similar enough that you think it would be over-engineering to try to automate the translation).

Buyt for context-free languages, the situation is not so good. In principle, the equivalence of context-free languages is decidable, but I would have to spend time rereading Hopcroft and Ullman, or Grune and Jacobs, to figure out how to go about it. And I don’t know of any good grammar-analysis tools. (When I ask people, they say the closest thing they know of to a grammar analysis tool are the error messages from yacc and its ilk.) So even if one did buckle down and try to prove the original form of the grammar and the new form equivalent, the possibility of errors in the proof is quite real and it would be nice to have a convenient way of generating a thorough set of test cases.

I can think of two ways to generate test cases:

  • Generation of random or pseudo-random strings; let’s call this Monte Carlo testing.
  • Careful systematic creation of test cases. I.e., hard work, either in the manual construction of tests or in setting things up for automatic test generation.

Naturally my first thought was how to avoid hard work by generating useful test cases with minimal labor.

The bad news is that this only led to other questions, like “what do you mean by useful test cases?”

The obvious answer is that in the grammar comparison case, one wants to generate test cases which will expose differences in the languages defined by the two grammars, just as in the case of software one wants test cases which will expose errors in the program. The parallel suggests that one might learn something useful by attempting to apply general testing principles to grammars and to specifications based on grammars.

So I’ve been thinking about some questions which arise from that idea. In much of this I am guided by Glenford J. Myers, The art of software testing (New York: Wiley, 1979). If I had no other reasons for being grateful to Paul Cotton, his recommending Myers to me would still put me in his debt.

  • For the various measures of test ‘coverage’ defined by Myers (statement coverage, decision coverage, condition coverage, decision/condition coverage, multiple condition coverage), what are the corresponding measures for grammars?
  • If one generates random strings to use as test cases, how long does the string need to be in order to be useful? (For some meaning of “useful” — for example, in order to ensure that all parts of the grammar can be exercised.)
  • How long can the strings get before they are clearly not testing anything that shorter strings haven’t already tested adequately (for some meaning of “adequate”)?
  • From earlier experience generating random strings as test cases, I know that for pretty much any definition of “interesting test case”, the large majority of random test cases are not “interesting”. Is there a way to increase the likelihood of a test case being interesting? A way that doesn’t involve hard work, I mean.
  • How good a job can we do at generating interesting test cases with only minimal understanding of the language, or with only minimal analysis of its grammar or FSA?
  • What kinds of analysis offer the best bang for the buck in terms of improving our ability to generate test cases automatically?

Transitive apertures

Sometimes ignorance really is bliss.

If I had normal mathematical training, I would probably know the answers to some of the questions that have been puzzling me lately. But I would have missed the fun of figuring out the answers for myself.

Consider a relation R on some domain, and consider its transitive closure R*. Or, for some purposes, consider its positive transitive closure R+.

Now, sometimes when one is interested both in R and in R*, it seems convenient to have R be as small as it is possible to make it, as long as the transitive closure remains the same. So if there are any ‘redundant’ members of R — members that we can remove without changing R*, we would like to remove them. Since this operation feels like the opposite of transitive closure, I have privately called it “transitive aperture” since I first encountered it nine or ten years ago.

Given two relations R and R’, R’ is a transitive aperture of R if and only if R’ is a subset of R and R’* = R*. R’ is a minimal transitive aperture of R if and only if R’ has no subset R” such that R”* = R’* = R*.

For example: consider the < (less-than) relation on integers. It’s clear (a) that the transitive closure of < is < itself, and (b) that the successor relation on integers (1 = succ(0), 2 = succ(1), etc.) has the same transitive closure. It’s also clear (intuitively — I haven’t tried a proof) that if you omit anything from the successor relation, its transitive closure won’t be <, so the successor relation is apparently a minimal transitive aperture of <.

Now, in all the examples I’ve thought of so far, it’s been obvious both how to calculate the minimal transitive aperture of a given relation, and that there was only one.

But I’ve been asking myself several questions recently. Can I define the notion of minimal transitive aperture cleanly? Does every relation have one? Is it unique? I.e., does every relation have exactly one? Is there an algorithm for finding a minimal transitive aperture for an arbitrary relation?

I spent some time thinking about this today, while driving to Santa Fe and back, and I believe I know some of the answers. I think the definition above is clean and precise. And I think it’s obvious that there’s an algorithm for finding all minimal transitive apertures for any finite relation. (It’s finite, right? Take all the subsets and see if they are (a) a transitive aperture of R and (b) a minimal transitive aperture — if R is finite, then R has a finite number of subsets, so the algorithm will terminate. I didn’t say it was fast, I just said there’s an algorithm.) I don’t know whether there’s an algorithm for infinite relations. And it’s clear that for some relations, there is more than one minimal transitive aperture. But I don’t have time tonight to sketch any of this. I shouldn’t be writing this in the first place, I should be packing for a trip, but I wanted to get at least the formulation of the problem down.

The idea can also be formulated in terms of graphs. Take any directed graph G; you have a set of nodes and a relation on that set. Take the transitive closure of that relation, and draw another graph T with the same set of nodes as G, and arcs representing the transitive closure of the arcs of G. Call T the transitive closure of G, or the reachability graph of G. (For any two nodes n and m, there is an arc from n to m in T if and only if m is reachable from n in G.) A transitive aperture of G is a digraph with the same set of nodes, a subset of G’s arcs, and the same reachability graph.

It’s clear, as I say, that there can be multiple minimal apertures that have the same transitive closure. Any graph of three nodes a, b, c with a cycle among all three nodes a → b → c → a will do to show this: its transitive closure is a complete graph of three nodes. But a → c → b → a has the same transitive closure. QED.

So it’s clear that for digraphs with cycles, there may be multiple minimal transitive apertures.

And for acyclic digraphs?

Dumbing grammars down

[15 April 2008]

For reasons too complicated to go into just now (I tried, and it takes paragraphs and paragraphs), I have found myself thinking lately about a grammar manipulation process well known to be possible and wondering how to do it in fact.


Sometimes what you have is a context-free language L, but what you need is a related regular language. Or, more precisely, what you have is a context-free grammar for L, and what you need is a regular grammar or an FSA or a regular expression that recognizes either the largest practicable regular sub-language of L or the smallest regular super-language of L. You might need this because the context-free grammar, in all its hairy glory, is too difficult to handle with the limited resources at your disposal. (The tractable-sublanguage approach was pioneered by students of natural languages, back when they were trying to make the equivalent of an 80386 parse English.) Or you might need it because you want to define an element or attribute for expressions of some kind defined by a context-free grammar, and you want to define a simple type for them, for use with XSD, or you want to use Simon St. Laurent’s regular fragmentation or similar tools to describe them, but the tools (like the XSD pattern facet) only deal with regular expressions, not context-free grammars.

Since regular grammars are weaker (or dumber) than context-free grammars, the question turns into: how can we most usefully go about dumbing down a context-free grammar to make it into a regular grammar?

The largest practicable regular sub-language

Now, it’s reasonably well known that for any context-free language which allows arbitrarily deep nesting of structures (that is, for any context-free language at all — if it doesn’t allow arbitrarily deep nesting, the language isn’t context-free), it’s possible to describe a related regular language which allows some bounded amount of nesting.

For example: the language of arithmetic expressions defined by this grammar is context-free:

expression = term (add-op term)*
term = factor (mul-op factor)*
factor = NUMBER | VAR-REF | '(' expression ')'
add-op = '+' | '-'
mul-op = '*' | '/'

A similar language that allows nesting of parenthesized expressions only up to some fixed depth is regular, not context-free, and can be recognized with a finite-state automaton or described with a regular expression. And every sentence in the regular language is a sentence in the original context-free language.

This fact has been exploited before now to make certain parsing and grammar-management tasks more tractable. I read about it first (if memory serves) in Grune and Jacobs, Parsing techniques: A practical guide, who cite a number of papers, most of which I have not read. But while they provide a clear high-level description of how the necessary grammar transformation works, they don’t go into any detail about how to perform the transformation in practice for arbitrary grammars. So it’s been in the back of my mind for some time to think it through and see if I could reduce it to code.

The smallest regular super-language

Sometimes, the regular language you want is not a sub-language, but a super-language. Suppose we wanted to write a regular expression that would accept all legal arithmetic expressions and as few other strings as possible. (If you’re defining a simple type for arithmetic expressions, you don’t want it rejecting things just because the parentheses are nested too deep. So you need a superlanguage, not a sub-language.) The nature of the context-free / regular distinction is that such a super-language would not require parentheses to match up correctly. But it should be possible to enforce some of the other rules relating to parentheses: we want strings like “(3 +)” or “4 (* x)” to be rejected, because in the original grammar operators are not allowed to be adjacent to parentheses in this way, and you don’t need a context-free grammar to enforce those rules.

As it happens, the regular expression (*d+()|[+-*/]d+)* comes pretty close to capturing the rules. (For simplicity, I have just written “d+” for the places where the original grammar has NUM or VAR-REF.) That is, unless I have botched something somewhere, any string legal against the original grammar will be legal against this regular expression, and nothing else will except strings that would be correct except that their parentheses are messed up.

Towards a method

Given a particular context-free grammar, it’s generally possible to sit down with it and cipher out how to dumb it down into a regular grammar for fixed depth, or a regular grammar for arbitrary depth but no checks on parenthesis nesting. But how do you do it for arbitrary grammars, methodically and mechanically? How do you reduce it to code?

I daresay it’s in the literature, but I haven’t read much of the relevant literature; I expect to, when I can, but it was more fun to try to solve the problem from scratch, armed only with Grune and Jacobs’ concise description

Even (finite) context-dependency can be incorporated: for languages that require the verb to agree in number with the subject, we duplicate the first rule:

MainClause -->
SubjectSingular VerbSingular Object
| SubjectPlural VerbPlural Object

and duplicate the rest of the grammar accordingly. The result is still regular…. We replicate the grammar the desired number of times and remove the possibility of further levels from the deepest level. Then the deepest level is regular, which makes the other levels regular in turn. the resulting grammar will be huge but regular and will be able to profit fom all simple and efficient techniques known for regular grammars. The required duplications and modifications are mechanical and can be one by a program.

I don’t have code yet. But I do have a sketch of a methodical approach. Some of these steps probably need further explanation, which I do not have time to provide today.

  1. Find the set of non-terminals in the grammar that are reachable from themselves (call these cyclic non-terminals).
  2. To each production for a relevant non-terminal (details below), add a synthesized attribute to the grammar, to show the nesting level for that non-terminal. Non-terminals are relevant if one of the cyclic non-terminals is reachable from them.
  3. Draw the CFG as an augmented transition network (this amounts to drawing each rule as an FSA, with terminals or non-terminals labeling the arcs).
  4. Using (an adaptation of) the Thompson algorithm, combine the FSAs from the transition network into a single FSA with epsilon transitions as necessary, and only terminals labeling the arcs. Keep track of the attributes using whatever additions to the usual tools your wit can provide. Me, I use counting FSAs as I described them in 2004; I still can’t believe that work is new, but I still haven’t found anything else in that vein.
  5. Eliminate epsilon transitions in the usual way, to clean things up. If your heart is strong, you may also be able to merge states that clearly can be merged safely, but I don’t know how to define that test formally just yet.
  6. You now have a fairly standard counting FSA; it can readily be re-transcribed into a counting regular attribute grammar (CRAG).
  7. To generate a regular sub-language, assign a maximum value to each counter; you now have an attribute grammar over a finite lattice and can generate a normal (non-attributed) grammar in the obvious way.
  8. To generate a regular super-language, just ignore (or: delete) the guards and counters.

Strictly speaking, you can generate both the sub- and the super-language as soon as you have the attribute grammar of step

Let me illustrate using the arithmetic expression grammar above.

First, note that expression, term, and factor are all cyclic, and nothing else is. In this particular case, we actually only need one counter, but it does no harm to add all three. For brevity, I’ll abbreviate the non-terminals to single letters.

e(i,j,k) = t(i+1,j,k) (a t(i+1,j,k))*
t(i,j,k) = f(i,j+1,k) (m f(i,j+1,k))*
f(i,j,k) = NUMBER | VAR-REF | '(' e(i,j,k+1) ')'
a = '+' | '-'
m = '*' | '/'

The result of drawing a transition network (not shown) and then mushing it all together, after minimization, with guards and counter operations, is something like this: FSA for the superlanguage

A counting regular attribute grammar for this, with e(0) the start symbol, is:

e(n) = '(' e(n+1)
e(n) = NUM q(n)
e(n) = VAR-REF q(n)
q(n) = [+-*/] e(n)
q(n) = {n > 0} ')' q(n-1)
q(n) = {n = 0} ''

(The last line could of course be written “q(0) = ''”, but for reasons I haven’t figured out the form with the explicit and verbose guard seems better, perhaps clearer. Maybe it is.)

This can be multiplied out in the obvious way to produce an unattributed grammar which accepts two levels of parenthesized expressions:

e = '(' e1
e = NUM q
e = VAR-REF q
q = [+-*/] e
q = ''
e1 = '(' e2
e1 = NUM q1
e1 = VAR-REF q1
q1 = [+-*/] e1
q1 = ')' q
e2 = NUM q2
e2 = VAR-REF q2
q2 = [+-*/] e2
q2 = ')' q1

And the smallest regular superlanguage appears to be

e = '(' e
e = NUM q
e = VAR-REF q
q = [+-*/] e
q = ')' q
q = ''

Some open questions and follow-on topics:

  • Can a formally satisfactory definition be given which explicates the intuitive notion of the ‘smallest regular super-language’?
  • If so, can it be proven that the construction above produces a grammar (an FSA, a regular expression) which in fact accepts the smallest regular superlanguage?
  • To check this, I really ought to turn this into code; if any readers of this have a use for such code in Prolog or XSLT, let me know.
  • It would be interesting to use this technique to write simple types for approximations of arithmetic expressions, and XPath expressions, and various other specialized types that in principle require context-free grammars. I foresee a union type with first a large regular subset, then the smallest regular superset, then arbitrary strings.
  • This whole thing came up because I was worrying about issues relating to test cases for grammars; it would be nice to try to apply this work there.