A way to allow both extensibility and interoperability (Balisage report 2)

[begun 11 August 2008, finished 19 August 2008]

At the versioning symposium the day before the Balisage 2008 conference, my colleague Sandro Hawke talked about a method he has developed for allowing a core language to be extended in a variety of ways while still retaining reasonably good interoperability. The details of his implementation are interesting, but what strikes me most forcibly is the underlying model of interoperability, which seems to me to bring some very useful new ideas to bear on the problem. There may be — I mean there must be — situations where these new ideas don’t help. It’s not actually clear to me how best to characterize the situations where they do help, and the situations where they don’t, but it does seem to me clear that in the situations where they do help, they can help a lot.

It took me a bit of effort to grasp the basic idea, and in fact I didn’t really get it until my evil twin Enrique started to explain it to me. The summary of the basic ideas that follows is Enrique’s, not Sandro’s, and Sandro should not be held responsible for it.

1 Start at the beginning.

“It’s a very good place to start,” I murmured. “What was that?” “Nothing.” “One more Sound of music reference out of you and this explanation is over, got that?”

It’s obvious, and thus well understood by at least some people, that interchange is easy when everyone involves is speaking exactly the same language.

“The Lord said, ‘If as one people speaking one language they have begun to do this, then nothing they plan to do will be impossible for them.’“ “And right he was. Sometimes Yahweh really knows what he’s talking about.”

2 But we have a lot of situations where we don’t want to, or can’t, speak exactly the same language.

3 In some cases, with unrelated languages, the only option seems to be translation, or non-communication.

“Non-communication?” I asked. “Always a popular option,” said Enrique. “Popular?” “Or at least, a common result. I assume it must be popular, given how many normally intelligent working group members engage in it. Do you really think that in the average working group argument, people are speaking the same language? Or even trying to?” “Well, of course they are. People working together in good faith …” “Good faith? working together? When was the last time you actually listened to what people were saying in a working group meeting?” “Don’t say that where my Working Group chairs can hear you, OK?” “You think they don’t know? Stick around, kid. Your naïvete is refreshing.”

4 An important case: We have a lot of situations where we want to support multiple versions / variants of the same vocabulary: a sequence of version 1, version 2, and version 3 is a simple example.

Another example is the definition of a core architecture and various competing (or complementary) extensions. For example, in the context of the W3C Rule Interchange Format, Sandro’s slide imagines a Core RIF dialect, extended separately by adding actions (to create Production Rule Dialect) and by adding equality and functions (to create a Basic Logic Dialect), the Basic Logic Dialect in turn being extended by a negation-as-failure feature and a classical-negation feature (producing, respectively, a logic-programming dialect and a version of first order logic).

“Is it an important pattern here,” I asked, “that we are talking about different dialects with overlapping expressive power?” “Possibly. What do you mean?” “The examples you mention seem to involve different dialects with semantics or functionality in common, or partially in common: you can approximate negation-as-failure with classical negation, or vice versa, if you translate direct from the one dialect to the other. If you had to translate them down into a common core language without any negation at all, you wouldn’t be able to approximate them nearly so well.” “You’re getting ahead of yourself here, but yes, that’s true of the examples so far.”

Another example is the definition of a central reference architecture and various extensions or specializations.

“You mean like HL7 and different specializations of it for general practitioners and for hospitals and other health-care institutions?” “Possibly.”

5 A common approach to such situations is a kind of hub and spoke model: specify an interchange language, an interlingua. Everybody translates into it, everybody translates out of it. So you have a 2×n problem, not an n-squared problem.

6 But in some situations, e.g. core language + multiple independent extensions, the only feasible interlingua is the core language without extensions. (Example: consider any attempt to write portable C, or portable SQL. Basic rule: stay away from vendor extensions at all costs, even if they would be a lot of help. Secondary rule: use them, but put ifdefs and so on around them.)

I.e. the interlingua is often not as expressive as the specific variants and extensions people are using. (That’s why they’re using the variants and not the interlingua.)

Put yet another way: going from the variants to the interlingua can be lossy, and often is.

7 Two things are wrong / can be wrong with an interlingua like that (or ANY interlingua solution), especially when a round trip through the interlingua produces something less useful or idiomatic than the original. (This is not logically necessary, it just almost always seems to be the case.)

Well, three.

(a) If two people (say, Ana Alicia and Ignacio) are doing blind interchange, and using the same extensions, it seems dumb for Ana Alicia to filter out all the extensions she uses, and for Ignacio not to get them. They would not have hurt him, they would have been helpful to him. The same applies if Ana and Ignacio are using similar but distinct extensions: if it’s possible to translate Ana’s extensions into a form that Ignacio can understand, that may be better than losing them entirely.

But that doesn’t work well if the interlingua can’t handle Ana Alicia’s extensions.

Sandro gave, among others, the example of the blink tag. The rules of HTML say that a browser that doesn’t understand the blink element should ignore its start- and end-tags. That makes the defined tags a sort of interlingua, and prescribes a simple (brutally simple, and also brutal) translation into the interlingua. But for text that should ideally be presented as blinking, it would be better to fall back to strong or some other form of strong highlighting, rather than rendering the contents using the same style as the surrounding text.

(b) Having extensions Ignacio doesn’t understand may or may not be a problem, depending on what Ignacio is planning on doing with the data.

(c) Even more important than (b): the best lossy translation to use may vary with Ignacio’s use of the data.

If one translation preserves the logical semantics of a set of logical rules (e.g. in RIF), but makes the rules humanly illegible, and another preserves human legibility but damages the logical content, then it matters whether Ignacio is running an inference engine or just displaying rules to users without doing any inference.

8 One advantage of direct translation pairs (the n-squqred approach instead of 2×n with an interlingua) is that the output is often higher quality. This is true both for artificial and for natural languages.

“Really?” I asked. “Sure,” said Enrique. “Look at the secret history of any of the European machine-translation projects. Even the ones based on having an interlingua ended up with annotations that say in effect ‘if you’re going from Spanish to Italian, do this; if from Spanish to German, do that.’’ “Where did you read that?” “Somewhere. No, I can’t cite a source. But I bet it’s true.”

But when we don’t know in advance what pairs are needed? and how many people are going to be playing? And thus when we cannot prepare all n-squared translation pairs in advance? What do we do then?

9 Could we manage a system where the pairwise translation is selected dynamically, based on simpler parts (so the total cost is not quadratic but less)?

10 Yes. Here’s how, in principle.

(a) Identify one or more base interlingua notations.

“That’s got to be a centralized function, surely?” I asked. “Probably; if it’s not centralized you’re unlikely to reach critical mass. But when you’re thinking of this technique as a way to provide for the extensibility of a particular language, then it’s not likely to be an issue in practice, is it? That language itself will be the candidate interlingua.” “Then why say ‘one more more’?” I asked. “Because in principle you could have more than one. Some widely deployed language in the application domain might be a second interlingua.”

(b) Provide names for various extensions / variants / versions of the base notation(s).

This means you can describe any end point in the process as accepting a particular base language with zero or more named extensions.

“Does this need to be centralized?” I asked. “Not at all; just use normal procedures to avoid name conflicts between extensions named separately.” “You mean, use URIs?” “Yes, use URIs. Sometimes you’re not as dumb as you look, you know?”

(c) Identify well known purposes for the use of the data (display to user, logical reasoning, etc.). This can if necessary be done in a distributed way, although it’s probably most effective if at least the basic purposes are agreed upon and named up front.

(d) For each extension, say how (and whether) it can be translated into some other target language — whether the target is one of the interlinguas identified in step (a) or is an interlingua plus a particular set of extensions.

Define the translation. And quantify (somehow, probably using some subjective quality measure) how lossy it is, how good the output is, for each of the specific well known purposes.

“This,” said Enrique, “ is why it’s better to centralize the naming of well known purposes. If you don’t have an agreed upon set of purposes, you end up with spotty, incomplete characterizations of various translations.” “And how is the translation to be specified?” I asked. “Any way you like. XTAN and GRDDL both use XSLT, but at this level of abstraction that’s an implementation detail. I take back what I said about your looks.”

(e) Now we have a graph of language variants, each defined as an interlingua plus extensions. Each node is a language variant. Each arc is a translation from one variant to another. Each arc has a cost associated with each well known purpose.

(f) To get from Ana Alicia’s language variant to Ignacio’s language variant, all Ignacio has to do (or Ana Alicia — whoever is taking charge of the translation) is find a path from Ana’s node in the graph

“The node representing Ana Alicia’s dialect, you mean.” “Pedant! Yes, of course that’s what I mean.”

to Ignacio’s. It need not be direct; it may involve several arcs.

(g) If there’s more than one path, Ignacio (let’s assume he is doing the translation) can choose the path that produces the highest quality output for his purposes.

Finding the optimal path through a weighted graph is not without its exciting points —

“You mean it’s hard” “Well, of course I do. What did you think exciting meant in this context?! But only when the graph is large.”

— not (I was saying) without its exciting points, but it’s also a reasonably well understood problem, and perfectly tractable in a small graph of the kind you’re going to have in applications of this technique. Sandro’s implementation uses logical rules here to good effect.

(h) If there’s only one path, Ignacio can decide whether the output is going to do him any good, or if he should just bag the idea of reusing Ana Alicia’s data.

11 The description above is how it works in principle. But can it actually be implemented? Yes.

But for details, you’re going to have to look at Sandro’s talk on XTAN, or on other writeups he has done (cited in the talk).

More questions about Monte Carlo test-case generation

[9 August 2008]

Dave Pawson’s response to the previous post here has made me wonder about ways to tweak the Monte Carlo methods for generating test strings using random number generators — in particular, ways to make them more systematic (and also, perhaps, more automatic).

At the moment, I don’t have answers, but a couple of questions have occurred to me, and if I write them down on a piece of paper I’m just going to lose the paper. So I’ll write them down here.

(1) Can we measure the effectiveness of different methods of test generation by counting how many random strings must be generated in order to obtain a certain number of ‘interesting’ test cases?

(2) One way to define ‘interesting’: given two grammars for similar but not identical languages, how many random strings must we generate, using different methods, in order to find ten strings that exercise the difference between the languages? (I.e. ten strings that are members of L1 but not of L2, or vice versa.)

If we think we know that L1 and L2 differ in exactly three ways, we can also ask: how many test cases must we generate in order to find examples (or: ten examples) of each of the three differences between the languages? (A method that quickly exercises the first two differences but never finds the third would not be our best choice for work with languages where we don’t know the answers in advance.)

(3) Another way: given any grammar, how many test strings must we generate, using a given method, in order to find three different strings which match each non-terminal in the grammar?

How many other ways of defining ‘interesting’ test cases can we find? How many other ways can we find of measuring the effectiveness of methods of generating test cases?

(4) At the moment, I know three methods for generating pseudo-random strings for testing grammars:

  1. Run the grammar itself as a generator, possibly using probabilities assigned to different branches at each choice point. John Cowan described this method in a comment on an earlier post here; Liam Quin also described essentially this method to me, in private conversation. If I need a name for this method, I’ll call it probabilistic sentence generation.
  2. For a given length of string n (given as a parameter, or itself chosen randomly), pick n characters randomly from the character set and concatenate them. I’ll call this random character concatenation.
  3. For a given number n, choose n items randomly from an array of strings constructed to include strings that match each non-terminal and terminal in the grammar’s vocabulary, and concatenate them. Since here the random selection is from strings representing the terminal + non-terminal vocabulary, I’ll call this random TNT concatenation.

Question: are there other ways to use random-number generators to generate test strings for grammars?

(5) Can the array of strings used in the TNT method be constructed automatically, by using random character concatenation until you have matches for each terminal and non-terminal, and the array is fully populated?

(Here, Enrique dug his elbow into my rib. “Well, obviously yes it can,” he hissed. “What kind of idiotic question is that?!” “Hey, I’m just brainstorming. Gimme a break. Besides, this helps set up the two questions that follow.”)

I should mention here that I have a strong gut feeling that it’s useful, in practical cases, to augment the grammar with a terminal class for characters (or strings) which can never occur in sentences of the language. For a grammar for URIs, that is, the class of characters illegal in URIs definitely needs to be represented in the TNT array; for a grammar for numbers in scientific notation, you definitely want to have a few entries in the TNT array for characters like space, or #, or letters of the alphabet other than ‘e’ and ‘E’, which occur in the coded character sets your users will be using but are not part of the alphabet of the language in the formal sense. I have no good formal grip on this gut feeling, and no good formal way to describe it; I’ll work on that. But for now, understand that when I talk about strings matching terminals, I also intend to include some characters that don’t match any terminal. Otherwise you will surely fail to exercise some error detection paths in your system.

(6) Can the TNT array be populated by a bootstrapping method? Would this work?

  1. Start with an array that just includes substrings for the terminals.
  2. Select a production rule whose right-hand side contains only terminals (or, later on, one whose right hand side contains only items for which the TNT array has substrings).
  3. Generate random strings until you have a match, or two matches, for that non-terminal. (N.B. for recursive rules you’ll want to treat each choice in the right-hand side as a separate production rule.)
  4. Add the new substrings to the TNT array.
  5. Repeat (i.e. return to step 2), until you have strings in the TNT array for each production rule (not just each non-terminal and terminal) in the grammar.

(7) Another possible bootstrapping method:

  1. Begin with a TNT array for all the terminals, as just described.
  2. Also begin with a count table showing the non-terminals (or better, the production rules) of the grammar, and how many substrings they have in the TNT array; to begin with, every production rule in the count table will have zero matches.
  3. Generate a random string from the TNT array.
  4. Check the string against each rule in the grammar table for which you have fewer than three strings in the TNT array; when you find a match, add that string to the TNT array and increment the number for that rule in the count table. If the string doesn’t match any rules (or any rules that aren’t already at 3), discard it and generate another (i.e. go back to step 3). If every entry in the count table has a count of 3, stop.

The description just given assumes that you want three strings in the TNT table for each rule in the grammar. That’s an arbitrary choice; it could be two, or one, or ten. And if there is reason to want especially thorough testing in some parts of the grammar, we could assign different quotas to different non-terminals. (If I’m particularly worried about character classes, then I want a lot of test cases that exercise that part of the grammar; I can get more by giving the non-terminals in that part of the grammar higher representation in the TNT table.)

(8) When I originally sketched out the TNT method I assumed that what I wanted was a few strings for each non-terminal (and terminal) in the grammar. In the course of working through the questions above I have come to believe that what we want is a few strings for each rule in the grammar: non-terminals with three rules need more strings than non-terminals with just one rule. Which is true? If the measurement methods described above work, it should be possible to measure the effeciveness of different approaches quantitatively.

(9) In setting up the TNT array, is the best default behavior going to be to provide the same number of strings for each item in the terminal + non-terminal vocabulary? Or would we do better to count the number of times each TNT vocabulary item is actually mentioned in the grammar, and assign slots proportionally? One mention: two strings. Eight mentions: sixteen strings. (Or should we do it logarithmically? One mention, two strings; eight mentions, three times two strings; n mentions, 2 times log2 n strings?)

(10) So far, this has all been about sequences of characters. How many of these methods can be transferred conveniently to the task of generating XML documents as test cases for document grammars?

Two styles of Monte Carlo testing for grammars

[8 August 2008]

In an earlier post I talked about generating test cases for grammars, and asked a number of questions. I’ve been meaning to get back to the topic; this afternoon I revisited some work I did with the XSD 1.1 grammar for regular expressions, and so the topic is again on my mind.

Pseudo-random strings

In the past, I’ve used Monte Carlo test generation to test grammars, in particular to find strings that distinguish between grammars. For example, given two grammars for the lexical space of a particular datatype, can we find strings that are accepted by one grammar and rejected by the other?

But random strings can also be used just to understand a given grammar better.

At one point in our work on XSD 1.1, for example, I proposed that in order to have a looser coupling between XSD and the RFCs that define URIs, we should just accept any string at all as a type-valid URI, much the same way that HTML browsers ‘accept’ any string as the value of an href attribute. They may not understand it, but they don’t raise an error message. The rationale for my proposal was that in any case, the generic grammars defined by RFC 3986 and its predecessors are so vague, so general, and so forgiving, that it wasn’t really very clear that they outlawed very much at all. I’ve heard people who know URIs well say that really the only thing strings that grammar won’t accept are strings with two hash signs. Others, equally well informed, say that those are legal, too.

The scheme-specific grammars (e.g. the grammar for http: URLs) are much more useful in catching errors, I argued. But we don’t require that they be checked, and don’t really want to make them part of the definition of type validity for the anyURI type. (I think a good validator should check anyURI value against the scheme-specific grammars and issue warnings if there are problems, but I don’t know any that do that.) So, I argued, the XSD spec was giving the illusion of useful value checking, but not the reality. Others in the working group maintained that the generic grammar does perform some useful checks, and wanted to retain it.

Generating ten thousand or so strings of twenty or forty characters each, with each character a randomly chosen character in the printable range of the ASCII character set, persuaded me that I was wrong. Even if the only useful thing the generic URI grammar does is eliminate strings containing forbidden characters, it rejects well over half the random strings I generated.

So I abandoned my proposal, and in XSD 1.1, type validity for anyURI does involve checking the literal against the grammar given in the RFC.

More interesting pseudo-random strings

One problem, for a human looking at the data, is that if you generate random strings as I did, by generating twenty random numbers between 32 and 127, and then concatenating the corresponding characters, then you generate an awful lot of uninteresting test cases, and you can go a long long time before you actually exercise a specific part of your grammar.

So when I faced the task of generating test cases for XSD’s regular-expression grammar, I wanted some method that would generate more interesting test cases, test cases more likely to exercise all the parts of the grammar. In particular, I wanted to make sure we got good coverage of the more obscure corners of the character-class constructs. These have relatively elaborate substructures and the likelihood of generating a complex character-class expression using randomly chosen characters was way too low.

To make a long story short, I think I found a way. This is what I did.

  1. I built a set of strings in the following way.
    1. For each non-terminal in the grammar, I generated one or more strings that matched that non-terminal, and added them to the set.
    2. For each literal string mentioned in the grammar, I included that literal string in the set. For example, the first production rule in the regex grammar is
      regExp ::= branch ( '|' branch )* 

      So I added “|” to the set of strings.

    3. For each class of terminal symbol, I included one or two examples in the set of strings.

    For the regex grammar, I ended up with a set of a hundred or so strings, from “a” to “(” to “[p{IsBasicLatin}-[aeiou]]”.

  2. I put the set of strings into an array, in no particular order.
  3. Then, I generated sequences of five or ten random numbers between 0 and the size of my set of string. For example, “[92, 21, 48, 6, 41, 69, 47”.
  4. Then I used each number to look up the corresponding string in the array, and concatenated the results. For the random numbers just given, this produced the strings ”}”, ”a”, ”2”, ”)”, ”|xyz”, ”|”, ”[\p{IsBasicLatin}-[aeiou]]”, and ”(”.
    Together, these yield the test string “ }a2)|xyz|[\p{IsBasicLatin}-[aeiou]](

The test cases generated by this modified method (random selection of substrings from a set of strings known to be grammatically interesting) exercised the grammar far better than test cases generated by the random-character method.

If I were a better mathematician, I might be able to characterize the difference. But in a purely informal way, I think one can think of the production rules of any grammar as a set of regular expressions over an alphabet consisting of both the terminal and the non-terminal symbols of the grammar, not just of characters. In any normal grammar, for most non-terminals some or most random character sequences will fail to match the non-terminal. Test cases generated by selecting characters at random will have the same effect as selecting items randomly from the terminal + non-terminal vocabulary, but with a bias (typically a very strong bias) against selecting any non-terminal symbol. By making a set of substrings as described, the second Monte Carlo method evens the odds a bit, and you get sequences that are much more likely to match the regular expressions over the terminal + non-terminal vocabulary, or to come very close to matching them. And that combination (matching all the productions, and coming close to matching each of the productions, but missing) is as close as I can come to characterizing the set of ‘interesting’ test cases.

Certainly, it was not necessary to generate tens of thousands of test cases in order to find examples that were valid against one form of the grammar, and invalid against a different form of the grammar.

If anyone wants to look at this quick and dirty test generator, the code for the test generator is at http://www.w3.org/XML/2008/03/xsdl-regex/testgenerator1.pl; the parser for the grammars I was testing is in the other files of the same directory.

Thinking about test cases for grammars

[19 June 2008]

I’ve been thinking about testing and test cases a lot recently.

I don’t have time to write it all up, and it wouldn’t fit comfortably in a single post anyway. But several questions have turned out to provide a lot of food for thought.

The topic first offered itself in connection with several bug reports against the grammar for regular expressions in XSD Part 2: Datatypes, and with the prospect of revising the grammar to resolve the issues. When revising a grammar, it would be really useful to be confident that the changes one is making change the parts of the language one wants to change, and leave the rest of the language untouched. In the extreme case, perhaps we don’t want to change the language at all, just to reformulate the grammar to be easier to follow or to provide semantics for. How can we be confident that the old grammar and the new one describe the same language?

For regular languages, I believe the problem is fairly straightforward. You can calculate the minimal FSA for each language and check whether the two minimal FSAs are isomorphic. Or you can calculate both set differences (L1 – L2 and L2 – L1) and check that both of them are the empty set. And there are tools like Grail that can help you perform the check, although the notation Grail uses is just different enough from the notation XSD uses to make confusion and translation errors possible (but similar enough that you think it would be over-engineering to try to automate the translation).

Buyt for context-free languages, the situation is not so good. In principle, the equivalence of context-free languages is decidable, but I would have to spend time rereading Hopcroft and Ullman, or Grune and Jacobs, to figure out how to go about it. And I don’t know of any good grammar-analysis tools. (When I ask people, they say the closest thing they know of to a grammar analysis tool are the error messages from yacc and its ilk.) So even if one did buckle down and try to prove the original form of the grammar and the new form equivalent, the possibility of errors in the proof is quite real and it would be nice to have a convenient way of generating a thorough set of test cases.

I can think of two ways to generate test cases:

  • Generation of random or pseudo-random strings; let’s call this Monte Carlo testing.
  • Careful systematic creation of test cases. I.e., hard work, either in the manual construction of tests or in setting things up for automatic test generation.

Naturally my first thought was how to avoid hard work by generating useful test cases with minimal labor.

The bad news is that this only led to other questions, like “what do you mean by useful test cases?”

The obvious answer is that in the grammar comparison case, one wants to generate test cases which will expose differences in the languages defined by the two grammars, just as in the case of software one wants test cases which will expose errors in the program. The parallel suggests that one might learn something useful by attempting to apply general testing principles to grammars and to specifications based on grammars.

So I’ve been thinking about some questions which arise from that idea. In much of this I am guided by Glenford J. Myers, The art of software testing (New York: Wiley, 1979). If I had no other reasons for being grateful to Paul Cotton, his recommending Myers to me would still put me in his debt.

  • For the various measures of test ‘coverage’ defined by Myers (statement coverage, decision coverage, condition coverage, decision/condition coverage, multiple condition coverage), what are the corresponding measures for grammars?
  • If one generates random strings to use as test cases, how long does the string need to be in order to be useful? (For some meaning of “useful” — for example, in order to ensure that all parts of the grammar can be exercised.)
  • How long can the strings get before they are clearly not testing anything that shorter strings haven’t already tested adequately (for some meaning of “adequate”)?
  • From earlier experience generating random strings as test cases, I know that for pretty much any definition of “interesting test case”, the large majority of random test cases are not “interesting”. Is there a way to increase the likelihood of a test case being interesting? A way that doesn’t involve hard work, I mean.
  • How good a job can we do at generating interesting test cases with only minimal understanding of the language, or with only minimal analysis of its grammar or FSA?
  • What kinds of analysis offer the best bang for the buck in terms of improving our ability to generate test cases automatically?

Dumbing grammars down

[15 April 2008]

For reasons too complicated to go into just now (I tried, and it takes paragraphs and paragraphs), I have found myself thinking lately about a grammar manipulation process well known to be possible and wondering how to do it in fact.


Sometimes what you have is a context-free language L, but what you need is a related regular language. Or, more precisely, what you have is a context-free grammar for L, and what you need is a regular grammar or an FSA or a regular expression that recognizes either the largest practicable regular sub-language of L or the smallest regular super-language of L. You might need this because the context-free grammar, in all its hairy glory, is too difficult to handle with the limited resources at your disposal. (The tractable-sublanguage approach was pioneered by students of natural languages, back when they were trying to make the equivalent of an 80386 parse English.) Or you might need it because you want to define an element or attribute for expressions of some kind defined by a context-free grammar, and you want to define a simple type for them, for use with XSD, or you want to use Simon St. Laurent’s regular fragmentation or similar tools to describe them, but the tools (like the XSD pattern facet) only deal with regular expressions, not context-free grammars.

Since regular grammars are weaker (or dumber) than context-free grammars, the question turns into: how can we most usefully go about dumbing down a context-free grammar to make it into a regular grammar?

The largest practicable regular sub-language

Now, it’s reasonably well known that for any context-free language which allows arbitrarily deep nesting of structures (that is, for any context-free language at all — if it doesn’t allow arbitrarily deep nesting, the language isn’t context-free), it’s possible to describe a related regular language which allows some bounded amount of nesting.

For example: the language of arithmetic expressions defined by this grammar is context-free:

expression = term (add-op term)*
term = factor (mul-op factor)*
factor = NUMBER | VAR-REF | '(' expression ')'
add-op = '+' | '-'
mul-op = '*' | '/'

A similar language that allows nesting of parenthesized expressions only up to some fixed depth is regular, not context-free, and can be recognized with a finite-state automaton or described with a regular expression. And every sentence in the regular language is a sentence in the original context-free language.

This fact has been exploited before now to make certain parsing and grammar-management tasks more tractable. I read about it first (if memory serves) in Grune and Jacobs, Parsing techniques: A practical guide, who cite a number of papers, most of which I have not read. But while they provide a clear high-level description of how the necessary grammar transformation works, they don’t go into any detail about how to perform the transformation in practice for arbitrary grammars. So it’s been in the back of my mind for some time to think it through and see if I could reduce it to code.

The smallest regular super-language

Sometimes, the regular language you want is not a sub-language, but a super-language. Suppose we wanted to write a regular expression that would accept all legal arithmetic expressions and as few other strings as possible. (If you’re defining a simple type for arithmetic expressions, you don’t want it rejecting things just because the parentheses are nested too deep. So you need a superlanguage, not a sub-language.) The nature of the context-free / regular distinction is that such a super-language would not require parentheses to match up correctly. But it should be possible to enforce some of the other rules relating to parentheses: we want strings like “(3 +)” or “4 (* x)” to be rejected, because in the original grammar operators are not allowed to be adjacent to parentheses in this way, and you don’t need a context-free grammar to enforce those rules.

As it happens, the regular expression (*d+()|[+-*/]d+)* comes pretty close to capturing the rules. (For simplicity, I have just written “d+” for the places where the original grammar has NUM or VAR-REF.) That is, unless I have botched something somewhere, any string legal against the original grammar will be legal against this regular expression, and nothing else will except strings that would be correct except that their parentheses are messed up.

Towards a method

Given a particular context-free grammar, it’s generally possible to sit down with it and cipher out how to dumb it down into a regular grammar for fixed depth, or a regular grammar for arbitrary depth but no checks on parenthesis nesting. But how do you do it for arbitrary grammars, methodically and mechanically? How do you reduce it to code?

I daresay it’s in the literature, but I haven’t read much of the relevant literature; I expect to, when I can, but it was more fun to try to solve the problem from scratch, armed only with Grune and Jacobs’ concise description

Even (finite) context-dependency can be incorporated: for languages that require the verb to agree in number with the subject, we duplicate the first rule:

MainClause -->
SubjectSingular VerbSingular Object
| SubjectPlural VerbPlural Object

and duplicate the rest of the grammar accordingly. The result is still regular…. We replicate the grammar the desired number of times and remove the possibility of further levels from the deepest level. Then the deepest level is regular, which makes the other levels regular in turn. the resulting grammar will be huge but regular and will be able to profit fom all simple and efficient techniques known for regular grammars. The required duplications and modifications are mechanical and can be one by a program.

I don’t have code yet. But I do have a sketch of a methodical approach. Some of these steps probably need further explanation, which I do not have time to provide today.

  1. Find the set of non-terminals in the grammar that are reachable from themselves (call these cyclic non-terminals).
  2. To each production for a relevant non-terminal (details below), add a synthesized attribute to the grammar, to show the nesting level for that non-terminal. Non-terminals are relevant if one of the cyclic non-terminals is reachable from them.
  3. Draw the CFG as an augmented transition network (this amounts to drawing each rule as an FSA, with terminals or non-terminals labeling the arcs).
  4. Using (an adaptation of) the Thompson algorithm, combine the FSAs from the transition network into a single FSA with epsilon transitions as necessary, and only terminals labeling the arcs. Keep track of the attributes using whatever additions to the usual tools your wit can provide. Me, I use counting FSAs as I described them in 2004; I still can’t believe that work is new, but I still haven’t found anything else in that vein.
  5. Eliminate epsilon transitions in the usual way, to clean things up. If your heart is strong, you may also be able to merge states that clearly can be merged safely, but I don’t know how to define that test formally just yet.
  6. You now have a fairly standard counting FSA; it can readily be re-transcribed into a counting regular attribute grammar (CRAG).
  7. To generate a regular sub-language, assign a maximum value to each counter; you now have an attribute grammar over a finite lattice and can generate a normal (non-attributed) grammar in the obvious way.
  8. To generate a regular super-language, just ignore (or: delete) the guards and counters.

Strictly speaking, you can generate both the sub- and the super-language as soon as you have the attribute grammar of step

Let me illustrate using the arithmetic expression grammar above.

First, note that expression, term, and factor are all cyclic, and nothing else is. In this particular case, we actually only need one counter, but it does no harm to add all three. For brevity, I’ll abbreviate the non-terminals to single letters.

e(i,j,k) = t(i+1,j,k) (a t(i+1,j,k))*
t(i,j,k) = f(i,j+1,k) (m f(i,j+1,k))*
f(i,j,k) = NUMBER | VAR-REF | '(' e(i,j,k+1) ')'
a = '+' | '-'
m = '*' | '/'

The result of drawing a transition network (not shown) and then mushing it all together, after minimization, with guards and counter operations, is something like this: FSA for the superlanguage

A counting regular attribute grammar for this, with e(0) the start symbol, is:

e(n) = '(' e(n+1)
e(n) = NUM q(n)
e(n) = VAR-REF q(n)
q(n) = [+-*/] e(n)
q(n) = {n > 0} ')' q(n-1)
q(n) = {n = 0} ''

(The last line could of course be written “q(0) = ''”, but for reasons I haven’t figured out the form with the explicit and verbose guard seems better, perhaps clearer. Maybe it is.)

This can be multiplied out in the obvious way to produce an unattributed grammar which accepts two levels of parenthesized expressions:

e = '(' e1
e = NUM q
e = VAR-REF q
q = [+-*/] e
q = ''
e1 = '(' e2
e1 = NUM q1
e1 = VAR-REF q1
q1 = [+-*/] e1
q1 = ')' q
e2 = NUM q2
e2 = VAR-REF q2
q2 = [+-*/] e2
q2 = ')' q1

And the smallest regular superlanguage appears to be

e = '(' e
e = NUM q
e = VAR-REF q
q = [+-*/] e
q = ')' q
q = ''

Some open questions and follow-on topics:

  • Can a formally satisfactory definition be given which explicates the intuitive notion of the ‘smallest regular super-language’?
  • If so, can it be proven that the construction above produces a grammar (an FSA, a regular expression) which in fact accepts the smallest regular superlanguage?
  • To check this, I really ought to turn this into code; if any readers of this have a use for such code in Prolog or XSLT, let me know.
  • It would be interesting to use this technique to write simple types for approximations of arithmetic expressions, and XPath expressions, and various other specialized types that in principle require context-free grammars. I foresee a union type with first a large regular subset, then the smallest regular superset, then arbitrary strings.
  • This whole thing came up because I was worrying about issues relating to test cases for grammars; it would be nice to try to apply this work there.