Archive for the ‘Automata’ Category

Automata and infinite strings

Tuesday, December 15th, 2009

[15 December 2009]

[This is another one of those ignorance-is-bliss postings. If I had studied automata theory properly, this would (I guess) have been covered in class; that would have deprived me of the fun of thinking about it without knowing the right answer. If you did study automata theory, and you know how infinite strings are handled, and it irritates you to see someone waffling on and on about it instead of just doing some homework and reading the damn literature, you might want to stop reading soon.]

Some time ago, Michael Kay suggested that it was pointless for the XSD datatypes spec to specify that the lexical representations of integers, or strings, or various other simple types, were finite sequences of characters with certain properties. No implementation, he pointed out, can reliably distinguish finite from infinite strings, so it’s a non-testable assertion.

[“True if you're using conventional I/O and conventional representations of strings, maybe,” said Enrique. “But if you represent the sequence of characters using a description, rather than an array of characters, it's not clear that that's true. Instead of the sequence "3.141592...", store an algorithm for calculating, on demand, the nth digit of the decimal expansion of π. Ditto for the square root of 2. And so on!” “You may be right,” I said. “But that wasn't what I wanted to talk about, so be quiet.”]

The working group declined the proposal to drop the word “finite” on the grounds that if the strings in question are required to be finite, then we know that all the lexical representations of integers (for example) can in principle be recognized by a finite state automaton. Without the restriction to finite sequences, most of what people know about finite state automata isn’t guaranteed to apply.

I found myself wondering this morning about the possible application of automata to infinite and semi-infinite strings. I know that in principle automata theorists have historically not restricted their interest to finite automata; it seems plausible to assume they have also considered infinite strings. But I don’t know what they have said, without spending time looking it up; instead, I am going to enjoy myself for a little while seeing how much I can figure out for myself.

One obvious question to answer is: if you want to use an automaton to identify infinite sequences, how do you define acceptance of the sequence? For a finite sequence, you ask what state you’re in at the end of the sequence, and whether that state is an “accept state” or not. That won’t work for an infinite sequence: there is no final state.

Perhaps we can consider the sequence of states the automaton enters and define acceptance in terms of that sequence. Possible answers:

  1. Accept if (a) the automaton eventually ends up in a single state which it never again leaves, and (b) that state is an accept state.
  2. Accept if there is some point in the sequence of states such that every state following that point is an accept state.

These would work (in the sense of providing a yes/no answer).
Do these rules for acceptance of strings define sets of automata with different discriminating power?

It seems obvious that they do, but what exactly are the differences?

Consider, for example, automata for recognizing various classes of numbers written as an infinite sequence of decimal digits. Numbers seem to be on my mind, perhaps because of the tie-in to XSD datatypes.

For such infinite strings of digits (including a decimal point), integers have the property that every digit to the right of (i.e. following) the decimal point is a 0. If you build the obvious automaton, for an integer it will spend all its time in the zero-after-decimal-point state, and for a non-integer it will, eventually, end up caught in an error state.

[Enrique tells me I should pause to draw pictures of these automata, but I'm not going to take the time just yet. Apologies to those who find it hard to visualize what I'm talking about.]

So the first acceptance rule suffices for recognizing integers. It may be relevant that the same automaton can be used to recognize finite strings as representing integers: any prefix of the infinite string representing an integer will also be accepted as representing an integer.

The first rule would also suffice to allow us to build a recognizer for certain fractions, e.g. 1/3: the infinite string denoting 1/3 ends up perpetually in the “we’ve just read a 3” state.

On the other hand, it doesn’t suffice for all rationals: in decimal notation,1/7 has an infinitely repeating sequence of digits (142857, if you were wondering). To distinguish 1/7 in decimal notation we’d need a cycle of six states in the automaton.

All rational numbers have a decimal expansion that eventually settles into an infinite series of repeated strings of digits (if only an infinitely repeating sequence of zeroes). So if we adopt the second rule for defining acceptance of the string, we can say: for every rational number, there is a finite state automaton that recognizes that rational number. And irrationals, which have no repeating sequences, aren’t recognizable by an automaton with finite states. (An automaton with infinitely many states might be able to recognize the decimal expansion of a particular irrational number, say π, but it’s hard to know what to do with that information — maybe it’s a reason to say that languages recognizable with an infinite automaton are not necessarily regular.)

That sounds like a nice pattern. It would be even nicer if we could devise an automaton to recognize the set of decimal expansions of rational numbers, but I suspect that’s not feasible, since the complement of that set is the irrationals, and being able to recognize the one set by regular means would entail being able to recognize the other, too.

Does it make sense to require that the automaton eventually end up spending all its time in accept states? (Or equivalently, that the sequence of states have a suffix in which every element in the suffix is an accept state.)

What if that is too restrictive a rule? What if we said instead

  1. Accept if at every point in the sequence of states there are an infinite number of accept states among the states following that point.

That is, allow the string to put the automaton into a non-accepting state, as long as it’s only temporary, and it eventually gets back into an accepting state.

Consider an automaton which has two states, A and B. Every time a B is found in the input, we go to state B; for any other symbol we go to state A. B is an accept state.

If we adopt the second story about termination, a string ending in an unending series of Bs will be accepted and is thus recognizable by an automaton. A string with an infinite number of Bs, interspersed with other symbols, will not be accepted by this automaton (nor by any other, as far as I can tell).

OK, that seems to establish (if we accept the conjecture about strings with infinitely many Bs) that the second and third rules define distinct sets of languages. I suppose that one chooses to use the second rule, or the third, or some other I haven’t thought of yet, in part based on whether it feels right to count as regular the languages one can recognize using that rule.

Hmm. OK, time to look at the bookshelves.

I’ve just checked and found that John E. Hopcroft and Jeffrey D. Ullman, in Introduction to automata theory, languages, and computation (Reading: Addison-Wesley, 1979), restrict their attention to finite strings.

Dick Grune and Ceriel J. H. Jacobs, Parsing techniques: a practical guide, second edition (New York: Springer, 2008), don’t explicitly impose this restriction. But a quick scan of their opening pages also doesn’t show any explicit consideration of infinite sequences of symbols, either. I’m guessing they do treat infinite input somewhere, if only because if you can contemplate van Wijngaarden grammars, which have infinite numbers of context-free rules (and remember, Grune didn’t just contemplate van Wijngaarden grammars, he wrote a parser generator for them), infinite strings are just not going to frighten you.

I suppose the idea of thinking seriously about infinitely long sentences in a language is one I first encountered in D. Terence Langendoen and Paul Postal, The vastness of natural languages (Oxford: Blackwell, 1984). To whom (for this, as for many other things) thanks!

I’m pretty sure that there was some treatment of infinite automata and/or infinite input strings in S. C. Kleene, “Representation of events in nerve nets and finite automata”, in Automata studies, ed. C. E. Shannon and J. McCarthy (Princeton: PUP, 1956), and V. M. Glushkov, “The abstract theory of automata”, Russian mathematical surveys: a translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk 16 (1961). They are both kind of tough sledding, but I suppose I really ought to go back and read them carefully with an eye to this topic.

Grail for regular languages

Friday, December 11th, 2009

[11 December 2009]

Every now and then — not constantly, but recurrently — I experience a strong need to have a running copy of Grail, a software package first written by Derick Wood and Darrell Raymond and described by its documentation as “a symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects.”

Among other things, Grail is handy for answering questions about the equivalence or non-equivalence of regular expressions, or about subset/superset relations holding between the languages recognized by them. A few years ago, for example, the W3C XML Schema Working Group found itself in possession of two different descriptions of the lexical space of the XSD duration type. The working group wished, not unreasonably, to check that the two really were equivalent.

The first description provided three regular expressions, and said the lexical space of duration included all the strings which matched all three expressions:

  • -?P([0-9]+Y)?([0-9]+M)?([0-9]+D)?(T([0-9]+H)?([0-9]+M)?([0-9]+(\.[0-9]+)?S)?)? (strings in which the fields of an ISO 8601 duration appear in the correct order, and in which each field appears only if it has at least one digit present)
  • .*[YMDHS].* (strings in which at least one field is present)
  • [^T]+(T[^HMS]+[HMS].*)? (if the character T appears, it must be followed by one of the time-related fields)

The second description translated the context-free grammar into regular-expression form (I’ve introduced white space for legibility):

-?P(([0-9]+Y)([0-9]+M)?([0-9]+D)?
 (T(([0-9]+H)([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
            |([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
                    |([0-9]+(\.[0-9]+)?S)))?
|([0-9]+M)([0-9]+D)?
 (T(([0-9]+H)([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
            |([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
                    |([0-9]+(\.[0-9]+)?S)))?
|([0-9]+D)?
 (T(([0-9]+H)([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
            |([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
                      |([0-9]+(\.[0-9]+)?S)))?
|T(([0-9]+H)([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
           |([0-9]+M)?([0-9]+(\.[0-9]+)?S)?
                     |([0-9]+(\.[0-9]+)?S)))

Easy enough to eyeball, for some people, I guess, but the working group wanted a more reliable method.

After a few hours trying vainly to compile Grail for my Linux box, I found an RPM that worked for me, and in ten minutes or so I had used Grail to establish that the two descriptions are equivalent.

Today I realized that another problem I face could best be solved by using Grail, but I no longer have a Linux box (and have not, in any case, found that old RPM). Grail 2.5 is dated March 1996, and the C++ in which it is written does not seem to please GCC 4.0.1. Grail+ 3.0, a successor project in other hands, may have been touched as recently as 2002 or 2004, but most of the dates appear to be in summer or fall 1998. GCC doesn’t like it, either.

So I have thus far been unable to recompile this very helpful tool.

If anyone out there knows of anyone who has either massaged the source of Grail into a form more like what modern C++ compilers will compile, or found out what combination of compile-time flags will persuade GCC to put itself in a more forgiving frame of mind and compile the thing, please get in touch. (And no, -Wno-deprecated does not suffice to do the trick.)

And any C++ proficients looking for interesting and useful projects to undertake could do a lot worse for themselves and for the world than to bring Grail into the twenty-first century.

Dumbing grammars down, cont’d

Saturday, May 30th, 2009

[30 May 2009; diagrams added 31 May 2009]

Last April I posted an entry on the possibility of systematically producing regular languages which are subsets or supersets of a context-free language. The process I outlined involves creating a finite state automaton, augmented with counters and guards and whatnot (or equivalently, creating an attributed regular grammar with attributes and semantic conditions). My description of that process involved a certain amount of hand-waving, because I knew intuitively how to build such an automaton by hand (at least for simple cases), but had not yet figured out a good way to describe it more precisely.

I’m still not there yet, but I’ve made some progress today. Recently I spent some relaxing hours reading about Earley parsers and implementing an Earley parser. And today it occurred to me: a simpler way to make the attributed regular grammar / augmented FSA I need is to make one state for each position in each rule of the grammar, so that each state in the automaton (each non-terminal in the attributed regular grammar) corresponds to the rule-and-position part of an Earley item. (The Earley item also keeps track of the starting position in the input of the non-terminal whose rule is being recognized; I don’t need that for the FSA.)

So I can now give a slightly simpler account of the FSA construction.

  1. Let the states of the automaton be the positions in the grammar. (Each rule of length n has n+1 positions.)
  2. For each position where the rule has a terminal symbol, include an arc from that position to the next position in that rule, labeled with the terminal symbol.
  3. For each position p where the rule has a non-terminal symbol N, (a) add arcs labeled with the empty string from p to the first position of each rule in the grammar where N appears on the left-hand side, and (b) add arcs labeled with the empty string from the final positions in each rule for N to the next position after p.
  4. Add a new starting state; add arcs labeled with the empty string from the starting state to the first positions of the rules for the start symbol of the grammar.
  5. The accepting states of the automaton are the final positions of the rules for the grammar’s start symbol.

I have not yet attempted anything like a proof, but I believe that the automaton thus constructed recognizes what my earlier post called the smallest regular superlanguage of the language you started with. That is, every sentence of the original language will be recognized by the automaton, and some non-sentences will also be recognized.

The set of sentences recognized by the automaton can be characterized simply, by reference to the pumping lemma for context-free languages. In simple terms, this lemma says that for long enough sentences in a language, the sentence s can be partitioned into subsequences u, v, w, x, and y, such that s is the concatenation uvwxy, and such that for any positive integer n, the string uvnwxny is also in the language. The automaton recognizes the slightly larger set of sentences uvnwxmy for positive integers n and m. That is to say, it does not ensure (for example) that closing parens match opening parens in number or kind. But otherwise it captures the language.

To take a very simple example, consider the grammar:

  • s ::= ‘a’
  • s ::= ‘(’ s ‘)’
  • s ::= ‘[' s ']‘

The automaton has ten states, which can be represented in the style of Earley items:

  1. s ::= · ‘a’
  2. s ::= ‘a’ ·
  3. s ::= · ‘(’ s ‘)’
  4. s ::= ‘(’ · s ‘)’
  5. s ::= ‘(’ s · ‘)’
  6. s ::= ‘(’ s ‘)’ ·
  7. s ::= · ‘[' s ']‘
  8. s ::= ‘[' · s ']‘
  9. s ::= ‘[' s · ']‘
  10. s ::= ‘[' s ']‘ ·

Plus a new starting state I’ll call 0.

Arcs:

  • 0: λ → 1
  • 0: λ → 3
  • 0: λ → 7
  • 1: “a” → 2
  • 2: λ → 5
  • 2: λ → 9
  • 3: “(” → 4
  • 4: λ → 1
  • 4: λ → 3
  • 4: λ → 7
  • 5: “)” → 6
  • 6: λ → 5
  • 6: λ → 9
  • 7: “[” → 8
  • 8: λ → 1
  • 8: λ → 3
  • 8: λ → 7
  • 9: “]” → 10
  • 10: λ → 5
  • 10: λ → 9

Accepting states are 2, 6, and 10.

Or in pictures:

Drawing of the finite-state automaton described in the text, with circles for the states and arrows for the transitions.

Eliminating epsilon transitions and minimizing the automaton gives us a smaller one with two states, 1 (the start state) and 2 (the accepting state), with transitions:

  • 1: “a” → 2
  • 1: “(” → 1
  • 1: “[” → 1
  • 2: “)” → 2
  • 2: “]” → 2

Drawing of the two-state finite-state automaton described in the text, with circles for the states and arrows for the transitions.

The regular expression “[\[(]*a[\])]*” provides a compact equivalent. As you can see, it accepts a superset of the example language, but it is the smallest regular superset and does impose at least some of the constraints of the original grammar.

Unfinished business (anyone who needs a puzzle to solve can feel free to take these off my hands):

  • Prove that the FSA whose construction is described above accepts a superset of the language accepted by the grammar from which it’s constructed.
  • Describe (with proof of correctness) what augmentations to the mechanism would suffice to use this automaton to parse the original grammar. (Hint: my earlier conjecture that counters would suffice is not true; the example grammar illustrates why.)
  • The augmented FSA that correctly recognizes the original language is, presumably, a re-invention of some well known parsing technique for context-free languages; which?
  • Prove that the language recognized by the FSA is essentially the language uvnwxmy for every uvwxy decomposition of sentences in the original language.
  • Can the largest practicable regular sublanguage be constructed using this FSA?
  • Earley parsing can, I think, be described in terms of this automaton, in the sense that the actions of an Earley parser can be translated into state changes in the automaton (with an external control that prevents making an illegal move). Can other parsing techniques also be?

Simple proof that the URI grammar of RFC 3986 defines a regular language

Friday, January 16th, 2009

[16 January 2009]

A while back, a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.

If the language is regular, of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:

But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if it’s regular. Is that language regular? I haven’t looked at it for a while, so I’m not really sure. At the very least, it’s not obvious that it’s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.

First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact refers(X,Y) is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the refers relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).

But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.

   setof(V,X^(refers(V,X);refers(X,V)),Vs),
   setof(L-R,refers(L,R),Es),
   vertices_edges_to_ugraph(Vs,Es,G),
   transitive_closure(G,Gc),
   member(LHS-RHS,Gc),
   member(LHS,RHS).

There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.

The line

   setof(V,X^(refers(V,X);refers(X,V)),Vs),

makes a set Vs of all terms in either argument position for refers. This is the set of vertices in the non-terminal reachability graph.

The line

   setof(L-R,refers(L,R),Es),

similarly makes a set of expressions of the form L-R for terms linked by the refers relation. For example, since the ABNF in the RFC includes the rule

URI    = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

the set Es (edges) will contain 'URI'-scheme, 'URI'-'hier-part', 'URI'-query, and 'URI'-fragment. Prolog quotes URI here to avoid having it taken as a variable.

The line

   vertices_edges_to_ugraph(Vs,Es,G),

uses an SWI utility to turn the lists Vs and Es of vertices and edges into an unweighted graph G, in the form used by the ugraph library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.

G has the form of a list of pairs X-[Y1,Y2,...], where there are edges from X to Y1, Y2, … in the graph.

In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form X = ... Y ...)

The line

   transitive_closure(G,Gc),

takes the transitive closure of graph G and assigns it to variable Gc. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.

The final lines

   member(LHS-RHS,Gc),
   member(LHS,RHS).

seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.

Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.

Q.E.D.

It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book Grundlagen und Techniken des Compilerbaus (Bonn: Addison-Wesley, 1996), also in English as Compiler Construction (Harlow, England: Addison-Wesley, 1996). He writes:

Eine Sprache ist regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.

Or:

A language is regular, if its syntax can be expressed by a single EBNF expression.

(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)

In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.

So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.

Having worked this out, I asked my colleague, Dan Connolly, what line of reasoning he had followed in answer the question. “Well, I just tried the construction, and it worked.“ He has a Web page with Javascript that performs the translation from the ABNF of the RFC into Javascript regexes, and allows the user to test strings to see if they match that grammar. If you are interested in this kind of question, you may find that page fun to play with.

A way to allow both extensibility and interoperability (Balisage report 2)

Tuesday, August 19th, 2008

[begun 11 August 2008, finished 19 August 2008]

At the versioning symposium the day before the Balisage 2008 conference, my colleague Sandro Hawke talked about a method he has developed for allowing a core language to be extended in a variety of ways while still retaining reasonably good interoperability. The details of his implementation are interesting, but what strikes me most forcibly is the underlying model of interoperability, which seems to me to bring some very useful new ideas to bear on the problem. There may be — I mean there must be — situations where these new ideas don’t help. It’s not actually clear to me how best to characterize the situations where they do help, and the situations where they don’t, but it does seem to me clear that in the situations where they do help, they can help a lot.

It took me a bit of effort to grasp the basic idea, and in fact I didn’t really get it until my evil twin Enrique started to explain it to me. The summary of the basic ideas that follows is Enrique’s, not Sandro’s, and Sandro should not be held responsible for it.

1 Start at the beginning.

“It’s a very good place to start,” I murmured. “What was that?” “Nothing.” “One more Sound of music reference out of you and this explanation is over, got that?”

It’s obvious, and thus well understood by at least some people, that interchange is easy when everyone involves is speaking exactly the same language.

“The Lord said, ‘If as one people speaking one language they have begun to do this, then nothing they plan to do will be impossible for them.’“ “And right he was. Sometimes Yahweh really knows what he’s talking about.”

2 But we have a lot of situations where we don’t want to, or can’t, speak exactly the same language.

3 In some cases, with unrelated languages, the only option seems to be translation, or non-communication.

“Non-communication?” I asked. “Always a popular option,” said Enrique. “Popular?” “Or at least, a common result. I assume it must be popular, given how many normally intelligent working group members engage in it. Do you really think that in the average working group argument, people are speaking the same language? Or even trying to?” “Well, of course they are. People working together in good faith …” “Good faith? working together? When was the last time you actually listened to what people were saying in a working group meeting?” “Don’t say that where my Working Group chairs can hear you, OK?” “You think they don’t know? Stick around, kid. Your naïvete is refreshing.”

4 An important case: We have a lot of situations where we want to support multiple versions / variants of the same vocabulary: a sequence of version 1, version 2, and version 3 is a simple example.

Another example is the definition of a core architecture and various competing (or complementary) extensions. For example, in the context of the W3C Rule Interchange Format, Sandro’s slide imagines a Core RIF dialect, extended separately by adding actions (to create Production Rule Dialect) and by adding equality and functions (to create a Basic Logic Dialect), the Basic Logic Dialect in turn being extended by a negation-as-failure feature and a classical-negation feature (producing, respectively, a logic-programming dialect and a version of first order logic).

“Is it an important pattern here,” I asked, “that we are talking about different dialects with overlapping expressive power?” “Possibly. What do you mean?” “The examples you mention seem to involve different dialects with semantics or functionality in common, or partially in common: you can approximate negation-as-failure with classical negation, or vice versa, if you translate direct from the one dialect to the other. If you had to translate them down into a common core language without any negation at all, you wouldn’t be able to approximate them nearly so well.” “You’re getting ahead of yourself here, but yes, that’s true of the examples so far.”

Another example is the definition of a central reference architecture and various extensions or specializations.

“You mean like HL7 and different specializations of it for general practitioners and for hospitals and other health-care institutions?” “Possibly.”

5 A common approach to such situations is a kind of hub and spoke model: specify an interchange language, an interlingua. Everybody translates into it, everybody translates out of it. So you have a 2×n problem, not an n-squared problem.

6 But in some situations, e.g. core language + multiple independent extensions, the only feasible interlingua is the core language without extensions. (Example: consider any attempt to write portable C, or portable SQL. Basic rule: stay away from vendor extensions at all costs, even if they would be a lot of help. Secondary rule: use them, but put ifdefs and so on around them.)

I.e. the interlingua is often not as expressive as the specific variants and extensions people are using. (That’s why they’re using the variants and not the interlingua.)

Put yet another way: going from the variants to the interlingua can be lossy, and often is.

7 Two things are wrong / can be wrong with an interlingua like that (or ANY interlingua solution), especially when a round trip through the interlingua produces something less useful or idiomatic than the original. (This is not logically necessary, it just almost always seems to be the case.)

Well, three.

(a) If two people (say, Ana Alicia and Ignacio) are doing blind interchange, and using the same extensions, it seems dumb for Ana Alicia to filter out all the extensions she uses, and for Ignacio not to get them. They would not have hurt him, they would have been helpful to him. The same applies if Ana and Ignacio are using similar but distinct extensions: if it’s possible to translate Ana’s extensions into a form that Ignacio can understand, that may be better than losing them entirely.

But that doesn’t work well if the interlingua can’t handle Ana Alicia’s extensions.

Sandro gave, among others, the example of the blink tag. The rules of HTML say that a browser that doesn’t understand the blink element should ignore its start- and end-tags. That makes the defined tags a sort of interlingua, and prescribes a simple (brutally simple, and also brutal) translation into the interlingua. But for text that should ideally be presented as blinking, it would be better to fall back to strong or some other form of strong highlighting, rather than rendering the contents using the same style as the surrounding text.

(b) Having extensions Ignacio doesn’t understand may or may not be a problem, depending on what Ignacio is planning on doing with the data.

(c) Even more important than (b): the best lossy translation to use may vary with Ignacio’s use of the data.

If one translation preserves the logical semantics of a set of logical rules (e.g. in RIF), but makes the rules humanly illegible, and another preserves human legibility but damages the logical content, then it matters whether Ignacio is running an inference engine or just displaying rules to users without doing any inference.

8 One advantage of direct translation pairs (the n-squqred approach instead of 2×n with an interlingua) is that the output is often higher quality. This is true both for artificial and for natural languages.

“Really?” I asked. “Sure,” said Enrique. “Look at the secret history of any of the European machine-translation projects. Even the ones based on having an interlingua ended up with annotations that say in effect ‘if you’re going from Spanish to Italian, do this; if from Spanish to German, do that.’’ “Where did you read that?” “Somewhere. No, I can’t cite a source. But I bet it’s true.”

But when we don’t know in advance what pairs are needed? and how many people are going to be playing? And thus when we cannot prepare all n-squared translation pairs in advance? What do we do then?

9 Could we manage a system where the pairwise translation is selected dynamically, based on simpler parts (so the total cost is not quadratic but less)?

10 Yes. Here’s how, in principle.

(a) Identify one or more base interlingua notations.

“That’s got to be a centralized function, surely?” I asked. “Probably; if it’s not centralized you’re unlikely to reach critical mass. But when you’re thinking of this technique as a way to provide for the extensibility of a particular language, then it’s not likely to be an issue in practice, is it? That language itself will be the candidate interlingua.” “Then why say ‘one more more’?” I asked. “Because in principle you could have more than one. Some widely deployed language in the application domain might be a second interlingua.”

(b) Provide names for various extensions / variants / versions of the base notation(s).

This means you can describe any end point in the process as accepting a particular base language with zero or more named extensions.

“Does this need to be centralized?” I asked. “Not at all; just use normal procedures to avoid name conflicts between extensions named separately.” “You mean, use URIs?” “Yes, use URIs. Sometimes you’re not as dumb as you look, you know?”

(c) Identify well known purposes for the use of the data (display to user, logical reasoning, etc.). This can if necessary be done in a distributed way, although it’s probably most effective if at least the basic purposes are agreed upon and named up front.

(d) For each extension, say how (and whether) it can be translated into some other target language — whether the target is one of the interlinguas identified in step (a) or is an interlingua plus a particular set of extensions.

Define the translation. And quantify (somehow, probably using some subjective quality measure) how lossy it is, how good the output is, for each of the specific well known purposes.

“This,” said Enrique, “ is why it’s better to centralize the naming of well known purposes. If you don’t have an agreed upon set of purposes, you end up with spotty, incomplete characterizations of various translations.” “And how is the translation to be specified?” I asked. “Any way you like. XTAN and GRDDL both use XSLT, but at this level of abstraction that’s an implementation detail. I take back what I said about your looks.”

(e) Now we have a graph of language variants, each defined as an interlingua plus extensions. Each node is a language variant. Each arc is a translation from one variant to another. Each arc has a cost associated with each well known purpose.

(f) To get from Ana Alicia’s language variant to Ignacio’s language variant, all Ignacio has to do (or Ana Alicia — whoever is taking charge of the translation) is find a path from Ana’s node in the graph

“The node representing Ana Alicia’s dialect, you mean.” “Pedant! Yes, of course that’s what I mean.”

to Ignacio’s. It need not be direct; it may involve several arcs.

(g) If there’s more than one path, Ignacio (let’s assume he is doing the translation) can choose the path that produces the highest quality output for his purposes.

Finding the optimal path through a weighted graph is not without its exciting points —

“You mean it’s hard” “Well, of course I do. What did you think exciting meant in this context?! But only when the graph is large.”

— not (I was saying) without its exciting points, but it’s also a reasonably well understood problem, and perfectly tractable in a small graph of the kind you’re going to have in applications of this technique. Sandro’s implementation uses logical rules here to good effect.

(h) If there’s only one path, Ignacio can decide whether the output is going to do him any good, or if he should just bag the idea of reusing Ana Alicia’s data.

11 The description above is how it works in principle. But can it actually be implemented? Yes.

But for details, you’re going to have to look at Sandro’s talk on XTAN, or on other writeups he has done (cited in the talk).

More questions about Monte Carlo test-case generation

Saturday, August 9th, 2008

[9 August 2008]

Dave Pawson’s response to the previous post here has made me wonder about ways to tweak the Monte Carlo methods for generating test strings using random number generators — in particular, ways to make them more systematic (and also, perhaps, more automatic).

At the moment, I don’t have answers, but a couple of questions have occurred to me, and if I write them down on a piece of paper I’m just going to lose the paper. So I’ll write them down here.

(1) Can we measure the effectiveness of different methods of test generation by counting how many random strings must be generated in order to obtain a certain number of ‘interesting’ test cases?

(2) One way to define ‘interesting’: given two grammars for similar but not identical languages, how many random strings must we generate, using different methods, in order to find ten strings that exercise the difference between the languages? (I.e. ten strings that are members of L1 but not of L2, or vice versa.)

If we think we know that L1 and L2 differ in exactly three ways, we can also ask: how many test cases must we generate in order to find examples (or: ten examples) of each of the three differences between the languages? (A method that quickly exercises the first two differences but never finds the third would not be our best choice for work with languages where we don’t know the answers in advance.)

(3) Another way: given any grammar, how many test strings must we generate, using a given method, in order to find three different strings which match each non-terminal in the grammar?

How many other ways of defining ‘interesting’ test cases can we find? How many other ways can we find of measuring the effectiveness of methods of generating test cases?

(4) At the moment, I know three methods for generating pseudo-random strings for testing grammars:

  1. Run the grammar itself as a generator, possibly using probabilities assigned to different branches at each choice point. John Cowan described this method in a comment on an earlier post here; Liam Quin also described essentially this method to me, in private conversation. If I need a name for this method, I’ll call it probabilistic sentence generation.
  2. For a given length of string n (given as a parameter, or itself chosen randomly), pick n characters randomly from the character set and concatenate them. I’ll call this random character concatenation.
  3. For a given number n, choose n items randomly from an array of strings constructed to include strings that match each non-terminal and terminal in the grammar’s vocabulary, and concatenate them. Since here the random selection is from strings representing the terminal + non-terminal vocabulary, I’ll call this random TNT concatenation.

Question: are there other ways to use random-number generators to generate test strings for grammars?

(5) Can the array of strings used in the TNT method be constructed automatically, by using random character concatenation until you have matches for each terminal and non-terminal, and the array is fully populated?

(Here, Enrique dug his elbow into my rib. “Well, obviously yes it can,” he hissed. “What kind of idiotic question is that?!” “Hey, I’m just brainstorming. Gimme a break. Besides, this helps set up the two questions that follow.”)

I should mention here that I have a strong gut feeling that it’s useful, in practical cases, to augment the grammar with a terminal class for characters (or strings) which can never occur in sentences of the language. For a grammar for URIs, that is, the class of characters illegal in URIs definitely needs to be represented in the TNT array; for a grammar for numbers in scientific notation, you definitely want to have a few entries in the TNT array for characters like space, or #, or letters of the alphabet other than ‘e’ and ‘E’, which occur in the coded character sets your users will be using but are not part of the alphabet of the language in the formal sense. I have no good formal grip on this gut feeling, and no good formal way to describe it; I’ll work on that. But for now, understand that when I talk about strings matching terminals, I also intend to include some characters that don’t match any terminal. Otherwise you will surely fail to exercise some error detection paths in your system.

(6) Can the TNT array be populated by a bootstrapping method? Would this work?

  1. Start with an array that just includes substrings for the terminals.
  2. Select a production rule whose right-hand side contains only terminals (or, later on, one whose right hand side contains only items for which the TNT array has substrings).
  3. Generate random strings until you have a match, or two matches, for that non-terminal. (N.B. for recursive rules you’ll want to treat each choice in the right-hand side as a separate production rule.)
  4. Add the new substrings to the TNT array.
  5. Repeat (i.e. return to step 2), until you have strings in the TNT array for each production rule (not just each non-terminal and terminal) in the grammar.

(7) Another possible bootstrapping method:

  1. Begin with a TNT array for all the terminals, as just described.
  2. Also begin with a count table showing the non-terminals (or better, the production rules) of the grammar, and how many substrings they have in the TNT array; to begin with, every production rule in the count table will have zero matches.
  3. Generate a random string from the TNT array.
  4. Check the string against each rule in the grammar table for which you have fewer than three strings in the TNT array; when you find a match, add that string to the TNT array and increment the number for that rule in the count table. If the string doesn’t match any rules (or any rules that aren’t already at 3), discard it and generate another (i.e. go back to step 3). If every entry in the count table has a count of 3, stop.

The description just given assumes that you want three strings in the TNT table for each rule in the grammar. That’s an arbitrary choice; it could be two, or one, or ten. And if there is reason to want especially thorough testing in some parts of the grammar, we could assign different quotas to different non-terminals. (If I’m particularly worried about character classes, then I want a lot of test cases that exercise that part of the grammar; I can get more by giving the non-terminals in that part of the grammar higher representation in the TNT table.)

(8) When I originally sketched out the TNT method I assumed that what I wanted was a few strings for each non-terminal (and terminal) in the grammar. In the course of working through the questions above I have come to believe that what we want is a few strings for each rule in the grammar: non-terminals with three rules need more strings than non-terminals with just one rule. Which is true? If the measurement methods described above work, it should be possible to measure the effeciveness of different approaches quantitatively.

(9) In setting up the TNT array, is the best default behavior going to be to provide the same number of strings for each item in the terminal + non-terminal vocabulary? Or would we do better to count the number of times each TNT vocabulary item is actually mentioned in the grammar, and assign slots proportionally? One mention: two strings. Eight mentions: sixteen strings. (Or should we do it logarithmically? One mention, two strings; eight mentions, three times two strings; n mentions, 2 times log2 n strings?)

(10) So far, this has all been about sequences of characters. How many of these methods can be transferred conveniently to the task of generating XML documents as test cases for document grammars?

Two styles of Monte Carlo testing for grammars

Friday, August 8th, 2008

[8 August 2008]

In an earlier post I talked about generating test cases for grammars, and asked a number of questions. I’ve been meaning to get back to the topic; this afternoon I revisited some work I did with the XSD 1.1 grammar for regular expressions, and so the topic is again on my mind.

Pseudo-random strings

In the past, I’ve used Monte Carlo test generation to test grammars, in particular to find strings that distinguish between grammars. For example, given two grammars for the lexical space of a particular datatype, can we find strings that are accepted by one grammar and rejected by the other?

But random strings can also be used just to understand a given grammar better.

At one point in our work on XSD 1.1, for example, I proposed that in order to have a looser coupling between XSD and the RFCs that define URIs, we should just accept any string at all as a type-valid URI, much the same way that HTML browsers ‘accept’ any string as the value of an href attribute. They may not understand it, but they don’t raise an error message. The rationale for my proposal was that in any case, the generic grammars defined by RFC 3986 and its predecessors are so vague, so general, and so forgiving, that it wasn’t really very clear that they outlawed very much at all. I’ve heard people who know URIs well say that really the only thing strings that grammar won’t accept are strings with two hash signs. Others, equally well informed, say that those are legal, too.

The scheme-specific grammars (e.g. the grammar for http: URLs) are much more useful in catching errors, I argued. But we don’t require that they be checked, and don’t really want to make them part of the definition of type validity for the anyURI type. (I think a good validator should check anyURI value against the scheme-specific grammars and issue warnings if there are problems, but I don’t know any that do that.) So, I argued, the XSD spec was giving the illusion of useful value checking, but not the reality. Others in the working group maintained that the generic grammar does perform some useful checks, and wanted to retain it.

Generating ten thousand or so strings of twenty or forty characters each, with each character a randomly chosen character in the printable range of the ASCII character set, persuaded me that I was wrong. Even if the only useful thing the generic URI grammar does is eliminate strings containing forbidden characters, it rejects well over half the random strings I generated.

So I abandoned my proposal, and in XSD 1.1, type validity for anyURI does involve checking the literal against the grammar given in the RFC.

More interesting pseudo-random strings

One problem, for a human looking at the data, is that if you generate random strings as I did, by generating twenty random numbers between 32 and 127, and then concatenating the corresponding characters, then you generate an awful lot of uninteresting test cases, and you can go a long long time before you actually exercise a specific part of your grammar.

So when I faced the task of generating test cases for XSD’s regular-expression grammar, I wanted some method that would generate more interesting test cases, test cases more likely to exercise all the parts of the grammar. In particular, I wanted to make sure we got good coverage of the more obscure corners of the character-class constructs. These have relatively elaborate substructures and the likelihood of generating a complex character-class expression using randomly chosen characters was way too low.

To make a long story short, I think I found a way. This is what I did.

  1. I built a set of strings in the following way.
    1. For each non-terminal in the grammar, I generated one or more strings that matched that non-terminal, and added them to the set.
    2. For each literal string mentioned in the grammar, I included that literal string in the set. For example, the first production rule in the regex grammar is
      regExp ::= branch ( '|' branch )* 

      So I added “|” to the set of strings.

    3. For each class of terminal symbol, I included one or two examples in the set of strings.

    For the regex grammar, I ended up with a set of a hundred or so strings, from “a” to “(” to “[\p{IsBasicLatin}-[aeiou]]”.

  2. I put the set of strings into an array, in no particular order.
  3. Then, I generated sequences of five or ten random numbers between 0 and the size of my set of string. For example, “[92, 21, 48, 6, 41, 69, 47”.
  4. Then I used each number to look up the corresponding string in the array, and concatenated the results. For the random numbers just given, this produced the strings ”}”, ”a”, ”2”, ”)”, ”|xyz”, ”|”, ”[\\p{IsBasicLatin}-[aeiou]]”, and ”(”.
    Together, these yield the test string “ }a2)|xyz|[\\p{IsBasicLatin}-[aeiou]](

The test cases generated by this modified method (random selection of substrings from a set of strings known to be grammatically interesting) exercised the grammar far better than test cases generated by the random-character method.

If I were a better mathematician, I might be able to characterize the difference. But in a purely informal way, I think one can think of the production rules of any grammar as a set of regular expressions over an alphabet consisting of both the terminal and the non-terminal symbols of the grammar, not just of characters. In any normal grammar, for most non-terminals some or most random character sequences will fail to match the non-terminal. Test cases generated by selecting characters at random will have the same effect as selecting items randomly from the terminal + non-terminal vocabulary, but with a bias (typically a very strong bias) against selecting any non-terminal symbol. By making a set of substrings as described, the second Monte Carlo method evens the odds a bit, and you get sequences that are much more likely to match the regular expressions over the terminal + non-terminal vocabulary, or to come very close to matching them. And that combination (matching all the productions, and coming close to matching each of the productions, but missing) is as close as I can come to characterizing the set of ‘interesting’ test cases.

Certainly, it was not necessary to generate tens of thousands of test cases in order to find examples that were valid against one form of the grammar, and invalid against a different form of the grammar.

If anyone wants to look at this quick and dirty test generator, the code for the test generator is at http://www.w3.org/XML/2008/03/xsdl-regex/testgenerator1.pl; the parser for the grammars I was testing is in the other files of the same directory.

Thinking about test cases for grammars

Thursday, June 19th, 2008

[19 June 2008]

I’ve been thinking about testing and test cases a lot recently.

I don’t have time to write it all up, and it wouldn’t fit comfortably in a single post anyway. But several questions have turned out to provide a lot of food for thought.

The topic first offered itself in connection with several bug reports against the grammar for regular expressions in XSD Part 2: Datatypes, and with the prospect of revising the grammar to resolve the issues. When revising a grammar, it would be really useful to be confident that the changes one is making change the parts of the language one wants to change, and leave the rest of the language untouched. In the extreme case, perhaps we don’t want to change the language at all, just to reformulate the grammar to be easier to follow or to provide semantics for. How can we be confident that the old grammar and the new one describe the same language?

For regular languages, I believe the problem is fairly straightforward. You can calculate the minimal FSA for each language and check whether the two minimal FSAs are isomorphic. Or you can calculate both set differences (L1 - L2 and L2 - L1) and check that both of them are the empty set. And there are tools like Grail that can help you perform the check, although the notation Grail uses is just different enough from the notation XSD uses to make confusion and translation errors possible (but similar enough that you think it would be over-engineering to try to automate the translation).

Buyt for context-free languages, the situation is not so good. In principle, the equivalence of context-free languages is decidable, but I would have to spend time rereading Hopcroft and Ullman, or Grune and Jacobs, to figure out how to go about it. And I don’t know of any good grammar-analysis tools. (When I ask people, they say the closest thing they know of to a grammar analysis tool are the error messages from yacc and its ilk.) So even if one did buckle down and try to prove the original form of the grammar and the new form equivalent, the possibility of errors in the proof is quite real and it would be nice to have a convenient way of generating a thorough set of test cases.

I can think of two ways to generate test cases:

  • Generation of random or pseudo-random strings; let’s call this Monte Carlo testing.
  • Careful systematic creation of test cases. I.e., hard work, either in the manual construction of tests or in setting things up for automatic test generation.

Naturally my first thought was how to avoid hard work by generating useful test cases with minimal labor.

The bad news is that this only led to other questions, like “what do you mean by useful test cases?”

The obvious answer is that in the grammar comparison case, one wants to generate test cases which will expose differences in the languages defined by the two grammars, just as in the case of software one wants test cases which will expose errors in the program. The parallel suggests that one might learn something useful by attempting to apply general testing principles to grammars and to specifications based on grammars.

So I’ve been thinking about some questions which arise from that idea. In much of this I am guided by Glenford J. Myers, The art of software testing (New York: Wiley, 1979). If I had no other reasons for being grateful to Paul Cotton, his recommending Myers to me would still put me in his debt.

  • For the various measures of test ‘coverage’ defined by Myers (statement coverage, decision coverage, condition coverage, decision/condition coverage, multiple condition coverage), what are the corresponding measures for grammars?
  • If one generates random strings to use as test cases, how long does the string need to be in order to be useful? (For some meaning of “useful” — for example, in order to ensure that all parts of the grammar can be exercised.)
  • How long can the strings get before they are clearly not testing anything that shorter strings haven’t already tested adequately (for some meaning of “adequate”)?
  • From earlier experience generating random strings as test cases, I know that for pretty much any definition of “interesting test case”, the large majority of random test cases are not “interesting”. Is there a way to increase the likelihood of a test case being interesting? A way that doesn’t involve hard work, I mean.
  • How good a job can we do at generating interesting test cases with only minimal understanding of the language, or with only minimal analysis of its grammar or FSA?
  • What kinds of analysis offer the best bang for the buck in terms of improving our ability to generate test cases automatically?

Dumbing grammars down

Tuesday, April 15th, 2008

[15 April 2008]

For reasons too complicated to go into just now (I tried, and it takes paragraphs and paragraphs), I have found myself thinking lately about a grammar manipulation process well known to be possible and wondering how to do it in fact.

Background

Sometimes what you have is a context-free language L, but what you need is a related regular language. Or, more precisely, what you have is a context-free grammar for L, and what you need is a regular grammar or an FSA or a regular expression that recognizes either the largest practicable regular sub-language of L or the smallest regular super-language of L. You might need this because the context-free grammar, in all its hairy glory, is too difficult to handle with the limited resources at your disposal. (The tractable-sublanguage approach was pioneered by students of natural languages, back when they were trying to make the equivalent of an 80386 parse English.) Or you might need it because you want to define an element or attribute for expressions of some kind defined by a context-free grammar, and you want to define a simple type for them, for use with XSD, or you want to use Simon St. Laurent’s regular fragmentation or similar tools to describe them, but the tools (like the XSD pattern facet) only deal with regular expressions, not context-free grammars.

Since regular grammars are weaker (or dumber) than context-free grammars, the question turns into: how can we most usefully go about dumbing down a context-free grammar to make it into a regular grammar?

The largest practicable regular sub-language

Now, it’s reasonably well known that for any context-free language which allows arbitrarily deep nesting of structures (that is, for any context-free language at all — if it doesn’t allow arbitrarily deep nesting, the language isn’t context-free), it’s possible to describe a related regular language which allows some bounded amount of nesting.

For example: the language of arithmetic expressions defined by this grammar is context-free:

expression = term (add-op term)*
term = factor (mul-op factor)*
factor = NUMBER | VAR-REF | '(' expression ')'
add-op = '+' | '-'
mul-op = '*' | '/'

A similar language that allows nesting of parenthesized expressions only up to some fixed depth is regular, not context-free, and can be recognized with a finite-state automaton or described with a regular expression. And every sentence in the regular language is a sentence in the original context-free language.

This fact has been exploited before now to make certain parsing and grammar-management tasks more tractable. I read about it first (if memory serves) in Grune and Jacobs, Parsing techniques: A practical guide, who cite a number of papers, most of which I have not read. But while they provide a clear high-level description of how the necessary grammar transformation works, they don’t go into any detail about how to perform the transformation in practice for arbitrary grammars. So it’s been in the back of my mind for some time to think it through and see if I could reduce it to code.

The smallest regular super-language

Sometimes, the regular language you want is not a sub-language, but a super-language. Suppose we wanted to write a regular expression that would accept all legal arithmetic expressions and as few other strings as possible. (If you’re defining a simple type for arithmetic expressions, you don’t want it rejecting things just because the parentheses are nested too deep. So you need a superlanguage, not a sub-language.) The nature of the context-free / regular distinction is that such a super-language would not require parentheses to match up correctly. But it should be possible to enforce some of the other rules relating to parentheses: we want strings like “(3 +)” or “4 (* x)” to be rejected, because in the original grammar operators are not allowed to be adjacent to parentheses in this way, and you don’t need a context-free grammar to enforce those rules.

As it happens, the regular expression \(*\d+(\)|[+\-*/]\d+)* comes pretty close to capturing the rules. (For simplicity, I have just written “\d+” for the places where the original grammar has NUM or VAR-REF.) That is, unless I have botched something somewhere, any string legal against the original grammar will be legal against this regular expression, and nothing else will except strings that would be correct except that their parentheses are messed up.

Towards a method

Given a particular context-free grammar, it’s generally possible to sit down with it and cipher out how to dumb it down into a regular grammar for fixed depth, or a regular grammar for arbitrary depth but no checks on parenthesis nesting. But how do you do it for arbitrary grammars, methodically and mechanically? How do you reduce it to code?

I daresay it’s in the literature, but I haven’t read much of the relevant literature; I expect to, when I can, but it was more fun to try to solve the problem from scratch, armed only with Grune and Jacobs’ concise description

Even (finite) context-dependency can be incorporated: for languages that require the verb to agree in number with the subject, we duplicate the first rule:

  MainClause -->
        SubjectSingular VerbSingular Object
      | SubjectPlural VerbPlural Object

and duplicate the rest of the grammar accordingly. The result is still regular…. We replicate the grammar the desired number of times and remove the possibility of further levels from the deepest level. Then the deepest level is regular, which makes the other levels regular in turn. the resulting grammar will be huge but regular and will be able to profit fom all simple and efficient techniques known for regular grammars. The required duplications and modifications are mechanical and can be one by a program.

I don’t have code yet. But I do have a sketch of a methodical approach. Some of these steps probably need further explanation, which I do not have time to provide today.

  1. Find the set of non-terminals in the grammar that are reachable from themselves (call these cyclic non-terminals).
  2. To each production for a relevant non-terminal (details below), add a synthesized attribute to the grammar, to show the nesting level for that non-terminal. Non-terminals are relevant if one of the cyclic non-terminals is reachable from them.
  3. Draw the CFG as an augmented transition network (this amounts to drawing each rule as an FSA, with terminals or non-terminals labeling the arcs).
  4. Using (an adaptation of) the Thompson algorithm, combine the FSAs from the transition network into a single FSA with epsilon transitions as necessary, and only terminals labeling the arcs. Keep track of the attributes using whatever additions to the usual tools your wit can provide. Me, I use counting FSAs as I described them in 2004; I still can’t believe that work is new, but I still haven’t found anything else in that vein.
  5. Eliminate epsilon transitions in the usual way, to clean things up. If your heart is strong, you may also be able to merge states that clearly can be merged safely, but I don’t know how to define that test formally just yet.
  6. You now have a fairly standard counting FSA; it can readily be re-transcribed into a counting regular attribute grammar (CRAG).
  7. To generate a regular sub-language, assign a maximum value to each counter; you now have an attribute grammar over a finite lattice and can generate a normal (non-attributed) grammar in the obvious way.
  8. To generate a regular super-language, just ignore (or: delete) the guards and counters.

Strictly speaking, you can generate both the sub- and the super-language as soon as you have the attribute grammar of step

Let me illustrate using the arithmetic expression grammar above.

First, note that expression, term, and factor are all cyclic, and nothing else is. In this particular case, we actually only need one counter, but it does no harm to add all three. For brevity, I’ll abbreviate the non-terminals to single letters.

e(i,j,k) = t(i+1,j,k) (a t(i+1,j,k))*
t(i,j,k) = f(i,j+1,k) (m f(i,j+1,k))*
f(i,j,k) = NUMBER | VAR-REF | '(' e(i,j,k+1) ')'
a = '+' | '-'
m = '*' | '/'

The result of drawing a transition network (not shown) and then mushing it all together, after minimization, with guards and counter operations, is something like this: FSA for the superlanguage

A counting regular attribute grammar for this, with e(0) the start symbol, is:

e(n) = '(' e(n+1)
e(n) = NUM q(n)
e(n) = VAR-REF q(n)
q(n) = [+\-*/] e(n)
q(n) = {n > 0} ')' q(n-1)
q(n) = {n = 0} ''

(The last line could of course be written “q(0) = ''”, but for reasons I haven’t figured out the form with the explicit and verbose guard seems better, perhaps clearer. Maybe it is.)

This can be multiplied out in the obvious way to produce an unattributed grammar which accepts two levels of parenthesized expressions:

e = '(' e1
e = NUM q
e = VAR-REF q
q = [+\-*/] e
q = ''
e1 = '(' e2
e1 = NUM q1
e1 = VAR-REF q1
q1 = [+\-*/] e1
q1 = ')' q
e2 = NUM q2
e2 = VAR-REF q2
q2 = [+\-*/] e2
q2 = ')' q1

And the smallest regular superlanguage appears to be

e = '(' e
e = NUM q
e = VAR-REF q
q = [+\-*/] e
q = ')' q
q = ''

Some open questions and follow-on topics:

  • Can a formally satisfactory definition be given which explicates the intuitive notion of the ‘smallest regular super-language’?
  • If so, can it be proven that the construction above produces a grammar (an FSA, a regular expression) which in fact accepts the smallest regular superlanguage?
  • To check this, I really ought to turn this into code; if any readers of this have a use for such code in Prolog or XSLT, let me know.
  • It would be interesting to use this technique to write simple types for approximations of arithmetic expressions, and XPath expressions, and various other specialized types that in principle require context-free grammars. I foresee a union type with first a large regular subset, then the smallest regular superset, then arbitrary strings.
  • This whole thing came up because I was worrying about issues relating to test cases for grammars; it would be nice to try to apply this work there.

Grune and Jacobs, Parsing Techniques, Second Edition

Tuesday, March 18th, 2008

[18 March 2008]

Good news.

The second edition of Parsing techniques: A practical guide, by Dick Grune and Ceriel J. H. Jacobs, has now appeared. I found this out not long ago and ordered a copy, and it was here waiting for me when I came home from my most recent trip, and I’ve now had a day or so to look at it.

The first edition has been out of print for a while, which was frustrating at first since I liked recommending the book. But then the authors got the rights back from the publisher, and made Postscript and PDF of the first edition available on Dick Grune’s web site. A few years ago Dick Grune told me a second edition was planned, and I have been waiting rather impatiently ever since.

It was worth the wait. Everything that made the first edition so useful and fun to read has been retained.

Well, almost; there is one minor exception. The second edition has 662 pages, compared to the first edition’s 322, and while Springer has done a very good job of keeping the book compact and handy, it does weigh more, so that when I sit on the couch to read it, eventually the weight makes my little finger hurt. But the new material is, well, worth the weight. (Sorry!)

The best way to give a feel for the book is to quote from the preface to the first edition:

This book does not address a strictly uniform audience. On the contrary, while writing this book, we have consistently tried to imagine giving a course on the subject to a diffuse mixture of students and faculty members of assorted faculties, sophisticated laymen, the avid readers of the science supplement of the large newspapers, etc. Such a course was never given; a diverse audience like that would be too uncoordinated to convene at regular intervals, which is why we wrote this book, to be read, studied, perused or consulted wherever or whenever desired.

For myself, I’ve read, studied, and perused the first edition repeatedly since I bought the first edition at Kroch’s and Brentano’s in downtown Chicago in 1992 (back when their computer science department had computer science books, not just dummies books about popular computer programs). And I’ve consulted it for reference material many times.

They give a less formal account of automata and the key theorems (pumping lemma and so on) than, say, Hopcroft and Ullman, but nevertheless a clear one, and incomparably more accessible. And they give full, interesting accounts of a lot of topics not well covered in other books on parsing I’ve looked at. Not everyone knows that parsing does not necessarily proceed left to right, but can run right to left, or even non-directionally, but anyone who has read this book will have a very clear understanding of the matter. (This came up once in connection with an obscure passage of ISO 8879, which defines SGML: the spec says that certain attributes default to the “most recently specified” value. But most recent on what time axis? It turned out that the spec was written in the belief that constructs in a language must necessarily be recognized left to right; the editor expressed disbelief when I told him that other recognition orders were possible.)

The first edition had a very good treatment of van Wijngaarden grammars, affix grammars, and attribute grammars; the second edition expands the treatment into a full chapter on non-Chomsky grammars which also describes tree-adjoining grammars (which I now understand in principle, though I still don’t know how to write a parser for them), coupled grammars, ordered grammars, ‘recognition systems’, and boolean grammars. I had just noticed that there’s no discussion of adaptive grammars and was feeling just a little smug (“wow, there’s an unusual approach to parsing that I know about that Grune and Jacobs don’t mention; I must be hot stuff”) when I found a paragraph that says

One interesting possibility not explored here is to modify the CF grammar under the influence of parser actions. This leads to dynamic grammars, also called modifiable grammars, or adaptable grammars.

And the web-based bibliography for the second edition lists publications on the topic going back to 1963.

The book is much more than “a reservation for endangered parsers” (in the words of Kees van Reeuwijk, quoted in the preface): the descriptions of the standard algorithms are clear and easy to follow, and the authors provide a clear sense of the varying points of view and value schemes of mathematicians studying formal languages, computer scientists, and users from other fields seeking to apply grammars and parsing techniques in those fields.

If you have any interest at all in formal languages, grammars, or parsing, this book should be on your shelf.