[30 May 2009; diagrams added 31 May 2009]
Last April I posted an entry on the possibility of systematically producing regular languages which are subsets or supersets of a context-free language. The process I outlined involves creating a finite state automaton, augmented with counters and guards and whatnot (or equivalently, creating an attributed regular grammar with attributes and semantic conditions). My description of that process involved a certain amount of hand-waving, because I knew intuitively how to build such an automaton by hand (at least for simple cases), but had not yet figured out a good way to describe it more precisely.
I’m still not there yet, but I’ve made some progress today. Recently I spent some relaxing hours reading about Earley parsers and implementing an Earley parser. And today it occurred to me: a simpler way to make the attributed regular grammar / augmented FSA I need is to make one state for each position in each rule of the grammar, so that each state in the automaton (each non-terminal in the attributed regular grammar) corresponds to the rule-and-position part of an Earley item. (The Earley item also keeps track of the starting position in the input of the non-terminal whose rule is being recognized; I don’t need that for the FSA.)
So I can now give a slightly simpler account of the FSA construction.
- Let the states of the automaton be the positions in the grammar. (Each rule of length n has n+1 positions.)
- For each position where the rule has a terminal symbol, include an arc from that position to the next position in that rule, labeled with the terminal symbol.
- For each position p where the rule has a non-terminal symbol N, (a) add arcs labeled with the empty string from p to the first position of each rule in the grammar where N appears on the left-hand side, and (b) add arcs labeled with the empty string from the final positions in each rule for N to the next position after p.
- Add a new starting state; add arcs labeled with the empty string from the starting state to the first positions of the rules for the start symbol of the grammar.
- The accepting states of the automaton are the final positions of the rules for the grammar’s start symbol.
I have not yet attempted anything like a proof, but I believe that the automaton thus constructed recognizes what my earlier post called the smallest regular superlanguage of the language you started with. That is, every sentence of the original language will be recognized by the automaton, and some non-sentences will also be recognized.
The set of sentences recognized by the automaton can be characterized simply, by reference to the pumping lemma for context-free languages. In simple terms, this lemma says that for long enough sentences in a language, the sentence s can be partitioned into subsequences u, v, w, x, and y, such that s is the concatenation uvwxy, and such that for any positive integer n, the string uvnwxny is also in the language. The automaton recognizes the slightly larger set of sentences uvnwxmy for positive integers n and m. That is to say, it does not ensure (for example) that closing parens match opening parens in number or kind. But otherwise it captures the language.
To take a very simple example, consider the grammar:
- s ::= ‘a’
- s ::= ‘(‘ s ‘)’
- s ::= ‘[‘ s ‘]’
The automaton has ten states, which can be represented in the style of Earley items:
- s ::= · ‘a’
- s ::= ‘a’ ·
- s ::= · ‘(‘ s ‘)’
- s ::= ‘(‘ · s ‘)’
- s ::= ‘(‘ s · ‘)’
- s ::= ‘(‘ s ‘)’ ·
- s ::= · ‘[‘ s ‘]’
- s ::= ‘[‘ · s ‘]’
- s ::= ‘[‘ s · ‘]’
- s ::= ‘[‘ s ‘]’ ·
Plus a new starting state I’ll call 0.
Arcs:
- 0: λ → 1
- 0: λ → 3
- 0: λ → 7
- 1: “
a
” → 2 - 2: λ → 5
- 2: λ → 9
- 3: “
(
” → 4 - 4: λ → 1
- 4: λ → 3
- 4: λ → 7
- 5: “
)
” → 6 - 6: λ → 5
- 6: λ → 9
- 7: “
[
” → 8 - 8: λ → 1
- 8: λ → 3
- 8: λ → 7
- 9: “
]
” → 10 - 10: λ → 5
- 10: λ → 9
Accepting states are 2, 6, and 10.
Or in pictures:
Eliminating epsilon transitions and minimizing the automaton gives us a smaller one with two states, 1 (the start state) and 2 (the accepting state), with transitions:
- 1: “
a
” → 2 - 1: “
(
” → 1 - 1: “
[
” → 1 - 2: “
)
” → 2 - 2: “
]
” → 2
The regular expression “[[(]*a[])]*
” provides a compact equivalent. As you can see, it accepts a superset of the example language, but it is the smallest regular superset and does impose at least some of the constraints of the original grammar.
Unfinished business (anyone who needs a puzzle to solve can feel free to take these off my hands):
- Prove that the FSA whose construction is described above accepts a superset of the language accepted by the grammar from which it’s constructed.
- Describe (with proof of correctness) what augmentations to the mechanism would suffice to use this automaton to parse the original grammar. (Hint: my earlier conjecture that counters would suffice is not true; the example grammar illustrates why.)
- The augmented FSA that correctly recognizes the original language is, presumably, a re-invention of some well known parsing technique for context-free languages; which?
- Prove that the language recognized by the FSA is essentially the language uvnwxmy for every uvwxy decomposition of sentences in the original language.
- Can the largest practicable regular sublanguage be constructed using this FSA?
- Earley parsing can, I think, be described in terms of this automaton, in the sense that the actions of an Earley parser can be translated into state changes in the automaton (with an external control that prevents making an illegal move). Can other parsing techniques also be?
Sketch of a proof strategy (further elaboration needed).
An Earley parser works by generating sets of Earley items for each position p in the input; each item in those item-sets corresponds to exactly one state in the finite state automaton. The accept configurations of the Earley parser similarly correspond to the accept states of the FSA. If we can show that an Earley parser adds an item to the item-set for any position p only if the FSA can reach the corresponding state after p symbols of the input, then it will follow that an Earley parser working for some grammar G will accept a string as a sentence of L(G) only if the FSA constructed from G accepts the same string as a member of L(FSA(G)). So the superset relation is ensured.
The proof will rely on the fact that if an Earley parser examines an item i1 in the item-set for some position p and handles that item by adding a new item i2 to the item-set for p or p+1, then the FSA will invariably have an arc connecting the corresponding states s1 → s2. If item i1 is a shift item, then the arc corresponds to the labeled arc added by rule 2 for the terminal symbol predicted by the item. If i1 is a prediction item (i.e. the symbol at its position is a non-terminal), then the FSA has an epsilon transition introduced by rule 3(a). If i1 is a reduce item, then the FSA has an epsilon transition introduced by rule 3(b).
I don’t know anything about Earley grammars, but I think I can follow your high quality exposition.
It wasn’t obvious to me why counters wouldn’t work, but eventually I saw that they would accept bad input like “([a)]”. I imagine a stack (of corresponding rules entered, visited or exited) would fix it, or is that defeating the purpose of the state machine?
Oops. The post claims that the automata illustrated recognize “the smallest regular superlanguage” of the language L defined by the context-free grammar we started with.
That’s not true.
Call the language recognized by the automata L′. It’s regular, and it’s a superset of L. So far, so good. But now consider language L″, defined by the following regular grammar:
s = ‘
a
’ | ‘(
’ L′ ‘)
’ | ‘[
’ L′ ‘]
’Since L′ is regular, L″ is clearly also regular. It’s also clearly a superset of L. But L″ is strictly smaller than L′, since it excludes the strings (a] and [a). So L′ cannot be the smallest regular superset of L.
For strings of length up to three, the members of L and L″ are the same; L″ includes strings that are not members of L only if they have more than one pair of brackets.
Informally, we can say that L″ keeps track of one level of brackets, and loses track after that. And now it’s easy to see that there is a language L″′ which keeps track of one or two levels of brackets and loses track after that. Indeed, for any natural number n, there is a language which matches L for up to n pairs of brackets, and includes L as a subset for more than n pairs. So it’s clear (a) that there are an infinite number of regular superset languages which differ only in the number of pairs of brackets they keep track of [or: in the level of nesting in the original grammar they replicate], and (b) that there is no smallest regular superset language.
The language L′ defined in the post is, perhaps, the smallest regular superlanguage that doesn’t count brackets at all. (But I don’t think I know how to distinguish formally between automata that count and automata that don’t count. I did see, some time ago, a monograph on Counter-free automata by Robert McNaughton and Seymour Papert (Cambridge: MIT Press, 1971; MIT research monograph no. 65), but I can’t say I understood it. More unfinished business.)