[30 May 2009; diagrams added 31 May 2009]
Last April I posted an entry on the possibility of systematically producing regular languages which are subsets or supersets of a context-free language. The process I outlined involves creating a finite state automaton, augmented with counters and guards and whatnot (or equivalently, creating an attributed regular grammar with attributes and semantic conditions). My description of that process involved a certain amount of hand-waving, because I knew intuitively how to build such an automaton by hand (at least for simple cases), but had not yet figured out a good way to describe it more precisely.
I’m still not there yet, but I’ve made some progress today. Recently I spent some relaxing hours reading about Earley parsers and implementing an Earley parser. And today it occurred to me: a simpler way to make the attributed regular grammar / augmented FSA I need is to make one state for each position in each rule of the grammar, so that each state in the automaton (each non-terminal in the attributed regular grammar) corresponds to the rule-and-position part of an Earley item. (The Earley item also keeps track of the starting position in the input of the non-terminal whose rule is being recognized; I don’t need that for the FSA.)
So I can now give a slightly simpler account of the FSA construction.
- Let the states of the automaton be the positions in the grammar. (Each rule of length n has n+1 positions.)
- For each position where the rule has a terminal symbol, include an arc from that position to the next position in that rule, labeled with the terminal symbol.
- For each position p where the rule has a non-terminal symbol N, (a) add arcs labeled with the empty string from p to the first position of each rule in the grammar where N appears on the left-hand side, and (b) add arcs labeled with the empty string from the final positions in each rule for N to the next position after p.
- Add a new starting state; add arcs labeled with the empty string from the starting state to the first positions of the rules for the start symbol of the grammar.
- The accepting states of the automaton are the final positions of the rules for the grammar’s start symbol.
I have not yet attempted anything like a proof, but I believe that the automaton thus constructed recognizes what my earlier post called the smallest regular superlanguage of the language you started with. That is, every sentence of the original language will be recognized by the automaton, and some non-sentences will also be recognized.
The set of sentences recognized by the automaton can be characterized simply, by reference to the pumping lemma for context-free languages. In simple terms, this lemma says that for long enough sentences in a language, the sentence s can be partitioned into subsequences u, v, w, x, and y, such that s is the concatenation uvwxy, and such that for any positive integer n, the string uvnwxny is also in the language. The automaton recognizes the slightly larger set of sentences uvnwxmy for positive integers n and m. That is to say, it does not ensure (for example) that closing parens match opening parens in number or kind. But otherwise it captures the language.
To take a very simple example, consider the grammar:
- s ::= ‘a’
- s ::= ‘(‘ s ‘)’
- s ::= ‘[‘ s ‘]’
The automaton has ten states, which can be represented in the style of Earley items:
- s ::= · ‘a’
- s ::= ‘a’ ·
- s ::= · ‘(‘ s ‘)’
- s ::= ‘(‘ · s ‘)’
- s ::= ‘(‘ s · ‘)’
- s ::= ‘(‘ s ‘)’ ·
- s ::= · ‘[‘ s ‘]’
- s ::= ‘[‘ · s ‘]’
- s ::= ‘[‘ s · ‘]’
- s ::= ‘[‘ s ‘]’ ·
Plus a new starting state I’ll call 0.
Arcs:
- 0: λ → 1
- 0: λ → 3
- 0: λ → 7
- 1: “
a
” → 2 - 2: λ → 5
- 2: λ → 9
- 3: “
(
” → 4 - 4: λ → 1
- 4: λ → 3
- 4: λ → 7
- 5: “
)
” → 6 - 6: λ → 5
- 6: λ → 9
- 7: “
[
” → 8 - 8: λ → 1
- 8: λ → 3
- 8: λ → 7
- 9: “
]
” → 10 - 10: λ → 5
- 10: λ → 9
Accepting states are 2, 6, and 10.
Or in pictures:
Eliminating epsilon transitions and minimizing the automaton gives us a smaller one with two states, 1 (the start state) and 2 (the accepting state), with transitions:
- 1: “
a
” → 2 - 1: “
(
” → 1 - 1: “
[
” → 1 - 2: “
)
” → 2 - 2: “
]
” → 2
The regular expression “[[(]*a[])]*
” provides a compact equivalent. As you can see, it accepts a superset of the example language, but it is the smallest regular superset and does impose at least some of the constraints of the original grammar.
Unfinished business (anyone who needs a puzzle to solve can feel free to take these off my hands):
- Prove that the FSA whose construction is described above accepts a superset of the language accepted by the grammar from which it’s constructed.
- Describe (with proof of correctness) what augmentations to the mechanism would suffice to use this automaton to parse the original grammar. (Hint: my earlier conjecture that counters would suffice is not true; the example grammar illustrates why.)
- The augmented FSA that correctly recognizes the original language is, presumably, a re-invention of some well known parsing technique for context-free languages; which?
- Prove that the language recognized by the FSA is essentially the language uvnwxmy for every uvwxy decomposition of sentences in the original language.
- Can the largest practicable regular sublanguage be constructed using this FSA?
- Earley parsing can, I think, be described in terms of this automaton, in the sense that the actions of an Earley parser can be translated into state changes in the automaton (with an external control that prevents making an illegal move). Can other parsing techniques also be?