n-gram Markov models as regular approximations?

[16 September 2016][cleared out Viagra spam 27 April 2017]

By construction, Markov models can generate, or recognize, regular languages. This follow from the fact that they are essentially weighted finite state automata.

If we take a body of material that conforms to a context-free grammar G, segment it into tokens the way a lexical scanner would do, and then construct an n-gram Markov model on the basis of the material, we will have a weighted finite state automaton (FSA) that produces or recognizes a regular approximation of the context-free language of G.

How will that regular approximation, and any regular grammar that expresses it (with or without the weights) relate to the original context-free grammar G? How will it relate to other regular approximations of L(G), the kind that one gets by manipulating the grammar G directly?

This is mostly an abstract question, but regular approximations have many uses.

  • Recognizing regular languages is often faster than recognizing context-free languages, and requires fewer resources.
  • Schema languages like XSD and Relax NG can use regular expressions to restrict strings, but not context-free grammars. (Amelia Lewis told me once she’d like to design a schema language which allowed context-free grammars for constraining strings; I think that would be an interesting schema language.)
  • Language-specific editing modes (e.g. c-mode in Emacs) must often treat the language in question as if it were regular (i.e. you don’t really have access to the context-free grammar or a parse tree when deciding how much to indent a line: you have to decide based on whether the preceding line ends with an open brace or not, and so on). I’ve never found descriptions of how to write language modes in Emacs easy to follow, perhaps because they don’t explain how to stop thinking about a language in terms of its context-free grammar and how to think about it instead as a regular language.

But probably I’m thinking about this today because I’ve been thinking about part-of-speech tagging a lot recently. The easiest way to build a reasonably good part-of-speech tagger nowadays appears to be to build a hidden Markov model on the basis of some training corpus. (There are other more sophisticated approaches, which fight it out among themselves for the odd tenth of a percentage point in their correctly-tagged rates. They don’t appear to be nearly as easy to understand and build.)

I conjecture that one of the reasons bigram and trigram part-of-speech taggers do as well as they do is that they apply information about what can occur where in a sentence, in a way that has been dumbed down to the point where it can fit into (be expressed by) a finite state automaton. I wonder if a systematic way to go from a context-free grammar to an FSA could help build better / smarter taggers. Could it capture more information about grammatical context? Transmit that information over longer distances? Provide guidance in cases where the available training data would otherwise be too sparse?

One reason trigrams can do better than bigrams is that they provide more information for the decision about how to tag each word: the choice of tag depends not just on the preceding tag but on the preceding two tags. One reason trigrams can do worse than bigrams is that there are a lot more potential trigrams for any set of part-of-speech tags than there are bigrams, so trigrams are apt to suffer from sparse-data problems unless one has “a lot” of training data (for some meaning of “a lot”); 4-grams, 5-grams, etc., appear to be non-starters because of the sparse-data problem. Could a systematic derivation of a Markov model from a CFG help? Could we judiciously tweak the underlying FSA to carry information we know (or suspect) will be useful? That would provide the same advantages as n-grams with larger n. Could generating the FSA from a grammar help provide guidance for distinguishing grammatical-but-infrequent turns of phrase from gibberish? That would help minimize the sparse-data problem for n-grams with larger n.

It’s the 16th of September — Happy Independence Day, neighbors! Viva Hidalgo!

Automata and infinite strings

[15 December 2009]

[This is another one of those ignorance-is-bliss postings. If I had studied automata theory properly, this would (I guess) have been covered in class; that would have deprived me of the fun of thinking about it without knowing the right answer. If you did study automata theory, and you know how infinite strings are handled, and it irritates you to see someone waffling on and on about it instead of just doing some homework and reading the damn literature, you might want to stop reading soon.]

Some time ago, Michael Kay suggested that it was pointless for the XSD datatypes spec to specify that the lexical representations of integers, or strings, or various other simple types, were finite sequences of characters with certain properties. No implementation, he pointed out, can reliably distinguish finite from infinite strings, so it’s a non-testable assertion.

[“True if you’re using conventional I/O and conventional representations of strings, maybe,” said Enrique. “But if you represent the sequence of characters using a description, rather than an array of characters, it’s not clear that that’s true. Instead of the sequence "3.141592...", store an algorithm for calculating, on demand, the nth digit of the decimal expansion of π. Ditto for the square root of 2. And so on!” “You may be right,” I said. “But that wasn’t what I wanted to talk about, so be quiet.”]

The working group declined the proposal to drop the word “finite” on the grounds that if the strings in question are required to be finite, then we know that all the lexical representations of integers (for example) can in principle be recognized by a finite state automaton. Without the restriction to finite sequences, most of what people know about finite state automata isn’t guaranteed to apply.

I found myself wondering this morning about the possible application of automata to infinite and semi-infinite strings. I know that in principle automata theorists have historically not restricted their interest to finite automata; it seems plausible to assume they have also considered infinite strings. But I don’t know what they have said, without spending time looking it up; instead, I am going to enjoy myself for a little while seeing how much I can figure out for myself.

One obvious question to answer is: if you want to use an automaton to identify infinite sequences, how do you define acceptance of the sequence? For a finite sequence, you ask what state you’re in at the end of the sequence, and whether that state is an “accept state” or not. That won’t work for an infinite sequence: there is no final state.

Perhaps we can consider the sequence of states the automaton enters and define acceptance in terms of that sequence. Possible answers:

  1. Accept if (a) the automaton eventually ends up in a single state which it never again leaves, and (b) that state is an accept state.
  2. Accept if there is some point in the sequence of states such that every state following that point is an accept state.

These would work (in the sense of providing a yes/no answer).
Do these rules for acceptance of strings define sets of automata with different discriminating power?

It seems obvious that they do, but what exactly are the differences?

Consider, for example, automata for recognizing various classes of numbers written as an infinite sequence of decimal digits. Numbers seem to be on my mind, perhaps because of the tie-in to XSD datatypes.

For such infinite strings of digits (including a decimal point), integers have the property that every digit to the right of (i.e. following) the decimal point is a 0. If you build the obvious automaton, for an integer it will spend all its time in the zero-after-decimal-point state, and for a non-integer it will, eventually, end up caught in an error state.

[Enrique tells me I should pause to draw pictures of these automata, but I’m not going to take the time just yet. Apologies to those who find it hard to visualize what I’m talking about.]

So the first acceptance rule suffices for recognizing integers. It may be relevant that the same automaton can be used to recognize finite strings as representing integers: any prefix of the infinite string representing an integer will also be accepted as representing an integer.

The first rule would also suffice to allow us to build a recognizer for certain fractions, e.g. 1/3: the infinite string denoting 1/3 ends up perpetually in the “we’ve just read a 3” state.

On the other hand, it doesn’t suffice for all rationals: in decimal notation,1/7 has an infinitely repeating sequence of digits (142857, if you were wondering). To distinguish 1/7 in decimal notation we’d need a cycle of six states in the automaton.

All rational numbers have a decimal expansion that eventually settles into an infinite series of repeated strings of digits (if only an infinitely repeating sequence of zeroes). So if we adopt the second rule for defining acceptance of the string, we can say: for every rational number, there is a finite state automaton that recognizes that rational number. And irrationals, which have no repeating sequences, aren’t recognizable by an automaton with finite states. (An automaton with infinitely many states might be able to recognize the decimal expansion of a particular irrational number, say π, but it’s hard to know what to do with that information — maybe it’s a reason to say that languages recognizable with an infinite automaton are not necessarily regular.)

That sounds like a nice pattern. It would be even nicer if we could devise an automaton to recognize the set of decimal expansions of rational numbers, but I suspect that’s not feasible, since the complement of that set is the irrationals, and being able to recognize the one set by regular means would entail being able to recognize the other, too.

Does it make sense to require that the automaton eventually end up spending all its time in accept states? (Or equivalently, that the sequence of states have a suffix in which every element in the suffix is an accept state.)

What if that is too restrictive a rule? What if we said instead

  1. Accept if at every point in the sequence of states there are an infinite number of accept states among the states following that point.

That is, allow the string to put the automaton into a non-accepting state, as long as it’s only temporary, and it eventually gets back into an accepting state.

Consider an automaton which has two states, A and B. Every time a B is found in the input, we go to state B; for any other symbol we go to state A. B is an accept state.

If we adopt the second story about termination, a string ending in an unending series of Bs will be accepted and is thus recognizable by an automaton. A string with an infinite number of Bs, interspersed with other symbols, will not be accepted by this automaton (nor by any other, as far as I can tell).

OK, that seems to establish (if we accept the conjecture about strings with infinitely many Bs) that the second and third rules define distinct sets of languages. I suppose that one chooses to use the second rule, or the third, or some other I haven’t thought of yet, in part based on whether it feels right to count as regular the languages one can recognize using that rule.

Hmm. OK, time to look at the bookshelves.

I’ve just checked and found that John E. Hopcroft and Jeffrey D. Ullman, in Introduction to automata theory, languages, and computation (Reading: Addison-Wesley, 1979), restrict their attention to finite strings.

Dick Grune and Ceriel J. H. Jacobs, Parsing techniques: a practical guide, second edition (New York: Springer, 2008), don’t explicitly impose this restriction. But a quick scan of their opening pages also doesn’t show any explicit consideration of infinite sequences of symbols, either. I’m guessing they do treat infinite input somewhere, if only because if you can contemplate van Wijngaarden grammars, which have infinite numbers of context-free rules (and remember, Grune didn’t just contemplate van Wijngaarden grammars, he wrote a parser generator for them), infinite strings are just not going to frighten you.

I suppose the idea of thinking seriously about infinitely long sentences in a language is one I first encountered in D. Terence Langendoen and Paul Postal, The vastness of natural languages (Oxford: Blackwell, 1984). To whom (for this, as for many other things) thanks!

I’m pretty sure that there was some treatment of infinite automata and/or infinite input strings in S. C. Kleene, “Representation of events in nerve nets and finite automata”, in Automata studies, ed. C. E. Shannon and J. McCarthy (Princeton: PUP, 1956), and V. M. Glushkov, “The abstract theory of automata”, Russian mathematical surveys: a translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk 16 (1961). They are both kind of tough sledding, but I suppose I really ought to go back and read them carefully with an eye to this topic.

Grail for regular languages

[11 December 2009]

Every now and then — not constantly, but recurrently — I experience a strong need to have a running copy of Grail, a software package first written by Derick Wood and Darrell Raymond and described by its documentation as “a symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects.”

Among other things, Grail is handy for answering questions about the equivalence or non-equivalence of regular expressions, or about subset/superset relations holding between the languages recognized by them. A few years ago, for example, the W3C XML Schema Working Group found itself in possession of two different descriptions of the lexical space of the XSD duration type. The working group wished, not unreasonably, to check that the two really were equivalent.

The first description provided three regular expressions, and said the lexical space of duration included all the strings which matched all three expressions:

  • -?P([0-9]+Y)?([0-9]+M)?([0-9]+D)?(T([0-9]+H)?([0-9]+M)?([0-9]+(.[0-9]+)?S)?)? (strings in which the fields of an ISO 8601 duration appear in the correct order, and in which each field appears only if it has at least one digit present)
  • .*[YMDHS].* (strings in which at least one field is present)
  • [^T]+(T[^HMS]+[HMS].*)? (if the character T appears, it must be followed by one of the time-related fields)

The second description translated the context-free grammar into regular-expression form (I’ve introduced white space for legibility):


Easy enough to eyeball, for some people, I guess, but the working group wanted a more reliable method.

After a few hours trying vainly to compile Grail for my Linux box, I found an RPM that worked for me, and in ten minutes or so I had used Grail to establish that the two descriptions are equivalent.

Today I realized that another problem I face could best be solved by using Grail, but I no longer have a Linux box (and have not, in any case, found that old RPM). Grail 2.5 is dated March 1996, and the C++ in which it is written does not seem to please GCC 4.0.1. Grail+ 3.0, a successor project in other hands, may have been touched as recently as 2002 or 2004, but most of the dates appear to be in summer or fall 1998. GCC doesn’t like it, either.

So I have thus far been unable to recompile this very helpful tool.

If anyone out there knows of anyone who has either massaged the source of Grail into a form more like what modern C++ compilers will compile, or found out what combination of compile-time flags will persuade GCC to put itself in a more forgiving frame of mind and compile the thing, please get in touch. (And no, -Wno-deprecated does not suffice to do the trick.)

And any C++ proficients looking for interesting and useful projects to undertake could do a lot worse for themselves and for the world than to bring Grail into the twenty-first century.

Dumbing grammars down, cont’d

[30 May 2009; diagrams added 31 May 2009]

Last April I posted an entry on the possibility of systematically producing regular languages which are subsets or supersets of a context-free language. The process I outlined involves creating a finite state automaton, augmented with counters and guards and whatnot (or equivalently, creating an attributed regular grammar with attributes and semantic conditions). My description of that process involved a certain amount of hand-waving, because I knew intuitively how to build such an automaton by hand (at least for simple cases), but had not yet figured out a good way to describe it more precisely.

I’m still not there yet, but I’ve made some progress today. Recently I spent some relaxing hours reading about Earley parsers and implementing an Earley parser. And today it occurred to me: a simpler way to make the attributed regular grammar / augmented FSA I need is to make one state for each position in each rule of the grammar, so that each state in the automaton (each non-terminal in the attributed regular grammar) corresponds to the rule-and-position part of an Earley item. (The Earley item also keeps track of the starting position in the input of the non-terminal whose rule is being recognized; I don’t need that for the FSA.)

So I can now give a slightly simpler account of the FSA construction.

  1. Let the states of the automaton be the positions in the grammar. (Each rule of length n has n+1 positions.)
  2. For each position where the rule has a terminal symbol, include an arc from that position to the next position in that rule, labeled with the terminal symbol.
  3. For each position p where the rule has a non-terminal symbol N, (a) add arcs labeled with the empty string from p to the first position of each rule in the grammar where N appears on the left-hand side, and (b) add arcs labeled with the empty string from the final positions in each rule for N to the next position after p.
  4. Add a new starting state; add arcs labeled with the empty string from the starting state to the first positions of the rules for the start symbol of the grammar.
  5. The accepting states of the automaton are the final positions of the rules for the grammar’s start symbol.

I have not yet attempted anything like a proof, but I believe that the automaton thus constructed recognizes what my earlier post called the smallest regular superlanguage of the language you started with. That is, every sentence of the original language will be recognized by the automaton, and some non-sentences will also be recognized.

The set of sentences recognized by the automaton can be characterized simply, by reference to the pumping lemma for context-free languages. In simple terms, this lemma says that for long enough sentences in a language, the sentence s can be partitioned into subsequences u, v, w, x, and y, such that s is the concatenation uvwxy, and such that for any positive integer n, the string uvnwxny is also in the language. The automaton recognizes the slightly larger set of sentences uvnwxmy for positive integers n and m. That is to say, it does not ensure (for example) that closing parens match opening parens in number or kind. But otherwise it captures the language.

To take a very simple example, consider the grammar:

  • s ::= ‘a’
  • s ::= ‘(‘ s ‘)’
  • s ::= ‘[‘ s ‘]’

The automaton has ten states, which can be represented in the style of Earley items:

  1. s ::= · ‘a’
  2. s ::= ‘a’ ·
  3. s ::= · ‘(‘ s ‘)’
  4. s ::= ‘(‘ · s ‘)’
  5. s ::= ‘(‘ s · ‘)’
  6. s ::= ‘(‘ s ‘)’ ·
  7. s ::= · ‘[‘ s ‘]’
  8. s ::= ‘[‘ · s ‘]’
  9. s ::= ‘[‘ s · ‘]’
  10. s ::= ‘[‘ s ‘]’ ·

Plus a new starting state I’ll call 0.


  • 0: λ → 1
  • 0: λ → 3
  • 0: λ → 7
  • 1: “a” → 2
  • 2: λ → 5
  • 2: λ → 9
  • 3: “(” → 4
  • 4: λ → 1
  • 4: λ → 3
  • 4: λ → 7
  • 5: “)” → 6
  • 6: λ → 5
  • 6: λ → 9
  • 7: “[” → 8
  • 8: λ → 1
  • 8: λ → 3
  • 8: λ → 7
  • 9: “]” → 10
  • 10: λ → 5
  • 10: λ → 9

Accepting states are 2, 6, and 10.

Or in pictures:

Drawing of the finite-state automaton described in the text, with circles for the states and arrows for the transitions.

Eliminating epsilon transitions and minimizing the automaton gives us a smaller one with two states, 1 (the start state) and 2 (the accepting state), with transitions:

  • 1: “a” → 2
  • 1: “(” → 1
  • 1: “[” → 1
  • 2: “)” → 2
  • 2: “]” → 2

Drawing of the two-state finite-state automaton described in the text, with circles for the states and arrows for the transitions.

The regular expression “[[(]*a[])]*” provides a compact equivalent. As you can see, it accepts a superset of the example language, but it is the smallest regular superset and does impose at least some of the constraints of the original grammar.

Unfinished business (anyone who needs a puzzle to solve can feel free to take these off my hands):

  • Prove that the FSA whose construction is described above accepts a superset of the language accepted by the grammar from which it’s constructed.
  • Describe (with proof of correctness) what augmentations to the mechanism would suffice to use this automaton to parse the original grammar. (Hint: my earlier conjecture that counters would suffice is not true; the example grammar illustrates why.)
  • The augmented FSA that correctly recognizes the original language is, presumably, a re-invention of some well known parsing technique for context-free languages; which?
  • Prove that the language recognized by the FSA is essentially the language uvnwxmy for every uvwxy decomposition of sentences in the original language.
  • Can the largest practicable regular sublanguage be constructed using this FSA?
  • Earley parsing can, I think, be described in terms of this automaton, in the sense that the actions of an Earley parser can be translated into state changes in the automaton (with an external control that prevents making an illegal move). Can other parsing techniques also be?

Simple proof that the URI grammar of RFC 3986 defines a regular language

[16 January 2009]

A while back, a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.

If the language is regular, of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:

But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if it’s regular. Is that language regular? I haven’t looked at it for a while, so I’m not really sure. At the very least, it’s not obvious that it’s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.

First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact refers(X,Y) is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the refers relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).

But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.


There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.

The line


makes a set Vs of all terms in either argument position for refers. This is the set of vertices in the non-terminal reachability graph.

The line


similarly makes a set of expressions of the form L-R for terms linked by the refers relation. For example, since the ABNF in the RFC includes the rule

URI    = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

the set Es (edges) will contain 'URI'-scheme, 'URI'-'hier-part', 'URI'-query, and 'URI'-fragment. Prolog quotes URI here to avoid having it taken as a variable.

The line


uses an SWI utility to turn the lists Vs and Es of vertices and edges into an unweighted graph G, in the form used by the ugraph library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.

G has the form of a list of pairs X-[Y1,Y2,…], where there are edges from X to Y1, Y2, … in the graph.

In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form X = ... Y ...)

The line


takes the transitive closure of graph G and assigns it to variable Gc. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.

The final lines


seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.

Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.


It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book Grundlagen und Techniken des Compilerbaus (Bonn: Addison-Wesley, 1996), also in English as Compiler Construction (Harlow, England: Addison-Wesley, 1996). He writes:

Eine Sprache ist regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.


A language is regular, if its syntax can be expressed by a single EBNF expression.

(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)

In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.

So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.

Having worked this out, I asked my colleague, Dan Connolly, what line of reasoning he had followed in answer the question. “Well, I just tried the construction, and it worked.“ He has a Web page with Javascript that performs the translation from the ABNF of the RFC into Javascript regexes, and allows the user to test strings to see if they match that grammar. If you are interested in this kind of question, you may find that page fun to play with.