n-gram Markov models as regular approximations?

[16 September 2016][cleared out Viagra spam 27 April 2017]

By construction, Markov models can generate, or recognize, regular languages. This follow from the fact that they are essentially weighted finite state automata.

If we take a body of material that conforms to a context-free grammar G, segment it into tokens the way a lexical scanner would do, and then construct an n-gram Markov model on the basis of the material, we will have a weighted finite state automaton (FSA) that produces or recognizes a regular approximation of the context-free language of G.

How will that regular approximation, and any regular grammar that expresses it (with or without the weights) relate to the original context-free grammar G? How will it relate to other regular approximations of L(G), the kind that one gets by manipulating the grammar G directly?

This is mostly an abstract question, but regular approximations have many uses.

  • Recognizing regular languages is often faster than recognizing context-free languages, and requires fewer resources.
  • Schema languages like XSD and Relax NG can use regular expressions to restrict strings, but not context-free grammars. (Amelia Lewis told me once she’d like to design a schema language which allowed context-free grammars for constraining strings; I think that would be an interesting schema language.)
  • Language-specific editing modes (e.g. c-mode in Emacs) must often treat the language in question as if it were regular (i.e. you don’t really have access to the context-free grammar or a parse tree when deciding how much to indent a line: you have to decide based on whether the preceding line ends with an open brace or not, and so on). I’ve never found descriptions of how to write language modes in Emacs easy to follow, perhaps because they don’t explain how to stop thinking about a language in terms of its context-free grammar and how to think about it instead as a regular language.

But probably I’m thinking about this today because I’ve been thinking about part-of-speech tagging a lot recently. The easiest way to build a reasonably good part-of-speech tagger nowadays appears to be to build a hidden Markov model on the basis of some training corpus. (There are other more sophisticated approaches, which fight it out among themselves for the odd tenth of a percentage point in their correctly-tagged rates. They don’t appear to be nearly as easy to understand and build.)

I conjecture that one of the reasons bigram and trigram part-of-speech taggers do as well as they do is that they apply information about what can occur where in a sentence, in a way that has been dumbed down to the point where it can fit into (be expressed by) a finite state automaton. I wonder if a systematic way to go from a context-free grammar to an FSA could help build better / smarter taggers. Could it capture more information about grammatical context? Transmit that information over longer distances? Provide guidance in cases where the available training data would otherwise be too sparse?

One reason trigrams can do better than bigrams is that they provide more information for the decision about how to tag each word: the choice of tag depends not just on the preceding tag but on the preceding two tags. One reason trigrams can do worse than bigrams is that there are a lot more potential trigrams for any set of part-of-speech tags than there are bigrams, so trigrams are apt to suffer from sparse-data problems unless one has “a lot” of training data (for some meaning of “a lot”); 4-grams, 5-grams, etc., appear to be non-starters because of the sparse-data problem. Could a systematic derivation of a Markov model from a CFG help? Could we judiciously tweak the underlying FSA to carry information we know (or suspect) will be useful? That would provide the same advantages as n-grams with larger n. Could generating the FSA from a grammar help provide guidance for distinguishing grammatical-but-infrequent turns of phrase from gibberish? That would help minimize the sparse-data problem for n-grams with larger n.

It’s the 16th of September — Happy Independence Day, neighbors! Viva Hidalgo!