[15 December 2009]

[This is another one of those ignorance-is-bliss postings. If I had studied automata theory properly, this would (I guess) have been covered in class; that would have deprived me of the fun of thinking about it without knowing the right answer. If you did study automata theory, and you know how infinite strings are handled, and it irritates you to see someone waffling on and on about it instead of just doing some homework and reading the damn literature, you might want to stop reading soon.]

Some time ago, Michael Kay suggested that it was pointless for the XSD datatypes spec to specify that the lexical representations of integers, or strings, or various other simple types, were finite sequences of characters with certain properties. No implementation, he pointed out, can reliably distinguish finite from infinite strings, so it’s a non-testable assertion.

[“True if you’re using conventional I/O and conventional representations of strings, maybe,” said Enrique. “But if you represent the sequence of characters using a description, rather than an array of characters, it’s not clear that that’s true. Instead of the sequence `"3.141592..."`

, store an algorithm for calculating, on demand, the *n*th digit of the decimal expansion of π. Ditto for the square root of 2. And so on!” “You may be right,” I said. “But that wasn’t what I wanted to talk about, so be quiet.”]

The working group declined the proposal to drop the word “*finite*” on the grounds that if the strings in question are required to be finite, then we know that all the lexical representations of integers (for example) can in principle be recognized by a finite state automaton. Without the restriction to finite sequences, most of what people know about finite state automata isn’t guaranteed to apply.

I found myself wondering this morning about the possible application of automata to infinite and semi-infinite strings. I know that in principle automata theorists have historically not restricted their interest to finite automata; it seems plausible to assume they have also considered infinite strings. But I don’t know what they have said, without spending time looking it up; instead, I am going to enjoy myself for a little while seeing how much I can figure out for myself.

One obvious question to answer is: if you want to use an automaton to identify infinite sequences, how do you define acceptance of the sequence? For a finite sequence, you ask what state you’re in at the end of the sequence, and whether that state is an “accept state” or not. That won’t work for an infinite sequence: there is no final state.

Perhaps we can consider the sequence of states the automaton enters and define acceptance in terms of that sequence. Possible answers:

- Accept if (a) the automaton eventually ends up in a single state which it never again leaves, and (b) that state is an accept state.
- Accept if there is some point in the sequence of states such that every state following that point is an accept state.

These would work (in the sense of providing a yes/no answer).

Do these rules for acceptance of strings define sets of automata with different discriminating power?

It seems obvious that they do, but what exactly are the differences?

Consider, for example, automata for recognizing various classes of numbers written as an infinite sequence of decimal digits. Numbers seem to be on my mind, perhaps because of the tie-in to XSD datatypes.

For such infinite strings of digits (including a decimal point), integers have the property that every digit to the right of (i.e. following) the decimal point is a 0. If you build the obvious automaton, for an integer it will spend all its time in the zero-after-decimal-point state, and for a non-integer it will, eventually, end up caught in an error state.

[Enrique tells me I should pause to draw pictures of these automata, but I’m not going to take the time just yet. Apologies to those who find it hard to visualize what I’m talking about.]

So the first acceptance rule suffices for recognizing integers. It may be relevant that the same automaton can be used to recognize finite strings as representing integers: any prefix of the infinite string representing an integer will also be accepted as representing an integer.

The first rule would also suffice to allow us to build a recognizer for certain fractions, e.g. 1/3: the infinite string denoting 1/3 ends up perpetually in the “we’ve just read a 3” state.

On the other hand, it doesn’t suffice for all rationals: in decimal notation,1/7 has an infinitely repeating sequence of digits (`142857`

, if you were wondering). To distinguish 1/7 in decimal notation we’d need a cycle of six states in the automaton.

All rational numbers have a decimal expansion that eventually settles into an infinite series of repeated strings of digits (if only an infinitely repeating sequence of zeroes). So if we adopt the second rule for defining acceptance of the string, we can say: for every rational number, there is a finite state automaton that recognizes that rational number. And irrationals, which have no repeating sequences, aren’t recognizable by an automaton with finite states. (An automaton with infinitely many states might be able to recognize the decimal expansion of a particular irrational number, say π, but it’s hard to know what to do with that information — maybe it’s a reason to say that languages recognizable with an infinite automaton are not necessarily regular.)

That sounds like a nice pattern. It would be even nicer if we could devise an automaton to recognize the set of decimal expansions of rational numbers, but I suspect that’s not feasible, since the complement of that set is the irrationals, and being able to recognize the one set by regular means would entail being able to recognize the other, too.

Does it make sense to require that the automaton eventually end up spending all its time in accept states? (Or equivalently, that the sequence of states have a suffix in which every element in the suffix is an accept state.)

What if that is too restrictive a rule? What if we said instead

- Accept if at every point in the sequence of states there are an infinite number of accept states among the states following that point.

That is, allow the string to put the automaton into a non-accepting state, as long as it’s only temporary, and it eventually gets back into an accepting state.

Consider an automaton which has two states, A and B. Every time a B is found in the input, we go to state B; for any other symbol we go to state A. B is an accept state.

If we adopt the second story about termination, a string ending in an unending series of Bs will be accepted and is thus recognizable by an automaton. A string with an infinite number of Bs, interspersed with other symbols, will not be accepted by this automaton (nor by any other, as far as I can tell).

OK, that seems to establish (if we accept the conjecture about strings with infinitely many Bs) that the second and third rules define distinct sets of languages. I suppose that one chooses to use the second rule, or the third, or some other I haven’t thought of yet, in part based on whether it feels right to count as regular the languages one can recognize using that rule.

Hmm. OK, time to look at the bookshelves.

I’ve just checked and found that John E. Hopcroft and Jeffrey D. Ullman, in *Introduction to automata theory, languages, and computation* (Reading: Addison-Wesley, 1979), restrict their attention to finite strings.

Dick Grune and Ceriel J. H. Jacobs, *Parsing techniques: a practical guide*, second edition (New York: Springer, 2008), don’t explicitly impose this restriction. But a quick scan of their opening pages also doesn’t show any explicit consideration of infinite sequences of symbols, either. I’m guessing they do treat infinite input somewhere, if only because if you can contemplate van Wijngaarden grammars, which have infinite numbers of context-free rules (and remember, Grune didn’t just contemplate van Wijngaarden grammars, he wrote a parser generator for them), infinite strings are just not going to frighten you.

I suppose the idea of thinking seriously about infinitely long sentences in a language is one I first encountered in D. Terence Langendoen and Paul Postal, *The vastness of natural languages* (Oxford: Blackwell, 1984). To whom (for this, as for many other things) thanks!

I’m pretty sure that there was some treatment of infinite automata and/or infinite input strings in S. C. Kleene, “Representation of events in nerve nets and finite automata”, in *Automata studies*, ed. C. E. Shannon and J. McCarthy (Princeton: PUP, 1956), and V. M. Glushkov, “The abstract theory of automata”, *Russian mathematical surveys: a translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk* 16 (1961). They are both kind of tough sledding, but I suppose I really ought to go back and read them carefully with an eye to this topic.

So the first acceptance rule suffices for recognizing integers.

That depends I think what you mean by recognise here. Unlike the finite case there may be no effective procedure for deciding your criterion.

It depends for a start on the strength of the language that you allow to be used to describe the infinite sequence.

take for example the number 0.00000…. which you want to accept as the integer 0, you can only do that if you know that you will stay in the

“seen 0” state indefinitely, if the infinite string is described as

0 . x1 x2 x3 x4 …

where the character xi is 0 if Fermat’s Last Theorem is true for the ith odd prime, and 1 otherwise.

then knowing that this sequence is an integer requires proving fermats last theorem (which is hard, but done). Clearly though the description of the sequence could be made to depend on unproven or (given a suitably expressive language to describe the sequence) unprovable statements.

To show your automaton accepts a sequence you need to show that given the current state and a generating function for the remainder of the sequence, that the automaton will only reach states with the accept property. I’d guess (without doing any reading on the subject) that is only tractable if you severely restrict the language that you allow to be used for the generating function.

David, thank you. You are quite right that I used the word “recognize” without a definition, and that clarifying what is to be meant by it seems to require that we engage with the notion of effective procedure.

My initial thought on this question is that I’d like to take my cue from Langendoen and Postal. In

Vastness, they argue that restricting the concept ofnatural languageto countably infinite sets of finite sentences over a finite vocabulary is arbitrary and unjustified; they suggest that natural languages are larger than countably infinite (and in fact, in the terminology they use, that they are not sets butmegacollections). But it does not follow, they argue, that natural languages and their sentences cannot be usefully characterized. It’s just a question of using different tools.I think the idea I was struggling towards, in the sentence you quote, might be more carefully enunciated this way: the first acceptance rule suffices to allow us to use finite state automata to

characterize(rather thanrecognize, if we understand recognition as involving an effective procedure) the set of strings denoting integers.This of course only pushes the bump in the carpet over and leads to the question “What does it mean to characterize a set of infinite strings using an FSA?” My tentative answer is: for any automaton, there is a set (possibly the empty set) of paths through that automaton; each such path corresponds to one or more strings. Using the various acceptance rules I mentioned, or others, we can characterize some set of such paths as ‘accept paths’, and say that the automaton so defined characterizes the set of strings which correspond to those accept paths.

In the usual (standard) case, we restrict attention to finite paths and finite strings, the set of accept paths is the set of paths whose final state is an accept state, and we have the pleasant state of affairs that operating the FSA provides a convenient and simple effective procedure for deciding whether a string is a member of the set characterized by the FSA.

In the case of infinite strings, we don’t have that guarantee of termination and so FSAs do not in themselves provide an effective procedure for determining set membership. But in cases where we can describe the string sufficiently well, we can know whether its path through an automaton is an accept path or not. FSAs serve here not as effective procedures for deciding set membership, but as convenient mechanisms for describing the set.

On the other hand, Langendoen and Postal deny that constructive methods like generative grammars (and a fortiori, I suppose, FSAs) can usefully characterize sets of sequences with infinite and transfinite length. I’m not sure I understand all of the logic here: certainly the usual account of sentence generation doesn’t produce infinite sequences, so the usual way of describing the relation between a grammar (or an FSA) and the string doesn’t work. But it does not seem clear to me that no other way of describing the relation between grammars and sets of strings can be supplied; I think my speculations are trying to move, in their own fumbling way, toward some sense of what such an alternative description might look like.

I agree entirely that our ability to show that a given string does or does not correspond to an accepting path in a given automaton depends on what mechanisms we use for defining, or identifying, the string. It might be interesting to characterize the decidability of the question, for various kinds of string characterization or generator. But I suppose that that, in a sense, is what the study of computable numbers is about.