Automata and infinite strings

[15 December 2009]

[This is another one of those ignorance-is-bliss postings. If I had studied automata theory properly, this would (I guess) have been covered in class; that would have deprived me of the fun of thinking about it without knowing the right answer. If you did study automata theory, and you know how infinite strings are handled, and it irritates you to see someone waffling on and on about it instead of just doing some homework and reading the damn literature, you might want to stop reading soon.]

Some time ago, Michael Kay suggested that it was pointless for the XSD datatypes spec to specify that the lexical representations of integers, or strings, or various other simple types, were finite sequences of characters with certain properties. No implementation, he pointed out, can reliably distinguish finite from infinite strings, so it’s a non-testable assertion.

[“True if you’re using conventional I/O and conventional representations of strings, maybe,” said Enrique. “But if you represent the sequence of characters using a description, rather than an array of characters, it’s not clear that that’s true. Instead of the sequence "3.141592...", store an algorithm for calculating, on demand, the nth digit of the decimal expansion of π. Ditto for the square root of 2. And so on!” “You may be right,” I said. “But that wasn’t what I wanted to talk about, so be quiet.”]

The working group declined the proposal to drop the word “finite” on the grounds that if the strings in question are required to be finite, then we know that all the lexical representations of integers (for example) can in principle be recognized by a finite state automaton. Without the restriction to finite sequences, most of what people know about finite state automata isn’t guaranteed to apply.

I found myself wondering this morning about the possible application of automata to infinite and semi-infinite strings. I know that in principle automata theorists have historically not restricted their interest to finite automata; it seems plausible to assume they have also considered infinite strings. But I don’t know what they have said, without spending time looking it up; instead, I am going to enjoy myself for a little while seeing how much I can figure out for myself.

One obvious question to answer is: if you want to use an automaton to identify infinite sequences, how do you define acceptance of the sequence? For a finite sequence, you ask what state you’re in at the end of the sequence, and whether that state is an “accept state” or not. That won’t work for an infinite sequence: there is no final state.

Perhaps we can consider the sequence of states the automaton enters and define acceptance in terms of that sequence. Possible answers:

  1. Accept if (a) the automaton eventually ends up in a single state which it never again leaves, and (b) that state is an accept state.
  2. Accept if there is some point in the sequence of states such that every state following that point is an accept state.

These would work (in the sense of providing a yes/no answer).
Do these rules for acceptance of strings define sets of automata with different discriminating power?

It seems obvious that they do, but what exactly are the differences?

Consider, for example, automata for recognizing various classes of numbers written as an infinite sequence of decimal digits. Numbers seem to be on my mind, perhaps because of the tie-in to XSD datatypes.

For such infinite strings of digits (including a decimal point), integers have the property that every digit to the right of (i.e. following) the decimal point is a 0. If you build the obvious automaton, for an integer it will spend all its time in the zero-after-decimal-point state, and for a non-integer it will, eventually, end up caught in an error state.

[Enrique tells me I should pause to draw pictures of these automata, but I’m not going to take the time just yet. Apologies to those who find it hard to visualize what I’m talking about.]

So the first acceptance rule suffices for recognizing integers. It may be relevant that the same automaton can be used to recognize finite strings as representing integers: any prefix of the infinite string representing an integer will also be accepted as representing an integer.

The first rule would also suffice to allow us to build a recognizer for certain fractions, e.g. 1/3: the infinite string denoting 1/3 ends up perpetually in the “we’ve just read a 3” state.

On the other hand, it doesn’t suffice for all rationals: in decimal notation,1/7 has an infinitely repeating sequence of digits (142857, if you were wondering). To distinguish 1/7 in decimal notation we’d need a cycle of six states in the automaton.

All rational numbers have a decimal expansion that eventually settles into an infinite series of repeated strings of digits (if only an infinitely repeating sequence of zeroes). So if we adopt the second rule for defining acceptance of the string, we can say: for every rational number, there is a finite state automaton that recognizes that rational number. And irrationals, which have no repeating sequences, aren’t recognizable by an automaton with finite states. (An automaton with infinitely many states might be able to recognize the decimal expansion of a particular irrational number, say π, but it’s hard to know what to do with that information — maybe it’s a reason to say that languages recognizable with an infinite automaton are not necessarily regular.)

That sounds like a nice pattern. It would be even nicer if we could devise an automaton to recognize the set of decimal expansions of rational numbers, but I suspect that’s not feasible, since the complement of that set is the irrationals, and being able to recognize the one set by regular means would entail being able to recognize the other, too.

Does it make sense to require that the automaton eventually end up spending all its time in accept states? (Or equivalently, that the sequence of states have a suffix in which every element in the suffix is an accept state.)

What if that is too restrictive a rule? What if we said instead

  1. Accept if at every point in the sequence of states there are an infinite number of accept states among the states following that point.

That is, allow the string to put the automaton into a non-accepting state, as long as it’s only temporary, and it eventually gets back into an accepting state.

Consider an automaton which has two states, A and B. Every time a B is found in the input, we go to state B; for any other symbol we go to state A. B is an accept state.

If we adopt the second story about termination, a string ending in an unending series of Bs will be accepted and is thus recognizable by an automaton. A string with an infinite number of Bs, interspersed with other symbols, will not be accepted by this automaton (nor by any other, as far as I can tell).

OK, that seems to establish (if we accept the conjecture about strings with infinitely many Bs) that the second and third rules define distinct sets of languages. I suppose that one chooses to use the second rule, or the third, or some other I haven’t thought of yet, in part based on whether it feels right to count as regular the languages one can recognize using that rule.

Hmm. OK, time to look at the bookshelves.

I’ve just checked and found that John E. Hopcroft and Jeffrey D. Ullman, in Introduction to automata theory, languages, and computation (Reading: Addison-Wesley, 1979), restrict their attention to finite strings.

Dick Grune and Ceriel J. H. Jacobs, Parsing techniques: a practical guide, second edition (New York: Springer, 2008), don’t explicitly impose this restriction. But a quick scan of their opening pages also doesn’t show any explicit consideration of infinite sequences of symbols, either. I’m guessing they do treat infinite input somewhere, if only because if you can contemplate van Wijngaarden grammars, which have infinite numbers of context-free rules (and remember, Grune didn’t just contemplate van Wijngaarden grammars, he wrote a parser generator for them), infinite strings are just not going to frighten you.

I suppose the idea of thinking seriously about infinitely long sentences in a language is one I first encountered in D. Terence Langendoen and Paul Postal, The vastness of natural languages (Oxford: Blackwell, 1984). To whom (for this, as for many other things) thanks!

I’m pretty sure that there was some treatment of infinite automata and/or infinite input strings in S. C. Kleene, “Representation of events in nerve nets and finite automata”, in Automata studies, ed. C. E. Shannon and J. McCarthy (Princeton: PUP, 1956), and V. M. Glushkov, “The abstract theory of automata”, Russian mathematical surveys: a translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk 16 (1961). They are both kind of tough sledding, but I suppose I really ought to go back and read them carefully with an eye to this topic.

Grail for regular languages

[11 December 2009]

Every now and then — not constantly, but recurrently — I experience a strong need to have a running copy of Grail, a software package first written by Derick Wood and Darrell Raymond and described by its documentation as “a symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects.”

Among other things, Grail is handy for answering questions about the equivalence or non-equivalence of regular expressions, or about subset/superset relations holding between the languages recognized by them. A few years ago, for example, the W3C XML Schema Working Group found itself in possession of two different descriptions of the lexical space of the XSD duration type. The working group wished, not unreasonably, to check that the two really were equivalent.

The first description provided three regular expressions, and said the lexical space of duration included all the strings which matched all three expressions:

  • -?P([0-9]+Y)?([0-9]+M)?([0-9]+D)?(T([0-9]+H)?([0-9]+M)?([0-9]+(.[0-9]+)?S)?)? (strings in which the fields of an ISO 8601 duration appear in the correct order, and in which each field appears only if it has at least one digit present)
  • .*[YMDHS].* (strings in which at least one field is present)
  • [^T]+(T[^HMS]+[HMS].*)? (if the character T appears, it must be followed by one of the time-related fields)

The second description translated the context-free grammar into regular-expression form (I’ve introduced white space for legibility):

-?P(([0-9]+Y)([0-9]+M)?([0-9]+D)?
(T(([0-9]+H)([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+(.[0-9]+)?S)))?
|([0-9]+M)([0-9]+D)?
(T(([0-9]+H)([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+(.[0-9]+)?S)))?
|([0-9]+D)?
(T(([0-9]+H)([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+(.[0-9]+)?S)))?
|T(([0-9]+H)([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+M)?([0-9]+(.[0-9]+)?S)?
|([0-9]+(.[0-9]+)?S)))

Easy enough to eyeball, for some people, I guess, but the working group wanted a more reliable method.

After a few hours trying vainly to compile Grail for my Linux box, I found an RPM that worked for me, and in ten minutes or so I had used Grail to establish that the two descriptions are equivalent.

Today I realized that another problem I face could best be solved by using Grail, but I no longer have a Linux box (and have not, in any case, found that old RPM). Grail 2.5 is dated March 1996, and the C++ in which it is written does not seem to please GCC 4.0.1. Grail+ 3.0, a successor project in other hands, may have been touched as recently as 2002 or 2004, but most of the dates appear to be in summer or fall 1998. GCC doesn’t like it, either.

So I have thus far been unable to recompile this very helpful tool.

If anyone out there knows of anyone who has either massaged the source of Grail into a form more like what modern C++ compilers will compile, or found out what combination of compile-time flags will persuade GCC to put itself in a more forgiving frame of mind and compile the thing, please get in touch. (And no, -Wno-deprecated does not suffice to do the trick.)

And any C++ proficients looking for interesting and useful projects to undertake could do a lot worse for themselves and for the world than to bring Grail into the twenty-first century.

Cross training

[11 December 2009]

One of my most pleasant memories is the discovery, some years ago, that while it is often rather boring to copy a program out of a book or off the Web and modify it to suit the needs at hand, it’s a lot more interesting if at the same time you can translate it from one programming language into another.

I discovered this when I needed a sort routine in Spitbol once; one of the Spitbol implementations we were using had a built-in sort, but the other didn’t. I copied a Shell sort, or possibly a Quicksort, from a Pascal or C textbook, and found that the task of translating the algorithm from one language into a different language with rather different control and data structures was much more rewarding and interesting than it had ever been to copy a Pascal or C program from a book into a file on my disk, compile it, and ‘play’ with it. It was non-trivial enough to give me a small feeling of accomplishment, and easy enough (the algorithm was right in front of me, after all, in executable form) not to cause serious troubles.

I suppose that most textbooks have no choice: they have to show you programs in the language they are teaching, and they can’t really assume the reader knows some other language. (When I started to learn Middle High German, the professor asked the participants in the class who had learned Greek in school, so he could know whether citing Greek parallels would be helpful. He remarked with a sigh that it was years since it had made any sense to ask who in the class had Hebrew.) But I learned a lot more about both Spitbol and the sorting algorithm in question when I did that translation than I ever had before.

This topic came to mind this morning because in my continuing work with Alloy and ACL2 I have been trying to rewrite a simple ACL2 example (a function, a few examples, and a couple simple theorems to prove that the function meets its requirements) into a roughly corresponding Alloy model. I think the exercise illuminates both Alloy and ACL2 (more on that later, after I’ve gotten this example to work and maybe done a few more, in both directions).

Similarly, I have learned a lot about Steve Pepper’s Italian Opera topic map by thinking about what a translation into other forms (SQL, Prolog, …) would look like; I expect to learn more about the topic map and about the technologies involved, when I push those translations further.

It’s funny, though: I am not sure I’ve ever seen anyone say explicitly that translating short programs from one language to another can be an interesting and rewarding experience for those learning the languages involved. Does no one else find it as helpful as I do? (Am I insufficiently lazy as a programmer?) Or is it too obvious to merit mention? Or is there just not a good single term for the practice?

Makes me wonder whether the double-translation method (by which Roger Ascham taught Greek to Elizabeth I) could be applied to programming and markup languages?

Daylight analysis

[14 July 2009, Happy Bastille Day]

In interoperability testing, test cases are particularly interesting when they elicit different behaviors from different processors.

In revision of a grammar, strings are of particular interest when they are grammatical according to one version of the grammar, but not according to the other. Either the change was intended (and the string provides an example) or it was not intended (and the string exhibits a problem). Differences in the structure of the parse trees produced by the different grammars may be interesting, too.

Several working groups I’ve served on have spent time worrying about whether our spec’s rules for handling (especially escaping and unescaping) URIs and IRIs should align with the rules specified by a variety of other specs (HTML 4.01, XSLT 1.0, any of the various RFC which have at various times been the authoritative source of information, any of the various internet drafts which have later turned into, or failed to turn into, RFCs, etc. etc. ad luxuriam). At any given time, it would have been really really useful to have an answer to the question “Do these two different formulations of the rules ever actually produce different results? Or are they just different ways of saying the same thing? And if they do ever produce different results, are the cases involved already so pathological on other grounds that we don’t actually mind?”

These cases have in common that they exhibit discrepancies among things that (other things being equal) are (or might be) expected to be indistinguishable. That is, they document the daylight (which my Oxford dictionary glosses as “visible distance between one … thing and another”) between things that shouldn’t have daylight between them.

I’m coming to believe that during the development of a spec, daylight analysis — seeking and finding instances of daylight between things, or seeking and failing to find any daylight — may be the most important function of test cases. If not the most important, then surely a very important use.

If this conjecture is true, then it ought to have implications for judging the relative effectiveness of different methods for constructing collections of test cases. Traditional testing can be measured by how many bugs it finds, at what cost, and techniques for generating test cases are valued high or low depending on how likely they seem to be to find new bugs. For spec development, the utility of a test is tied to the likelihood of its finding daylight between the old and the new formulations of some rule.

Hmm. Is that a difference or not? I suppose a bug can be regarded as an instance of daylight between the implementation and the spec it’s supposed to be implementing. So perhaps bug-finding is just a special case of daylight analysis.

Of course, each revision of a rule or a grammar provides a new opportunity for daylight to arise and be found. It seems to follow that you may need new test cases for each revision. Automated methods that allow quick generation of relevant new test cases would be particularly useful.

Daylight analysis is not the only useful goal of test generation during spec development. Tests that illustrate the intended behavior of a rule are useful for clarification. Call these exemplary tests.

And for checking that a working group understands a problem area well, and that the rule for that problem is well formulated, it can be useful to construct tests with randomly selected properties, just to check to see that everyone agrees on how the rule should handle them. Call these sanity checks. Random selection of properties in sanity checks helps ensure that you don’t inadvertently feed your rule only ‘sensible’ examples which happen not to exercise its flaws. I once used an Alloy model to make twelve very simple test cases for the schema composition part of XSD; nine of them turned out to pose questions to which the spec’s answer is not obvious. (The twelve were a bit redundant: eliminating the redundancies, the twelve auto-generated examples boiled down to four non-redundant examples, for one of which the spec provides a clear analysis.) If I had limited myself to sensible examples I would almost certainly have missed most or all of the problematic cases.

Daylight analysis as a goal works well with random generation of test cases, because it helps deal with one of the great problems of random generation: most randomly generated test cases (at least, the ones I have been able to generate using random means) are rather boring. The goal of finding daylight provides a simple filter: a randomly generated test case is interesting if it finds daylight in two things you are comparing; it is uninteresting otherwise. In realistic cases, uninteresting test cases will overwhelmingly outnumber interesting ones, but if you can apply an automated filter (parse with grammar G1, parse with grammar G2, compare the results; if they differ, you have found daylight), then you can keep a few uninteresting test cases (just in case) but throw most of them away and focus on interesting ones.

Formal methods and WGs (response to Jacek Kopecky)

[30 June 2009]

Jacek Kopecky has commented on my earlier posting about Brittleness, regression testing, and formal methods. I started to reply in a further comment, but my response has gotten a bit long, so I’m making a separate post out of it.

Examples needed

Jacek writes:

Dear Michael, you are raising a good point, but you’re doing it the same way as the “formal methods proponents”: you don’t show concrete examples.

Dear Jacek, you are quite correct: I don’t have persuasive examples. I have a gut feeling that formal methods would be helpful in spec development, but I can’t point to convincing examples of places it has helped. Having really convincing examples seems to require first persuading working groups to use formal methods, which is (as you suggest) not easy in the absence of good examples. It’s also a hard sell if the spec in question is at all far along in its development, because it appears that the first thing the group has to do is stop everything else to build a formal model of the current draft. No working group is going to want to do that.

I did embark once on an effort to model the XProc spec in Alloy (also in HTML), in an attempt to capture a spec which was still being worked on, and perhaps persuade the WG to apply formal methods in its work going forward. The example seems to illustrate both the up and the down of using formal methods on a spec. On the up side: the work I did raised a number of questions about the properties of pipeline ‘components’ as they were defined in the then-current draft and may have helped persuade some in the WG to support the proposal to eliminate the separate ‘component’ level. But on the down side: the draft was moving faster than I could move: before I was able to finish the model, submit the paper formally to the WG, and propose that the WG use the model as part of our on-going design process, the draft had been revised in such a way as to make much of the model inaccurate. No one in a WG really wants to consider proposals based on an outdated draft, so no one really wanted to look at or comment on the fragmentary model I had produced. I could not keep up, and eventually abandoned the attempt. Perhaps someone with better modeling skills than mine would have been able to find a way to move faster, or would have found a way to make a lighter-weight model which would be easier to update and keep in synch with the spec. But one possible lesson is: if the group doesn’t use formalisms from the very beginning, it may be very hard to adopt them in mid-process.

Starting small

Other lessons are possible. Kaufmann, Manolios, and Moore write, in Computer-aided reasoning: ACL2 case studies ([Austin: n.p.], 2002), p. 13:

The formalizers will often be struggling with several issues at once: understanding the informal descriptions of the product, discovering and formalizing the relevant knowledge that the implementors take for granted, and formalizing the design and its specification. To a large extent this phenomenon is caused by the fact that formal methods [are] only now being injected into industry. Once a significant portion of a group’s past projects has been formalized, along with the then-common knowledge, it will be much easier to keep up. But at this moment in history, keeping up during the earliest phases of a project can be quite stressful.

When working on a new project, we recommend that the formalizers start by formalizing the simplest imaginable model, e.g., the instruction set with one data operation and a branch instruction, or the protocol with a simple handshake. Choosing this initial model requires some experience…. Since iteration is often required to get things to fit together properly, it is crucial that this initial foray into the unknown be done with a small enough model to permit complete understanding and rapid, radical revision.

In my effort to model XProc, I failed to find a suitably small initial model. (As they say, it takes some experience, which is what I was trying to gain.) But then, how persuasive would a really small toy model be to a working group skeptical of formal methods in the first place? I don’t yet know how to square this circle.

I was more successful in finding useful small partial models when I used Alloy in 2007 to model schema composition. In that case I was able to generate a number of extremely useful test cases, but the exercise, sadly, showed merely that the XML Schema WG does not have any consensus on the meaning of the XSD spec’s description of schema composition. Being able to generate those examples during the original design phase might conceivably have led to a better design. At least, I like to think so. A skeptic might say it would only have led the working group to introduce more special cases and ad hoc rules, in order to avoid having to explain how certain cases are supposed to work. (This is, after all, the XML Schema WG we’re talking about here.) Nothing is perfect.

The tedium of formalization

JK continues:

I, for one, avoid formal methods other than programmatic reference implementation for a few reasons: the tediousness of translating the spec into the formalism that would support all those tests; creating all those tests; and finally the difficulty of being relatively sure that what is written in the formalism actually corresponds to the intent of what is written in the text.

I could change my mind if it was shown that enough formalization for XML Schema that would support your regression tests is not all that tedious.

Here JK makes several excellent points, to which I do not currently have good answers.

On tedium: It’s possible to model a spec selectively (as in my model of schema composition) with relatively little tedium. Indeed, the point of light-weight formal methods (as promoted by Daniel Jackson and embodied in Alloy) is to allow the user to construct relatively small and simple models which cover just the aspects of the design one is currently worried about. There is no fixed foundational set of primitives in which models must be described; the model can specify its own primitives.

For example, I was able to model some simple aspects of XSD 1.0 schema composition without having to model all of schema validation, and without modeling all the details of XML well-formedness. If the paper on that model does not seem especially light-weight, I think it’s due to the need to specify how to inspect the behavior of implementations unambiguously.

I have to confess, thought, that it’s not always obvious to me how best to construct such partial models. For a long time I avoided trying to model any part of XSD because I thought I was likely to have to model the Infoset and XML in full, first. I’d like to do that, someday, just for the satisfaction, but I don’t expect it to be quick. It was only after a longish while that I saw a way to make a simpler partial model. The more accurately I can identify some part of the design that I’m worrying about, the easier it is to find a partial model for that part of the design.

But in the earlier post, I was thinking about a kind of regression testing which would be intended to identify problems and interactions I was not consciously worried about. For that to work, I expect I would need a model that covered pretty much all of the salient properties of the entire spec. And at the moment, I cannot say I expect the construction of such a model to be entirely free of tedium. If I think that it might nevertheless be a good idea, it’s because checking natural-language prose for consistency is also tedious (which is why working groups sometimes do not do it well, or at all).

Consistency of prose and formalism

In other words, if you have one model (the spec in English), it will have dark corners that need discovering and clearing up. That will have a cost. If you have two models (the spec in English and the formalism), in my experience both may have dark corners (but one model’s dark corner may be clarified by the other model where it’s not a dark corner, so that may be a net plus), but there’s also the consistency of the two models that comes in question. So, gimme examples, show me that the consistency is not a problem, please, and I’ll be very grateful indeed.

I read somewhere once (in a source I have not managed to find again) that some international standards bodies require, or recommend, that their working groups produce both the English and the French version of their specifications, rather than working monolingually and then having a normative translation made. Why? Because working monolingually left too many dark corners in the text; working on both versions simultaneously led to clearer texts in both languages. The initial process was slower, but the additional effort was paid back by better results with lower maintenance costs.

I speculate that working both in English and in a suitable formalism like Alloy or ACL2 would have a similar effect: initial progress would feel slower, but the results would be better and maintenance would be easier. (This is a lot like having test cases in software development: it seems to slow you down at first, but it speeds things up later.)

Consistency between the two formulations should not, in principle, be any greater problem than consistency between the natural-language formulation of different parts of a complex spec. In fact, it should (note that modal verb!) be less of a problem: natural language provides only so much help with consistency checking, whereas formalisms tend to be a bit better on that score. The only really complex spec I have ever seen with no inconsistencies I could detect was ISO 8879, whose editor was a lawyer skilled in the drafting of contracts.

There are, of course, things a working group can do to make it easier, or harder, to maintain consistency between the prose and the formalism. Putting the formal treatment in a separate document (like the XPath formal semantics) or a separate part of the same document (like the schema for schema documents in XSD) is a good way to make it easier for inconsistencies to remain undetected. Integrating fragments of the formalism with the prose describing those constructs (as is done with the BNF in the XML spec, or as is done in any good literate program) makes it easier to detect and remove inconsistencies. It also produced pressure on the working group to use a legible formalism; the XML form of XSD might be more readable if the working group had forced itself to read it more often instead of hiding it in the appendix to the spec.

Integrating the formalism into the text flow of the spec itself can work only if the members of the working group are willing to learn and use the formalism. That’s why I attach such importance to finding and using simple notations and light-weight methods.

But this long post is just more speculation on my part. You are right that examples are needed. Someday, I hope to oblige.