Brittleness, regression testing, and formal methods

[17 June 2009]

For some years now I have thought that the development of specs at places like W3C would benefit from the application of formal methods. Examples from the past few years illustrate (a) that I’m not alone in this thought, and (b) that it’s difficult to persuade a WG as a whole to adopt formal methods. Some WGs have published documents with formalizations of their spec in Z or in Gentzen calculus or in some other notation, but as far as I can tell most members of those WGs did not read those documents, let alone understand them, participate actively in their drafting and revision, or use the formalism as a way of working on or thinking about the design of the main spec. (And if I have understood what most proponents of formal methods have in mind, it’s really the last idea that some people regard as the main goal.) In most WGs I’ve been in, even the maintenance of a BNF in the spec, or the formal definition of an XML vocabulary, ends up being handled by a small minority of the group.

Part of the problem is the cost/benefit analysis. Most people in WGs don’t have any real grasp of or skill with any tool in the formal-methods toolbox.

[Enrique nudged my elbow here and said “Speak for yourself!” “Oh, I do,” I said. I use Alloy and enjoy it, but my command of Alloy is very weak compared to my command of most of the programming languages I use. And ACL2? Otter? HOL? “Ha,” muttered Enrique. “In your dreams.”]

And many formal tools look rather forbidding at first glance. And second glance.

[“How about, tenth glance?” said Enrique. “You have, what, ten books on Z on your shelf —” [yep, I just counted] “— quick, what does an arrow with a double head, no tail, and a cross mean?” “Er,” I said. “Come on, smart guy! Partial function? Total function? Total injection? Partial injection? Fuel injection? Which?” “Oh, hush.” (For the record, double head, no tail, and cross mean it’s a partial surjection. “Surjection?” asked Enrique. “What’s that?” “I can’t tell you; it’s a secret.” “A secret?” “I mean, look it up yourself.”)]

So a WG perceives a real cost, including a possibly steep learning curve and the real chance of appearing ignorant and foolish in front of colleagues. Alloy does its best to address this problem with (a) a non-forbidding syntax and (b) some chances of very fast payoff, so the learning curve is really not so steep. But it’s still a learning curve.

And the benefit? What benefits will accrue if my WG uses formal methods?

This is one place where I think formal methods proponents could do better. They always show you nifty uses of the tool to prove correctness of a design; it might be more persuasive if they showed you (a) a design plausible enough to make you say “yeah, that looks all right” and then (b) a failure to prove it correct, because of a flaw in the design, followed by (c) a fix and (d) a proof of correctness for the repaired design.

Here, too, Alloy does a good job: Daniel Jackson’s book includes the claim “Transitive closure is not axiomatizable in first-order logic”, with an exercise that involves first trying to axiomatize transitive closure in Alloy [which is first-order] and then using Alloy to find the flaw. “Now execute the command, examine the counterexample, and explain what the bug is. The official definition of UML 1.0 had this problem.” I’d like more examples like that!

This morning I found myself once more wishing, for a very specific reason, that I had a good formal model of XSD 1.1. It might or might not persuade an entire WG, but it certainly illustrates the kind of situation where I think WGs might find formal methods to their advantage.

What happened is that Tony Coates raised an issue against XSD 1.1 relating to all-groups and named model groups. The details are not important for this discussion, but the upshot is simple: (1) I think Tony has identified a real and unintended flaw in XSD, (2) I think there is a relatively straightforward fix for the flaw, and (3) I am deeply frightened by the prospect of changing the spec to include that fix now, at this stage of its development. Why? Because I can see too clearly the possibility that in making what looks like a straightforward change we might break something else which depends on what we are changing but which we overlook.

[“You wouldn’t be speaking from experience here, would you?” sneered Enrique, digging his elbow hard into my ribs. Let’s just say that here are several corrections to XSD 1.0 which themselves introduced errors. And yes, I’m embarrassed by them. But it’s not just XSD: the phenomenon is also familiar from programming. “Yeah, yeah,” said Enrique. “Spread the blame. It’s not so bad, because everybody else does it, too? Haven’t you ever heard of the Categorical Imperative?” “I told you to hush! If you can’t be quiet, you’ll have to wait outside.”]

When you get to the point where you don’t want to fix a problem because you’re afraid you’ll just break something else, then you are well on the way to paralysis.

For software, one answer to incipient paralysis is to have a good suite of regression tests. Make the change, run the regression tests, and if all the tests pass, you can feel more confident that you didn’t break things.

The reason I am wishing, today, that I had a good formal model of XSD is that formal models can serve as a kind of regression test suite for prose specs. You have a model, you’ve proven that the spec has this and that property, and that operation X on example Y produces result Z. Now you’re considering a change to the design. Make the corresponding change to the model, and check: does it still have this and that property? Does X(Y) still evaluate to Z? Mechanized systems can make it much easier to check that specified properties of a design are stable in the face of a given change. In Alloy, I find it helpful to re-check all assertions after a significant change to the model (or at least the assertions I think might be affected); Alloy makes that easy. In ACL2, similarly, you can have a set of theorems with proof scripts (including a theorem that proves that the value of X(Y) is Z) and re-run them automatically after changes.

I begin to suspect that a key advantage of formal methods is not just being able to prove that a design has certain properties, but being able to re-prove that the design has those properties, again and again, each time you revise it. So you can make sure that changing this little bit over here didn’t suddenly cause a catastrophic failure in that bit over there.

As Matt Kaufmann, Panagiotis Manolios, and J. Strother Moore put it in the introduction to Computer-aided reasoning: ACL2 case studies ([n.p.: n.p.], 2002), describing a project at Motorola:

Because the design was under constant evolution during the period, the formal models were also under constant evolution and “the” equivalence theorem [that is, a key result in establishing that the microcode worked correctly] was proved many times. This highlights the advantage of developing general proof strategies…. It also highlights the utility of having a good inference engine: minor changes in the theorem being proved do not necessarily disrupt the proof replay.

If formal methods can help us design robust specs, they will be a huge help. But they can be helpful even in working with brittle specs, the kind where changing one thing here can easily break something over there. When we work with such a spec, formal methods can help ensure that we know at once if a change breaks something, so we can avoid making that change in the normative version of the spec. (I’m pretty sure that this was one advantage of the work the CICS group at IBM’s Hursley Laboratory did when they specified parts of CICS in Z. It will have made it a bit easier to contemplate the possibility of changes to the interface if you could prove that the changes would retain the properties you cared most about keeping.)

Of course, if you use formal methods a lot, you can hope that you will learn how to make your designs more robust and cleaner. That’s part of the selling proposition for Alloy, at least. But even for those of us whose designs are sometimes, ah, not as robust as we would like, a formalization of an existing design can be a help in maintenance.

Dumbing grammars down, cont’d

[30 May 2009; diagrams added 31 May 2009]

Last April I posted an entry on the possibility of systematically producing regular languages which are subsets or supersets of a context-free language. The process I outlined involves creating a finite state automaton, augmented with counters and guards and whatnot (or equivalently, creating an attributed regular grammar with attributes and semantic conditions). My description of that process involved a certain amount of hand-waving, because I knew intuitively how to build such an automaton by hand (at least for simple cases), but had not yet figured out a good way to describe it more precisely.

I’m still not there yet, but I’ve made some progress today. Recently I spent some relaxing hours reading about Earley parsers and implementing an Earley parser. And today it occurred to me: a simpler way to make the attributed regular grammar / augmented FSA I need is to make one state for each position in each rule of the grammar, so that each state in the automaton (each non-terminal in the attributed regular grammar) corresponds to the rule-and-position part of an Earley item. (The Earley item also keeps track of the starting position in the input of the non-terminal whose rule is being recognized; I don’t need that for the FSA.)

So I can now give a slightly simpler account of the FSA construction.

  1. Let the states of the automaton be the positions in the grammar. (Each rule of length n has n+1 positions.)
  2. For each position where the rule has a terminal symbol, include an arc from that position to the next position in that rule, labeled with the terminal symbol.
  3. For each position p where the rule has a non-terminal symbol N, (a) add arcs labeled with the empty string from p to the first position of each rule in the grammar where N appears on the left-hand side, and (b) add arcs labeled with the empty string from the final positions in each rule for N to the next position after p.
  4. Add a new starting state; add arcs labeled with the empty string from the starting state to the first positions of the rules for the start symbol of the grammar.
  5. The accepting states of the automaton are the final positions of the rules for the grammar’s start symbol.

I have not yet attempted anything like a proof, but I believe that the automaton thus constructed recognizes what my earlier post called the smallest regular superlanguage of the language you started with. That is, every sentence of the original language will be recognized by the automaton, and some non-sentences will also be recognized.

The set of sentences recognized by the automaton can be characterized simply, by reference to the pumping lemma for context-free languages. In simple terms, this lemma says that for long enough sentences in a language, the sentence s can be partitioned into subsequences u, v, w, x, and y, such that s is the concatenation uvwxy, and such that for any positive integer n, the string uvnwxny is also in the language. The automaton recognizes the slightly larger set of sentences uvnwxmy for positive integers n and m. That is to say, it does not ensure (for example) that closing parens match opening parens in number or kind. But otherwise it captures the language.

To take a very simple example, consider the grammar:

  • s ::= ‘a’
  • s ::= ‘(‘ s ‘)’
  • s ::= ‘[‘ s ‘]’

The automaton has ten states, which can be represented in the style of Earley items:

  1. s ::= · ‘a’
  2. s ::= ‘a’ ·
  3. s ::= · ‘(‘ s ‘)’
  4. s ::= ‘(‘ · s ‘)’
  5. s ::= ‘(‘ s · ‘)’
  6. s ::= ‘(‘ s ‘)’ ·
  7. s ::= · ‘[‘ s ‘]’
  8. s ::= ‘[‘ · s ‘]’
  9. s ::= ‘[‘ s · ‘]’
  10. s ::= ‘[‘ s ‘]’ ·

Plus a new starting state I’ll call 0.

Arcs:

  • 0: λ → 1
  • 0: λ → 3
  • 0: λ → 7
  • 1: “a” → 2
  • 2: λ → 5
  • 2: λ → 9
  • 3: “(” → 4
  • 4: λ → 1
  • 4: λ → 3
  • 4: λ → 7
  • 5: “)” → 6
  • 6: λ → 5
  • 6: λ → 9
  • 7: “[” → 8
  • 8: λ → 1
  • 8: λ → 3
  • 8: λ → 7
  • 9: “]” → 10
  • 10: λ → 5
  • 10: λ → 9

Accepting states are 2, 6, and 10.

Or in pictures:

Drawing of the finite-state automaton described in the text, with circles for the states and arrows for the transitions.

Eliminating epsilon transitions and minimizing the automaton gives us a smaller one with two states, 1 (the start state) and 2 (the accepting state), with transitions:

  • 1: “a” → 2
  • 1: “(” → 1
  • 1: “[” → 1
  • 2: “)” → 2
  • 2: “]” → 2

Drawing of the two-state finite-state automaton described in the text, with circles for the states and arrows for the transitions.

The regular expression “[[(]*a[])]*” provides a compact equivalent. As you can see, it accepts a superset of the example language, but it is the smallest regular superset and does impose at least some of the constraints of the original grammar.

Unfinished business (anyone who needs a puzzle to solve can feel free to take these off my hands):

  • Prove that the FSA whose construction is described above accepts a superset of the language accepted by the grammar from which it’s constructed.
  • Describe (with proof of correctness) what augmentations to the mechanism would suffice to use this automaton to parse the original grammar. (Hint: my earlier conjecture that counters would suffice is not true; the example grammar illustrates why.)
  • The augmented FSA that correctly recognizes the original language is, presumably, a re-invention of some well known parsing technique for context-free languages; which?
  • Prove that the language recognized by the FSA is essentially the language uvnwxmy for every uvwxy decomposition of sentences in the original language.
  • Can the largest practicable regular sublanguage be constructed using this FSA?
  • Earley parsing can, I think, be described in terms of this automaton, in the sense that the actions of an Earley parser can be translated into state changes in the automaton (with an external control that prevents making an illegal move). Can other parsing techniques also be?

Managing to disagree

[30 March 2009]

For some reason, lately I’ve found an old remark of Allen Renear’s running through my head.

“We can disagree about many things; but can we disagree about everything?

“Or would that be like positing the existence of an airline so small, it has no nonstop flights?”

[Memory tells me that he said this at a meeting of the Society for Textual Scholarship; Google, aided by Robin Cover’s Cover Pages, tells me that it was in April 1995.]

It’s on my mind, perhaps, because with Claus Huitfeldt and Yves Marcoux I’ve been doing some work on a formal model of transcription, and when we have examined how multiple divergent transcriptions of the same exemplar look in our model, it has proven much harder than I would have thought possible to make the transcriptions actually contradict one another. (More on this in another post, perhaps.)

Consistency checking in prose

[26 March 2009]

Twice in the last couple of days I’ve run into documents with consistency problems. So I’ve been thinking about consistency checking in prose as a challenge.

The web site for a large organization has, in the About Us section of the site, a side bar saying so-and-so many employees in so-and-so many countries. And one of the documents within the About Us section talked about the organization’s efforts to be a good corporate citizen and employer at all of its so-and-so many locations. If you are in n locations in m countries, though, shouldn’t n be greater than or equal to m?

The other example was documentation for a specialized XML vocabulary which included a summary of gaps in the vocabulary’s coverage and shortcomings in the design. The main gap, said the documentation, was that “the vocabulary offers no solution to the problem of XYZ” But the vocabulary does offer a solution to that problem: the revision of the vocabulary to deal with problem XYZ is described with some pride two or three sections further up in the document.

One may speculate that in both cases, a perfectly true statement in the document was rendered false by later events, and statements added later to the document, reflecting the later state of affairs, contradict the earlier statements. (There was a gap in the vocabulary, and the documentation mentioned it as a potentially painful one. Eventually it was painful enough to be filled. And the tag-by-tag account of the markup was even revised to document the new constructs. But the description of gaps and shortcomings was not revised. And it’s not hard to believe that an organization may be in m locations at one point, and in a larger number of locations, in n countries, later on.)

In neither of these cases is the contradiction particularly embarrassing or problematic.

[“But I notice you name no names,” said Enrique. “Chicken.” “Hush,” I said. “The names are not relevant.”]

But of course the same problem happens in specs, where inconsistencies may have graver consequences.

[“Ha! I can think of some names under this rubric!” crowed Enrique. “Shut up!” I explained. “I’m trying to describe and understand the problem, not confess my sins.” “Oh, go on! Admitting you have a problem is the first step towards recovery. If you don’t admit that you have the problem, you’ll never — what the ?!” At this point, I managed to get the duct tape over his mouth.]

I think there must be two basic approaches to trying to avoid inconsistencies.

(1) You can try to prevent them arising at all.

(2) You can try to make them easier to detect automatically, so that an automated process can review a set of documents and flag passages that need attention.

Neither of these seems to be easy to do. But for both of them, it’s not hard to think of techniques that can help. And thinking about any kind of information system, whether it’s an XML vocabulary or a database management system or a programming language or a Web site content management system, or a complicated combination of the above, we can usefully ask ourselves:

How could we make it easier to prevent inconsistency in a set of documents?

How could we make it easier to keep a set of documents in synch with each other as they change?

How could we document the information dependencies between documents at a useful level of granularity? (Helpful, perhaps, to say “if document X changes, documents Y and Z, which depend on it, must be checked to see if they need corresponding revisions”, but a lot more helpful if you can say which sections in Y and Z depend on which bits of X.) Could we do it automatically?

It seems plausible that detecting inconsistencies between distant parts of a document would be easier if we could get a list of (some of) the entailments of each bit of a document.

How can we make it easier to document the entailments of a particular passage in a spec?

For making the entailments of a passage explicit (and thus amenable to mechanical consistency checking with other parts of the document set) I think there are several obvious candidates: RDF, Topic Maps, RDFa, the work Sam Hunting has been doing with embedded topic maps (see for example his report at Balisage 2008), colloquial XML structures designed for the specific system in question. Years ago, José Carlos Ramalho and colleagues were talking about semantic validation of document content; they must have had something to say about this too. (In the DTD, I recall, they used processing instructions.) Even standard indexing tools may be relevant.

How do these candidates compare? Are they all essentially equally expressive? Does one or the other make it easier to annotate the document? Is one or the other easier to process?

[“If you don’t admit that you have the problem, you’ll never be able to fix it. And you keep that roll of duct tape away from me, you hear?”]

One reason I love Alloy (Goodman, calculus of individuals)

[13 February 2009]

In the last couple of years I have become rather fond of a tool called Alloy. I’ve been talking to a couple of people about it lately, and this week I had an extremely enlightening experience with Alloy, so it occurred to me it might be useful to give a simple example of its use here. (This has gotten longer than expected, because some context is needed to understand the problem I had this week and how Alloy helped me solve it. Reader, persevere!)

First, some background.

Alloy is intended for use in what its developers call ‘light-weight formal methods’. It can be used as a sort of desk-checker for software designs. Alloy allows you to describe a design in a simple notation for symbolic logic, define predicates which hold either of particular objects in the universe, or of the universe as a whole, make assertions about the design, and so on. So far, it’s a lot like the specification language Z, only it doesn’t use Greek symbols and it’s much less intimidating to programmers who turn off when things look too formal or mathy. But one nice feature of Alloy not shared by Z is that Alloy can be used to find instances of predicates you define, or counter-examples to assertions you make. It doesn’t actually prove your assertions true, but it does an exhaustive check on all possible instances of the model you’ve defined, up to a user-specified size, and if it finds a counter-example, it will show it to you.

So if there is a problem in your design which can be illustrated by an example small enough to fit into the scope you specify, Alloy will find it.

Not all bugs have small test cases that exercise them. But a whole lot of them do. And Alloy’s clever exploitation of current SAT-solver technology means you can find those bugs really easily.

Now, my former W3C colleague Dan Connolly pointed out to me a long, long time ago that one can use specification languages like Alloy not just to check software designs, but to test our understanding of things we read. Dan told me once that when he really wanted to understand something, he tried to model it using a specification language and prove at least a few trivial theorems about it. As an example, he showed me a formalization he had done of part of paper by Murata Makoto, using Larch (and in particular the Larch Prover LP) for this purpose; nowadays I think he uses ACL2 for this kind of thing. (I like LP and ACL2, too, but I find them hard to use, which means I don’t use them very often. And I’m not sure how many theorems Dan proves in a year using these techniques.)

The other day, Claus Huitfeldt and I were discussing Nelson Goodman’s account of properties in his book The structure of appearance. Some properties, Goodman points out, have the property that if they are true of a thing, they are true of every part of that thing. (The property “is located in Indiana” is an example.) Other properties have other (second-order) properties; Goodman identifies several properties of properties that he finds useful. Claus and I worked through the definitions of dissective, expansive, collective, and nucleative properties, translating them into symbols and trying to think of examples, and we were just starting on the properties of two-place predicates when I thought it might be useful to check some of our conjectures by formalizing things a little more fully.

This involved backing up a bit, because in defining the terms dissective, expansive, etc., Goodman appeals to notions like “is part of”, “is the sum of”, and “is the product of”. These, in turn, are secondary concepts defined in terms of a single primitive, “overlaps”. (This alone would probably have sufficed to hook Claus and me.) So we set out to describe, in Alloy, the notion of overlap in Goodman’s calculus of individuals.

My first cut was simple. An individual is anything a particular system wants to treat as an individual; OO programmers would call it “object” not “individual”, but Goodman antedates OO programming. Overlap is a relation among individuals, and relations among things are specified in Alloy by naming them in the description of the first member of the ordered pairs. So I wrote

sig Individual {
overlaps : set Individual
}

This defines a signature or set of objects, named Individual. Each individual has a field (or property) named overlaps, the value of which is a set of individuals. Another equivalent way to describe it is to say that there is a relation overlaps the members of which are ordered pairs of individuals.

Looking at instances of this model (using the Alloy Analyzer, you add the instruction run {} and ask that it be executed) showed that this formulation had an intuitive drawback: For two individuals a and b, a could overlap b without b overlapping a. That didn’t seem right, so I added a constraint: if b is in the overlaps set of a, then a is in the overlap set of b. (The “@” sign is a syntactic artifact; don’t worry about it.)

sig Individual {
overlaps: set Individual
}{
all i : Individual | i in overlaps iff this in i.@overlaps
}

The second pair of braces contains a set of constraints (in this case, just one) that hold of every member of the signature. This constraint has the desired effect of guaranteeing that if a.overlaps contains b, then b.overlaps contains a.

To avoid having to say awkward things like b in a.overlaps, we can define a predicate that allows us to say simply overlap[a,b]. (For technical reasons, predicate arguments in Alloy 4 are enclosed in square brackets, not parentheses. You’ll get used to it.)

pred overlap[x, y : Individual] {
x in y.overlaps or y in x.overlaps
}

[Enrique piped up here, to say “The constraint in the declaration of Individual guarantees that if y in x.overlaps then x in y.overlaps; you can delete either half of the or without losing anything except unnecessary complexity.” I was doubtful, but to shut him up, I wrote

pred overlap2[x, y : Individual] {
x in y.overlaps
}
assert defs_equivalent {
all x, y : Individual | overlap[x,y] iff overlap2[x,y]
}
check defs_equivalent for 5

The Analyzer found no counter-examples to the assertion. This isn’t exactly a proof, but it persuaded me that Enrique was right.]

Now, Goodman takes overlap as a primitive notion, but he provides a sort of formal definition by saying that individuals x and y overlap if and only if there is some individual z, such that anything that overlaps z also overlaps x and y. Intuitively, z can be any individual which is part of both x and y. (You can think of x and y as sets, and of z as any individual in their intersection, but only if you don’t mind hearing howls of outrage from the general direction of Goodman’s grave: in defining the calculus of individuals, he was trying to provide an alternative to set theory for those who like himself preferred to avoid having to assume that sets exist.)

We translated this into Alloy thus:

fact overlap_axiom {
all x, y : Individual |
overlap[x,y] iff
(some z : Individual |
(all w : Individual |
overlap[w,z] => (overlap[w,x] and overlap[w,y])))
}

Now, it seems clear from our intuitive understanding of the nature of overlap that in a plausible formalization the overlap predicate has to be both symmetric (overlap[a,b] iff overlap[b,a]) and reflexive (everything overlaps itself).

We added appropriate assertions to the model and checked them. It seemed clear that the symmetry must follow from what we had said so far, and I was not surprised to see that it did: in every universe with ten Individuals or fewer, the symmetry of the overlap relation holds. It seemed equally clear that based on our model so far, overlap would not (yet) be reflexive, and we would need to make it so by adding another fact asserting the reflexivity of overlap as an axiom.

So I admit I was surprised when Alloy found no examples of the model which violated the assertion that overlap is reflexive.

I have spent several odd hours, over the last couple of days, trying to understand this result. I assumed it was a typo on my part, some error in my definition of overlap_axiom. Perhaps the scoping of the operators was wrong? But no, I can never remember operator priorities reliably, so the version I was testing was already fully parenthesized. Perhaps some weirdness of the syntax rules was tripping me up?

Finally, I sat down and wrote out a description of a universe in which all the axioms of the Alloy model held, in which overlap was not reflexive. After two failed attempts, I realized that reflexivity follows necessarily from Goodman’s characterization of overlap (i.e. from overlap_axiom). (Hint: in a universe with only one individual a, bind x, y, and z to a and try to make overlap[a,a] false. You won’t. Q.E.D.)

Goodman’s axiom guarantees the symmetry of overlap, too. So we can lose the constraint in the definition of the Individual signature.

What I like about this experience is not so much having spent two hours or so looking fruitlessly for a syntax error in my formulation of overlap_axiom; that I could have done without. What I really like is that formalizing Goodman’s concept of overlap in Alloy surprised me by exposing a bug in my understanding. I was convinced that the reflexivity of overlap, as Goodman described it, was a hidden premise smuggled in from our extra-logical understanding of the English word “overlap”. At the cost of about ten minutes’s work with Alloy, I discovered that my conviction was misplaced. It did take me several more hours of thought, over a couple of days, to understand why reflexivity follows from Goodman’s characterization, but that really illustrates a shortcoming in my central nervous system, not one in Alloy.

Any tool that can so effortlessly expose misconceptions about fundamental concepts and thus help clarify our thinking about a design or a theory is a tremendous aid. Why isn’t every Working Group in the W3C and every other standards-development organization working in information technology writing Alloy models to check their understanding of their designs?

Finally, some links: If I have succeeded in making Alloy sound worth looking at, you’ll want to know that Alloy was developed by the Software Design Group at MIT’s Computer Science and Artificial Intelligence Laborary (CSAIL) under the leadership of Daniel Jackson. The Alloy site has tutorials and other ancillary information, as well as the current downloadable version of the Alloy Analyzer.