Consistency checking in prose

[26 March 2009]

Twice in the last couple of days I’ve run into documents with consistency problems. So I’ve been thinking about consistency checking in prose as a challenge.

The web site for a large organization has, in the About Us section of the site, a side bar saying so-and-so many employees in so-and-so many countries. And one of the documents within the About Us section talked about the organization’s efforts to be a good corporate citizen and employer at all of its so-and-so many locations. If you are in n locations in m countries, though, shouldn’t n be greater than or equal to m?

The other example was documentation for a specialized XML vocabulary which included a summary of gaps in the vocabulary’s coverage and shortcomings in the design. The main gap, said the documentation, was that “the vocabulary offers no solution to the problem of XYZ” But the vocabulary does offer a solution to that problem: the revision of the vocabulary to deal with problem XYZ is described with some pride two or three sections further up in the document.

One may speculate that in both cases, a perfectly true statement in the document was rendered false by later events, and statements added later to the document, reflecting the later state of affairs, contradict the earlier statements. (There was a gap in the vocabulary, and the documentation mentioned it as a potentially painful one. Eventually it was painful enough to be filled. And the tag-by-tag account of the markup was even revised to document the new constructs. But the description of gaps and shortcomings was not revised. And it’s not hard to believe that an organization may be in m locations at one point, and in a larger number of locations, in n countries, later on.)

In neither of these cases is the contradiction particularly embarrassing or problematic.

[“But I notice you name no names,” said Enrique. “Chicken.” “Hush,” I said. “The names are not relevant.”]

But of course the same problem happens in specs, where inconsistencies may have graver consequences.

[“Ha! I can think of some names under this rubric!” crowed Enrique. “Shut up!” I explained. “I’m trying to describe and understand the problem, not confess my sins.” “Oh, go on! Admitting you have a problem is the first step towards recovery. If you don’t admit that you have the problem, you’ll never — what the ?!” At this point, I managed to get the duct tape over his mouth.]

I think there must be two basic approaches to trying to avoid inconsistencies.

(1) You can try to prevent them arising at all.

(2) You can try to make them easier to detect automatically, so that an automated process can review a set of documents and flag passages that need attention.

Neither of these seems to be easy to do. But for both of them, it’s not hard to think of techniques that can help. And thinking about any kind of information system, whether it’s an XML vocabulary or a database management system or a programming language or a Web site content management system, or a complicated combination of the above, we can usefully ask ourselves:

How could we make it easier to prevent inconsistency in a set of documents?

How could we make it easier to keep a set of documents in synch with each other as they change?

How could we document the information dependencies between documents at a useful level of granularity? (Helpful, perhaps, to say “if document X changes, documents Y and Z, which depend on it, must be checked to see if they need corresponding revisions”, but a lot more helpful if you can say which sections in Y and Z depend on which bits of X.) Could we do it automatically?

It seems plausible that detecting inconsistencies between distant parts of a document would be easier if we could get a list of (some of) the entailments of each bit of a document.

How can we make it easier to document the entailments of a particular passage in a spec?

For making the entailments of a passage explicit (and thus amenable to mechanical consistency checking with other parts of the document set) I think there are several obvious candidates: RDF, Topic Maps, RDFa, the work Sam Hunting has been doing with embedded topic maps (see for example his report at Balisage 2008), colloquial XML structures designed for the specific system in question. Years ago, José Carlos Ramalho and colleagues were talking about semantic validation of document content; they must have had something to say about this too. (In the DTD, I recall, they used processing instructions.) Even standard indexing tools may be relevant.

How do these candidates compare? Are they all essentially equally expressive? Does one or the other make it easier to annotate the document? Is one or the other easier to process?

[“If you don’t admit that you have the problem, you’ll never be able to fix it. And you keep that roll of duct tape away from me, you hear?”]