Mainframe terminal rooms and the oral tradition

[7 July 2009]

A number of XML experts I know use Emacs for editing XML, employing either James Clark’s nxml mode or Lennart Staflin’s psgml mode. But few people who don’t already know Emacs are eager to learn it.

My evil twin Enrique suggested a reason: “In the old days,” (he means thirty years ago, when he first learned to use computers), “using a computer mostly meant using a mainframe. Which meant, on most university campuses, using a public terminal room. Which meant there were usually other people around who might be able to help figure out how to make the editor do something. Emacs was able to spread widely in that culture because the written documentation was not the only available source of information. (Did Emacs even have written documentation in those days?) Emacs, and a lot of other tools, were propagated by oral tradition.

“Nowadays, however, the oral traditions of the public terminal room are mostly dead. What the user cannot figure out how to use from the user interface and (perhaps) a glance at the documentation, might as well not be in the program. Fewer and fewer users will trouble to learn Emacs.

“I predict that when the people who first learned computing in a mainframe terminal room are dead, Emacs will be effectively dead, too. Its natural method of propagation is by looking over someone’s shoulder at what they are doing and asking ‘How did you do that?’ That doesn’t happen when computing almost always happens in private places.

“R.I.P., Emacs,” he intoned mournfully. “And probably TeX and LaTeX, too.”

“Well, hang on,” I said. “Neither Emacs nor TeX is dead yet.”

“Maybe not, but it’s only a matter of time. They’ll end up in the Retro-Computing Museum.” I could have sworn I saw a tear in his eye.

“But, you know, it’s only a matter of time for all of us. And besides, you’re wrong in at least some ways. I did indeed spend the first few years of my computing life haunting university terminal rooms. I got a lot of help from other people, and I passed it on. But I didn’t use TeX or Emacs until years later. The oral traditions of the terminal room, if they ever actually existed, had nothing to do with it. Both Emacs and TeX are perfectly capable of acquiring new users without oral transmission.”

He looked up. “You mean, there’s hope yet?”

“There’s always hope. But no, I’m still not going to help you debug that self-modifying 360 Assembler program you brought over. I’ve got work to do.”

Formal methods and WGs (response to Jacek Kopecky)

[30 June 2009]

Jacek Kopecky has commented on my earlier posting about Brittleness, regression testing, and formal methods. I started to reply in a further comment, but my response has gotten a bit long, so I’m making a separate post out of it.

Examples needed

Jacek writes:

Dear Michael, you are raising a good point, but you’re doing it the same way as the “formal methods proponents”: you don’t show concrete examples.

Dear Jacek, you are quite correct: I don’t have persuasive examples. I have a gut feeling that formal methods would be helpful in spec development, but I can’t point to convincing examples of places it has helped. Having really convincing examples seems to require first persuading working groups to use formal methods, which is (as you suggest) not easy in the absence of good examples. It’s also a hard sell if the spec in question is at all far along in its development, because it appears that the first thing the group has to do is stop everything else to build a formal model of the current draft. No working group is going to want to do that.

I did embark once on an effort to model the XProc spec in Alloy (also in HTML), in an attempt to capture a spec which was still being worked on, and perhaps persuade the WG to apply formal methods in its work going forward. The example seems to illustrate both the up and the down of using formal methods on a spec. On the up side: the work I did raised a number of questions about the properties of pipeline ‘components’ as they were defined in the then-current draft and may have helped persuade some in the WG to support the proposal to eliminate the separate ‘component’ level. But on the down side: the draft was moving faster than I could move: before I was able to finish the model, submit the paper formally to the WG, and propose that the WG use the model as part of our on-going design process, the draft had been revised in such a way as to make much of the model inaccurate. No one in a WG really wants to consider proposals based on an outdated draft, so no one really wanted to look at or comment on the fragmentary model I had produced. I could not keep up, and eventually abandoned the attempt. Perhaps someone with better modeling skills than mine would have been able to find a way to move faster, or would have found a way to make a lighter-weight model which would be easier to update and keep in synch with the spec. But one possible lesson is: if the group doesn’t use formalisms from the very beginning, it may be very hard to adopt them in mid-process.

Starting small

Other lessons are possible. Kaufmann, Manolios, and Moore write, in Computer-aided reasoning: ACL2 case studies ([Austin: n.p.], 2002), p. 13:

The formalizers will often be struggling with several issues at once: understanding the informal descriptions of the product, discovering and formalizing the relevant knowledge that the implementors take for granted, and formalizing the design and its specification. To a large extent this phenomenon is caused by the fact that formal methods [are] only now being injected into industry. Once a significant portion of a group’s past projects has been formalized, along with the then-common knowledge, it will be much easier to keep up. But at this moment in history, keeping up during the earliest phases of a project can be quite stressful.

When working on a new project, we recommend that the formalizers start by formalizing the simplest imaginable model, e.g., the instruction set with one data operation and a branch instruction, or the protocol with a simple handshake. Choosing this initial model requires some experience…. Since iteration is often required to get things to fit together properly, it is crucial that this initial foray into the unknown be done with a small enough model to permit complete understanding and rapid, radical revision.

In my effort to model XProc, I failed to find a suitably small initial model. (As they say, it takes some experience, which is what I was trying to gain.) But then, how persuasive would a really small toy model be to a working group skeptical of formal methods in the first place? I don’t yet know how to square this circle.

I was more successful in finding useful small partial models when I used Alloy in 2007 to model schema composition. In that case I was able to generate a number of extremely useful test cases, but the exercise, sadly, showed merely that the XML Schema WG does not have any consensus on the meaning of the XSD spec’s description of schema composition. Being able to generate those examples during the original design phase might conceivably have led to a better design. At least, I like to think so. A skeptic might say it would only have led the working group to introduce more special cases and ad hoc rules, in order to avoid having to explain how certain cases are supposed to work. (This is, after all, the XML Schema WG we’re talking about here.) Nothing is perfect.

The tedium of formalization

JK continues:

I, for one, avoid formal methods other than programmatic reference implementation for a few reasons: the tediousness of translating the spec into the formalism that would support all those tests; creating all those tests; and finally the difficulty of being relatively sure that what is written in the formalism actually corresponds to the intent of what is written in the text.

I could change my mind if it was shown that enough formalization for XML Schema that would support your regression tests is not all that tedious.

Here JK makes several excellent points, to which I do not currently have good answers.

On tedium: It’s possible to model a spec selectively (as in my model of schema composition) with relatively little tedium. Indeed, the point of light-weight formal methods (as promoted by Daniel Jackson and embodied in Alloy) is to allow the user to construct relatively small and simple models which cover just the aspects of the design one is currently worried about. There is no fixed foundational set of primitives in which models must be described; the model can specify its own primitives.

For example, I was able to model some simple aspects of XSD 1.0 schema composition without having to model all of schema validation, and without modeling all the details of XML well-formedness. If the paper on that model does not seem especially light-weight, I think it’s due to the need to specify how to inspect the behavior of implementations unambiguously.

I have to confess, thought, that it’s not always obvious to me how best to construct such partial models. For a long time I avoided trying to model any part of XSD because I thought I was likely to have to model the Infoset and XML in full, first. I’d like to do that, someday, just for the satisfaction, but I don’t expect it to be quick. It was only after a longish while that I saw a way to make a simpler partial model. The more accurately I can identify some part of the design that I’m worrying about, the easier it is to find a partial model for that part of the design.

But in the earlier post, I was thinking about a kind of regression testing which would be intended to identify problems and interactions I was not consciously worried about. For that to work, I expect I would need a model that covered pretty much all of the salient properties of the entire spec. And at the moment, I cannot say I expect the construction of such a model to be entirely free of tedium. If I think that it might nevertheless be a good idea, it’s because checking natural-language prose for consistency is also tedious (which is why working groups sometimes do not do it well, or at all).

Consistency of prose and formalism

In other words, if you have one model (the spec in English), it will have dark corners that need discovering and clearing up. That will have a cost. If you have two models (the spec in English and the formalism), in my experience both may have dark corners (but one model’s dark corner may be clarified by the other model where it’s not a dark corner, so that may be a net plus), but there’s also the consistency of the two models that comes in question. So, gimme examples, show me that the consistency is not a problem, please, and I’ll be very grateful indeed.

I read somewhere once (in a source I have not managed to find again) that some international standards bodies require, or recommend, that their working groups produce both the English and the French version of their specifications, rather than working monolingually and then having a normative translation made. Why? Because working monolingually left too many dark corners in the text; working on both versions simultaneously led to clearer texts in both languages. The initial process was slower, but the additional effort was paid back by better results with lower maintenance costs.

I speculate that working both in English and in a suitable formalism like Alloy or ACL2 would have a similar effect: initial progress would feel slower, but the results would be better and maintenance would be easier. (This is a lot like having test cases in software development: it seems to slow you down at first, but it speeds things up later.)

Consistency between the two formulations should not, in principle, be any greater problem than consistency between the natural-language formulation of different parts of a complex spec. In fact, it should (note that modal verb!) be less of a problem: natural language provides only so much help with consistency checking, whereas formalisms tend to be a bit better on that score. The only really complex spec I have ever seen with no inconsistencies I could detect was ISO 8879, whose editor was a lawyer skilled in the drafting of contracts.

There are, of course, things a working group can do to make it easier, or harder, to maintain consistency between the prose and the formalism. Putting the formal treatment in a separate document (like the XPath formal semantics) or a separate part of the same document (like the schema for schema documents in XSD) is a good way to make it easier for inconsistencies to remain undetected. Integrating fragments of the formalism with the prose describing those constructs (as is done with the BNF in the XML spec, or as is done in any good literate program) makes it easier to detect and remove inconsistencies. It also produced pressure on the working group to use a legible formalism; the XML form of XSD might be more readable if the working group had forced itself to read it more often instead of hiding it in the appendix to the spec.

Integrating the formalism into the text flow of the spec itself can work only if the members of the working group are willing to learn and use the formalism. That’s why I attach such importance to finding and using simple notations and light-weight methods.

But this long post is just more speculation on my part. You are right that examples are needed. Someday, I hope to oblige.

Trip report: Digital Humanities 2009

[29 June 2009; subheads added 30 June 2009]

Last week I was at the University of Maryland attending Digital Humanities 2009, the annual joint conference of the Association for Computers and the Humanities (ACH), the Association for Literary and Linguistic Computing (ALLC), and the Society for Digital Humanities / Société pour l’étude des médias interactifs (SDH/SEMI) — the constituent societies of the Association of Digital Humanities Organizations. It had a fine concentration of digital humanists, ranging from stylometricians and adepts of authorship attribution to theorists of video games (this is a branch of cultural and medial studies, not to be confused with the game theory of von Neumann and Morgenstern — there may be a promotional opportunity in the slogan “Not your grandfather’s game theory!”, but I’ll let others take it up).

I saw a number of talks; some of the ones that stick in my mind are these.

Rockwell / Sinclair on Knowledge radio

Geoffrey Rockwell (Univ. of Alberta) and Stefán Sinclair (McMaster U.) talked about “Animating the knowledge radio” and showed how one could lower the threshold of text analysis by processing raw streams of data without requiring that it be indexed first. The user’s wait while the stream is being processed can be made bearable, and perhaps even informative, by animating the processing being performed. In one of their examples, word clouds of the high-frequency words in a text are created, and the cloud changes as new words are read in and the list of most-frequent words changes. The analogy with radio (as a stream you tap into without saving it to disk first) is arresting and may have potential for doing more work than they currently make it do. I wonder, too, whether going forward their analysis would benefit by considering current work on continuous queries (a big topic in database research and practice today) or streaming query processors (more XQuery processors act on streams than act on static indexed data). Bottom line: the visuals were pretty and the discipline of making tools work on streams appears to be beneficial.

Roued, Cayless, Stokes on paleography and the reading of manuscripts

Henriette Roued (Oxford) spoke on an Interpretation Support System she is building in company with others, to help readers and editors of ancient documents keep track of cruces, conjectures, arguments in favor of this or that reading of a crux, and so on. It was very impressive stuff. In the same session, Hugh Cayless (NYU) sketched out a kind of theoretical framework for the annotation of manuscript images, starting from a conventional scan, processing it into a page full of SVG paths, and attempting from that to build up a web of links connecting transcriptions to the image at the word token and line levels. This led to various hallway and lunchroom conversations about automatic detection of page structure, or mise en page, about which I know mostly that there are people who have studied it in some detail and whose results really ought to be relevant here. The session ended with Peter Stokes (Cambridge) talking about the past and future of computer-aided paleography. Among them, the three speakers seemed to have anticipated a good deal of what Claus Huitfeldt, Yves Marcoux, and I were going to say later in the week, and their pictures were nicer. This could have been depressing. But we decided to take this fact as a confirmation that our topic really is relevant.

One thing troubles me a bit. Both Roued and Cayless seem to take as a given that the regions of a document containing basic tokens provide a tessellation of the page; surely this is an oversimplification. It is perhaps true for most typewritten pages using the Roman alphabet, if they have no manuscript additions, but high ascenders, low descenders, complex scribal abbreviations, even printers’ ligatures all seem to require or suggest that it might be wise to assume that the regions occupied by basic tokens might overlap each other. (Not to mention the practice in times of paper shortage of overwriting the page with lines at right angles to the first set. And of course there are palimpsests.) And in pages with a lot of white space, it doesn’t seem obvious to me that all of the whitespace need be accounted for in the tabulation of basic tokens.

Bradley on the contributions of technical specialists to interdisciplinary projects

John Bradley (King’s College London) closed (my experience of) the first day of the conference by presenting a thought-provoking set of reflections on the contribution of specialists in digital humanities to projects undertaken jointly with humanists who are not particularly focused on the digital (analog humanists?). Of course, in my case he was preaching to the choir, but his arguments that those who contribute to the technical side of such projects should be regarded as partners, not as factotums, ought to be heeded by everyone engaged in interdisciplinary projects. Those who have ears to hear, let them hear.

Pierazzo on diplomatic editions

One of the high points of the conference for me was a talk on Wednesday by Elena Pierazzo (King’s College London), who spoke under the title “The limit of representation” about digital diplomatic editions, with particular reference to experience with a three-year project devoted to Jane Austen’s manuscripts of fiction. She spoke eloquently and insightfully about the difference between transcriptions (even diplomatic transcriptions) and the original, and about the need to choose intelligently when to capture some property of the original in a diplomatic edition and when to gesture instead toward the facsimile or leave the property uncaptured. This is a quiet step past Thomas Tanselle’s view (Studies in Bibliography 31 [1978]) that “the editor’s goal is to reproduce in print as many of the characteristics of the document as he can” — the history of digital editions, short as it is, provides plenty of examples to illustrate the proposition that editorial decisions should be driven by the requirements of the material and of the intended users of the edition, not (as in Tanselle’s view) by technology.

Ruecker and Galey on hermeneutics of design

Stan Ruecker (U. Albert) and Alan Galey (Toronto) gave a paper on “Design as a hermeneutic process: Thinking through making from book history to critical design” which I enjoyed a great deal, and think I learned a lot from, but which appears to defy paraphrase. After discussing the competing views that design should be the handmaiden of content and that design can and should itself embody an argument, they considered several examples, reading each as the embodiment of an argument, elucidating the work and the argument, and critiquing the embodiment. It gave me a pleasure much like that of sitting in on a master class in design.

Huitfeldt, Marcoux, and Sperberg-McQueen on transcription

In the same session (sic), Claus Huitfeldt (Univ. of Bergen), Yves Marcoux (Univ. de Montréal), and I gave our paper on “What is transcription? part 2”; the slides are on the Web.

Rockwell and Day, Memento mori for projects

The session concluded with a presentation by Geoff Rockwell (again! and still at Alberta) and Shawn Day (Digital Humanities Observatory, RIA Dublin) called “Burying dead projects: Depositing the Globalization Compendium”. They talked about some of the issues involved in depositing digital work with archives and repositories, as illustrated by their experience with a several-year project to develop a collection of resources on globalization (the Globalization Compendium of the title). Deposit is, they said, a requirement for all projects funded by the Canadian Social Sciences and Humanities Research Council (SSHRC), and has been for some time, but the repositories they worked with were still working out the kinks in their processes, and their own initial plans for deposit were also subject to change (deposit of the material was, interestingly, from the beginning planned into the project schedule and budget, but in the course of the project they changed their minds about what “the material” to be deposited should include).

I was glad to hear the other talks in the session, but I never did figure out what the program committee thought these three talks had in common.

Caton on transcription and its collateral losses

On the final day of the conference, Paul Caton (National Univ. of Ireland, Galway) gave a talk on transcription, in which he extended the analysis of transcription which Claus Huitfeldt and I had presented at DH 2007 (later published in Literary & Linguistic Computing) to consider information beyond the sequence of graphemes presented by a transcription and its exemplar.

There are a number of methodological and terminological pitfalls here, which mean caution is advised. For example, we seem to have different ideas about the meaning of the term token, which some people use to denote a concrete physical object (or distinguishable part of an object), but which Paul seems to use to denote a particular glyph or graphetic form. And is the uppercase / lowercase distinction of English to be taken as graphemic? I think the answer is yes (changing the case of a letter does not always produce a minimal pair, but it sometimes does, which I think suffices); Paul explicitly says the answer is no.

Paul identifies, under the cover term modality, some important classes of information which are lost by (most) transcriptions: presentation modality (e.g. font shifts), accidental modality (turned letters, malformed letters, broken type, even incorrect letters and words out of sequence), and temporal modality (the effects of time upon a document).

I think that some of the phenomena he discusses can in fact be treated as extensions of the set of types used to read and transcribe a document, but that raises thorny questions to which I do not have the answer. I think Paul has placed his finger upon a sore spot in the analysis of types and tokens: the usual view of the types instantiated by tokens is that we have a flat unstructured set of them, but as the examples of upper- and lower-case H, roman, italic, and bold instances of the word here, and other examples (e.g. long and short s, i/j, v/u) illustrate, the types we use in practice often do not form a simple flat set in which the identity of the type is the only salient information: often types are related in special ways. We can say, for purposes of analysis and discussion, that a set of types which merges upper and lower case, on the one had, and one which distinguishes them, on the other, are simply two different sets of types. But then, in practice, we operate not with one type system but with several, and the relations among type systems become a topic of interest. In particular, it’s obvious that some sets of types subsume others, and conversely that some are refinements of others. It’s not obvious that subsumption / refinement is the only relation among typesets that is worth worrying about. I assume that phonology has similar issues, both with identifying phonemes and with choosing the level of detail for phonetic transcriptions, but I know too little of phonology to be able to identify useful morals for application here.

What, no markup theory?

Looking back over this trip report, I notice that I haven’t mentioned any talks on markup theory or practice. Partly this reflects the program: a lot of discussions of markup theory seem to have migrated from the Digital Humanities conference to the Balisage conference. But partly it’s illusory: there were plenty of mentions of markup, markup languages, and so on. Syd Bauman and Dot Porter talked about the challenge of improving the cross referencing of the TEI Guidelines, and many talks mentioned their encoding scheme explicitly (usually the TEI). The TEI appears to be in wide use, and some parts of the TEI which have long been neglected appear to be coming into their own: Jan Christoph Meister of Hamburg and his students have built an annotation system (CATMA) based on TEI feature structures, and at least one other poster or paper also applied feature structures to its problem. Several people also mentioned standoff markup (though when one otherwise interesting presenter proposed using character offsets as the way to point into a base text, I left quietly to avoid screaming at him during the question session).

The hallway conversations were also very rewarding this year. Old friends and new ones were present in abundance, and I made some new acquaintances I look forward to renewing at future DH conferences. The twitter stream from the conference was also abundant (archive); not quite as active as an IRC channel during a typical W3C meeting, but quite respectable nonetheless.

All in all, the local organizers at the Maryland Institute for Technology in the Humanities, and the program committee, are to be congratulated. Good job!

Another example of the curb-cut effect

[29 June 2009]

The XSD Datatypes spec has a diagram showing the hierarchical derivation relations among the built-in datatypes. The old version (created by Asir Vedamuthu, to whom thanks, and used in XSD 1.0 and in earlier drafts of XSD 1.1) has simple color-coding to distinguish various classes of datatypes (what are now called the special datatypes, the primitives, and the other built-ins).

For the Candidate Recommendation draft of XSD 1.1, though, we needed to make a new drawing to show the built-in datatypes added in 1.1 (anyAtomicType, dateTimeStamp, dayTimeDuration, yearMonthDuration, precisionDecimal).

The new version created for the Candidate Recommendation draft has a new color scheme, which I made with the help of a very nice tool for color, now to be found at colorschemedesigner.com (I used the previous version, but the functionality I counted on is still there). This tool (and some others) allows you to see an approximation of the effect of your color scheme for a reader with various forms of color perception deficit (protanopy, deuteranopy, tritanopy, etc.), which means you can try to ensure that the distinctions in your diagrams are visible also to readers with those forms of vision.

I found it remarkable that I ended up with a color scheme I find more attractive than the old one; it’s remarkable how many people have told me they think the same (without realizing the proximate cause of the change).

SVG, of course, makes it easy to make diagrams for which the color scheme can easily be modified. And XSLT makes it easier to generate this diagram and to modify it systematically in various ways (including color scheme). But it’s the idea of universal design that gets the credit for making the diagram visually more attractive.

Universal design: try it sometime. You’ll be glad you did.

Erik Naggum, R.I.P.

[20 June 2009]

It appears from reports on the Net that Erik Naggum, long-time genius loci of comp.text.sgml, has died.

In person, he was (as far as I could tell, on the very few occasions I encountered him in the flesh) a very sweet individual. On the net — well, he taught me what a flame war was. His work on internationalization gave hints of great generosity; his resentment against the Unicode Consortium was almost comic in its ferocity (even to me, never one of that organization’s greatest fans).

Erik Naggum, dead? Is it possible? One person fewer who remembers the old days.

So it goes.