Text Analysis Software Planning Meeting

Princeton, 17-19 May 1996

Trip Report


C. M. Sperberg-McQueen

Table of Contents


1 The Next Generation

The humanistic researcher who wants computer assistance with research on texts today faces a rich array of possibilities: concordance programs of the batch and interactive persuasions, lemmatizers, morphological analysers, full text retrieval systems, sophisticated and simple hypertext delivery systems, collation programs, typesetting software. Some of this embarrassment of riches runs in a client/server environment, some runs standalone, some for DOS, some for Windows, the Mac, Unix. Much of this software is well written, much is well documented, much is available at reasonable cost or for free. And sometimes these three categories even overlap.

Why, then, does there seem to be a consensus that we are facing a sort of crisis of confidence in our software? Why do so many people in humanities computing regard software as a Problem We Must Face Up To Soon? Why am I one of them?

Because a generation is passing.

From the first uses of machine-readable text for humanistic research in the late 1940s, to today, I count three generations. The first generation included special-purpose, ad hoc programs written for particular projects to apply to particular texts. The second generation can be counted from early efforts to create reusable libraries of text-processing routines, some of which eventually turned into efforts to create general-purpose, reusable programs for use with many texts. Naturally, these were batch programs. The Oxford Concordance Program (OCP) and some modules of the Tustep system are good examples. In the third generation, general-purpose programs became interactive: Arras, the Tustep shell, Word Cruncher, Tact, and other newer programs are all third-generation programs in this sense.

What these programs do is useful: produce concordances, allow interactive searching, annotate texts with linguistic or other information, and a multitude of other tasks. Why aren't people happier about the current generation of software?

I think there are several reasons:

  1. For many potential users, existing software still seems very hard to learn. Not everyone thinks so, but those who find current software easy are in a decided minority. User interfaces vary so much between packages that learning a new package typically means learning an entire new user interface: there is very little transfer of training.
  2. Current programs don't interoperate well, or at all. CLAWS or other morphological analysers can tag your English text with part-of-speech information, but if you've tagged your text with Cocoa or SGML markup, you'll have to strip it out beforehand, to avoid confusing CLAWS. If you want to keep the tagging, you'll have to fold it back into the text after CLAWS gets through with it, as the British National Corpus did. Other programs, of course, will be confused in turn by the part-of-speech tagging, so you'll have to strip that out, too, before applying other text analysis tools, and then fold it back in again afterwards.
  3. Current programs are often closed systems, which cannot easily be extended to deal with problems or analyses not originally foreseen. You either do what the authors expected you to do, or you are out of luck. The more effort a program has put into its user interface, the more likely it is to resist extension (this need not be so, but it is commonly so); the more open a system is, i.e. the more care its developers have put into allowing the user freedom to undertake new tasks, the worse the reputation of its user interface is likely to be. A comparison of OCP and Tustep is instructive here: OCP has a very careful user interface design, and is remarkably easy to learn, but it is a closed black box, and cannot be extended except by defeating its designers' determined efforts to keep you from seeing what's happening under the hood. It is a closed system. Tustep, on the other hand, is remarkably open and flexible, but has a reputation for complexity which is perhaps not completely unrelated to that openness and flexibility.
  4. Almost all current text analysis tools rely on what now seems a hopelessly inadequate model of text structure. Text, for these programs, is almost invariably nothing but a linear sequence of words or letters. The most sophisticated of them may envisage it as a linear sequence with tags showing the values of different variables at different points in the sequence (as in Cocoa-style tagging), or as an alternating sequence of text and processing instructions (what is sometimes called the one-damn-thing-after-another model of text structure). None of the current generation of text-analysis tools support the significantly richer structural model of SGML, though the experience of the TEI has persuaded many people that that structural model represents a radical breakthrough in our ability to model text successfully in machine-readable form.

Different people, of course, assign different weights to these problems. Some are most moved by the need for better user interfaces, others by the need for tools which combine SGML-awareness with the textual-analysis primitives of the current generation of software. But for whatever reasons, many people seem to agree that the textual-research community is in need of a new generation of text-analysis tools.

And that is why now, on a sweltering afternoon, I am sitting in Newark Airport awaiting my flight home after a two-day Text Analysis Software Planning Meeting at Princeton sponsored by the Princeton / Rutgers Center for Electronic Texts in the Humanities (CETH), and trying to decide what to think about the meeting. I enjoyed myself immensely, and am grateful to CETH for having put it on and for having invited me, but I am not quite sure what to think. Was the meeting a rousing success, because there was so much successful exchange of information? Or was it a bit of a disappointment, because we did not leave with a clear consensus on what the next generation of software should look like or how to set about developing it? Perhaps it's expecting too much to hope that a meeting like this will develop a concrete plan of attack. Is the glass half full or half empty?

2 The CETH Text Analysis Software Planning Meeting

The meeting was organized by Susan Hockey at CETH, with assistance from Willard McCarty and Malcolm Brown, along with (less intensively) Allen Renear, Harold Short, and myself. An earlier meeting had gathered a smaller number of participants to discuss the same general topic; one conclusion of that earlier meeting was that further meetings, with broader representation of interested parties, were desirable. Participants came to this second meeting from North America, Europe, and Japan, with interests primarily in language and literature.

A couple people asked me how this effort related to the Text Software Initiative, announced by Nancy Ide and Jean Veronis at the Georgetown ACH/ALLC '93 conference. There is, as far as I know, no relation, but this meeting took pains to avoid at least one mistake for which the TSI was criticized at Georgetown, by inviting so many people, from so many places, to participate in the planning, and to ensure that any resulting effort would have broad community support.

The organizers as a group believed, not unanimously but on the whole, that it was dangerous to allow the conversations to veer too close to the topic of implementation, and that it was important to keep the participants focused on user requirements. This had the salutary consequence of preventing a rapid descent into geek-talk, but another result, unfortunately, was to rule out of order in advance many of the most interesting topics which come up in planning a new generation of text analysis software. It was also not possible to keep everyone's mind off implementation issues, broadly defined: they turned up repeatedly in discussions, unconvincingly disguised as user requirements. But "users need good search facilities" and "users need an open, modular architecture" are, despite the similarity of their structure, statements about very different aspects of the system. As a member of the organizing committee, I must accept responsibility for the mistake, but in retrospect it does seem to me to have been a mistake; a fuller, freer discussion of technical questions would I think have allowed a much more convincing treatment of them than was managed in the event.

3 Day 1: Overview of Current Tools

3.1 Pre-SGML Tools

We spent the first day in a useful overview of existing tools, including Lexa (a large collection of DOS-based utility programs for managing and studying language corpora), Monoconc (a Windows-based concordance package with a strong emphasis on simplicity of interface), OCP (an aging batch concordance program with some features still missing from most of its livelier interactive brethren), Tact (a DOS-based interactive concordance package), TextPack (a package of programs for content analysis developed at ZUMA, the Zentrum für Umfragen, Methoden, und Analyse, in Mannheim, originally for mainframes but now for microcomputers), and Tustep (a collection of interoperable programs developed in Tübingen for creating, managing, manipulating, and typesetting electronic texts). The programs in this first batch of tools were all created for researchers, and almost all of them were finished or well under way before SGML came, like a great wave, and changed forever the way many people think about electronic representation of text.

All the talks were interesting in one way or another; perhaps the most important points for thinking about the next generation of text analysis tools were made by Wilhelm Ott in a paper originally given at ALLC/ACH 92 in Oxford and redistributed to the participants. The key features he identified for any such software are:

Tustep exemplifies these principles in its way. Its individual modules each perform a relatively specific task; all Tustep programs read and write a single basic file structure, which means that they can be combined, like Tinkertoys or Unix filters, in arbitrary ways to perform work not explicitly foreseen by the developers. They perform to a high standard, well beyond what is possible with common off-the-shelf office-automation software. They include tools for all the major tasks a user normally needs, including some duplicating basic operating-system functions, which help ensure that projects can change platforms in mid-project without having to relearn how to copy and rename files, etc. And Tustep runs on a wide range of operating platforms.

The news that Tustep is now bilingual (German/English) clearly took some participants by surprise, and I suspect that the number of Tustep sites in North America will rise as a result of this meeting.

Given the focus of the meeting on text analysis, it is perhaps not surprising that most of the discussion ignored the issues raised by the third point (integration). In practice, however, it requires serious attention. If we wish to focus on users and their tasks, not just on software and its functions, then our attention must inexorably be drawn to activities of text preparation and presentation, and cannot be restricted to some narrow notion of text analysis as opposed to other activities. At the same time, insisting on supporting every possible user task within a single suite of related programs can lead to pointless duplication of effort and to a certain sense of being closed off from the rest of the world. Some important functions, e.g. data capture and interactive display, are well supported, at least for SGML data, by existing commercial and non-commercial software. Tools for text enrichment and analysis must cooperate gracefully with these editors and browsers, but we don't need to reimplement them all.

3.2 SGML Tools

The second half of the first morning was devoted to SGML-aware tools: Dynatext, Explorer, Pat, and SARA. The first two are systems for the publication of SGML documents; the third an extremely powerful search engine originally developed for the New Oxford English Dictionary. The last is an SGML-aware search and retrieval system developed in connection with the British National Corpus (BNC), a 100-million-word corpus of modern written and spoken British English. The main drawback of the first three systems, apart from their commercial price -- often quite reasonable by reasonable institutional standards, but out of the range of individuals -- is that the two display systems are not geared to scholarly querying, and that Pat, while geared to extremely elaborate querying, is an engine without a user interface, and experience shows it is not easy to find a good user interface for query capabilities so extensive and powerful. SARA is very promising, but hovers uncomfortably between being a general SGML tool and being tailored exclusively for the BNC. The intention of the developers, I am told, is to release it as a general tool, but when I ask questions about specific SGML queries I'd like to be able to issue, the answer, with depressing regularity, is "No, it can't do that, we didn't need anything like that for the BNC." It looks promising, but as a general SGML tool it isn't yet ready for prime time.

3.3 User Requirements

After lunch, we heard talks from representatives of various specialized user communities discussing the requirements of those communities: classics and Biblical studies (with a clear overview by Winfried Bader of the German Bible Society), manuscript studies (with examples from the autograph MSS of C.S. Peirce by Michael Neuman, focusing in particular on the enormous problem of sequencing the confused, disordered mass of his posthumous papers), literary criticism (in the form of a letter to the participants from John Burrows, who was unable to attend), East Asian material (for which Shoichiro Hara provided illustrations of problems facing the National Institute of Japanese Literature, some of them related to the character set problem and some shared with Western manuscript material), and documentary editors (David Chesnutt described the work of preparing scholarly editions -- to my astonished chagrin, however, most participants appear to have decided that `text preparation' and `text analysis' did not have much in common, although in fact I think each has more problems in common with the other than either can call uniquely its own).

We concluded the afternoon (but not the day) in small groups discussing who the potential users of text analysis software are and what requirements they might have. The group I was in came to the unsurprising conclusion that every discipline in the humanities, and most disciplines outside the humanities, might use text analysis tools, as might university administrators (only on occasion, however, and only by accident) and people outside the university. The core functions we identified were support for a rational model of text in the form of support for SGML/TEI encoding, good search functions, and a fully scriptable user interface. Character set support and good style sheets and display also came up, but didn't get as many votes, perhaps because some of us were able to persuade ourselves that character set support was implicitly included in good TEI support, and style sheets are implied logically by a fully configurable user interface. This seems a rather meager result for two hours of intensive discussion, but explicit confirmation of one's hunches by the hunches of other people should probably not be valued too lightly. More serious, perhaps, was that after the end of the session, Wilhelm Ott and I each thought of critical aspects of functionality left completely unmentioned: sorting, for one, and all manner of interim processing, including linguistic or other analysis and tagging. Later in the evening, it occurred to me that I had left unmentioned what I think is the most important area of all for new work: the systematic transformation of one SGML document into another SGML document; this is known in the SGML industry as tree-to-tree transformation. It includes the enrichment of the content with annotation, but more generally also includes a lot of processing which cannot usefully be described as annotation.

After dinner, the participants met once more (flagging slightly by this time) to hear reports from the work groups.

4 Day 2: Implementation Issues

The second morning's program had the label Implementation Issues, but in fact it was a heterogeneous series of presentations on miscellaneous topics. Where I come from, implementation implies being a bit more concrete and specific than any of these talks (fifteen minutes each in length) was able to be.

In my own fifteen minutes, on SGML and TEI encoding, I had been asked to address the questions:

The first question had caused me to vacillate unhappily among amusement, resignation, and truculence for several days, since I devoutly hoped that for most participants this was not really an open question, and I realized that if any participants still harbored doubts about SGML and TEI, after the ten years of intensive discussion they have received in the relevant conferences and journals, then a fifteen minute talk from me was hardly likely to tip the balance. So I asked the participants whether they had any specific concerns about supporting TEI which we could usefully address. The questions they asked seemed mostly informational in nature:

John Bradley expressed concern that writing software to support TEI markup would leave users out in the cold if all they had were texts without markup. Since we had spent most of the previous day hearing about all the text analysis software already available for such texts, I have to admit I do not see the force of this objection, even if support for the TEI were held to imply non-support for other markup or for texts not marked up at all, which it does not. The pressure of time made it impossible to air this issue adequately, and it remained unresolved through the entire meeting.

To be honest, of course, I think worrying excessively about legacy data is a good way to make software development significantly harder and longer and to water down the end product, and the only software I'm interested in developing to work on non-marked-up text is software to assist in tagging it. But I cannot stop other people from working on whatever software they think interesting and useful, and would not want to even if I could.

Otherwise, no one raised any serious objections to SGML support, though it later became clear that Espen Ore also objected to the idea, because he would prefer that the next generation of software support an abstract markup into which one might translate TEI, or Cocoa markup, or MECS, the markup used by the Wittgenstein Archives at the University of Bergen, and which would thus be neutral among existing markup schemes. The TEI markup scheme was designed for precisely this neutrality among pre-existing markup schemes, however, which makes me think such an approach either redundant (if the TEI is felt to have succeeded adequately in the task) or very tricky (if the TEI's encoding-neutral notation doesn't do the job, how easy will it be to make a better one?).

After me, Robin Cover gave a bird's-eye view of the current state of SGML and related standards and their software; he views with some optimism the prospects for a successful marriage of SGML with object-oriented database management systems. He asked how far one might get, toward the kinds of tools we need, by using standard generic SGML software rather than TEI-specific programs; the answer, he suggested, was "surprisingly far -- farther than any application has ever gone yet -- and yet, not quite far enough." The key problem, he suggested, was the inability of current generic SGML systems to know very much about the semantics of the markup in a document. An SGML ID/IDREF link might in fact represent a hyperlink, which should be represented to a user as a hot button, but it might also represent something rather different (as does, for example, the TEI lang attribute). The absence of any formal semantics from SGML and SGML systems has some far-reaching consequences -- it may be, in fact, the main reason for desiring TEI-aware, not just SGML-aware, text analysis software.

Gary Simons then gave a presentation of Cellar (Computer Environment for Linguistic, Literary, and Anthropological Research), which has been developed at the Summer Institute of Linguistics (SIL) by a team under his direction. To the syntactic specifications of SGML, Cellar adds tools for conceptual modeling which make it significantly easier to build smart applications which `understand' the markup in a text. He showed examples of a Cellar application which understands TEI markup for textual variation and uses it to present the variants in several very different ways. He was characteristically soft-spoken about the program, but anyone who cares about text analysis should be very excited to learn that SIL plans to burn CD-ROMs for the Windows version of Cellar within the next few months. (The Mac version, he told me, has fallen prey -- temporarily, he hopes -- to a vendor's withdrawal of support for the Mac version of a library used by Cellar.) Cover's talk helped strengthen my growing conviction that the TEI might do well to attempt a formal treatment of the semantics of the TEI tag sets; Simons's description of Cellar made me wonder whether Cellar's conceptual-modeling language might be a useful vehicle for such a formal semantic specification.

John Bradley and Geoff Rockwell then gave a quick overview of their ideas about a visual interface for text analysis software, which now goes by the name of Eye-ConTact. Inspired by some visual-programming tools for scientific visualization, Eye-ConTact involves a sort of GUI tool for creating pipelines of filters (which, however, unlike Unix pipes, can have more than one input or output stream); to make this work, one needs to have (1) process modules to do things to the data, (2) framework modules to manage the user interface and (based on how the user connects pipeline stages) manage the process modules, (3) data files and result files, and (4) `map' files, which show the combinations of processes which resulted in a given result.

I like the visual interface Bradley and Rockwell describe for controlling the modular pieces of their system, but I think their most important contribution was to make explicit a point which had arisen obliquely in several discussions already: the next-generation software system we are thinking about must not only comprise a number of independent, interoperating modules with consistent interfaces, it must also be open: it must allow single modules to be replaced by other modules possibly developed by different programmers; it must be possible to add new modules to the system, and to access data at any and every module boundary.

I think this is important because it matches the reality of the user requirements, and our funding situation. Research requirements cannot be mapped out exhaustively in advance, because research involves asking questions to which the answers are not yet known -- and from time to time asking questions which themselves were not foreseen when the research began. A research-oriented software system cannot, therefore, be exhaustively complete, except in the most narrow technical sense (if it has a programming language, it can be Turing complete): newly discovered interests may require new processing modules at any time. It is essential, therefore, that it be possible to add new modules to a system -- possible not just for the original developers, but for any and every user willing to take up a keyboard and write the new modules. Extensibility is an absolutely essential requirement for a really satisfactory system. And, in the current funding climate, it's easier to imagine finding funds to make a new module here and a new module there than funds to build a large new system from scratch.

(It will be objected that writing new modules may require programming skill. Yes, it may. Some will ask, what about the average humanist who does not have programming skills? To which there is no answer but to ask, what about the average humanist whose research leads into an area which requires a good reading knowledge of Sanskrit?)

Tom Horton concluded the session with a talk about domain-specific analysis, a new trend in software engineering which promises, he thinks, substantial benefits for relatively circumscribed domains like scholarly text analysis. By focusing on specific domains, this approach makes it easier to create reusable resources: it focuses not just on reusability of code, but also on reusable requirements statements, user modeling, specifications, and the like. In an ideal world, these are implemented by cleanly defined libraries supporting well specified application-program interfaces, and it becomes easy to implement a wide variety of software by combining these library routines in various ways.

At this point, we broke into groups again, to discuss architectural issues, searching capabilities, and user needs. The architectural group began, plausibly enough, by deciding to decide what it might mean to specify an architecture for the kind of system we had been talking about: what needs to be specified, and at what level of detail? This is, surely, a necessary first step. It would be nice to be able to report that after it, we had taken another one, but after we had reached something resembling agreement, it was time for lunch. Our dedication to the cause fought with our desire to eat; struggled; wavered; lost. We went to lunch.

For what it's worth, before knocking off we agreed that a general document describing various components of the system would be needed (what do we mean by user interface? Where does it start and end? Ditto for MOdule-Management (MOM) modules, which direct the actions of the other modules and child processes. Ditto for the other parts of the large spaghetti-like diagram we drew on the flip chart.) About the inter-module communications, we agreed only that a specification of the architecture should describe how modules might send each other messages (so that one window, controlled by one module, could be updated in reaction to user actions taken in another window controlled by a separate module), and what kinds of primitive search or other operations might be undertaken. This last specification might take the form of a grammar for a query language, which could be interpreted either as the rules governing messages from a client to a query server, or as a concise notation with which a user interface describes, to a MOM module, what the user has just requested the MOM tell the other modules to do.

The query language (John Bradley resisted the term, on the grounds that it biases the mind toward a client/server approach, but we didn't find any other short name for the thing) seemed to be potentially very important, because a good non-proprietary query language could be used to allow single clients to deal with multiple search servers, each with their own query language. The client, or else a shim between the network and the search server, would translate queries from the common language into the server's own language, and package the results appropriately before returning them. Antonio Zampolli pointed out that such functionality was not a chimera; a European project will be starting up in the next few months to define a common query language for precisely that purpose: to allow single-client interrogation of multiple query systems. (Since the meeting, Robin Cover has drawn my attention to a project in Canada with a similar goal: the Canadian Strategic Software Consortium, with a WWW home page at http://www.cssc.ca/.)

5 Is the Glass Half Full or Half Empty?

It was about this time that I decided I had come to the meeting with a fundamentally misconceived notion of what we should be aiming at, of what would constitute perfect success of such a planning meeting. While carefully keeping an open mind on the issue, I had imagined that what one ought to be hoping for was the emergence of a consensus among the participants that a new generation of text-analysis software is indeed needed, some shared ideas about what that new generation might look like, and an agreed-on plan for organizing systematic technical work to design and promulgate an architecture for interoperating modules, followed by systematic efforts to develop modules to fit into that overall architecture.

There certainly seems to be a consensus that a new generation of tools is needed -- though to be honest I am taking a short leap of faith here, since there was no opportunity to put this proposition to an up-or-down vote or discussion. There is also a very strong consensus among some participants about the overall framework within which the next generation should be developed: an open, modular system designed to work both in client/server and in stand-alone environments, within which institutions could deploy Pat or the other search engine of their choice, working on SGML-encoded text and exploiting SGML for other aspects of the system as well (e.g. for communication between client and server, for style sheets, and for work files and other inter-module communications), with modules developed in parallel at multiple sites and not centrally, based on bottom-up democratic development rather than top-down planning, and governed by the well-known motto "Rough consensus and running code." These principles seem so obvious to some potential collaborators that they go almost without saying. But not everyone agrees. Some participants clearly have very different views about how the software should be developed, or what sort of architecture should govern it. The divergence of views may reflect simple lack of communication or confusion about what is meant by various catch-phrases or shorthand references, but in part I think the differences of opinion are real. Both causes, no doubt, contributed to the difficulty experienced by the work groups in arriving at anything beyond generalities (an architectural specification must specify the interfaces between modules; modularity is good; motherhood and apple pie are admirable). It was regrettable (this is where the glass begins to look empty) that there was no good opportunity in the meeting to air these differences and try to iron them out.

I realized, about lunchtime on the second day, that I no longer felt a systematic top-down definition of architecture was realistic, or even necessarily desirable. If it delays experimentation with new modules, it is emphatically undesirable. What is needed is a commitment to cooperative work among developers in a chaotic environment of experimentation and communication. If we were building a closed, monolithic system, planning and prior agreement about everything would be as desirable as they always are in software engineering. But the one point on which everyone seems agreed is that we need an open, extensible system, to work with texts we have not read yet, on machines that have not been built yet, performing analyses we have not invented yet. This is not a system for which we can plan the details in advance; its architecture, if we insist on calling it that, will be an emergent property of its development, not an a priori specification. We are not building a building; blueprints will get us nowhere. We are trying to cultivate a coral reef; all we can do is try to give the polyps something to attach themselves to, and watch over their growth.

In practice, I think this means that what is needed is regular communication among developers writing software for textual analysis who are willing to make a shared commitment to cooperation, reuse and sharing of code, and interoperability among their programs. The goal should be to grow a coral reef of cooperating programs, not to attempt to decide in advance what scholars will need and how software should meet those needs. Improvisation and social pressure to Do the Right Thing are important, as are the programmer's cardinal virtues of laziness, impatience, and hubris (which can, properly channeled and supported by communication, lead to effective reuse and improvement of modules). Not all developers will be willing or able to do this, though I think enough are to make it worth while. Any funding agency willing to fund a small implementors' round-table meeting once every six to nine months would be performing a massive service for the humanities. Even a concerted effort to schedule panels and exchange experiences at the annual conferences of ACH and ALLC might bring useful results.

This train of thoughts came to me late enough in the meeting that I was not able to discuss them at any length with other participants; these remarks should not be taken as representing the views of anyone else at the meeting, let alone a consensus of the participants. But if the point of the meeting was to consider how to go about creating the next generation of text analysis tools, then for me at least it was a rousing success, because now I have a clear notion of how I think we should proceed, where I had only uncertain, conflicting ideas before. The path is not the one I expected. But I now feel confident that there is a path.

Is the glass half full or half empty? Well, the disappointing news is that the glass I had in mind when I left for this meeting is neither: it has a hole in the bottom. The good news is, there is another glass, and it's half full now. And if we can get our hands on a pitcher, we can fill it up to overflowing. As I sit here in the hot, airless departure lounge of Newark Airport, I am very thirsty. An overflowing glass would be welcome indeed.

For the promise of that overflowing glass, everyone interested in humanities computing owes a debt to Susan Hockey and to CETH for organizing and supporting this meeting.