Trip Report:

American Society for Information Science

C. M. Sperberg-McQueen

25 October 1993

The annual ASIS meeting runs from last Saturday (23 October, a pre-conference workshop on "Crossing the Internet Threshold") through Thursday of this week, in Columbus, Ohio, but I only attended one day (25 October), in order to get back to work on TEI P2 (so this report will be brief).

Attempting to be a good soldier, I attended a continental breakfast for newcomers to the ASIS conference on Monday morning; arriving a little late, I found ample fruit and croissants, but no coffee. (I am beginning to think I have somehow incurred the wrath of the world's coffee gods, and am fated, until just retribution is exacted, to see the coffee run out just before I reach the head of the line. If anyone knows what I may have done, please let me know.) The breakfast was marked with an earnest friendliness on the part of the organizers which reminded me a little uneasily of Rotary Club meetings, but another attendee, who is working on the 'packaging and distribution' of environmental and econometric data for a consortium of (I gathered) quasi-governmental organizations, did tell me that she had heard of the TEI Header and thought it might have some relevance to her work. This almost made up for the coffee. I'm not sure I saw her, however, at the TEI session.

A professor at the University of North Carolina Graduate School in Library and Information Science, Stephanie Haas, had organized a session on SGML and the TEI, which took place from 8:30 to 10:00. I gave a rapid introduction to SGML and its notation; Susan Hockey gave an introduction to the TEI, its goals, and its organization; and I outlined the contents of TEI P2 and showed a couple of simple examples. Questions from the audience of between 80 and 120 included:

You mentioned the possibility of exchanging TEI Headers between sites as a means of providing holdings information and a substitute for a catalog. What relationship do you foresee between such headers and MARC records in the local catalog?
I am concerned that the methods of analysis and annotation seem so oriented toward things like morphological analysis. In my experience on Esprit and other projects, I have come to believe that semantic annotation is the crucial task; can your tags for annotation handle things like case-grammar analysis?
Who is going to do all this tagging?
The problem with SGML is that until DSSSL is completed, you cannot describe the physical appearance of the page. If you lose the format information, however, you lose the archival resilience of the material; how do you address that problem?
(from someone whose name tag said he was from the International Atomic Energy Commission) As a database provider whose data are now in MARC format, who is looking toward the future, I am interested in considering SGML. But, although you did not mention it, CCF is also a strong candidate as a release format for our data. What are the relative strengths of using a specialized bibliographic format like CCF, compared with a general-purpose format like SGML?

Both SH and I were impressed by the quality of the questions, and the number which were new to us.

After the session and quiet discussion with numerous auditors, SH and I had coffee with Annalies Hoogcarspel of CETH, Elaine Brennan of the Brown Women Writers Project, Daniel Pitti of Berkeley (who is running a Finding Aids project and wanted to know whether CONCUR would solve the problem of encoding finding aids; we decided CONCUR could handle finding aids, but would not really be necessary), and three people from the Getty Art History Information Program (Deborah Wilde, Jennifer Trant -- an external consultant working for AHIP -- and a fellow whose name I lost).

After clearing up the problem of CONCUR for Daniel Pitti, the discussion moved to how SGML could be applied to the problems of cataloguing or describing art works and their related materials. JT, in particular, had been struck by the possibility that an art work -- e.g. a print, or a concrete poem -- could be encoded directly as a text in TEI form, in which case the TEI header would need to contain a description of the work. She was concerned, however, about the problem of establishing and documenting relationships between the work itself, preparatory materials (e.g. a script for a happening), artifactual traces of the work (e.g. objects used during the happening), textual or other surrogates for the work (e.g. a description of the happening written during its performance, or a videotape of a performance piece), and curatorial descriptions of a work or of any of its associated materials. I encouraged them to consider the model of HyTime, with the hub document providing links among an arbitrary collection of other objects and their surrogates. [They should have no trouble with this, since David Durand is working on DTD development for the project.] Since curatorial description, at least, is often a well structured activity, the Getty people were also interested in defining a canonical form for such descriptions; since the form varies, however, with the nature of the art work, they were also interested in techniques like the pizza model or like the parallel options for bibliographic citations and for dictionary entries, which allow an encoding scheme to capture the regularity of the majority of instances of a form, while still accommodating the outliers and eccentrics. On the whole, the Getty people were clearly much more open to discussion with others and to learning from other projects than their reputation had led me to expect; they remarked themselves that the atmosphere at the institution had changed and it was more open to the outside than before.

After lunch, I (we) attended the meeting of the National Information Standards Organization (NISO), to hear Eric van Herwijnen speak on "Standards as Tools for Integrating Information and Technology: the Impact and Implications of SGML." He began with a brief demo of the DynaText version of his book Practical SGML, and then spoke about electronic delivery of information as it is currently practiced (at CERN, for example, *all* publications are now prepared in electronic form, 95% of them in TeX, and most authors desire to exert full control over details of page layout), and how it might develop now that storage is cheap (disk space now costs less than $1 per Mb, if purchased at 1 Gb quantities) and "network speed is no longer a problem." He described how the development of preprint distribution bulletin boards at Los Alamos and SLAC -- which now contain 60% of the new articles on physical topics, at the rate of 6000 to 7000 articles per year -- have democratized access to preprints. Internet discovery tools like WAIS, Gopher, and WWW (or V-6 as he ironically dubbed it) provide transparent access to documents across the network without requiring the user to know their actual addresses; this allows documents to be left at their point of origin instead of being mirrorred across the network, which causes synchronization problems. In the future, he said, we will all have the entirety of the world's documents on our desk top. This will lead to a further worsening of the information glut already induced by publish-or-perish promotion rules and the geometric growth of publication in modern times (the number of articles published in humanistic and scientific journals in 1992, he said, topped a million). Information glut, in turn, leads to the need for intelligent information retrieval. Within a discipline, however, information retrieval needs only very minimal intelligence: 90% of searches in the physics databases are done not using the elaborate subject indices and relevance measures on which we spend so much work, but on authors' names! Those working in a field know who else is working there, and know which of their colleagues produce work worth reading. For this reason, he predicted that in the long run peer reviewed journals would decline, since the young Turks who actually do the work of physics do not need the sanction of their established elders who actually do the work of reviewing (rather than the work of physics) to recognize and reward good work and bad. For work within a discipline, that is, the current system of preprint databases with primitive information retrieval techniques might well suffice. It is for work between fields and on the borders of established areas that really intelligent systems are required; by 'really intelligent' systems, he explained, he meant systems which can answer questions without being asked, which know the environment of the problem area and can recognized important information in adjacent fields, which know what the user really needs, which answer the questions the users *should* have asked, which open a dialog with the user to get information they need (instead of spacing out for an indefinite period to look for the answer on the net, without giving the user the ability to interrupt them and bring them back), which have access to more information than the user even knows exists, which can affect the result of a query instead of simply deriving it, which can see the logical connections between groups of disparate facts and which can thus have an opinion of their own. The pre-requisite for building such intelligent systems is to be able to describe the semantics of documents, and to perform rhetorical analysis to enable the system to determine what is actually useful information. Standards are critical for this development: standards for the interchange of information among documents and databases, for the expression of links between documents and databases (HyTime), and for the encoding of documents (TEI, ISO 12083). In conclusion, he argued, structured documents are important, important enough to force us to rethink the way we structure documents and to argue in favor of teaching SGML to school children. Information retrieval needs to focus on interdisciplinary research. And, owing to market forces, whatever happens can be reliably expected to be both easy and fun.

The final session of the afternoon was organized by Annalies Hoogcarspel on the topic "Electronic Texts in the Humanities: Markup and Access Techniques." Since Elli Mylonas had been unable to come to Columbus, AH had asked me to substitute for her and speak for a few minutes on SGML markup for hypertext, using the Comenius example familiar from Georgetown (and elsewhere). Following my remarks, Elaine Brennan gave an overview of the Women Writers Project; the initial naivete of the project's plan (the original budget did not even foresee any need for a programmer), its discovery of SGML, and the problems posed by the the paucity of SGML software directly usable in the project for document management, for printing, or for delivery to end users. Michael Neuman of Georgetown described the Peirce Telecommunity Project and the problems of dealing with the complex posthumous manuscripts of C. S. Peirce. He defined four levels of tagging for the project: a basic level, capturing the major structures and the pagination; a literal level, capturing interlinear and marginal material insertions and deletions; an editorial level including emendations and annotations; and an interpretive level including analyses and commentary. He showed sample encodings, including an attempt at an SGML encoding (mercifully, in type too small for most of the audience to read in detail). Conceptually, I was pleased to note, there was nothing in the running transcription of the text which is not handled by the current draft of chapter PH, though the problem of documenting Peirce's eccentric compositional techniques remains beyond the scope of P2. Susan Hockey acted as respondent and did an excellent job of pulling the session together with a list of salient points, which my notes show thus:

SGML is clearly the key to making texts accessible in useful ways
The texts studied by humanists can be extremely complex, can take almost any form, and can deal with almost any topic
The structure of these texts is extremely various and complex; overlapping hierarchies are prominent in many examples, as are texts in multiple versions
All markup implies some interpretation; it is essential, therefore, to be able to document what has been marked and why
The TEI provides useful methods of saying who marked what, where, and why
The reuse of data (and hence SGML) is important for the longevity of our electronic texts
Ancillary information (of the sort recorded in the TEI header) is critical
Encoding can be performed at various levels, in an incremental process: it need not all be done at once.
We need software to help scholars do what they need to do with these texts: the development of this software must similarly be an incremental, iterative process
All of this work is directly relevant to the development of digital libraries: the provision of images is good, but the fact is that transcripts, with markup, must also be provided in order to make texts really useful for scholarly work.

In the question session, Mike Lesk asked Elaine Brennan about the relative cost of SGML markup vis a vis basic keystroking without SGML. She replied that no clear distinction could be made, for the WWP, since markup is inserted at the time of data capture, as well as at proofreading time and later. Someone asked her whether the WWP had ever thought of extending their terminus ad quem in order to include more modern material, like Sylvia Plath, and what copyright issues might be involved. EB replied that the WWP had a fairly full plate already, with the four to five thousand texts written by women from 1330 to 1830, without asking for more. SH noted that copyright issues had been a thorn in the side of all work with electronic texts for thirty years or more, and that they needed to be tackled if we were ever to get beyond the frequent practice of using any edition, no matter how bad, as long as it is out of copyright, instead of being able to use the best edition, even if it is in copyright. Michael Neuman was asked how, if the Peirce Project actually invited the help of the community at large in solving some of the puzzles of Peirce's work, it could avoid massive quality control and post-editing problems. MN hedged his answer manfully, conceding that quality control would be a serious issue and hoping at the same time that public participation in the markup of Peirce would be a useful, productive activity, and managing to elude completely the tricky question of how those two principles could be reconciled.

This was the last session of the day, and for me the last of the conference; SGML is somewhat less prominent in the program after the first day, but I think we should regard its visibility on Day 1 as a good sign and a development to be encouraged.

C. M. Sperberg-McQueen

Chicago

26 Oct 1993