Trip Report:
American Society for Information Science
C. M. Sperberg-McQueen
25 October 1993
The annual ASIS meeting runs from last Saturday (23 October, a
pre-conference workshop on "Crossing the Internet Threshold") through
Thursday of this week, in Columbus, Ohio, but I only attended one day
(25 October), in order to get back to work on TEI P2 (so this report
will be brief).
Attempting to be a good soldier, I attended a continental breakfast for
newcomers to the ASIS conference on Monday morning; arriving a little
late, I found ample fruit and croissants, but no coffee. (I am beginning
to think I have somehow incurred the wrath of the world's coffee gods,
and am fated, until just retribution is exacted, to see the coffee run
out just before I reach the head of the line. If anyone knows what I
may have done, please let me know.) The breakfast was marked with an
earnest friendliness on the part of the organizers which reminded me
a little uneasily of Rotary Club meetings, but another attendee, who is
working on the 'packaging and distribution' of environmental and
econometric data for a consortium of (I gathered) quasi-governmental
organizations, did tell me that she had heard of the TEI Header and
thought it might have some relevance to her work. This almost made up
for the coffee. I'm not sure I saw her, however, at the TEI session.
A professor at the University of North Carolina Graduate School in
Library and Information Science, Stephanie Haas, had organized a session
on SGML and the TEI, which took place from 8:30 to 10:00. I gave a
rapid introduction to SGML and its notation; Susan Hockey gave an
introduction to the TEI, its goals, and its organization; and I outlined
the contents of TEI P2 and showed a couple of simple examples.
Questions from the audience of between 80 and 120 included:
-
You mentioned the possibility of exchanging TEI Headers between
sites as a means of providing holdings information and a substitute
for a catalog. What relationship do you foresee between such
headers and MARC records in the local catalog?
-
I am concerned that the methods of analysis and annotation seem
so oriented toward things like morphological analysis. In my
experience on Esprit and other projects, I have come to believe that
semantic annotation is the crucial task; can your tags for
annotation handle things like case-grammar analysis?
-
Who is going to do all this tagging?
-
The problem with SGML is that until DSSSL is completed, you cannot
describe the physical appearance of the page. If you lose the
format information, however, you lose the archival resilience
of the material; how do you address that problem?
-
(from someone whose name tag said he was from the International
Atomic Energy Commission) As a database provider whose data are
now in MARC format, who is looking toward the future, I am
interested in considering SGML. But, although you did not mention
it, CCF is also a strong candidate as a release format for our data.
What are the relative strengths of using a specialized bibliographic
format like CCF, compared with a general-purpose format like SGML?
Both SH and I were impressed by the quality of the questions, and the
number which were new to us.
After the session and quiet discussion with numerous auditors, SH and I
had coffee with Annalies Hoogcarspel of CETH, Elaine Brennan of the
Brown Women Writers Project, Daniel Pitti of Berkeley (who is running a
Finding Aids project and wanted to know whether CONCUR would solve the
problem of encoding finding aids; we decided CONCUR could handle finding
aids, but would not really be necessary), and three people from the
Getty Art History Information Program (Deborah Wilde, Jennifer Trant --
an external consultant working for AHIP -- and a fellow whose name I
lost).
After clearing up the problem of CONCUR for Daniel Pitti, the discussion
moved to how SGML could be applied to the problems of cataloguing or
describing art works and their related materials. JT, in particular,
had been struck by the possibility that an art work -- e.g. a print, or
a concrete poem -- could be encoded directly as a text in TEI form, in
which case the TEI header would need to contain a description of the
work. She was concerned, however, about the problem of establishing and
documenting relationships between the work itself, preparatory materials
(e.g. a script for a happening), artifactual traces of the work (e.g.
objects used during the happening), textual or other surrogates for the
work (e.g. a description of the happening written during its
performance, or a videotape of a performance piece), and curatorial
descriptions of a work or of any of its associated materials. I
encouraged them to consider the model of HyTime, with the hub document
providing links among an arbitrary collection of other objects and their
surrogates. [They should have no trouble with this, since David Durand
is working on DTD development for the project.] Since curatorial
description, at least, is often a well structured activity, the Getty
people were also interested in defining a canonical form for such
descriptions; since the form varies, however, with the nature of the art
work, they were also interested in techniques like the pizza model or
like the parallel options for bibliographic citations and for dictionary
entries, which allow an encoding scheme to capture the regularity of the
majority of instances of a form, while still accommodating the outliers
and eccentrics. On the whole, the Getty people were clearly much more
open to discussion with others and to learning from other projects than
their reputation had led me to expect; they remarked themselves that the
atmosphere at the institution had changed and it was more open to the
outside than before.
After lunch, I (we) attended the meeting of the National Information
Standards Organization (NISO), to hear Eric van Herwijnen speak on
"Standards as Tools for Integrating Information and Technology: the
Impact and Implications of SGML." He began with a brief demo of the
DynaText version of his book Practical SGML, and then spoke about
electronic delivery of information as it is currently practiced (at
CERN, for example, *all* publications are now prepared in electronic
form, 95% of them in TeX, and most authors desire to exert full control
over details of page layout), and how it might develop now that storage
is cheap (disk space now costs less than $1 per Mb, if purchased at 1 Gb
quantities) and "network speed is no longer a problem." He described
how the development of preprint distribution bulletin boards at Los
Alamos and SLAC -- which now contain 60% of the new articles on physical
topics, at the rate of 6000 to 7000 articles per year -- have
democratized access to preprints. Internet discovery tools like WAIS,
Gopher, and WWW (or V-6 as he ironically dubbed it) provide transparent
access to documents across the network without requiring the user to
know their actual addresses; this allows documents to be left at their
point of origin instead of being mirrorred across the network, which
causes synchronization problems. In the future, he said, we will all
have the entirety of the world's documents on our desk top. This will
lead to a further worsening of the information glut already induced by
publish-or-perish promotion rules and the geometric growth of
publication in modern times (the number of articles published in
humanistic and scientific journals in 1992, he said, topped a million).
Information glut, in turn, leads to the need for intelligent information
retrieval. Within a discipline, however, information retrieval needs
only very minimal intelligence: 90% of searches in the physics
databases are done not using the elaborate subject indices and relevance
measures on which we spend so much work, but on authors' names! Those
working in a field know who else is working there, and know which of
their colleagues produce work worth reading. For this reason, he
predicted that in the long run peer reviewed journals would decline,
since the young Turks who actually do the work of physics do not need
the sanction of their established elders who actually do the work of
reviewing (rather than the work of physics) to recognize and reward good
work and bad. For work within a discipline, that is, the current system
of preprint databases with primitive information retrieval techniques
might well suffice. It is for work between fields and on the borders of
established areas that really intelligent systems are required; by
'really intelligent' systems, he explained, he meant systems which can
answer questions without being asked, which know the environment of the
problem area and can recognized important information in adjacent
fields, which know what the user really needs, which answer the
questions the users *should* have asked, which open a dialog with the
user to get information they need (instead of spacing out for an
indefinite period to look for the answer on the net, without giving the
user the ability to interrupt them and bring them back), which have
access to more information than the user even knows exists, which can
affect the result of a query instead of simply deriving it, which can
see the logical connections between groups of disparate facts and which
can thus have an opinion of their own. The pre-requisite for building
such intelligent systems is to be able to describe the semantics of
documents, and to perform rhetorical analysis to enable the system to
determine what is actually useful information. Standards are critical
for this development: standards for the interchange of information
among documents and databases, for the expression of links between
documents and databases (HyTime), and for the encoding of documents
(TEI, ISO 12083). In conclusion, he argued, structured documents are
important, important enough to force us to rethink the way we structure
documents and to argue in favor of teaching SGML to school children.
Information retrieval needs to focus on interdisciplinary research.
And, owing to market forces, whatever happens can be reliably expected
to be both easy and fun.
The final session of the afternoon was organized by Annalies Hoogcarspel
on the topic "Electronic Texts in the Humanities: Markup and Access
Techniques." Since Elli Mylonas had been unable to come to Columbus, AH
had asked me to substitute for her and speak for a few minutes on SGML
markup for hypertext, using the Comenius example familiar from
Georgetown (and elsewhere). Following my remarks, Elaine Brennan gave
an overview of the Women Writers Project; the initial naivete of the
project's plan (the original budget did not even foresee any need for a
programmer), its discovery of SGML, and the problems posed by the the
paucity of SGML software directly usable in the project for document
management, for printing, or for delivery to end users. Michael Neuman
of Georgetown described the Peirce Telecommunity Project and the
problems of dealing with the complex posthumous manuscripts of C. S.
Peirce. He defined four levels of tagging for the project: a basic
level, capturing the major structures and the pagination; a literal
level, capturing interlinear and marginal material insertions and
deletions; an editorial level including emendations and annotations; and
an interpretive level including analyses and commentary. He showed
sample encodings, including an attempt at an SGML encoding (mercifully,
in type too small for most of the audience to read in detail).
Conceptually, I was pleased to note, there was nothing in the running
transcription of the text which is not handled by the current draft of
chapter PH, though the problem of documenting Peirce's eccentric
compositional techniques remains beyond the scope of P2. Susan Hockey
acted as respondent and did an excellent job of pulling the session
together with a list of salient points, which my notes show thus:
-
SGML is clearly the key to making texts accessible in useful ways
-
The texts studied by humanists can be extremely complex, can
take almost any form, and can deal with almost any topic
-
The structure of these texts is extremely various and complex;
overlapping hierarchies are prominent in many examples, as are
texts in multiple versions
-
All markup implies some interpretation; it is essential, therefore,
to be able to document what has been marked and why
-
The TEI provides useful methods of saying who marked what, where,
and why
-
The reuse of data (and hence SGML) is important for the longevity
of our electronic texts
-
Ancillary information (of the sort recorded in the TEI header) is
critical
-
Encoding can be performed at various levels, in an incremental
process: it need not all be done at once.
-
We need software to help scholars do what they need to do with
these texts: the development of this software must similarly be
an incremental, iterative process
-
All of this work is directly relevant to the development of
digital libraries: the provision of images is good, but the
fact is that transcripts, with markup, must also be provided in
order to make texts really useful for scholarly work.
In the question session, Mike Lesk asked Elaine Brennan about the
relative cost of SGML markup vis a vis basic keystroking without SGML.
She replied that no clear distinction could be made, for the WWP, since
markup is inserted at the time of data capture, as well as at
proofreading time and later. Someone asked her whether the WWP had ever
thought of extending their terminus ad quem in order to include more
modern material, like Sylvia Plath, and what copyright issues might be
involved. EB replied that the WWP had a fairly full plate already, with
the four to five thousand texts written by women from 1330 to 1830,
without asking for more. SH noted that copyright issues had been a
thorn in the side of all work with electronic texts for thirty years or
more, and that they needed to be tackled if we were ever to get beyond
the frequent practice of using any edition, no matter how bad, as long
as it is out of copyright, instead of being able to use the best
edition, even if it is in copyright. Michael Neuman was asked how, if
the Peirce Project actually invited the help of the community at large
in solving some of the puzzles of Peirce's work, it could avoid massive
quality control and post-editing problems. MN hedged his answer
manfully, conceding that quality control would be a serious issue and
hoping at the same time that public participation in the markup of
Peirce would be a useful, productive activity, and managing to elude
completely the tricky question of how those two principles could be
reconciled.
This was the last session of the day, and for me the last of the
conference; SGML is somewhat less prominent in the program after the
first day, but I think we should regard its visibility on Day 1 as
a good sign and a development to be encouraged.
C. M. Sperberg-McQueen
Chicago
26 Oct 1993