1
Introduction ·
Einleitung
It is an honor to be invited to address you
this evening on the occasion of the 25th anniversary of the
International Tustep User Group, in the year that
Prof. Wilhelm Ott turns 80. My congratulations and best
wishes all round.
Es ist eine Ehre, eingeladen zu werden, Ihnen an diesem
Abend anläßlich des 25. Jahrestages der Internationalen
Tustep User-Gruppe vorzutragen, in dem Jahr, in dem
Prof. Wilhelm Ott seinen 80. Geburtstag feiert. Ich
gratuliere.
I should confess at the outset that I am not a proficient
user of TUSTEP. Improving my TUSTEP skills has perpetually
been on my to-do list, and I regret that so far in my career
other obligations have prevented me from spending as much
time on that task as I would wish. But I have known about
TUSTEP almost since the time I began working seriously in
what we now call the digital humanities, and I have been
deeply impressed with what skilled users can accomplish with
this tool. And the knowledge I have of Tustep, incomplete
thought it is, has influenced me over the years whenever I
have had the occasion to work on infrastructure that could
benefit digital humanists and other scholars who work with
electronic text.
Ich sollte gleich am Anfang bekennen, dass ich kein
TUSTEP-Experte bin. Meine TUSTEP-Kenntnisse zu verbessern
steht seit Jahren auf der privaten Aufgabenliste, und ich
bedaure, dass bis jetzt andere Obliegenheiten es unmöglich
gemachte haben, dazu zu kommen. Von Tustep habe ich schon
früh in meiner Arbeit in der Digital Humanities gelernt, und
das, was Tustep-Könner mit diesem Werkzeug erreichen können,
hat mich immer tief imponiert. Auch die partielle Kenntnis
von Tustep hat meine Arbeit an Infrastruktur für die Digital
Humanities und andere wissenschaftliche Gebiete
beeinflusst.
Part of what Tustep seems to me to illustrate is what can
be achieved if in writing software we make it our first
task to meet the requirements of serious textual
scholarship. Software for scholarly text processing has
an interesting and intimate relation to our understanding
of texts as linguistic, historical, and cultural objects:
to support textual work effectively, software must embody
or reflect a serious understanding of text, and conversely
thinking about how our software works and how it models
textual structures can help us think about texts and
textuality more effectively and can help lead us to a
deeper understanding of text.
Unter anderem zeigt Tustep (wie mir scheint), was man
erreichen kann, wenn man bei der Herstellung von Software
zuallererst versuch, den Anforderungen des
wissenschaftlichen Umgangs mit Text gerecht zu
werden. Software für wissenschaftliche Textverarbeitung
hat eine interessante und intime Beziehung zu unserem
Verständnis von Texten als linguistische, historische und
kulturelle Objekte: Um Textarbeit effektiv zu
unterstützen, muss Software ein wissenschaftliches
Verständnis von Text zugrunde liegen. Umgekehrt kann das
Nachdenken darüber, wie unsere Software funktioniert und
wie es textuelle Strukturen modelliert, uns zu einem
tieferen Verständnis von Text verhelfen.
That is my topic for this lecture: Text, the shape of
text, and the shaping of Text.
Daraus ergibt sich mein heutiges Thema: der
Text, die Textgestalt, und die Textgestaltung.
In this day of multimedia, of high-resolution full color
scanning, of automatic image recognition, why worry
about text?
Warum heute, in dieser Zeit des Multimedialen, des
Vollfarb-Scannens in höchster Auflösung, der automatischen
Bilderkennung, warum heute uns noch mit Text beschäftigen?
The computer scientist Tim Bray, with whom I had the
pleasure of collaborating a few years ago, once had a
start-up company with the motto
“Knowlege is a text-based application.”
Der Informatiker Tim Bray, mit dem ich vor einigen
Jahren eng zusammenarbeitete, hatte einmal eine
Start-up-Firma mit dem Motto
“Das Wissen ist eine textbasierte Anwendung.”
When Alan Turing set himself the question “at what point
will it be fair to say that computers can think?” his
answer (known now as the Turing Test) involved the
exchange of written messages — that is, texts
— between the computer and a human being. Turing
has been criticized for equating thinking with mere verbal
facility — what about visual thinking and other
non-verbal forms of intelligence? — but it may be
noted that he did not in fact suppose that all thought is
verbal. He suggested only that if a machine exhibits
sufficient verbal facility, then it will be fair to call
what the machine is doing a kind of thinking. (After all,
our libraries are full of works by people we call the
thinkers of the past, and the main evidence we have of
their thinking are the texts they left behind.)
Als sich Alan Turing die Frage stellte "Wann wird es fair
sein zu sagen, dass Computer denken können?", war seine
Antwort das, was uns jetzt als der Turing - Test bekannt
ist. Dieser Test beruht auf den Austausch von
geschriebenen Nachrichten - also Texten - zwischen dem
Computer und einem Menschen. Turing wurde dafür
kritisiert, das Denken mit bloßer verbaler Fertigkeit
gleichzusetzen - es gibt ja da visuelle Denken und andere
nichtnonverbalen Formen der Intelligenz! - aber Turing hat
doch nicht angenommen, dass alle Gedanken verbal sind. Er
schlug nur vor, dass es fair sei, wenn eine Maschine eine
ausreichende verbale Fähigkeit aufweist, das von der
Maschine Geleistete "Denken" zu denkt. (Schließlich sind
unsere Bibliotheken voll Werke von Menschen, die wir die
Denker der Vergangenheit nennen, und der Hauptbeweis, dass
sie gedacht haben, sind die Texte, die sie hinterlassen
haben.)
I do not say that thinking is the same
thing as the production or consumption of texts — thinking
can also take other forms and use other tools. But it does
seem clear that producing texts and consuming them are two
activities very prominent in the way most of us go about
thinking. It may be possible to think without creating a
text, but it is difficult to imagine producing a text
without something that we would have to call thought.
Ich sage nicht, dass das Denken dasselbe ist wie die
Produktion oder der Konsum von Texten - das Denken kann
auch andere Formen annehmen und andere Werkzeuge
verwenden. Aber es scheint klar zu sein, dass das
Hervorbringen von Texten und das Konsumieren von Texten
zwei Tätigkeiten sind, die für die meisten von uns sehr
wichtig sind. Es mag möglich sein, zu denken, ohne einen
Text zu erstellen, aber es ist schwer vorstellbar, einen
Text zu produzieren, ohne das wir etwas vollbringen, das
wir als Denken bezeichnen müssten.
(Even texts which we hyperbolically call mindless or
thoughtless — whether they are advertising material,
bureaucratic prose, or the writings of our political
adversaries — involve mental processes which as far as we
know exceed the capacity of any non-human animal and of
any existing computational device.)
(Auch Texte, die wir hyperbolisch "hirnlos" oder
"gedankenlos" nennen - ob es sich um Werbematerial,
bürokratische Prosa oder die Schriften unserer politischen
Gegner handelt - erfordern mentale Prozesse, die, soweit wir
wissen, die Fähigkeiten jedes nichtmenschlichen Tieres und
jedes existierenden Computergerät übersteigen.)
Texts play a centrol role in the practice
of all scientific and scholarly disciplines (not only
literature and linguistics) and in the study of their
history.
Texte spielen eine zentrale Rolle in der Praxis aller
wissenschaftlichen Disziplinen (nicht nur der Literatur
und Linguistik) und in der Erforschung ihrer Geschichte.
So today I would like to talk a little bit about text,
about the nature of text and how we shape text. It is, I
hope, a good way to honor both TUSTEP, one of the most
sophisticated and complex tools we have for working with
text, and the man to whom we owe its existence.
Deshalb möchte ich heute ein wenig über Text sprechen:
über die Natur von Text und darüber, wie wir Text
gestalten. Ich hoffe, es ist eine geeignete Art, TUSTEP,
eines der raffiniertesten und komplexesten Werkzeuge, die
wir für die Arbeit mit Texten haben, und den Mann, dem wir
seine Existenz verdanken, zu würdigen.
[Need more / better motivation: what problems does a
model of text solve? what is the harm if a model is bad?
]
4
The shape of text in our heads ·
Die Textgestalt innerhalb des Kopfs
Texts are many things — cultural objects, literary
objects, utilitarian objects — but perhaps above all
they are linguistic objects.
Conventional texts consist of words,
which combine lexical information with morphological
and syntactic information; these words are organized
into sentences which have syntactic, semantic, and
pragmatic information. Sentences convey information
and ideas, directly as part of their meaning,
less directly as part of their entailment,
and very indirectly by their contribution to our
view of the world. Texts quote other texts,
or misquote them, appeal to their authority or refute
them, interpret or misinterpret them. A full
account of the shape of text must include some
mention of these linguistic, aesthetic,
denotational, expressive, conative, phatic,
metalinguistic, and intertextual functions; a fully complete
representation of text as data — if such a thing is
possible — must provide ways to represent such
aspects of text and process them.
... Eine vollständige Beschreibung der Textgestalt
muss diese sprachliche, ästhetische, ... Funktionen
irgendwie erwähnen; eine wirklich vollständige
Repräsentation von Text als Datentyp — wenn
so ein Ding überhaupt denkbar ist — muss es
ermöglichen, diese Texteigenschaften darzustellen
und zu verarbeiten.
Slide showing Babylonian 23
= 23, or 23 * 60, or 23 * 60 * 60, or 23 / 60, ...
These aspects of text are only partially represented in
the written form of a text. Print is, after all, a
technology for writing. And writing systems are almost
always incomplete representations of linguistic
utterances: the function of any writing system is to make
it possible to decipher a message, not to provide a
complete representation of all aspects of the utterance.
Accordingly, writing systems are almost always
underspecified vis a vis the language(s) they are used to
represent: the writing system disambiguates just as much
as it needs to, to be practical, and usually as little as
it can get away with. That means that written utterances
are frequently (strictly speaking) ambiguous, and we rely
on context to help us understand what was meant. A given
Babylonian numeral, for example, can mean 1, or 60, or 360,
and the writing system relies on context to make clear
which meaning is to be assigned to a given occurrence
(25 plus 35 is ... not 1 and not 360, but 60). The token
"M." may mean many things, but when followed by a
nomen gentilicium (e.g. "Tullius")
it can only mean "Marcus".
Writing systems for humans to use will almost always omit
information when they can, to make it easier to produce
written messages. But a writing system for machines can
be more explicit while remaining practically usable, if we
use software to assist in writing and reading it. What
would be involved if we tried to make our electronic
representations of text represent them more completely
than is customary (or perhaps possible) in print? What
might we then be able to do with our electronic texts?
Slide showing IPA transcription of a text
(screen shot from IAIA or UyLVs project).
Phonologists may be interested in a phonetic rendering of
the text, either a strictly phonetic one,
which transcribes in more or less detail the sounds
actually made by a speaker, or (as in the text shown in
the image) a phonemic one, which
distinguishes the phonemes of the language and indicated
(for those who can read the International Phonetic
Alphabet) at least approximately how they are pronounced.
Phonemic representations of this kind are, if I
understand things correctly, sometimes used to transcribe
material from languages which do not have an established
writing system.
Students of manuscripts might benefit from a distinction
similar to that made by phonologists between phonetic and
phonemic transcription: for some purposes, a graphetic
transcription is useful, which distinguishes the several
forms of lower-case a shown
earlier. For other purposes, a graphemic
transcription is more convenient. It is sometimes
nice if both levels can be accommodated in the same
electronic representation.
Wortform |
Morpheme |
Lexem+Grammeme |
Po |
{PO} |
po |
soobščenijam |
{SOOBŠČENIE} + {PL.DAT} |
soobščenie[pl,dat] |
pressy |
{PRESSA} + {SG.GEN} |
pressa[sg,gen] |
SŠA, |
{SŠA} |
SŠA |
Belyj |
{BEL(YJ)} + {MASC.SG.NOM} |
Belyj[masc,sg,nom] |
Dom |
{DOM} + {SG.NOM} |
Dom[sg,nom] |
We may be interested in the morphological structure of
the text, which can be modeled in several ways. The image
shows the surface form of the word on the left, in the
center the ‘shallow morphological
structure’ as specified in the Meaning/Text Model
of language created by Igor Mel'cuk and Aleksandr
Zholkovsky, and on the right the ‘deep
morphological structure’ in the same model. (The
full morphological representation of a sentence in the
meaning/text model requires also some representation of
prosody, but I omit it here for brevity.)
Wortform |
Wortklasse |
Ursprung |
NN |
bedeutet |
VVFIN |
hier |
ADV |
jenes |
PDAT |
, |
von |
APPR |
woher |
ADJA |
und |
KON |
wodurch |
PWAV |
eine |
ART |
Sache |
NN |
ist |
VAFIN |
, |
was |
PWS |
sie |
PPER |
ist |
VAFIN |
und |
KON |
wie |
PWAV |
sie |
PPER |
ist |
VAFIN |
. |
Ursprung |
bedeutet |
hier |
jenes |
, |
von |
woher |
NN |
VVFIN |
ADV |
PDAT |
, |
APPR |
ADJA |
und |
wodurch |
eine |
Sache |
ist |
, |
was |
KON |
PWAV |
ART |
NN |
VAFIN |
, |
PWS |
sie |
ist |
und |
wie |
sie |
ist |
. |
PPER |
VAFIN |
KON |
PWAV |
PPER |
VAFIN |
. |
The most common linguistic annotation in existing
language corpora consists of assigning a part of speech to
each word. Here we see the Stuttgart/Tübingen part of
speech tag set applied to a sentence from Heidegger, which
has apparently slightly confused the part-of-speech tagger.
(Specifically, on the word
woher.)
The first such corpora to be produced were the tagged
Brown and LOB corpora of modern American and British
English, in the 1970s and 1980s, but many more have
followed. You may notice that the part of speech tags used
have (as here) only a passing resemblance to the eight parts
of speech identified by the ancient Latin grammarians and
which you may have learned in school.
Surface syntactic representation of Melcuk sentence 7
Another popular form of annotation for language corpora
consists in producing a syntax tree for each sentence in the
corpus, either a phrase-structure tree (as in the Penn
Treebank and some others) or as a dependency tree.
(Dependency trees have a popularity in computational
contexts which appears to be out of all proportion to the
popularity of dependency grammars among practicing
linguists.) Shown here is the surface syntactic
representation of a Russian sentence (whose
morphology was glimpsed earlier). Each oval represents a
word in the sentence (or strictly speaking an item in the
deep morphological representation of the sentence). The
surface syntactic structure is modeled here as an unordered
tree: any information expressed by the ordering of words in
the sentence must be represented either in the dependencies
or in the classification (labels) of the dependencies.
(Most dependency treebanks retain the order of the words,
and perhaps as a result can use a much smaller set of
dependency classes.)
The nodes of the theme (topic) are colored light blue;
those of the rheme (comment) are colored pink. Within the
rheme, a subordinate theme and rheme are identified; these
are identified by double ovals in blue or red,
respectively.
I won't attempt to explicate this diagram in detail; as
you can perhaps see, it is a dependency tree rather than a
phrase-structure tree; it differs from dependency trees
created in other approaches primarily in being unordered, in
having (in consequence) a richer set of dependency labels,
and in systematically using not the surface form of the
words but the forms of the deep morphological structure as
nodes. For the purposes of this discussion, it suffices to
notice that the dependencies will require that we record
directional links from each dependent to its governor.
Deep syntactic representation of Melcuk sentence 7
Some linguists define two levels of syntax trees: a
surface level which is relatively clearly related to the
surface form of the sentence, and a deep level which is more
distant from the surface form, but which makes the semantic
structure of the sentence clearer. This example, again from
Igor Mel'cuk and illustrating the meaning/text model,
regularizes some surface syntactic and lexical variations.
The deep structure uses a very restricted set of dependency
types, and replaces some nodes (like "very energetic"
modifying "help") with lexical functions
(Magn).
Several arguments of verbs which are implicit in the
surface form of the sentence and omitted from the surface
syntactic structure are made explicit here, so the words
NAROD, referring to the American people, and STRANA,
referring to African countries, each appear three times, and
are in turn connected by links in the deep anaphoric
structure (represented again here by dotted lines).
Semantic representation of Melcuk sentence 7
Some treebanks attempt to make the semantic structure of
the sentence accessible for processing by annotating the
surface syntax tree; others, as I've just shown, produce a
second ‘deep’ syntax tree. Some
linguists will wish for a separate representation of the
semantics of the sentence; the image shows the
semantic structure postulated by Mel'cuk for
the sentence whose syntactic trees have already been shown.
(In the meaning/text model, this semantic representation
will be the same for all sentences
‘synonymous’ with the one shown earlier.
In this diagram, the semantic structure is modeled as a
directed graph; it would not be impossible to render it as a
set of sentences in logical form, if one wished to reason
formally about the meaning expressed.
I have used the meaning-text model to illustrate the
kinds of information we would need in order to have a
notionally complete representation of the linguistic
structure of text, partly because it is possible to find
fairly clear descriptions of each of its descriptive
layers. Some linguistic approaches may offer simpler,
possibly less complete, descriptions: not all students of
syntax will distinguish two syntactic structures and also a
semantic structure. The ‘semantic’
annotations to the Penn Tree Bank, for example, offer a
single structural representation intended to describe both
the syntactic structure and the associated semantic roles.
And of course we may be interested, in a given practical
project, only in the morphology, or only in the lexis,
or only in the surface syntax of our text.
If a more or less complete representation of the
linguistic structure of a text is so complex and raises
so many questions, we may decide that a complete
representations of everything is a chimera
not worth chasing.
How much chance do we have of being able to process a
query that demands “show me all the passages in Musil,
if there are any, which are best understood as allusions
to Faust,” even if we restrict ourselves
to the views of a single literary historian? How likely
is it that we could make explicit everything
we think about any given text?
Mathematicians sometimes say that for all the mathematical
discoveries of the last centuries,
we are like swimmers who explore one part of a small cove
in detail and know almost nothing at all about the ocean
just beyond. We have only scratched the surface.
Similarly, with text, I think, it is possible that we have
only scratched the surface. Our current methods of
textual representation constitute a set of hard-won
achievements. But there is a great deal more work to be
done. For that work we need the best tools we can have,
and among those tools, I think it is safe to say, is
TUSTEP.
5
Conclusion ·
Schluss
Some years ago, the prominent digital humanist Willard
McCarty (editor of the Humanist discussion list) used to say
humorously that software often reflected not only the
thinking but also the personality of its makers. To
drive the point home, he would say "Think of Word Perfect,
for example. Have you ever seen such a Mormon
piece of software? It is clean, well-behaved, polite almost
to a fault — and also unprepared to deal with anything that
does not fit into its world view, and absolutely,
irredeemably convinced that it knows the Right Way to do
things." To the extent that software is a tool, it
reflects the convictions of its makers about what tasks are
worth doing. Not only that: software will also necessarily
reflect the makers' beliefs about how one can (or should, or
must) set about performing those tasks, and (a little less
directly) about the human being or beings who will be using
the software: what they find important, what they find less
important, how they think about their task and how much
responsibility they wish to take for it.
There are not many pieces of software that make me wish
to live up to the software's implied image of
its user. TUSTEP is one of these.
On this analysis, every piece of text processing software
reflects an idea of text, of the things we might want to do
with text, of the ways we might want to shape text by
processing and (for those programs that produce output) the
ways we might want to shape it on the page. Sometimes, the
ideas reflected are simple, facile, even simplistic. Not
many programs offer a concept of text and the shaping of
text as detailed, as deeply felt, as coherently thought
through as that of TUSTEP. As always, though, programs
reflect the thought of those who make them, and it seems
fair to say that with TUSTEP, Wilhelm Ott has shown himself
to be one of our day's great thinkers about text, and the
shape of text, and the shaping of text.
Please join me in honoring Wilhelm Ott and thanking him
for his work.