Trip report
Korpuslinguistik deutsch: synchron, diachron, konstrastiv
University of Würzburg, 20-23 February 2003
The German department at the University of Würzburg, with the
help of some funding from the Deutsche Forschungsgemeinschaft (DFG),
organized a small conference on corpus linguistics this past week.
The largest emphasis was on German-language work, with some
complementary papers on contrastive studies, parallel corpora in
various language pairs (German/English, German/Norwegian/English,
German/Russian, German/Portuguese, etc.), and neighboring
disciplines.
I represented the neighboring discipline of text encoding. The
organizers had asked me to attend and give an update on XML-related
work at W3C.
A secondary purpose of the conference was to celebrate the 60th
birthday of Norbert Richard Wolf, a professor of German linguistics
here since 1977, who by all the evidence has won a great many friends
and admirers among colleagues and students, and who has been
instrumental in setting up many of the international partnerships
which are building the multilingual parallel corpora described above.
(I had met him during my last visit to Würzburg, when he gave
such a wonderful introduction to a talk I delivered that I rather
feared the talk was a bit of a letdown after the thorough and
imaginative introduction.)
I did not take detailed notes during the talks — some of my
fellow attendees seem to have been given a contrary impression by my
manic typing (are you the official notetaker? asked one delegate), but
my laptop usage has a different explanation: during several of the
lectures I was dividing my attention between listening to the
presentation and trying to get some off-the-shelf XML software to
perform useful searches on data from the British National Corpus
sampler disk. I wanted to turn part of my talk into a demo showing
that if you use XML, you can benefit from the existence of a lot more
tools than you would ever have a chance to write for yourself. It
seemed like a promising idea to show how one can build a search
interface to the BNC using Cocoon and XSLT, and doing nothing more
elaborate oneself than writing a few XSLT stylesheets.
After the obligatory opening remarks from the president of the
university, the program began with a paper from a local Anglist on the
topic of English corpora, asking (according to the program) which
corpora are suited for what. Ernst Burgschmidt, the speaker,
presented a noteworthy appearance, dressed in a Jankerl (a kind of
coat distinctive to Bavaria) and crowned with a flowing mane of silver
hair. He began by pulling a filing box full of newspapers out of a
bag, to remind us that working with corpora need not entail working
with computers. The burden of his exposition was, indeed, that if a
paper-based corpus is built around a specific kind of text, it may be
possible to use it to study phenomena which cannot be usefully studied
with larger machine-readable ‘balanced’ corpora,
since a balanced corpus is unlikely to have as large a selection of
the particular text type in question. He illustrated his point using
English adverbs formed from present participles and documenting a
certain leaching away of the specific lexical meaning of the
underlying verbs: when something is surprisingly useful, it's
not really a surprise, and when we write that someone is
alarmingly well-read, we expect the reader to react with
admiration rather than alarm.
George Smith, an American working in the German department in
Potsdam (a new university founded to rectify the fact that the state
of Brandenburg would otherwise have had no universities), described
the TIGER corpus ('Tiger', from 'trees in German') being prepared
there. The project aims to produce a tree bank of some tens of
thousands of syntactically analysed sentences in German. The
annotation appears to be part automatic but mostly human. I was a
little alarmed by an example he showed, in which what I thought was a
single clause with a compound verb was analysed as a pair of
coordinate clauses (one with an understood direct object, and one with
an understood subject), but I infer from conversations with other
participants that while this may be distasteful to some it is not
actually evidence of an error. What else can you expect, said one
informant in effect, if they insist on using dependency grammar?
Two talks the first day described corpora and corpus annotation and
search systems at the Institut der deutschen Sprache in Mannheim:
Rainer Perkuhn described the holdings and technology in general terms
— the general term that sticks in my mind is “1.9 billion
tokens”, not quite two thousand times the size of the Brown and LOB
corpora, nineteen times the size of the British National Corpus. Ah,
but (I hear the BNC creators say) is it tagged in SGML? to which the
answer appears to be yes, of course, why do you ask? This news made
me worry a bit less about whether XML would be seen as relevant to
the topic of the conference. Werner Kallmeyer
and Wilfried Schütte demonstrated special systems for the study
of spoken material. Automatic alignment of a transcription (mostly
orthographic, if memory serves) with the sound data, score-like
display (Partiturausgabe) of transcription and sonograph, and
wonderful facilities for playing back specific examples. The search
for the word aber, and then for
aber followed by a pause, were very impressive.
The example was well chosen, for different occurrences of a pause
after the conjunction turned out to have rather different functions,
so listening to the sound was really important in understanding and
interpreting the examples.
Ilse Mindt of Würzburg talked about learner English and the
use of corpora — in this case two very small parallel corpora,
one formed by extracting material from an existing corpus of spoken
English and one formed by having a small number of native speakers of
German (the students in a Hauptseminar) read set English passages
aloud — to study them. Acoustic analysis of various vowels
illustrated quite clearly the difficulty German native speakers have
with part of the English vowel system (short e and what I was taught
in school to call short a) and the features of the corresponding part
of the German vowel system which give rise to the problems. She had
suggestive hypotheses about how to improve the situation, though she
had not yet been able to test them in practice. I was surprised and
pleased to see how well sonographic analysis is able to reconstruct
the vowel triangle my teachers used to draw on the board to explain
(their reconstruction of) Old Norse vocalism to us. What engaged the
audience most, however, was probably her recording of a well-known
German football player giving an interview in English, during which he
obligingly demonstrated all the learner-English phenomena she wanted
to talk about, and more.
Two talks the first day talked about various forms of language
change: Norbert Dittmar of Berlin talked about the use of
weil followed by a clause with a verb in normal
(second) position, as opposed to the verb-final position which is
taught in textbooks and which remains dominant, and Irmgard Elter of
Forlí talked about the use of the dative rather than the
genitive after the prepositions trotz,
während, and so on. Both phenomena, it
was remarked, are often interpreted as signs of decay in language
habits, clarity of thought, and Occidental culture; neither
speaker seemed inclined to read them this way, though. (Informally,
Dr. Elter observed that her Italian students are often more exercised
about the phenomenon than native speakers of German are, at least if
the native speakers have linguistic training. As a non-native speaker
myself, I quite see the Italians' point: we had to learn
to use the gentive with those prepositions, and it's just not fair for
the native speakers to change the rules just to make things easier for
themselves!) I was surprised to see how strong the older forms appear
still to be in the data.
The final speaker of the first day was a young lawyer, Johannes
Patzelt of Munich, who gave a brisk summary of current German
copyright law. Afterwards, there was a reception in the Martin von
Wagner museum, in part of former residence of the bishops of
Würzburg, after which people adjourned for supper at the
Juliusspital, which serves quite good Franconian wine. Nelleke
Oostdijk of Nijmegen and Richard Mair of Freiburg left the reception
about the same time as I, and we went to the Juliusspital a bit early,
which gave us a chance to get our food order in ahead of the rush.
Later we were joined by John Sinclair, whose arrival had been delayed
by family obligations, and Ilse Mindt, who had picked him up at the
airport. At some point during the conversation, Mair or Sinclair said
in passing, well, of course the first thing I want to do when I get
some data tagged in XML is to get rid of the [deleted] tags. To which
the other agreed, but there doesn't seem to be a decent tag stripper
around! My worries about my topic returned: I wondered whether my
auditors would regard XML-based text encoding as a neighboring
discipline, or as a co-belligerent.
The second day was a bit of a mixed bag. Most of the Anglophone
talks were scheduled for Saturday, and there were reports from a large
variety of bilingual corpus projects, as well as from scholars working
on Old High German noun formation, Korean and German adverbials, and
the semantics of case in Sanskrit. I don't remember details of most
of them, but some things stick in my mind.
Christian Mair, one of the Anglists with whom I had dined the
evening before, talked about some new corpora of English made in
Freiburg as analogs to the Brown and LOB corpora, but using material
from the 1990s rather than from the 1960s: Frown is the Freiburg
analog to the Brown corpus, and F-LOB is the Freiburg LOB. By
studying phenomena in both pairs of corpora, he was able to show
historical trends in both British and American English; the usage of
the verb want, the relative frequency of the
parts of speech (are nouns really crowding out verbs?), and the usage
of the verb help all provided nice examples.
Where the thirty-year difference in material in the corpora doesn't
provide a long enough time span, he has found the citation material of
the electronic OED a useful backup.
John Sinclair, of Birmingham and (for the last six years or so) the
Tuscan Word Centre of Florence, gave a talk under the title "From
Corpus to Theory via Meaning", in which he started by looking at
examples of sentences in which some one or some thing is said to be,
or to have been, or not yet to have been, a gleam in someone's eye.
Often, this has an idiomatic meaning referring to the moment of
conception (or, strictly, I suppose, to a period shortly before
conception), but quite often it is merely another way of describing
someone's expression: they have an angry gleam, or an eager gleam, or
some other kind of gleam, in their eye. He observed that the
idiomatic meaning is almost never present when the noun
gleam is modified by an adjective, and
identified a number of other conditions for the idiomatic meaning. His
argument was that it is important to keep in mind the meaning being
conveyed in the utterances studied, and that if you wish to keep
access to the meaning, you have to give up on the notion that language
is a purely formal system; “it is a semi-formal system only.”
Moreover, the word is not the primary unit of meaning; our
dictionaries and thesauruses have used it only for convenience. The
word is at best the minimal indication of the presence of some unit of
meaning; it is the recognizable tip of the iceberg. He ended with a
plea for recognizing the primacy of speech and for demoting
‘written language grammar’ from its current
position as the focus of most linguistic analysis.
There were questions, but no one asked whether the idea of a
unit of meaning was a serious proposal or just
a figure of speech, and if the former how one would go about cutting
things up to identify the units.
Tony McEnery of Lancaster reported on work done with some
Chinese/English parallel corpora, in particular on a corpus-based
approach to the translation of aspect. As sometimes happens, what you
learn when you study the standard grammars and what you learn when you
study language in use are rather different. In particular, English
sentences which one would normally expect to require aspect marking in
their Chinese translations frequently don't receive such marking; it
is apparently possible to leave the aspect implicit, and this strategy
is, one surmises, preferred by the translators. It is apparently not,
however, a translation phenomenon: L1 material in Chinese also leaves
aspect implicit a very high percentage of the time.
Nelleke Oostdijk spoke about the annotation of multilingual
corpora, but began by observing that current work on multilingual
corpora doesn't actually typically involve any annotation in the sense
she means. Most polyglot corpora content themselves with parallel
segmentation of L1/L2 translation pairs, sometimes at the sentence
level and sometimes (mostly problematically) at the word level. In
some cases, word-class information is added to the individual words in
each language. A large opportunity exists to make more useful
parallel corpora by including syntactic annotation, based on a formal
descriptive analysis, which marks immediate constituent structure,
labels the constituents for syntactic function and category, and
remains (as far as possible) theory neutral. She gave good examples.
Such annotation is not necessarily easy to come by, however. For
contrastive linguistics, we need comparable descriptions of the two
languages in question, but to be comparable they have to take a
consistent position on questions like labeling constituent structures.
For reasons worthy of attention from sociologists of science, however,
trends in linguistic analysis frequently result in incomparable
analyses of syntax in the different languages represented in the
multilingual corpus.
Among the posters presented at the conference several stick in the
memory. Barbara Arnold reported on her lexicon of the vocabulary of
the Nachtwachen des Bonaventura. Kirsi
Pakkanen-Kilpiä's report showing that many German verbs described
by grammars as not capable of forming the passive nevertheless show up
with well attested passives in the Mannheim corpora, with the result
that any account of the semantic effect of the passive needs to be
re-thought. Finnish-German, Portuguese-German, Russian-German
parallel corpora have been built or are being built. Work on dialects
in Silesia, on German-Czech language interference and errors in German
by native speakers of Czech, and more presentations from IDS in
Mannheim, all sounded promising.
At the end of the second day, a session on 'corpus-related encoding
problems' housed my report on XML work at W3C and a plea for data
sharing (and for the use of XML and Unicode) by Prof. Roy Boggs of
Florida Gulf Coast University. Roy Boggs, though now a teacher of
systems analysis and design in a school of business, began his career
as a specialist in Middle High German, and did a concordance of the
Middle High German poet Hartmann von Aue which was one of the
resources which persuaded me, as a graduate student, that computers
could be a useful tool for philology. He did not actually utter the
slogan “information wants to be free,” but he pointed out that in
some fields, anyone who publishes a study based on the analysis of a
data set is expected, or required, to make the underlying data
available for secondary analysis. For my part, I summarized the
inescapable role of markup of one kind or another in any serious work
with textual data, and tried to suggest implicitly that if one is
going to have markup in one's data anyway, it might as well be well
organized, easy to parse, and well thought out. After a brief
discussion of the concept of document grammars and XML Schema, I
talked about searching XML and the central role of XPath, and ended by
arguing that the increasing availability of off-the-shelf XML tools
made it much easier to build specialized systems for manipulating
specialized data of any kind. My search interface to the treebank
data distributed with the BNC sampler was in fact working by then,
with four different styles of display, but there was not time to
demonstrate it. (My slides are now
on the Web.)
The final lecturer of the conference was Prof. Dr. Wolf-Dieter
Schäfer, an ophthalmologist who spoke, under the rubric
“Perspectives from neighboring sciences”, about the fundamental
importance of vision to mental development, social integration, and,
of course, to reading. The talk had, strictly speaking, nothing
whatever to do with corpus linguistics, but it was in fact one of the
most interesting talks in the conference, and I am grateful to the
organizers for having invited it.
After the final talk, Ludwig Eichinger, the director of the IDS in
Mannheim, made some concluding remarks, taking up points raised in the
talks given during the conference and examining them in a new light,
and then leading over gradually into a laudatio for
Norbert Richard Wolf, of whom many flattering things were said.
Afterwards, there was a celebratory meal for Prof. Dr. Wolf, organized
by his colleagues and students. At one point, the students took us all
outdoors outside the philosophical institute which housed the meal,
and gave everyone a helium balloon to which was tied a sparkler. The
smokers among us lit everyone's sparklers, and we released the
balloons into the night sky above Würzburg, sparkling as they
went. Back inside, there were speeches, there was conversation, and
beer and wine flowed like water, only rather more freely, and the
participants provided a fitting conclusion to the
colloquium by talking together.
My thanks to the organizers, especially Prof. Dr. Johannes
Schwitalla and Priv.-Doz. Dr. Werner Wegstein, for inviting me, and to
my fellow participants for the good talks and the stimulating
conversations.