Trip report

Korpuslinguistik deutsch: synchron, diachron, konstrastiv

University of Würzburg, 20-23 February 2003

The German department at the University of Würzburg, with the help of some funding from the Deutsche Forschungsgemeinschaft (DFG), organized a small conference on corpus linguistics this past week. The largest emphasis was on German-language work, with some complementary papers on contrastive studies, parallel corpora in various language pairs (German/English, German/Norwegian/English, German/Russian, German/Portuguese, etc.), and neighboring disciplines. I represented the neighboring discipline of text encoding. The organizers had asked me to attend and give an update on XML-related work at W3C.

A secondary purpose of the conference was to celebrate the 60th birthday of Norbert Richard Wolf, a professor of German linguistics here since 1977, who by all the evidence has won a great many friends and admirers among colleagues and students, and who has been instrumental in setting up many of the international partnerships which are building the multilingual parallel corpora described above. (I had met him during my last visit to Würzburg, when he gave such a wonderful introduction to a talk I delivered that I rather feared the talk was a bit of a letdown after the thorough and imaginative introduction.)

I did not take detailed notes during the talks — some of my fellow attendees seem to have been given a contrary impression by my manic typing (are you the official notetaker? asked one delegate), but my laptop usage has a different explanation: during several of the lectures I was dividing my attention between listening to the presentation and trying to get some off-the-shelf XML software to perform useful searches on data from the British National Corpus sampler disk. I wanted to turn part of my talk into a demo showing that if you use XML, you can benefit from the existence of a lot more tools than you would ever have a chance to write for yourself. It seemed like a promising idea to show how one can build a search interface to the BNC using Cocoon and XSLT, and doing nothing more elaborate oneself than writing a few XSLT stylesheets.

After the obligatory opening remarks from the president of the university, the program began with a paper from a local Anglist on the topic of English corpora, asking (according to the program) which corpora are suited for what. Ernst Burgschmidt, the speaker, presented a noteworthy appearance, dressed in a Jankerl (a kind of coat distinctive to Bavaria) and crowned with a flowing mane of silver hair. He began by pulling a filing box full of newspapers out of a bag, to remind us that working with corpora need not entail working with computers. The burden of his exposition was, indeed, that if a paper-based corpus is built around a specific kind of text, it may be possible to use it to study phenomena which cannot be usefully studied with larger machine-readable ‘balanced’ corpora, since a balanced corpus is unlikely to have as large a selection of the particular text type in question. He illustrated his point using English adverbs formed from present participles and documenting a certain leaching away of the specific lexical meaning of the underlying verbs: when something is surprisingly useful, it's not really a surprise, and when we write that someone is alarmingly well-read, we expect the reader to react with admiration rather than alarm.

George Smith, an American working in the German department in Potsdam (a new university founded to rectify the fact that the state of Brandenburg would otherwise have had no universities), described the TIGER corpus ('Tiger', from 'trees in German') being prepared there. The project aims to produce a tree bank of some tens of thousands of syntactically analysed sentences in German. The annotation appears to be part automatic but mostly human. I was a little alarmed by an example he showed, in which what I thought was a single clause with a compound verb was analysed as a pair of coordinate clauses (one with an understood direct object, and one with an understood subject), but I infer from conversations with other participants that while this may be distasteful to some it is not actually evidence of an error. What else can you expect, said one informant in effect, if they insist on using dependency grammar?

Two talks the first day described corpora and corpus annotation and search systems at the Institut der deutschen Sprache in Mannheim: Rainer Perkuhn described the holdings and technology in general terms — the general term that sticks in my mind is “1.9 billion tokens”, not quite two thousand times the size of the Brown and LOB corpora, nineteen times the size of the British National Corpus. Ah, but (I hear the BNC creators say) is it tagged in SGML? to which the answer appears to be yes, of course, why do you ask? This news made me worry a bit less about whether XML would be seen as relevant to the topic of the conference. Werner Kallmeyer and Wilfried Schütte demonstrated special systems for the study of spoken material. Automatic alignment of a transcription (mostly orthographic, if memory serves) with the sound data, score-like display (Partiturausgabe) of transcription and sonograph, and wonderful facilities for playing back specific examples. The search for the word aber, and then for aber followed by a pause, were very impressive. The example was well chosen, for different occurrences of a pause after the conjunction turned out to have rather different functions, so listening to the sound was really important in understanding and interpreting the examples.

Ilse Mindt of Würzburg talked about learner English and the use of corpora — in this case two very small parallel corpora, one formed by extracting material from an existing corpus of spoken English and one formed by having a small number of native speakers of German (the students in a Hauptseminar) read set English passages aloud — to study them. Acoustic analysis of various vowels illustrated quite clearly the difficulty German native speakers have with part of the English vowel system (short e and what I was taught in school to call short a) and the features of the corresponding part of the German vowel system which give rise to the problems. She had suggestive hypotheses about how to improve the situation, though she had not yet been able to test them in practice. I was surprised and pleased to see how well sonographic analysis is able to reconstruct the vowel triangle my teachers used to draw on the board to explain (their reconstruction of) Old Norse vocalism to us. What engaged the audience most, however, was probably her recording of a well-known German football player giving an interview in English, during which he obligingly demonstrated all the learner-English phenomena she wanted to talk about, and more.

Two talks the first day talked about various forms of language change: Norbert Dittmar of Berlin talked about the use of weil followed by a clause with a verb in normal (second) position, as opposed to the verb-final position which is taught in textbooks and which remains dominant, and Irmgard Elter of Forlí talked about the use of the dative rather than the genitive after the prepositions trotz, während, and so on. Both phenomena, it was remarked, are often interpreted as signs of decay in language habits, clarity of thought, and Occidental culture; neither speaker seemed inclined to read them this way, though. (Informally, Dr. Elter observed that her Italian students are often more exercised about the phenomenon than native speakers of German are, at least if the native speakers have linguistic training. As a non-native speaker myself, I quite see the Italians' point: we had to learn to use the gentive with those prepositions, and it's just not fair for the native speakers to change the rules just to make things easier for themselves!) I was surprised to see how strong the older forms appear still to be in the data.

The final speaker of the first day was a young lawyer, Johannes Patzelt of Munich, who gave a brisk summary of current German copyright law. Afterwards, there was a reception in the Martin von Wagner museum, in part of former residence of the bishops of Würzburg, after which people adjourned for supper at the Juliusspital, which serves quite good Franconian wine. Nelleke Oostdijk of Nijmegen and Richard Mair of Freiburg left the reception about the same time as I, and we went to the Juliusspital a bit early, which gave us a chance to get our food order in ahead of the rush. Later we were joined by John Sinclair, whose arrival had been delayed by family obligations, and Ilse Mindt, who had picked him up at the airport. At some point during the conversation, Mair or Sinclair said in passing, well, of course the first thing I want to do when I get some data tagged in XML is to get rid of the [deleted] tags. To which the other agreed, but there doesn't seem to be a decent tag stripper around! My worries about my topic returned: I wondered whether my auditors would regard XML-based text encoding as a neighboring discipline, or as a co-belligerent.

The second day was a bit of a mixed bag. Most of the Anglophone talks were scheduled for Saturday, and there were reports from a large variety of bilingual corpus projects, as well as from scholars working on Old High German noun formation, Korean and German adverbials, and the semantics of case in Sanskrit. I don't remember details of most of them, but some things stick in my mind.

Christian Mair, one of the Anglists with whom I had dined the evening before, talked about some new corpora of English made in Freiburg as analogs to the Brown and LOB corpora, but using material from the 1990s rather than from the 1960s: Frown is the Freiburg analog to the Brown corpus, and F-LOB is the Freiburg LOB. By studying phenomena in both pairs of corpora, he was able to show historical trends in both British and American English; the usage of the verb want, the relative frequency of the parts of speech (are nouns really crowding out verbs?), and the usage of the verb help all provided nice examples. Where the thirty-year difference in material in the corpora doesn't provide a long enough time span, he has found the citation material of the electronic OED a useful backup.

John Sinclair, of Birmingham and (for the last six years or so) the Tuscan Word Centre of Florence, gave a talk under the title "From Corpus to Theory via Meaning", in which he started by looking at examples of sentences in which some one or some thing is said to be, or to have been, or not yet to have been, a gleam in someone's eye. Often, this has an idiomatic meaning referring to the moment of conception (or, strictly, I suppose, to a period shortly before conception), but quite often it is merely another way of describing someone's expression: they have an angry gleam, or an eager gleam, or some other kind of gleam, in their eye. He observed that the idiomatic meaning is almost never present when the noun gleam is modified by an adjective, and identified a number of other conditions for the idiomatic meaning. His argument was that it is important to keep in mind the meaning being conveyed in the utterances studied, and that if you wish to keep access to the meaning, you have to give up on the notion that language is a purely formal system; “it is a semi-formal system only.” Moreover, the word is not the primary unit of meaning; our dictionaries and thesauruses have used it only for convenience. The word is at best the minimal indication of the presence of some unit of meaning; it is the recognizable tip of the iceberg. He ended with a plea for recognizing the primacy of speech and for demoting ‘written language grammar’ from its current position as the focus of most linguistic analysis.

There were questions, but no one asked whether the idea of a unit of meaning was a serious proposal or just a figure of speech, and if the former how one would go about cutting things up to identify the units.

Tony McEnery of Lancaster reported on work done with some Chinese/English parallel corpora, in particular on a corpus-based approach to the translation of aspect. As sometimes happens, what you learn when you study the standard grammars and what you learn when you study language in use are rather different. In particular, English sentences which one would normally expect to require aspect marking in their Chinese translations frequently don't receive such marking; it is apparently possible to leave the aspect implicit, and this strategy is, one surmises, preferred by the translators. It is apparently not, however, a translation phenomenon: L1 material in Chinese also leaves aspect implicit a very high percentage of the time.

Nelleke Oostdijk spoke about the annotation of multilingual corpora, but began by observing that current work on multilingual corpora doesn't actually typically involve any annotation in the sense she means. Most polyglot corpora content themselves with parallel segmentation of L1/L2 translation pairs, sometimes at the sentence level and sometimes (mostly problematically) at the word level. In some cases, word-class information is added to the individual words in each language. A large opportunity exists to make more useful parallel corpora by including syntactic annotation, based on a formal descriptive analysis, which marks immediate constituent structure, labels the constituents for syntactic function and category, and remains (as far as possible) theory neutral. She gave good examples. Such annotation is not necessarily easy to come by, however. For contrastive linguistics, we need comparable descriptions of the two languages in question, but to be comparable they have to take a consistent position on questions like labeling constituent structures. For reasons worthy of attention from sociologists of science, however, trends in linguistic analysis frequently result in incomparable analyses of syntax in the different languages represented in the multilingual corpus.

Among the posters presented at the conference several stick in the memory. Barbara Arnold reported on her lexicon of the vocabulary of the Nachtwachen des Bonaventura. Kirsi Pakkanen-Kilpiä's report showing that many German verbs described by grammars as not capable of forming the passive nevertheless show up with well attested passives in the Mannheim corpora, with the result that any account of the semantic effect of the passive needs to be re-thought. Finnish-German, Portuguese-German, Russian-German parallel corpora have been built or are being built. Work on dialects in Silesia, on German-Czech language interference and errors in German by native speakers of Czech, and more presentations from IDS in Mannheim, all sounded promising.

At the end of the second day, a session on 'corpus-related encoding problems' housed my report on XML work at W3C and a plea for data sharing (and for the use of XML and Unicode) by Prof. Roy Boggs of Florida Gulf Coast University. Roy Boggs, though now a teacher of systems analysis and design in a school of business, began his career as a specialist in Middle High German, and did a concordance of the Middle High German poet Hartmann von Aue which was one of the resources which persuaded me, as a graduate student, that computers could be a useful tool for philology. He did not actually utter the slogan “information wants to be free,” but he pointed out that in some fields, anyone who publishes a study based on the analysis of a data set is expected, or required, to make the underlying data available for secondary analysis. For my part, I summarized the inescapable role of markup of one kind or another in any serious work with textual data, and tried to suggest implicitly that if one is going to have markup in one's data anyway, it might as well be well organized, easy to parse, and well thought out. After a brief discussion of the concept of document grammars and XML Schema, I talked about searching XML and the central role of XPath, and ended by arguing that the increasing availability of off-the-shelf XML tools made it much easier to build specialized systems for manipulating specialized data of any kind. My search interface to the treebank data distributed with the BNC sampler was in fact working by then, with four different styles of display, but there was not time to demonstrate it. (My slides are now on the Web.)

The final lecturer of the conference was Prof. Dr. Wolf-Dieter Schäfer, an ophthalmologist who spoke, under the rubric “Perspectives from neighboring sciences”, about the fundamental importance of vision to mental development, social integration, and, of course, to reading. The talk had, strictly speaking, nothing whatever to do with corpus linguistics, but it was in fact one of the most interesting talks in the conference, and I am grateful to the organizers for having invited it.

After the final talk, Ludwig Eichinger, the director of the IDS in Mannheim, made some concluding remarks, taking up points raised in the talks given during the conference and examining them in a new light, and then leading over gradually into a laudatio for Norbert Richard Wolf, of whom many flattering things were said. Afterwards, there was a celebratory meal for Prof. Dr. Wolf, organized by his colleagues and students. At one point, the students took us all outdoors outside the philosophical institute which housed the meal, and gave everyone a helium balloon to which was tied a sparkler. The smokers among us lit everyone's sparklers, and we released the balloons into the night sky above Würzburg, sparkling as they went. Back inside, there were speeches, there was conversation, and beer and wine flowed like water, only rather more freely, and the participants provided a fitting conclusion to the colloquium by talking together.

My thanks to the organizers, especially Prof. Dr. Johannes Schwitalla and Priv.-Doz. Dr. Werner Wegstein, for inviting me, and to my fellow participants for the good talks and the stimulating conversations.