Text und Textgestaltung

C. M. Sperberg-McQueen

Draft of a Festvortrag to be given in Potsdam in October 2018, at the ITUG meeting.
This draft includes some rough text in English and some in German. Neither is finished. Any paragraph containing just a “-” or “{}” in the German is intended to be omitted from the oral presentation; it might be supplied in a possible written version.
In the current state of the draft, some topics come up more than once, and some points may be made more than once. This may be a sign of carefully constructed recurrent theme, which gets set up in one place and alluded to again and again later. It is more likely, however, that the author has merely lost the thread of the discussion and that the paragraphs need to be resequenced (and some paragraphs merged or cut entirely). If you are reading these words before the talk is actually given, it means I've asked you to read the draft, and that you are cordially invited to suggest improvements to the flow.
I have not gathered all the images I want for the slides; in some places where I know I'll want a slide to illustrate a point, I've inserted a dummy slide, typically of a piece of marble.
Right now, I believe that section 2 (the shape of text on the page) is far too long; if sections 3 and 4 were worked out at equal length, the talk would run for two or three hours.
So I expect to try to cut section 2 quite a bit, though there are also some important missing topics I'd like to add. A rough time budget would be
  • 1 Introduction (3 min)
  • 2 The shape of text on the page (15 min)
  • 3 The shape of text in the machine (15 min)
  • 4 The shape of text on our heads (10 min)
  • 5 Conclusion (2 min)
Since it's a lot of fun to look at pages, I might change to
  • 1 Introduction (3 min)
  • 2 The shape of text on the page (20 min)
  • 3 The shape of text in the machine (12 min)
  • 4 The shape of text on our heads (8 min)
  • 5 Conclusion (2 min)
Either way, suggestions on what could usefully be cut would be helpful.

1  Introduction ·  Einleitung

It is an honor to be invited to address you this evening on the occasion of the 25th anniversary of the International Tustep User Group, in the year that Prof. Wilhelm Ott turns 80. My congratulations and best wishes all round.
Es ist eine Ehre, eingeladen zu werden, Ihnen an diesem Abend anläßlich des 25. Jahrestages der Internationalen Tustep User-Gruppe vorzutragen, in dem Jahr, in dem Prof. Wilhelm Ott seinen 80. Geburtstag feiert. Ich gratuliere.
I should confess at the outset that I am not a proficient user of TUSTEP. Improving my TUSTEP skills has perpetually been on my to-do list, and I regret that so far in my career other obligations have prevented me from spending as much time on that task as I would wish. But I have known about TUSTEP almost since the time I began working seriously in what we now call the digital humanities, and I have been deeply impressed with what skilled users can accomplish with this tool. And the knowledge I have of Tustep, incomplete thought it is, has influenced me over the years whenever I have had the occasion to work on infrastructure that could benefit digital humanists and other scholars who work with electronic text.
Ich sollte gleich am Anfang bekennen, dass ich kein TUSTEP-Experte bin. Meine TUSTEP-Kenntnisse zu verbessern steht seit Jahren auf der privaten Aufgabenliste, und ich bedaure, dass bis jetzt andere Obliegenheiten es unmöglich gemachte haben, dazu zu kommen. Von Tustep habe ich schon früh in meiner Arbeit in der Digital Humanities gelernt, und das, was Tustep-Könner mit diesem Werkzeug erreichen können, hat mich immer tief imponiert. Auch die partielle Kenntnis von Tustep hat meine Arbeit an Infrastruktur für die Digital Humanities und andere wissenschaftliche Gebiete beeinflusst.
Part of what Tustep seems to me to illustrate is what can be achieved if in writing software we make it our first task to meet the requirements of serious textual scholarship. Software for scholarly text processing has an interesting and intimate relation to our understanding of texts as linguistic, historical, and cultural objects: to support textual work effectively, software must embody or reflect a serious understanding of text, and conversely thinking about how our software works and how it models textual structures can help us think about texts and textuality more effectively and can help lead us to a deeper understanding of text.
Unter anderem zeigt Tustep (wie mir scheint), was man erreichen kann, wenn man bei der Herstellung von Software zuallererst versuch, den Anforderungen des wissenschaftlichen Umgangs mit Text gerecht zu werden. Software für wissenschaftliche Textverarbeitung hat eine interessante und intime Beziehung zu unserem Verständnis von Texten als linguistische, historische und kulturelle Objekte: Um Textarbeit effektiv zu unterstützen, muss Software ein wissenschaftliches Verständnis von Text zugrunde liegen. Umgekehrt kann das Nachdenken darüber, wie unsere Software funktioniert und wie es textuelle Strukturen modelliert, uns zu einem tieferen Verständnis von Text verhelfen.
That is my topic for this lecture: Text, the shape of text, and the shaping of Text.
Daraus ergibt sich mein heutiges Thema: der Text, die Textgestalt, und die Textgestaltung.
In this day of multimedia, of high-resolution full color scanning, of automatic image recognition, why worry about text?
Warum heute, in dieser Zeit des Multimedialen, des Vollfarb-Scannens in höchster Auflösung, der automatischen Bilderkennung, warum heute uns noch mit Text beschäftigen?
The computer scientist Tim Bray, with whom I had the pleasure of collaborating a few years ago, once had a start-up company with the motto “Knowlege is a text-based application.”
Der Informatiker Tim Bray, mit dem ich vor einigen Jahren eng zusammenarbeitete, hatte einmal eine Start-up-Firma mit dem Motto “Das Wissen ist eine textbasierte Anwendung.”
When Alan Turing set himself the question “at what point will it be fair to say that computers can think?” his answer (known now as the Turing Test) involved the exchange of written messages — that is, texts — between the computer and a human being. Turing has been criticized for equating thinking with mere verbal facility — what about visual thinking and other non-verbal forms of intelligence? — but it may be noted that he did not in fact suppose that all thought is verbal. He suggested only that if a machine exhibits sufficient verbal facility, then it will be fair to call what the machine is doing a kind of thinking. (After all, our libraries are full of works by people we call the thinkers of the past, and the main evidence we have of their thinking are the texts they left behind.)
Als sich Alan Turing die Frage stellte "Wann wird es fair sein zu sagen, dass Computer denken können?", war seine Antwort das, was uns jetzt als der Turing - Test bekannt ist. Dieser Test beruht auf den Austausch von geschriebenen Nachrichten - also Texten - zwischen dem Computer und einem Menschen. Turing wurde dafür kritisiert, das Denken mit bloßer verbaler Fertigkeit gleichzusetzen - es gibt ja da visuelle Denken und andere nichtnonverbalen Formen der Intelligenz! - aber Turing hat doch nicht angenommen, dass alle Gedanken verbal sind. Er schlug nur vor, dass es fair sei, wenn eine Maschine eine ausreichende verbale Fähigkeit aufweist, das von der Maschine Geleistete "Denken" zu denkt. (Schließlich sind unsere Bibliotheken voll Werke von Menschen, die wir die Denker der Vergangenheit nennen, und der Hauptbeweis, dass sie gedacht haben, sind die Texte, die sie hinterlassen haben.)
I do not say that thinking is the same thing as the production or consumption of texts — thinking can also take other forms and use other tools. But it does seem clear that producing texts and consuming them are two activities very prominent in the way most of us go about thinking. It may be possible to think without creating a text, but it is difficult to imagine producing a text without something that we would have to call thought.
Ich sage nicht, dass das Denken dasselbe ist wie die Produktion oder der Konsum von Texten - das Denken kann auch andere Formen annehmen und andere Werkzeuge verwenden. Aber es scheint klar zu sein, dass das Hervorbringen von Texten und das Konsumieren von Texten zwei Tätigkeiten sind, die für die meisten von uns sehr wichtig sind. Es mag möglich sein, zu denken, ohne einen Text zu erstellen, aber es ist schwer vorstellbar, einen Text zu produzieren, ohne das wir etwas vollbringen, das wir als Denken bezeichnen müssten.
(Even texts which we hyperbolically call mindless or thoughtless — whether they are advertising material, bureaucratic prose, or the writings of our political adversaries — involve mental processes which as far as we know exceed the capacity of any non-human animal and of any existing computational device.)
(Auch Texte, die wir hyperbolisch "hirnlos" oder "gedankenlos" nennen - ob es sich um Werbematerial, bürokratische Prosa oder die Schriften unserer politischen Gegner handelt - erfordern mentale Prozesse, die, soweit wir wissen, die Fähigkeiten jedes nichtmenschlichen Tieres und jedes existierenden Computergerät übersteigen.)
Texts play a centrol role in the practice of all scientific and scholarly disciplines (not only literature and linguistics) and in the study of their history.
Texte spielen eine zentrale Rolle in der Praxis aller wissenschaftlichen Disziplinen (nicht nur der Literatur und Linguistik) und in der Erforschung ihrer Geschichte.
So today I would like to talk a little bit about text, about the nature of text and how we shape text. It is, I hope, a good way to honor both TUSTEP, one of the most sophisticated and complex tools we have for working with text, and the man to whom we owe its existence.
Deshalb möchte ich heute ein wenig über Text sprechen: über die Natur von Text und darüber, wie wir Text gestalten. Ich hoffe, es ist eine geeignete Art, TUSTEP, eines der raffiniertesten und komplexesten Werkzeuge, die wir für die Arbeit mit Texten haben, und den Mann, dem wir seine Existenz verdanken, zu würdigen.
[Need more / better motivation: what problems does a model of text solve? what is the harm if a model is bad? ]

2  The shape of text on the page ·  Die Textgestalt auf der Seite

If we want to know what text is, what shape it takes, there are worse ways to start than looking at the shapes text takes when transmitted from sender to receiver.
Wenn wir wissen wollen, was Text ist, welche Gestalt er hat, können wir damit einen Anfang machen, indem wir schauen, in welcher Gestalt wir Texte vom Sender zum Empfänger senden.

2.1  The linearity of language ·  Die Linearität der Sprache

If we consider oral texts, we are likely to infer quickly that whatever else may be involved, text is a one-dimensional sequence of some fundamental units or other. We may take the fundamental units to be sentences, or words, or sounds; this affects the details of the analysis, but one feature remains invariant: For every item in the text, whether sentence, word, or phoneme, there is exactly one next item. (There is one exception: if the text is finite, as thus far all human texts have been, then the last item in the text has no next item.) And conversely, for every item (except the first) there is exactly one immediately preceding item.
Wenn wir mündliche Texte in Betracht ziehen, werden wir wahrscheinlich schnell darauf schließen, dass Text, was auch immer er sonst sein mag, eine eindimensionale Abfolge fundamentaler Einheiten sei. Wir können als fundamentale Einheiten Sätze oder Wörter oder Töne ansetzen. Die Details der Analyse werden danach varieeren. Eine Eigenschaft aber bleibt invariant: Für jedes Element im Text, ob Satz, Wort oder Phonem, gibt es genau ein nächstes Element. (Eine Ausnahme: wenn der Text endlich ist, wie bisher alle menschlichen Texte, dann hat das letzte Element im Text kein nächstes Element.) Und umgekehrt gibt es für jedes Element (außer dem ersten) genau ein sofort vorhergehendes Element.
This idea feels natural, almost unavoidable: all natural human languages appear to involve a one-dimensional sequence of sounds, if only because human beings have only one mouth, and the sound signal of human speech unfolds in real time. Sometimes the one-dimensionality of an object of regret, or the object of a critical attack: George Landow's account of hypertext contrasts it with conventional text, which is one-dimensional and which must be read in a single order, from start to finish, without deviation from the path. A similar hostility to one-dimensional text can be found in the work of Ted Nelson. Nelson often appears to blame paper for this one-dimensionality,1 and Landow appears to blame the codex, though in fact the kind of one-dimensionality they talk about is a function of human language rather than a result of its being recorded on paper.2
Diese Idee scheint natürlich, fast unvermeidbar zu sein: Alle Natursprachen scheinen aus eindimensionalen Tonfolgen zu bestehen, schon allein deshalb, weil der Mensch nur einen Mund hat und das Tonsignal der menschlichen Sprache sich in Echtzeit entfaltet. Manche bedauern die Eindimensionalität der Sprache, oder machen sie zum Objekt eines kritischen Angriffs: in seiner Arbeit zum Hypertextbegriff kontrastiert George Landows den Hypertext mit konventionellem Text, der eindimensional ist und in einer einzigen Reihenfolge gelesen werden muss, von Anfang bis Ende, ohne Abweichung vom Weg. Eine ähnliche Feindseligkeit gegenüber die Eindimensionalität findet sich in dem Werk Ted Nelsons. Nelson scheint dem Papier, Landow der Kodexform des Buchs den Schuld für die Eindimensionalität zu geben, obwohl die Eindimensionalität, von der sie sprechen, eher eine Funktion der menschlichen Sprache ist als eine Folge ihrer Aufzeichnung auf Papier.
On the contrary, paper offers more opportunities than a single human voice for saying more than one thing ‘at the same time’, and the distinguishing feature of the codex as a book medium is that it provides effectively random access to the content, in contrast to the purely sequential access afforded by the scroll. The shift from magnetic tape to magnetic disk for mass storage of information echoes the earlier shift from scroll to codex, and with many of the same consequences.
Ganz im Gegenteil: das Papier bietet bessere Möglichkeiten, als die menschliche Stimme, mehrere Aussagen ‘gleichzeitig’ zu machen, und der Kodex unterscheidet sich als Textträger grundsätzlich dadurch von der Rolle, dass er einen direkten Zugriff zu beliebigen Stellen des Inhalts bietet, während die Rolle prinzipiell nur sequentiellen Zugriff bietet. Der Übergang von Magnetbänder zu Platten bei der Speicherung von Grossdaten ähnelt in Vielem den Übergang von der Rolle zum Kodex, und viele der Folgen sind auch überraschend ähnlich.

2.2  The non-linearity of the page ·  Die Nichtlinearität der Seite

In this context, it may be informative to look at a written text which reflects this idea fairly directly, as in this 1494 printing of a German translation of Apollonius of Tyre.
In diesem Zusammenhang mag es informativ sein, einen schriftlichen Text zu betrachten, der diese Idee ziemlich direkt widerspiegelt, wie in diesem 1494 Druck einer deutschen Übersetzung des Romans von Apollonius von Tyrus.
Slide: Apollonius of Tyre 1494
The type on the page is perfectly legible, there is nothing the least bit unconventional about the layout, and yet the page does not look at all like a modern page, even setting aside the use of Fraktur. If text is fundamentally a one-dimensional sequence of characters, why does this page, which treats text as precisely that not feel more familiar? It may not be immediately obvious just why the page looks foreign to us. The source of our feeling may become more evident if we examine a conventional printed page.
Die Schrift auf der Seite ist gut lesbar, am Layout ist nichts unkonventionell, und dennoch sieht die Seite gar nicht wie eine moderne Seite aus (auch wenn man den Einsatz von Fraktur ausser Acht lässt). Wenn Text im Grunde eine eindimensionale Folge von Zeichen ist, warum fühlt sich diese Seite, die den Text genau in diesem Sinne widergibt, nicht vertrauter an? Es ist vielleicht nicht sofort klar, aus welchen Gründen die Seite uns fremd erscheint. Die Quelle unseres Gefühls kann deutlicher werden, wenn wir eine konventionelle gedruckte Seite betrachten.
Slide: standard modern page
Slides needed: closeups of page numbers, running heads, paragraph breaks, and section heads.
Among the things a modern eye misses in the incunabulum (which we can notice because they are present on this page) are
  • page numbers

  • running heads

  • the use of whitespace to mark paragraph breaks

Das moderne Auge vermisst in diesem Wiegendruck unter anderem
  • die Seitenzahlen

  • die lebenden Kolumnen

  • die Gestaltung von Absatzgrenzen durch Zeilenbruch und Einrückung (o.ä.)

2.2.1  Paragraph and section breaks ·  Absatz und Textunterteilung
Slide: paragraphus symbol, modern para break
Modern paragraph breaks make use of whitespace in ways that incunabula often do not: the last line of the one paragraph is not filled out, and the new paragraph begins on the next line, after an indentation. Because of the shift to the next line, the modern paragraph break shown here is a two-dimensional phenomenon -- the same goes even more strongly for the section head shown. We can of course shoehorn these into a linear stream of characters, by treating functions like “carriage return”, “line feed”, “vertical tab”, and “horizontal tab” as special kinds of characters which are embedded in the text stream like any other. This is the way Teletype machines handled issues of text formatting. And indeed, all standard coded character sets for computer use reserve space for ‘control characters’ like these.
In modernen Büchern werden Absätze mit Leerzeichen gestaltet, wie es bei Inkunabeln oft nicht der Fall ist: Die letzte Zeile des einen Absatzes wird nicht ausgefüllt, und der neue Absatz beginnt in der nächsten Zeile nach einer Einrückung. Wegen der Verschiebung in die nächste Zeile ist der hier gezeigte moderne Absatzbruch ein zweidimensionales Phänomen - das gleiche gilt für den abgebildeten Abschnittskopf noch stärker. Wir können diese natürlich in einen linearen Zeichenstrom einfügen, indem wir Funktionen wie "Rücklauf", "Zeilenvorschub", "vertikaler Sprung" und "horizontaler Sprung" als spezielle Arten von Zeichen behandeln, die in den Textstream eingebettet sind Irgendwelche anderen. Auf diese Weise haben Teletype-Maschinen die Textformatierung verwaltet. Und in der Tat reservieren alle standardisierten codierten Zeichensätze für Computergebrauch Platz für "Steuerzeichen" wie diese.
The absence in inunabula of anything resembling a modern paragraph break or section break serves as a useful reminder that our modern typographic expectations are just that — a recent development, not more than a few centuries old. To a large extent, the layout conventions we find natural today require thinking in two dimensions, not only in one.
Dass Wiegendrucke keine modernen Absatzbrüche oder Abschnittbrüche haben, dient als nützliche Erinnerung daran, dass unsere modernen typografischen Erwartungen genau das sind - eine jung Entwicklung, die nicht mehr als einige Jahrhunderte alt ist. Die Layout-Konventionen, die wir heute als selbstverständlich empfinden, erfordern zu einem großen Teil das Denken in zwei Dimensionen, nicht nur in einem.
We not only render paragraph breaks differently in modern books, we also have, I believe, more of them, just as we have more subdivisions of the text at larger sizes. Literary texts now almost always are divided into chapters and paragraphs; long poems have been divided into cantos and books since Dante and Tasso and Vergil. Homer is also divided into books, but the divisions are a post-Homeric (Alexandrian) innovation, as are the divisions of Beowulf into fittes, and the division of the Nibelungenlied into aventiuren. How does it happen that we now subdivide our texts at fine granularity into chapters and sections and sub-sections? Is it that the regular, unbroken block of text has become so unwelcome that we introduce structural breaks in our texts in order to have an excuse for introducing whitespace on the page?
Wir gestalten nicht nur in modernen Büchern Absatzbrüche anders, wir haben auch, glaube ich, mehr Absatzbrüche, genauso wie wir mehr Unterteilungen des Textes in größeren Einheiten haben. Literarische Texte sind jetzt fast immer in Kapitel und Absätze unterteilt; lange Gedichte sind seit Dante und Tasso und Vergil in Cantos und Bücher aufgeteilt. Homer ist auch in Bücher unterteilt, aber die Spaltungen sind eine posthomerische (alexandrinische) Neuerung, ebenso wie die Aufteilung von Beowulf in Fittes und die Aufteilung des Nibelungenliedes in Aventiuren. Wie kommt es, dass wir unsere Texte nun feingranular in Kapitel und Abschnitte und Unterabschnitte unterteilen? Ist der regelmäßige, ungebrochene Textblock so unwillkommen geworden, dass wir strukturelle Brüche in unseren Texten einführen, um eine Entschuldigung für die Einführung von Leerzeichen auf der Seite zu haben?
2.2.2  Page numbers ·  Die Seitenzahlen
Page numbers illustrate a pervasively important property of printed books: if you and I have copies of the same edition, I can refer you to a particular passage of the work by mentioning its page number.3 This dramatically simplifies the discussion (and thus the study) of texts. Since manuscript copies of a work do not, as a rule, preserve the pagination of the exemplar, such page references are not a prominent feature of scribal cultures, and in order to support references to specific textual locations in a scribal culture it is necessary4 to construct some system for labeling specific passages in a text. The Eusebian canons used in Western Europe to allow references to Biblical passages and the creation of Gospel harmonies and rudimentary concordances are an example of such a system. Only the most important texts will be worth the effort of creating such a system. In a print culture where all pages are numbered, any text will have a convenient reference system.5
Slide: running heads
In books printed nowadays, running heads often identify both the book and the chapter of the book. This is rare (if probably not unheard of) in manuscript books, and it is tempting to speculate that it reflects the relative scarcity or abundance of books. In a monastic community in which one is allowed a single book in one's cell, for reading during silent periods of the day, it may relatively seldom be the case that one wishes to be able to tell what work the book contains, by glancing at a page. If I have just one book in my cell, how hard is it going to be to remember which book it is? It is the reader whose desk is littered with open books who is more obviously in need of such metadata. When one has easy access to many books, there is an obvious convenience in being able to see at a glance which is Aristotle and which is Plutarch and which is Augustine.
The gradual adoption of the title page reflects, I think, the same development: title pages are important for quick identification of books, when there are many lying around. The printers themselves may have been the first beneficiaries of title pages: once printing becomes a business, a printer with several books on hand for sale, will find it easier to manage the stock if it's easy to see which book is which. Page numbers also serve a purpose in the printing house, making it a little easier to check that the imposition of the forme is correct and that the quires have been assembled in the right order. The shift from leaf identifiers like Ai, Aii, Aiii, Aiiii to page numbers like 1, 2, 3, 4 may be linked to the dawning recognition that page numbers also have other uses for other readers.
The use of two-dimensional, visually prominent paragraph breaks, the use of running heads to identify chapter or section, and the use of page numbers all tend to make things easier for a reader who is scanning through the book looking for something. They bring less benefit to readers working their way systematically through the book (though they do help give us a sense of our progress through the book).
2.2.3  Are the peripheral phenomena? ·  Wesentlich oder unwesentlich?
Neither page numbers nor running heads fit comfortably into the model of text as a one-dimensional stream of words without further internal structure. We can accommodate them into a mechanical representation of text only if we are somehow able to distinguish them from the other material in the text stream. The method most widely used today is some sort of inline markup (XML, HTML, wiki markup, or other).
We could, perhaps, dismiss page numbers and running heads and the modern style of paragraph breaks and section headings as peripheral issues. Yes, of course, when we print a text on multiple pages it can make sense to label the pages. But (we might argue) it is the pages being labeled in this way, not the text: the page numbers and running heads are artefacts of the material representation of the text, and not part of the text itself. (We do not insist that the page numbers remain the same in a new edition of a text.)
There is no logical flaw in this view; it's merely a question about the point at which we draw a boundary between the text and the cultural practices by means of which we create, preserve, disseminate, and consume texts. It can be useful to distinguish clearly between texts and books. But if we are interested in texts, and if we are interested in software to help us produce, manage, or study texts, then we will also be interested in books, and other forms in which texts can be represented, and our software will need to know about books as well as about texts, or else its depiction of texts will always feel slightly awkward and foreign to us, in much the same way as that 1494 printing of Apollonius of Tyre. That is, we can narrow the meaning of the word text, if we like, to exclude things like page numbers, but then we will need other broader terms to describe what we are interested in, if we want to understand texts as cultural objects.

2.3  More complex pages (and texts) ·  Komplexere Text, komplexere Seiten

2.3.1  Annotation, parallel streams, apparatus ·  Anmerkungen, Paralleltexte, apparatus criticus
The idea that text is a simple sequence of words or characters becomes even less tenable when we examine texts that are important or central cultural objects. In almost any culture, important texts acquire commentaries. Often they acquire multiple commentaries and levels of annotation, as illustrated by this page of a [fifteenth-century Venetian ? check, confirm, or omit] edition of the Bible (in Latin), together with an interlinear gloss, the glossa ordinaria attributed to Walafrid Strabo, and the Postilla of Nicolaus de Lyra, one of the most popular of medieval Bible commentaries.
Glossa ordinaria
If one sticks a pin into this page at a random position, what is the likelihood that the ‘next’ word in the text is the one immediately to the right of the pin-prick, on the same line? The model of text as a one-dimensional sequence has some difficulties with page numbers and running heads, but it implodes completely when confronted with a work like this one.
In this page of the glossa ordinaria, we can distinguish seven text streams which must be synchronized page by page:
  • the base text of the Bible

  • an interlinear gloss, explaining the words or constructs of the text

  • in the inner margins, letters serving as references to corresponding passages in the glossa ordinaria or the commentary of Nicolas de Lyra

  • the glossa ordinaria

  • the commentary of Nicolas de Lyra

  • in the outer margins, references to parallel passages in other books of the Bible (marked by intratextual symbols in the Biblical text)

  • also in the outer margins, bibliographic references to authors and works mentioned in the glossa ordinaria or by Nicolas de Lyra

Slide: Complutensian Bible 1522
In the case of the glossa ordinaria, there is a clear primacy of one text over the others: the Bible text is central (both intellectually and topographically), the others peripheral. But in other cases (here the Complutensian Polyglot Bible of 1522), what we have are multiple versions of what is, in some sense, the same text (here, the book of Deuteronomy).
These examples of complex works suggest that for purposes of study, we may find it useful to think of text not as a one-dimensional stream of characters but as a bundle of one or more such streams, which are synchronized at various points. The trick for laying such material out on the page is to ensure that any given synchronization point falls on the same page for all of the parallel streams. This is a complex task, to put it mildly, which may be one reason there are so few books which work with as many parallel streams as the Complutensian Polyglot.
Slide: critical edition
If even the simplest conventional modern printed books are hardly one-dimensional, we should perhaps not be surprised to find that more complex texts are even less one-dimensional, nor that scholarly editions, intended to support serious work with the text, provide good examples of typographic complexity. Conventional printed scholarly editions provide a number of different kinds of information in addition to the base text, and rely on typographic form to distinguish the kinds of information and the interrelations:
  • line numbers

  • prose annotation

  • apparatus criticus (sometimes there are several apparatus, each devoted to a different class of information, sometimes by classes of text witnesses and sometimes by type of textual variation)

I describe this page as complex by comparison with other modern books; it is of course not a particularly complex page by Tustep standards.
2.3.2  Textual variation ·  Textvariation
An extremely rich source of complexity for texts is the inescapable (if often unwelcome) fact that texts vary. They shift, they change, they twist and turn. No text extant in more than one text witness will be reducible to a single sequence of words, because when there are multiple witnesses, the witnesses will not all agree on the exact sequence of words.
[Discussion of ways to think about textual variation.] [Need simple example, preferably in NHG. Failing that, using MHG / Minnesangs Frühling. Wolfram's Tagelied? Unter den linden?]
[Complete parallel texts.]
[Zeilensynoptik - line by line overview of variation.]
[Rhine Delta data structure / variant graph.]
[Parallel segmentation (Thaller).]
2.3.3  Relative cost of stability and change ·  Stabilität und Änderung
[Note: I rather like this comparison, but unless it has more to do with the topic than appears at the moment, it probably needs to be cut for time. If kept, it should be compressed.]
One of the key contributions of printing to textual scholarship was that it contributed decisively to textual stability.6 Printing pushes texts toward fixity by making it possible to disseminate multiple copies of a given work carrying the identical text (or, in the case of stop-press variants, nearly identical texts). If we measure the rate of textual change in textual variants per copy of a work, printing reduces the rate drastically -- not to zero, but to a much lower rate than typical manuscript traditions exhibit. Textual criticism was conceptually possible before printing, but the results were difficult to disseminate because manual copying of a purified text will introduce new textual changes, and eventually the details of the text-critical work will be lost. Textual criticism is somewhat easier in print culture, because its results are more durable.
In an oral culture, a story will often be expressed in different words each time it is told; as some of you will recall, the classicists Milman Parry and A.B. Lord observed this phenomena among the epic singers of the Balkans and developed a theory of formulaic poetic diction to explain how they were able to improvise metrical narrative at the speed of performance. The gist of the story is typically (relatively) stable, but the words will change from singer to singer and from performance to performance. It is easier to cast the story in our own words than the remember the exact words used when we heard the story ourselves. (Short texts, like lyric poetry, form an obvious exception to this generalization.)
In a scribal culture, it is much less difficult to reproduce the wording of the examplar than in oral transmission: after all, the exemplar is right there in front of the scribe, and no great feat of memory is required. Actually, fidelity to the words of the exemplar is perhaps slightly easier than changing them, because changing the words requires thinking a bit about how to phrase the sentence. Reproducing a word or sentence in one's own dialect, on the other hand, is perhaps a little easier than preserving the details of a foreign dialect (as well as, perhaps, less useful). So from a mental point of view, preserving the words of the exemplar and changing them are much more evenly balanced than in the oral situation. From the physical point of view, it takes the same amount of physical effort to write out the words of the exemplar without change as it does to change them -- except that omitting a word is less work than copying it. The net effect appears to be consistent with what we find in our libraries: scribal copies are typically close enough to their exemplars that we recognize them effortlessly as carrying the same text, but absolute fidelity to the exemplar is rare — essentially non-existent.
Printing changes this equation dramatically: it is much easier to pull one more sheet through the press without changing the text than it is to stop the press, make a correction to the text, re-tighten the forme, and print the next sheet — and that, in turn, is much simpler and faster than setting the text again from scratch. That is, the relative costs of making multiple copies which are textually identical is much lower than the costs of making multiple copies each different from each other.7 The result is that texts, while still subject to change from edition to edition, tend to have much more fixity in print than in manuscript.
The advent of digital media changes the equation again, to something more like the scribal or even oral state: the physical cost of making a copy is the same for faithful and for modified copies: the economic forces that pushed printed texts to fixity do not apply. It requires less thought to make a copy without introducing new changes, so there is still some pressure toward fixity. But a great deal depends on the context: the first dictionaries to be available in electronic form were valuable resources for many people working in what we now call computational linguistics, and in order to make the dictionary work properly with the system in use, computational linguists routinely reformatted the dictionaries in various ways, and added new information needed for the work they were doing. The result is that the textual history of the electronic dictionaries that were shared from project to project in the 1970s and 1980s looks a lot more like the textual history of a manuscript transmission than that of a purely printed work: for some early electronic dictionaries, essentially no two electronic copies of the work are the same.

3  The shape of text in the machine ·  Die Textgestalt innerhalb der Maschine

If we want to work with text using software, that software needs to have some notion of what text is. Looking at possible and actual representations for text in electronic form can be a useful way of thinking about what text is.

3.1  Characters and character sets ·  Zeichen und Zeichensätze

Picture of punched card
When scholars first began using information-processing equipment (punched cards, and later computers) for work with texts, the most pressing issue for text representation was finding ways to represent the characters that occur in texts but did not occur on the keyboards of punched-card machines.
List of characters on IBM 026, 029
‘Special’ characters, when I began working with computers, were any that required special treatment. Since punched cards were limited to the 26 letters of the uppercase Latin alphabet used in standard English, A through Z, even the distinction between upper-case letters and lower-case letters required special measures for some early projects. Such special measures, of course, required more effort, which cost money or time or both. For these reasons, the Trèsor de la langue française, for example, decided against marking the difference between uppercase and lowercase letters: their corpus was intended to support lexicographic work on the dictionary of French being created by the French Academy, and since they expected to re-check every citation against a good printed edition before including it in an entry, they saw no need for the electronic text, intended solely for project-internal use, to carry case distinctions. As some of you will remember and others can imagine, the situation was even more more challenging for characters with diacritics, like ä, ö, ü, or for characters used in some Latin scripts but not in English, like the ess-zett of German or the eth and thorn of Icelandic. The development of standardized Greek, Hebrew, Arabic, and other scripts took even longer.
The development of coded character set standards was slow, and commercial interests resisted the standardization of characters they felt were not important for contemporary commercial use. They objected, for example, to proposals to provide standard representations for the Cyrillic characters whose use was abolished shortly after the Bolshevik Revolution in Russia.
Picture of TLG text (Greek and Beta)
Under those circumstances, scholars interested in electronic transcription of older texts and texts in minority languages found it necessary to find their own ways to represent the characters they needed, via transcription (as in the Beta Code used in the Thesaurus Linguae Graecae) or by replacing a given ‘special’ character with a longer sequence of characters which would not (one hoped) appear in naturally occurring texts. Tustep's representations for Greek, Hebrew, and rarer Latin characters fall into this class; Tustep early developed a larger inventory of predefined characters than any other scholarly text processing software I'm familiar with. (Other tools left the users to their own devices.)
Picture of Tustep text (Greek and Tustep)
Picture of Mayan?
Character set standardization is important, and the development of the Universal Character Set (UCS) of ISO 10646 / Unicode is a huge advance for everyone interested in text processing by machine. But if we are interested in ensuring that our tools can be applied to texts from all languages, from all periods, and for all scholarly purposes — and as digital humanists, those goals are natural enough — then it is essential to recognize that no finite enumeration of characters will ever suffice to allow us to achieve those aims. There are writing systems we do not yet fully understand.
Slide showing six forms of "a" from Menota
For certain kinds of paleographic work we must sometimes preserve information about letter shapes which would normally be omitted from our electronic representations (which normally work at or near the graphemic level). The character inventories used in the Medieval Nordic Text Archive (MeNoTA) distinguish not just between insular or curved d and erect (or Continental) d, but four different forms of lowercase e, six different forms of lowercase a, and nine forms of lowercase f.
[Delete?] If our text processing tools are well grounded, we should very seldom need to define new characters for our work, because the characters we need will already be part of the system. (At least, if we are using Tustep.) But tools intended for scholarly use must also allow us to add more characters, when and as we need them. It is for these reasons that XML processors are required to accept characters in the Unicode ‘Private Use Area’, even though this thought scandalizes some people. And it is for these reasons that TUSTEP developed, so early in its history, such a broad coverage of the historical Latin, Greek, and Hebrew alphabets, but also allows user-defined characters when that is essential. [Delete?]
[Verify this claim about TUSTEP.]

3.2  OHCO ·  OHCO

Slide showing O H C O as acronym
As regards the organization of the text above the character level, over the decades a variety of data formats were developed, each reflecting in its way a particular conception of text and of the appropriate operations on text which need to be supported. I will not attempt a historical survey of these, although I believe the hermeneutics of data formats is a rewarding exercise, and the examination of such formats can sometimes help shed light on the nature of text (if only in a negative way), apart from their intrinsic interest.
Instead, I would like to skip forward several decades and consider what is still, I think, the most serious attempt so far to describe concisely a general model of text. I mean the model offered by Steven DeRose, David Durand, Elli Mylonas, and Allen Renear, then all of Brown University, under the title “What is text, really?” (‘Was ist Text, eigentlich?’).8
Slide showing quotation: “we think that our point can be scarcely overstated: text is an ordered hierarchy of content objects.”
As many of you may know, the authors define a model of text as an “ordered hierarchy of content objects” (‘geordnete Hierarchie von Inhaltsobjekten’), widely discussed since as “the OHCO model”. The authors briefly describe a number of alternative models of text, and argue for the superiority of their model, before describing some remaining challenges for the future. The crucial claim associated with the paper (and widely called “the OHCO thesis” or “the OHCO hypothesis”) is: “we think that our point can be scarcely overstated: text is an ordered hierarchy of content objects.” The OHCO model of text is clearly an attempt by the authors to explain the virtues they see in the Standard Generalized Markup Language (SGML) -- and now in the successor to SGML, called the Extensible Markup Language (XML) — and the OHCO model is frequently described as the philosophical underpinning of SGML and XML.
The idea that a text of any complexity can be reduced to a simple hierarchical arrangement of parts strikes many students of text as ludicrous, so the OHCO model, and the SGML and XML specs which are thought to be based on it, have been widely criticized as simplistic, uninformed, even authoritarian. (Some humanists have a reflexive distrust of anything called a hierarchy.) Any book chosen at random will show examples of structural units which do not nest properly and thus form no hierarchy: texts are divided into pages (whose cultural significance we have already noted) and also into paragraphs, which quite often extend across page boundaries. But if the elements don't nest, they don't form a hierarchy.
I would like to defend both the OHCO model and the SGML and XML specifications against this criticism.
Slide showing standard average text structure in an XML document: sections, paragraphs, ...
It is probably true that the OHCO model provides a pretty good match for the approach to text encoding and processing used by most early adopters of SGML and XML. The developers of SGML and XML vocabularies identify different kinds of textual objects, and (in the simple case) represent them as SGML or XML elements, which nest within each other and thus form a containment hierarchy. In a document production environment, pages are the result of processing, not the input to processing, and as the document changes from version to version the pagination can and should change. So early adopters of SGML routinely marked paragraphs, but not pages, as structural units of text.
But the match is not perfect. To believe that SGML and XML are useful in practice, it is not necessary to claim that text by nature is an ordered hierarchy of content objects. Nothing in either the XML or SGML specifications makes any such claim, explicitly or implicitly. To find these technologies useful, it suffices to believe that whatever the underlying Platonic reality of text (assuming it has one, which of course Aristotelians and post-modern constructivists can doubt), text can usefully be modeled as an ordered hierarchy of content objects.
And that texts can usefully be modeled using SGML and XML elements is, perhaps, adequately demonstrated by thirty years of successful SGML applications. Se non è vero, è ben trovato.
Slide showing concurrent structures. (Peer Gynt? Shakespeare? Götz von Berlichingen?)
The SGML spec, moreover, defines some concepts which directly contradict the claim that a text is a single hierarchy of objects: the optional feature CONCUR can be used to allow multiple element hierarchies to be defined in a document. If we wish to record not just the play / act / scene / line hierarchy of Shakespeare's plays, and also the page and gathering boundaries of the First Folio (since the correct interpretation of the spelling and text layout in the First Folio varies from page to page), we can do so with two parallel hierarchies.
It's also true that even without CONCUR, it is an oversimplification to believe that SGML (or XML) define structure only using nesting elements. Both specs define a mechanism for assigning identifiers (IDs) to elements and referring to those identifiers from elsewhere. (HTML hypertext anchors are perhaps the most widely known use of this idea.) The fundamental data structure most suitable for reasoning about SGML and XML documents is not the tree, but the directed graph. The element structure forms an often useful substructure of this graph.
Slide showing XML structure including cross reference links.
If SGML had been based on the assumption that a text has, by nature, one hierarchical set of content objects, then the existence of the CONCUR feature and of ID/IDREF links would be impossible to explain.
It should also be pointed out that while the authors of the OHCO paper do speak of text as an ordered hierarchy of objects, their concluding remarks include the observation that “many documents have more than one useful structure”. Since they obviously do not regard this observation as undermining their earlier claim that “text is an ordered hierarchy of content objects”, it seems safe to infer that by “an” ordered hierarchy they meant “at least one” ordered hierarchy, not “at most one” or “exactly one” ordered hierarchy.
While SGML and XML appear to be the focus of interest for both the authors of the OHCO paper and for the critics of that paper, the model of text as an ordered hierarchy of objects has a number of other instantiations which should be mentioned here at least in passing.
The software system developed in the 1960s by Douglas Engelbart at Xerox PARC (NLS for "oNLine System", later re-christened "Augment") structured documents using a generic three-level hierarchy: unlike SGML, Engelbart's system did not label nodes in the tree with a type name (like "chapter", "chapter-title", or "paragraph"), and it did not allow nodes to nested arbitrarily deep. But it implemented a model of text as an ordered hierarchy of anonymous objects. One of the first widely used PC-based concordance systems, Word Cruncher, also used a three-level organization for texts. This made it easy for Word Cruncher to handle texts like Shakespeare plays, where passages are conventionally identified by act, scene, and line number, or like the Bible, where passages are identified by book, chapter, and verse.
But perhaps the instantiation of this model most familiar to this audience is TUSTEP itself: the organization of a Tustep file into numbered pages and lines is a good example of the model. And the ways in which Tustep can work with pages and lines and their numbers illustrate how a more powerful model of text makes it possible to write more powerful software.

3.3  Variations on OHCO ·  OHCO-Variante

One informative way to learn from the OHCO hypothesis is to consider each part of its description in turn, and ask what happens to the concept of text if we modify that part of the phrase.
3.3.1  Unordered hierarchy? ·  Nichtgeordnete Hierarchie?
Ordered hierarchy of content objects
ordered. What would result if we assumed an unordered hierarchy of content objects? The answer is surprisingly clear: the result is a database management system supporting a hierarchical model. The object-oriented databases of the 1980s and 1990s come to mind. The model is interesting, but it is pretty clearly not a model of text. For at least the last forty years, with the rise of the relational database model, databases have explicitly disregarded the sequence of records in a table and the sequence of columns in a row.
But texts do not disregard sequence in that way: we cannot rearrange the chapters of a novel without changing the novel. We cannot reorder the words of a poem without destroying the poem. “Heide auf Röslein Röslein rot Röslein der Röslein” is not the opening line of a poem by Goethe.
There are some novels with multiple reading sequences, like the Argentinian modernist Julio Cortázar's Rayuela (‘Hopscotch’ [dt. ‘Himmel-und-Hölle’ oder ‘Hickelkasten’]), illustrate in part the principle that assigning a different order to the chapters (and their events) makes a different story.
3.3.2  Ordered hierarchy of ... ? ·  Geordnete Hierarchie von ... ?
Ordered hierarchy of content objects
content objects. What would result if we assumed that text were not composed of content objects? Would it be different? What would it be composed of?
I find it hard to answer this question, because the authors offer no definition of the term content object, and I don't know with certainty what they meant by this phrase.
Perhaps they mean ‘elements of the logical structure’ of the text, as opposed to the elements of the layout structure (so: paragraphs, not pages, for the typical prose document). This would be consistent with the rhetoric of most proponents of SGML at the time they were writing.
If text were not a hierarchy of content objects it might then be in contrast a hierarchy of page-layout or rendering objects. This, I believe, is the model of text implicit in PDF, or of TeX device-independent files, and might be visible in the internal data structures of software for those formats.
It is explicitly the model of the World Wide Web Consortium's XSL Formatting Objects specification. The XSL FO specification defines a set of XML elements which correspond to regions on a page, which are to be filled in a particular way with formatted text.
In recent decades, there have been scholars who have argued that in order to interpret literary texts we must study them in their original physical presentation; they may find meaning in the page layout and the choice of paper in the first editions of William Butler Yeats, and they argue strenuously that the poems of William Blake cannot be divorced from the script and the drawings in the engravings in which the poems were first presented. Some people identify this school of criticism as being interested in the materiality of text. The model of text as a hierarchy of layout objects — as, in effect, a sequence of pages — will appeal to those interested in materiality in this sense.
As a medievalist, I am also interested in the materiality of text: a good understanding of the material conditions of textual transmission is helpful both in textual criticism and in understanding how literature was produced and consumed. But I am reluctant to identify text as a structure (whether hierarchy, sequence, or something else) of typographic or more generally text-rendering objects. When we have no authorial manuscripts, when instead we have thirty-four manuscripts with different page layouts, which of the thirty-four manuscripts will be the text? Is the Nibelungenlied one literary work, or thirty-four? Materiality is of great interest, for medievalists as for others. But precisely because there is almost never a single, authoritative witness to any ancient or medieval work (they may be single, but seldom authoritative), medievalists and classicists find it useful to distinguish systematically between text (an abstract object) and text carriers (physical objects that instantiate the text). It is helpful to be able to encode pages and other material objects, in addition to but not instead of the more abstract and more purely textual objects like cantos, stanzas, or paragraphs.
Perhaps by “content objects” DeRose et al. mean merely ‘the objects which occur in the content’ (of the text, or of other objects), or ‘the objects which carry or realize the content of the text’. That would, it appears, leave open the identification of just what objects, and what kinds of objects, those are. This would be consistent with the fact that in SGML and XML, it is the user (or the designer of the SGML/XML vocabulary), not the specification or the software, who determines what kinds of objects will be recorded as occurring in the text. If we are interested in the materiality of our exemplar, all we need to do is define its content objects as material ones. This is why XML can be and is used for the description of page layout or other specialized kinds of structures.
If we think of text not as a structure of content objects, and not as a structure of objects at all, could we still have a model of text? In many computing contexts, we face the choice between representing information as data and representing it as a function or procedure. There have certainly been proposals to model text as procedures, but they have not been well received. Procedures are notoriously harder to think about than data, and it is difficult in many cases to determine, from examining two procedures, whether they calculate the same result or not. It can be difficult to define general conditions for regarding two texts as the same text (with some textual variation) or as different (but similar or overlapping) texts. But most textual scholars think the question is meaningful; it would I think be a mistake to adopt a model of text-as-procedure which makes it unanswerable in principle.
3.3.3  Not a hierarchy? ·  Eine Nichthierarchie?
Ordered (?) hierarchy of content objects
hierarchy. What if text were not arranged in a hiearchical structure, not a tree? What if it has some other form of organization? What other forms of organization are plausible candidates?
The linguist Igor Mel'cuk observes, in a discussion of modeling human language, that hierarchical structures — that is, trees — occupy an interesting middle ground between completely unconstrained directed graphs and the most tightly constrained of directed graphs. “A [semantic representation] is an (almost) arbitrary network, a graph with practically no formal restrictions imposed on it. On the contrary a [phonetic representation] is obviously a string of phonetic symbols, i.e. a chain, the simplest of all possible graphs, with the maximum of restrictions imposed on it ... To establish a many-to-many mapping between arbitrary networks and chains, a convenient bridge is needed, that is, a graph formally situated halfway betwen arbitrary networks, on the one hand, and chains, on the other. This happens to be a tree, a formal entity traditionally used to depict sentence structures ...”
If we wish to think about alternatives to tree structures for our model of text, we can usefully turn our attention first to the family of models which impose tighter restrictions, modeling text as a sequence (chain, in Mel'cuk's terminology). Note that a sequence is what we get if we constrain a tree with the rule that every node except a leaf has exactly one child. There will only ever be one leaf. Every node, except the leaf, will have just one child node, just as in a sequence every element but the last has just one next node.
3.3.3.1  Ordered sequence of content objects? ·  Geordnete Folge von Inhaltsobjekten?
We have encountered already the idea of text — indeed, or all language — as a sequence.
[Delete?] Some linguists argue that treating spoken utterances as a one-dimensional sequence of phonemes is an over-simplification: among other things it omits pauses, changes of pitch, and changes of speed. (In conventional writing, punctuation supplies partial information about these things, but not nearly enough to represent the prosody of an utterance completely. Naturally occurring writing are only as unambiguous as they need to be, to allow written messages to be deciphered reliably. They need not, and typically do not, offer anything like a complete record of an utterance.) Sign languages, meanwhile, are not spoken with a single mouth but with two hands, a body, and a face; it is possible to reduce them to a single sequence of written symbols, but the process resembles the multiplexing of several channels of information into a single channel: the existence of a sequential written form for a language does not mean the language is best modeled as a one-dimensional stream of words, morphemes, or phonemes. [Delete?]
A wide variety of sequence models have been defined or implemented; they all model text as “a sequence of ... X”, but they provide different values of X.
Slide to illustrate character sequence (? how?)
Perhaps the simplest model of this class identifies text as a sequence of characters. This model is at the heart of text editors like emacs or vi on Unix systems, and ‘programmer's editors’ of various kinds on other operating systems.
It is also visible in the string datatypes of our programming languages, in standard email (I mean email without HTML markup), and in many Unix utilities (where it competes with the alternative model of text as a series of lines).
de
Slide to illustrate typical punched-card format (mockup of Boiardo; stolen from Bologna 2007)
BO101011SIGNORI E CAVALLIER CHE VE ADUNATI
BO101012PER ODIR COSE DILETTOSE E NOVE,
BO101013STATI ATTENTI E QUI%ETI, ET ASCOLTATI
BO101014LA BELLA ISTORIA CHE 'L MIO CANTO MUOVE;
...
		  
Perhaps the earliest model, historically speaking, represents text as a sequence of punched cards, with as many words on each card as feasible, or for verse texts, with one verse line per card. In the Pfeffer Corpus of modern spoken German, each card showed the use of one word in the context of a sentence; when the sentence was more than 80 characters long, elisions were used to make it fit. [No, I do not have an example; cited from memory.]
After punched cards ceased to be used, the model of text as a sequence of lines remained prominent, if only in the data structures of text editors, which represented text files as arrays of line pointers. It is also visible in a great many Unix command-line tools like grep, which searches for strings of characters in a file and returns all the lines which match the given pattern. (And elsewhere?)
de
Beginning in the 1960s, programs began to be written to allow documents to be formatted for printing by the computer's printer. Most such programs treated their input as a sequence of lines, which the program divided into two groups: some lines contained the text to be printed, and others contained processing instructions to guide how the text should be printed. In the programs Runoff, roff, nroff, troff, and Script, versions of which were available for essentially every mainframe computer, formatting instructions were identified by a full stop occurring in the first column of the line (on the theory that in natural language, no words begin with a full stop; this is not quite true for English, it turns out, but it is easy enough for the user to avoid beginning an input line with such a word.
The most recent of the batch formatting programs, Donald Knuth's TeX, retains the division of the text into data and commands, but makes commands recognizable in other ways, thus eliminating the structural significance of line breaks in the input. Its model is (oversimplifying slightly) that text is a sequence of characters, interspersed with formatting commands.
Slide of two-dimensional text (to illustrate Busa)
An interesting variant on the card or line model was developed early on by Roberto Busa. In it, each card is devoted to a single word, which is recorded (in its surface form) in the initial columns of the card. Later columns of the card, in a kind of tabular format, provide other information about the word: perhaps its lemma or lookup form, perhaps its grammatical features, perhaps its syntactic function or the word number of the word it depends on (adjectives point to their nouns, subjects and objects to their verb, etc.). Because the conventional text can be read vertically down the beginnings of the lines, this format is sometimes called a ‘vertical text’. [citation of Anderson?], but because the additional information on each card provides a second dimension orthogonal to the first (what Busa called the “internal hypertext” of the word), I think it might better be called two-dimensional text.
Nowadays, the most widespread sequential model of text in computing, the one most human beings use for drafting and revising documents, is the common word processor model. In this model, a text is a sequence of paragraphs (or paragraph-like objects), and paragraphs are sequences of characters. In this model, different paragraphs may be styled differently: they may have different margins, different styles of indentation (flush left, flush right, centered), different leading [Durchschuss], and different default character stylings. Within a paragraph, different characters (or sequences of characters) can be assigned a particular font, font size, or treatment.
Although the style menu in a typical word processor may contain entries like Chapter title, Section title, Subsection title, and so on, word processors typically have no inbuilt notion that chapters contain sections, or that sections contain subsections; they thus cannot enforce a rule that (for example) a chapter should not directly contain a sub-sub-section. Indeed, the typical word processor knows nothing at all about chapters, sections, and sub-sections, only about chapter titles, section titles, and sub-section titles.
From a scholarly point of view, or for modeling purposes, this is perhaps the biggest drawback to all these flat sequence models: in these models, there is no natural sense in which the first paragraph of the first chapter of a novel is part of or in the first chapter: the chapter is not an element in the sequence which constitutes the text, and thus does not appear in the model. We can of course define the chapter as a sequence of paragraphs, but that amounts to changing the model, and it will be difficult to make the details of the extended model come out right.
3.3.3.2  Several hierarchies of content objects? ·  Mehrere Hierarchien von Inhaltsobjekten?
If we consider less constrained structures, two possibilities come easily to mind.
Slide of two-tree text (to illustrate CONCUR). [replace with German example]
First: instead of a single hierarchy of content objects, we can have several: two, three, or arbitrarily many. The image shows a haiku as a simple example: a haiku with three lines and two sentences. Blue nodes mark the linguistic structure, pink nodes the verse structure; white nodes are shared and occur in both structures.
As has been mentioned, SGML already defines multiple hierarchies as a possibility, and while the CONCUR feature was seldom implemented and was dropped from XML as an unnecessary complication, it is not conceptually difficult to extend XML to support the same thing. Oliver Schonefeld and Andreas Witt proposed an XCONCUR some years ago. More recently, at least one project has reported using a pure XML notation to record multiple tree structures in their documents: the ‘dominant’ structure uses conventional XML elements, and the ‘recessive’ structures use empty milestone elements. For different kinds of work, different dominant structures are appropriate, so the project automatically transforms documents from one XML structure to the other when needed. This allows the conceptual model of CONCUR to be applied in a pure XML context, so that standard XML tools can be used for processing.
3.3.3.3  Directed graph of content objects? ·  Gerichteter Graph von Inhaltsobjekten?
Second, we can allow the structure of text to be not a tree, and not several trees, but a graph or directed graph which is not required to be a tree. Some years ago, Claus Huitfeldt and I proposed a directed-graph structure for text, under the name Goddag (generalized ordered-descendant directed acyclic graph, also Norwegian for "hello"); the model has been useful for thought experiments, but we have had neither the resources nor the desire to build for Goddags the same kind of infrastructure that already exists for XML: parsers, validators, transformation languages and transformation engines, databases and database query languages, and so on.
A team at the Huygens Institute in the Netherlands is now undertaking such an ambitious project to support their own new model of text, under the name "Text as Graph." In TAG, text is modeled not as a single tree but as a hypergraph, from which various trees with different structures can be projected.
[More on Text as Graph would be helpful here.]
3.3.3.4  A practical compromise? ·  Eine pragmatische Lösung?
One substantial drawback of unrestricted graphs is that they are rather difficult to process efficiently. Multiple concurrent trees are easier to process, but are still more complex than handling a single tree. So even if we regard the underlying reality of our texts as unconstrained networks, or the equivalent, we may find trees attractive as tools for working with text: they allow us to expose part of the reality in a simpler form, and they provide a convenient way to mediate between an unrestricted graph structure for text and the linear order of bytes which we need in order to store a document in a file using a file system which models files as flat sequences of bytes.
In this context, I should mention TUSTEP's use of a hierarchical model for the management and manipulation of text, coupled with its ability to transform a document from one hierarchy to a different hierarchy (while retaining traces of the original structure, to allow re-transforming the document in the other direction). This allows a remarkable flexibility in the management of documents: there is always a hierarchical structure of ‘pages’ and ‘lines’, but markup can also be supplied in the document, and used to transform the document to a different structure, under user control. For a skilled user, therefore, Tustep offers the ability to work with either a model of text as a tree, or as a forest of multiple concurrent trees, or as a graph. The power thus given to users is I think one reason for TUSTEP's success.

4  The shape of text in our heads ·  Die Textgestalt innerhalb des Kopfs

Texts are many things — cultural objects, literary objects, utilitarian objects — but perhaps above all they are linguistic objects. Conventional texts consist of words, which combine lexical information with morphological and syntactic information; these words are organized into sentences which have syntactic, semantic, and pragmatic information. Sentences convey information and ideas, directly as part of their meaning, less directly as part of their entailment, and very indirectly by their contribution to our view of the world. Texts quote other texts, or misquote them, appeal to their authority or refute them, interpret or misinterpret them. A full account of the shape of text must include some mention of these linguistic, aesthetic, denotational, expressive, conative, phatic, metalinguistic, and intertextual functions; a fully complete representation of text as data — if such a thing is possible — must provide ways to represent such aspects of text and process them.
... Eine vollständige Beschreibung der Textgestalt muss diese sprachliche, ästhetische, ... Funktionen irgendwie erwähnen; eine wirklich vollständige Repräsentation von Text als Datentyp — wenn so ein Ding überhaupt denkbar ist — muss es ermöglichen, diese Texteigenschaften darzustellen und zu verarbeiten.
Slide showing Babylonian 23 = 23, or 23 * 60, or 23 * 60 * 60, or 23 / 60, ...
These aspects of text are only partially represented in the written form of a text. Print is, after all, a technology for writing. And writing systems are almost always incomplete representations of linguistic utterances: the function of any writing system is to make it possible to decipher a message, not to provide a complete representation of all aspects of the utterance. Accordingly, writing systems are almost always underspecified vis a vis the language(s) they are used to represent: the writing system disambiguates just as much as it needs to, to be practical, and usually as little as it can get away with. That means that written utterances are frequently (strictly speaking) ambiguous, and we rely on context to help us understand what was meant. A given Babylonian numeral, for example, can mean 1, or 60, or 360, and the writing system relies on context to make clear which meaning is to be assigned to a given occurrence (25 plus 35 is ... not 1 and not 360, but 60). The token "M." may mean many things, but when followed by a nomen gentilicium (e.g. "Tullius") it can only mean "Marcus".
M.
Writing systems for humans to use will almost always omit information when they can, to make it easier to produce written messages. But a writing system for machines can be more explicit while remaining practically usable, if we use software to assist in writing and reading it. What would be involved if we tried to make our electronic representations of text represent them more completely than is customary (or perhaps possible) in print? What might we then be able to do with our electronic texts?
Slide showing IPA transcription of a text (screen shot from IAIA or UyLVs project).
Phonologists may be interested in a phonetic rendering of the text, either a strictly phonetic one, which transcribes in more or less detail the sounds actually made by a speaker, or (as in the text shown in the image) a phonemic one, which distinguishes the phonemes of the language and indicated (for those who can read the International Phonetic Alphabet) at least approximately how they are pronounced. Phonemic representations of this kind are, if I understand things correctly, sometimes used to transcribe material from languages which do not have an established writing system.
Students of manuscripts might benefit from a distinction similar to that made by phonologists between phonetic and phonemic transcription: for some purposes, a graphetic transcription is useful, which distinguishes the several forms of lower-case a shown earlier. For other purposes, a graphemic transcription is more convenient. It is sometimes nice if both levels can be accommodated in the same electronic representation.
Wortform Morpheme Lexem+Grammeme
Po {PO} po
soobščenijam {SOOBŠČENIE} + {PL.DAT} soobščenie[pl,dat]
pressy {PRESSA} + {SG.GEN} pressa[sg,gen]
SŠA, {SŠA} SŠA
Belyj {BEL(YJ)} + {MASC.SG.NOM} Belyj[masc,sg,nom]
Dom {DOM} + {SG.NOM} Dom[sg,nom]
We may be interested in the morphological structure of the text, which can be modeled in several ways. The image shows the surface form of the word on the left, in the center the ‘shallow morphological structure’ as specified in the Meaning/Text Model of language created by Igor Mel'cuk and Aleksandr Zholkovsky, and on the right the ‘deep morphological structure’ in the same model. (The full morphological representation of a sentence in the meaning/text model requires also some representation of prosody, but I omit it here for brevity.)
Wortform Wortklasse
Ursprung NN
bedeutet VVFIN
hier ADV
jenes PDAT
,
von APPR
woher ADJA
und KON
wodurch PWAV
eine ART
Sache NN
ist VAFIN
,
was PWS
sie PPER
ist VAFIN
und KON
wie PWAV
sie PPER
ist VAFIN
.
Ursprung bedeutet hier jenes , von woher
NN VVFIN ADV PDAT , APPR ADJA

und wodurch eine Sache ist , was
KON PWAV ART NN VAFIN , PWS

sie ist und wie sie ist .
PPER VAFIN KON PWAV PPER VAFIN .
The most common linguistic annotation in existing language corpora consists of assigning a part of speech to each word. Here we see the Stuttgart/Tübingen part of speech tag set applied to a sentence from Heidegger, which has apparently slightly confused the part-of-speech tagger. (Specifically, on the word woher.)
The first such corpora to be produced were the tagged Brown and LOB corpora of modern American and British English, in the 1970s and 1980s, but many more have followed. You may notice that the part of speech tags used have (as here) only a passing resemblance to the eight parts of speech identified by the ancient Latin grammarians and which you may have learned in school.
Surface syntactic representation of Melcuk sentence 7
Another popular form of annotation for language corpora consists in producing a syntax tree for each sentence in the corpus, either a phrase-structure tree (as in the Penn Treebank and some others) or as a dependency tree. (Dependency trees have a popularity in computational contexts which appears to be out of all proportion to the popularity of dependency grammars among practicing linguists.) Shown here is the surface syntactic representation of a Russian sentence (whose morphology was glimpsed earlier). Each oval represents a word in the sentence (or strictly speaking an item in the deep morphological representation of the sentence). The surface syntactic structure is modeled here as an unordered tree: any information expressed by the ordering of words in the sentence must be represented either in the dependencies or in the classification (labels) of the dependencies. (Most dependency treebanks retain the order of the words, and perhaps as a result can use a much smaller set of dependency classes.)
The nodes of the theme (topic) are colored light blue; those of the rheme (comment) are colored pink. Within the rheme, a subordinate theme and rheme are identified; these are identified by double ovals in blue or red, respectively.
I won't attempt to explicate this diagram in detail; as you can perhaps see, it is a dependency tree rather than a phrase-structure tree; it differs from dependency trees created in other approaches primarily in being unordered, in having (in consequence) a richer set of dependency labels, and in systematically using not the surface form of the words but the forms of the deep morphological structure as nodes. For the purposes of this discussion, it suffices to notice that the dependencies will require that we record directional links from each dependent to its governor.
Deep syntactic representation of Melcuk sentence 7
Some linguists define two levels of syntax trees: a surface level which is relatively clearly related to the surface form of the sentence, and a deep level which is more distant from the surface form, but which makes the semantic structure of the sentence clearer. This example, again from Igor Mel'cuk and illustrating the meaning/text model, regularizes some surface syntactic and lexical variations. The deep structure uses a very restricted set of dependency types, and replaces some nodes (like "very energetic" modifying "help") with lexical functions (Magn).
Several arguments of verbs which are implicit in the surface form of the sentence and omitted from the surface syntactic structure are made explicit here, so the words NAROD, referring to the American people, and STRANA, referring to African countries, each appear three times, and are in turn connected by links in the deep anaphoric structure (represented again here by dotted lines).
Semantic representation of Melcuk sentence 7
Some treebanks attempt to make the semantic structure of the sentence accessible for processing by annotating the surface syntax tree; others, as I've just shown, produce a second ‘deep’ syntax tree. Some linguists will wish for a separate representation of the semantics of the sentence; the image shows the semantic structure postulated by Mel'cuk for the sentence whose syntactic trees have already been shown. (In the meaning/text model, this semantic representation will be the same for all sentences ‘synonymous’ with the one shown earlier. In this diagram, the semantic structure is modeled as a directed graph; it would not be impossible to render it as a set of sentences in logical form, if one wished to reason formally about the meaning expressed.
I have used the meaning-text model to illustrate the kinds of information we would need in order to have a notionally complete representation of the linguistic structure of text, partly because it is possible to find fairly clear descriptions of each of its descriptive layers. Some linguistic approaches may offer simpler, possibly less complete, descriptions: not all students of syntax will distinguish two syntactic structures and also a semantic structure. The ‘semantic’ annotations to the Penn Tree Bank, for example, offer a single structural representation intended to describe both the syntactic structure and the associated semantic roles. And of course we may be interested, in a given practical project, only in the morphology, or only in the lexis, or only in the surface syntax of our text.
If a more or less complete representation of the linguistic structure of a text is so complex and raises so many questions, we may decide that a complete representations of everything is a chimera not worth chasing. How much chance do we have of being able to process a query that demands “show me all the passages in Musil, if there are any, which are best understood as allusions to Faust,” even if we restrict ourselves to the views of a single literary historian? How likely is it that we could make explicit everything we think about any given text?
Mathematicians sometimes say that for all the mathematical discoveries of the last centuries, we are like swimmers who explore one part of a small cove in detail and know almost nothing at all about the ocean just beyond. We have only scratched the surface.
Similarly, with text, I think, it is possible that we have only scratched the surface. Our current methods of textual representation constitute a set of hard-won achievements. But there is a great deal more work to be done. For that work we need the best tools we can have, and among those tools, I think it is safe to say, is TUSTEP.

5  Conclusion ·  Schluss

Some years ago, the prominent digital humanist Willard McCarty (editor of the Humanist discussion list) used to say humorously that software often reflected not only the thinking but also the personality of its makers. To drive the point home, he would say "Think of Word Perfect, for example. Have you ever seen such a Mormon piece of software? It is clean, well-behaved, polite almost to a fault — and also unprepared to deal with anything that does not fit into its world view, and absolutely, irredeemably convinced that it knows the Right Way to do things." To the extent that software is a tool, it reflects the convictions of its makers about what tasks are worth doing. Not only that: software will also necessarily reflect the makers' beliefs about how one can (or should, or must) set about performing those tasks, and (a little less directly) about the human being or beings who will be using the software: what they find important, what they find less important, how they think about their task and how much responsibility they wish to take for it.
There are not many pieces of software that make me wish to live up to the software's implied image of its user. TUSTEP is one of these.
On this analysis, every piece of text processing software reflects an idea of text, of the things we might want to do with text, of the ways we might want to shape text by processing and (for those programs that produce output) the ways we might want to shape it on the page. Sometimes, the ideas reflected are simple, facile, even simplistic. Not many programs offer a concept of text and the shaping of text as detailed, as deeply felt, as coherently thought through as that of TUSTEP. As always, though, programs reflect the thought of those who make them, and it seems fair to say that with TUSTEP, Wilhelm Ott has shown himself to be one of our day's great thinkers about text, and the shape of text, and the shaping of text.
Please join me in honoring Wilhelm Ott and thanking him for his work.
1 Not quite always: in Literary Machines, he writes at one point “The sequentiality of text is based on the sequentiality of language and the sequentiality of printing and binding. Thes two simple and everyday facts have led us to thinking that text is intrinsically sequential.” (1/14, “Hypertext”).
2 We could imagine languages which would not be completely linear -- if, for example, the vowels and consonants of our speech carried one channel of a signal while pitch intonation or volume carried another independent signal. We might imagine a sign language in which the right and left hands carried independent meanings. But naturally occurring sign languages are, as it happens, linear enough to be rendered in writing, and in tonal languages, linguistic analysis simply records the tone as a phonemic property of the vowel which forms the nucleus of the syllable. Intonation may come closer to being independent and non-linear in non-tonal languages like English or German. But the information content of intonation in these languages is relatively poor, measured in bits, and conventional orthography limits itself to distinguishing unremarkable pitch profiles marked with full stops from meaning-bearing pitch profiles marked with question marks or exclamation points at the end of the sentence. Languages like Spanish, in which questions and exclamations are not syntactically distinct, mark the beginnings of questions and exclamations as well as the end, but have thus far apparently not felt any need for a richer inventory of punctuation for pitch profiles.
3 In this as in a number of other points I follow the analysis of Elisabeth Eisenstein.
4 Or at least, I don't know any other way to do it.
5 Of course, the fact that pagination varies from edition to edition means that the more important a text is (and the more different editions and translations of it are published), the less useful page numbers will be. And page numbers are seldom much use in electronic books, which repaginate themselves as the user resizes the font. This is not the place to enter into a full discussion of the interesting technical and social issues associated with reference systems in an electronic age.
6 In this discussion, I follow Eisenstein. She has been criticized on the grounds that printed texts are not in fact completely stable, and manuscript texts are not as unstable as her critics believe Eisenstein thinks. As it happens, though, Eisenstein does not say (or, apparently, believe) that printed texts are completely stable. Instead, she observes correctly that they are decisively more stable than manuscript traditions, and she suggests persuasively that the increased stability made a more critical scholarly attitude toward the text feasible and useful.
7 It's easy to get caught up in questions of detail here: stop-press variants, problems with loose type or bad inking, and the requirement to break up pages of set type so the type can be re-used, all have the effect of ensuring that printed texts are not perfectly static, any more than manuscript texts are. And careful scribes can in principle (and perhaps occasionally in practice?) copy texts without introducing changes. But the fact remains that printed texts approach fixity much more closely than do manusript texts; I think the reason is that the relative costs of making identical and modified copies are different.
8 DeRose, Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear. What is text, really? Journal of Computing in Higher Education 1, no. 2 (1990): 3-26. doi:https://doi.org/10.1007/BF02941632.