A directed-graph data structure for text manipulation
C. M. Sperberg-McQueen
Computer Center
University of Illinois at Chicago
Chicago, Illinois, USA
[Abstract of a talk given at the conference
The Dynamic Text,
9th International Conference on Computers and the Humanities (ICCH)
and
16th International Association for Literary and Linguistic Computing
(ALLC) Conference,
at the University of Toronto,
June 1989]
This work came to mind recently when I heard the paper “A Fresh Computational Approach to Textual Variation”
by Desmond Schmidt and Domenico Fiormonte at the conference
Digital Humanities 2006, the first International
Conference of the Alliance of Digital Humanities Organizations (ADHO),
at the Sorbonne in Paris earlier this month. So I have unearthed the
abstract and put it on the Web. Apart from the tagging in TEI Lite,
the addition of this note and the identifying information above, and
the omission of the now outdated address at the bottom of the
abstract, the text is unchanged from pp. 69-70 of the 1989 conference
guide.
C. M. Sperberg-McQueen
Española, New Mexico
21 July 2006
Written texts are conventionally represented in data processing as
linear arrays of characters (or tokens, or records). Our text editors
and concordance programs commonly assume that texts are simple streams
of characters; our programming texts teach programmers to represent them
so. But texts are not, as a class, exclusively linear. And the linear
model, although convenient, provides no help in dealing with texts when
they resist reduction to linearity, as even simple texts often do.
For some non-linear aspects of texts, proper data structures are well
known. Simple textual hierarchy (including, at least, footnotes,
running titles, and marginalia) can be handled even by very conventional
text programs. Structure editors, SGML editors and processors, and
outlining programs may model the hierarchy of a text much more
completely, representing in a tree-like data structure the text's
segmentation into chapters, sections, paragraphs, and sentences (or into
cantos, stanzas, and lines; or into volumes, pages, and typographic
lines). A second type of non-linearity comprises internal and external
cross references and the distinction of passages of various kinds (e.g.
material to be read on the first reading and material intended to be
read only later). This non-linearity is less simple to handle well than
the first, but the solution -- hypertext -- is very well known, even
though not widely implemented.
A third type of textual non-linearity is less well served by existing
data structures: textual variation. Texts do not always exist only in
single forms. Culturally significant texts rarely do: if they are not
preserved in multiple divergent manuscripts, they will have been printed
in multiple divergent editions. But textual variation does not affect
only culturally important texts. Letters, textbooks, software manuals,
-- anything written can and often does exist in multiple forms. But the
characteristics of variant texts -- an inescapable brute fact of our
cultural history -- can be modeled by linear data structures only
indirectly and awkwardly.
Conceptually, just as a simple text can be viewed (with some
simplification) as a single stream of characters, so multiple versions
of the same text can be viewed as so many parallel streams of
characters. In this model, the complex text becomes a two dimensional
array, a linear array of (texts represented as) linear arrays of
characters. This model appeals in its simplicity, but apart from its
theoretical problems (even a single text state need not be truly linear)
it wastes storage whenever several versions have the same reading, and
wastes processing power because to display a summary of the agreements
and variations among the text states, it must collate each text against
the base text from scratch.
I will outline in my paper a more economical solution to this problem.
In this non-linear model, the multiple versions of a text are imagined
not as so many parallel, non-intersecting lines, but as curves that
intersect, run together for a while, and then split apart again, like
the channels in a river delta. Unlike the channels of most river
deltas, the versions of a text often merge again after splitting. The
data structure takes its name from one riverine delta where such reunion
of the channels does occur; I have christened
it the "Rhine Delta"
structure. Unlike the two-dimensional model of complex texts, this
structure stores passages in which all versions agree only once; it is
thus more economical of space. It also records the agreements and
divergences of manuscripts structurally, which makes the task of
preparing a critical apparatus a much simpler computational task.
Formally, the Rhine Delta structure is a directed graph, each node of
which is labeled with one token of the text and with the symbols of the
manuscripts which contain that token. Each arc linking two tokens is
labeled with the symbols of the manuscripts in which the two tokens
follow each other. There is a single starting node and a single ending
node. If one follows all the arcs labeled with the symbol of a specific
manuscript, one visits, in turn, nodes representing each token of
that manuscript, in sequence. Passages where all the manuscripts agree
are marked by nodes and arcs bearing all the manuscript symbols.
Passages where they disagree will have as many paths through the passage
as there are manuscript variants.
It can be shown that from this structure we can, for any variant,
produce all the conventional views of linear text and perform all the
usual operations (deletion, insertion, replacement, travel, search and
replace, block move, etc.). Moreover, we can readily generate the
various conventional views of complex texts: base text with apparatus,
texts in parallel columns, text in parallel horizontal lines. Unlike
other methods of handling textual variation, the Rhine Delta has no
computational bias toward any single base text state; the user pays no
penalty for wishing to view the text in an alternate version, with an
apparatus keyed to that version.
Like a textual apparatus, a specific Rhine Delta structure is not wholly
determined by a specific configuration of textual variants; more than
one segmentation of variants and grouping of manuscripts can reflect the
same textual facts. The Rhine Delta structure can reflect structurally
the editor's preference for arranging, delimiting, separating, and
grouping the textual variations. Or one can generate various
alternative forms for the critical apparatus by mechanical operations on
the Rhine Delta structure.
The Rhine Delta structure does introduce some complications into the
treatment of textual hierarchies and cross references, but there appears
to be no reason that Rhine Delta structures and hierarchical text
structures could not both be integrated into a hypertext system,
providing powerful tools for manipulating text. The form of such a
combined hierarchic, hypertextual Rhine Delta structure will be sketched
briefly in my paper.
The Rhine Delta structure has a number of possible applications in
scholarly and non-scholarly text processing. Textual critics, and those
who do not trust textual critics, will be able to to examine all the
variant forms of a text separately and in parallel, with an apparatus
keyed to whichever version they are reading at the moment. Those
responsible for writing texts that must be prepared in various parallel,
synchronized versions (system-specific software documentation, texts
intermingling introductory overviews with esoteric digressions intended
for later reading, project descriptions to be submitted to multiple
funding agencies with different agendas, ...) will be able to create and
maintain such documents more easily with a text editor that understands
the basic concepts of textual variation and file synchronization.
Collaborators could propose revisions or alternate readings in each
other's texts without destroying the other alternatives, so that the
document editor, in making the final version, can have all the
suggestions laid out cleanly against each other.
The structure here described has almost certainly been invented before,
though not widely publicized. It has parallels with hypertext data
structures (although it appears to be finer grained than most hypertext
systems), as well as with transition networks and deterministic finite
automata. Structures very like it have surely been used sporadically
before. But though it would be foolhardy to claim much originality for
the idea of the Rhine Delta, it seems worthwhile to consider it in some
detail as a data structure and discover its characteristics, advantages,
and drawbacks. Only such consideration can lead to improvements in our
ability to handle the inescapable, recalcitrant, stubborn non-linearity
of the texts we study and write.