This set of slides is intended to make it possible to leaf through various sequences of images which are intended to illustrate three different approaches to choosing which virtual elements in a flattened file to raise when.
Slides full of prose (like this one) are explanations of what the diagrams are showing.
In an presentation, most of the prose would be handled by the speaker, not by prose on the slides.
There are several sequences of diagrams:
There are three different styles of tree shown here, which are summarized at the end.
This approach processes the input left to right in a single pass.
These slides show each node locked in a given position, for visual stability across transitions.
At the outset, we have a flat sequence of nodes.
We first encounter the start-marker for the paragraph.
We then encounter a start-marker for a citation. The paragraph has been started not finished.
We then encounter a start-marker for a quotation.
We then encounter a start-marker for a line group.
We then encounter a start-marker for the first line of verse.
And finally some text data: the text of the first line of verse.
End-marker for the first line.
We then encounter a start-marker for the second line of verse. The first line is now finished (as signaled by the change in color).
The text of the second line of verse.
End-marker for the second line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
End-marker for the line group.
End-marker for the quotation.
Start-marker for the note attributing the quotation.
Text directly within the note.
Start-marker for bibliographic reference.
Text of the bibliographic reference.
End-marker for the bibl.
End-marker for the note.
End-marker for the citation.
End-marker for the paragraph.
Final state.
This approach processes the input in multiple passes.
These slides show each node locked in a given position, for visual stability across transitions.
At the outset, we have a flat sequence of nodes.
On the first pass we raise the outermost element, the p.
On the second pass, we raise the citation element. The p element is finished though its children are still in-process.
On the third pass, we raise the quote element; we don't know how to raise the note at the same time.
On the fourth pass, we recur on the content of the quotation and raise the line group.
On the fifth pass, we raise the first line.
Second line.
Third line.
Fourth line.
Fifth line.
Sixth line.
Now we finally make it to the note.
On the final pass, we raise the bibliographic reference within the note.
Final state.
This approach processes the input in multiple passes.
These slides show each node locked in a given position, for visual stability across transitions.
At the outset, we have a flat sequence of nodes.
On the first pass we raise all character-only elements.
On the second pass, we raise the line group and the note.
On the third pass, we raise the quote element.
On the fourth pass, the citation element.
On the fifth and final pass, we raise the paragraph.
Final state.
This approach processes the input left to right in a single pass.
These slides shrink each invisible node, so the others have more room and can fill the space more naturally.
Maybe it works, maybe it doesn't.
At the outset, we have a flat sequence of nodes.
We first encounter the start-marker for the paragraph.
We then encounter a start-marker for a citation.
We then encounter a start-marker for a quotation.
We then encounter a start-marker for a line group.
We then encounter a start-marker for the first line of verse.
And finally some text data: the text of the first line of verse.
End-marker for the first line.
We then encounter a start-marker for the second line of verse.
The text of the second line of verse.
End-marker for the second line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
Start-marker for another line of verse.
Text of the line.
End-marker for the line.
End-marker for the line group.
End-marker for the quotation.
Start-marker for the note attributing the quotation.
Text directly within the note.
Start-marker for bibliographic reference.
Text of the bibliographic reference.
End-marker for the bibl.
End-marker for the note.
End-marker for the citation.
End-marker for the paragraph.
Final state.
This approach processes the input in multiple passes.
These slides show each node locked in a given position, for visual stability across transitions.
At the outset, we have a flat sequence of nodes.
On the first pass we raise all the outermost element, the p.
On the second pass, we raise the citation element.
On the third pass, we raise the quote element; we don't know how to raise the note at the same time.
On the fourth pass, we recur on the content of the quotation and raise the line group.
On the fifth pass, we raise the first line.
Second line.
Third line.
Fourth line.
Fifth line.
Sixth line.
Now we finally make it to the note.
On the final pass, we raise the bibliographic reference within the note.
Final state.
This approach processes the input in multiple passes.
These slides show each node locked in a given position, for visual stability across transitions.
At the outset, we have a flat sequence of nodes.
On the first pass we raise all character-only elements.
On the second pass, we raise the line group and the note.
On the third pass, we raise the quote element.
On the fourth pass, the citation element.
On the fifth and final pass, we raise the paragraph.
Final state.
The slides that follow are of mostly archival interest: they record earlier drafts.
This approach processes the input in multiple passes.
On each pass, it locates the innermost start- / end-marker pairs (the ones with no other start-/end-marker pairs between them), and raises them.
There will be one call for each level in the tree, ascending from leaves to the root. On each pass, the number of nodes to be dealt with shrinks.
Since the process makes one pass over the input for each level of depth, and examines each node in the document (or, if we are careful, each node in the flattened sequence, skipping content elements in that sequence), the worst-case performance appears to be n × p / 2 (where n = number of nodes and p = number of passes).
In the diagrams that follow:
Note that showing the start- and end-markers in this way requires all nodes to be illegibly small in the initial diagram.
In these diagrams:
Note that in this sequence, the nodes jump around a lot from step to step.
At the outset, we have a flat sequence of nodes.
On the first pass, we identify the start- and end-markers for the head and lines.
The first pass raises the head and line elements.
On the second pass, the lg start/end pair is the only innermost pair.
The second pass raises the line group.
The third pass selects the body start-marker/end-marker pair ...
... and raises the body element.
The fourth and final pass sees the markers for the text element.
Once we raise the text element, we are done.
In the first sequence of drawings, nodes jump around a lot; it's difficult to follow what is going on without careful examination of each diagram and comparison to one's memory of the previous diagram.
This sequence tries to make it easier, by making the layout of the diagram stable; it achieves this by omitting all depiction of start- and end-markers. Instead it shows only the virtual tree and the nodes raised on each pass.
At the outset, we have a flat sequence of nodes.
On the first pass, we raise the elements closest to the leaves: the head and lines.
On the second pass, we raise the lg element.
The third pass raises the body element.
On the final pass, the text element is raised.
This version of the inside-out illustrations uses different tree-drawing conventions: nodes become invisible but remain present in the formatting algorithm, so nodes don't jump around. In order to make the text and element nodes larger, this sequence reduces the marker nodes to points.
At the outset, we have a flat sequence of nodes.
On the first pass, we raise the elements closest to the leaves: the head and lines.
On the second pass, we raise the lg element.
The third pass raises the body element.
On the final pass, the text element is raised.
Final state.
Like the inside-out approach, the outside-in approach makes several passes over the input data, one for each level in the tree.
On each pass, we raise all the outermost elements (the one with the left-most start-marker and each of its virtual siblings). Or at least, we would if we knew how to do so; in practice, we find the left-most start-tag, raise that virtual element, and then make two recursive calls:
The maximum stack depth here is, as in the inside-out method, the depth of the tree. And the number of items in the sequence we are operating on steadily decreases. So we might expect the run time to be slightly better for outside-in than for inside-out.
Sometimes it is, and sometimes it is much worse.
There are two factors that could be responsible (but we have not been able to confirm this hunch):
raise()
function
for each virtual element: one on its contents and one on
what is to the right of it. In any implementation in which
function calls are expensive, this can hurt. At the outset, we have a flat sequence of nodes.
On the first pass, we raise the outermost element, i.e. text.
The second pass raises the body element.
The third pass could in principle raise both head and lg. In practice, it raises the head.
The fourth call recognizes the lg markers and raises the lg element.
Next, the first line.
And the second line.
And the third line.
Final state.
This approach does what a recursive-descent parser for a conventional XML document does:
When it sees the beginning of any construct (element, attribute value specification, namespace declaration, text node, comment, processing instruction), it calls a routine to recognize that construct. That routine can, in turn, call others, recursively.
The stack of function calls mirrors the nesting of constructs; the maximum stack depth is the depth of the XML tree.
The parser makes a single pass over the input; each chunk of input is seen once.
In the diagrams that follow:
In these diagrams:
At the outset, we have a flat sequence of nodes.
First we see the start-marker for the text element.
Then we see the start-marker for the body element.
Next, the start-marker for the head element.
Now the text-node content of the head element.
Now we see the end-marker for the head element.
The head element is now done; next we see the start-marker for the lg element.
Next, the start-marker for the first line.
The text node for the first line.
And the end-marker for the first line.
Start-marker for the second line.
Text node in the second line.
End-marker for the second line.
Start-marker for the third line.
Text of third line.
End-marker for third line.
End-marker for the lg element.
End-marker for the body element.
End-marker for the text element.
Final state.
This sequence should be mostly the same as the preceding, but was prepared using a different method.
At the outset, we have a flat sequence of nodes.
The initial node in the input is the start-marker for text.
The second node is the start-marker for body; the text element is still pink because it has been started but not finished.
Next, we see the start-marker for the head.
And the text of the head element.
Next, we see the end-marker for the head element.
The head element is now completed (and no longer pink), and we see the start-marker for the line group.
The start-marker for the first line.
Text of the first line.
End-marker of the first line.
Second line: start-marker.
Second line: text.
Second line: end-marker.
Third line: start-marker.
Text of third line.
And end-marker for third line. From here on, we just have end-markers.
End-marker for the line group.
End-marker for the body.
And finally the end-marker for the text element.
Final state.
The preceding sequences have used three different forms of tree. Alternatives are possible.
The initial sequence eliminated nodes from the graph when they were done; the result is that nodes jump around and change size as one progresses through the sequence. The other sequences eliminate the jumping around by keeping all nodes in the GraphViz input, but making them invisible when they are dropped. The invisible nodes and arcs continue to affect the layout even though they are not visible.
The visual stability of the second approach seems clearly better for our purposes, so I don't propose to experiment further on this point. The only question is how to draw the tree.
We can show markers explicitly, as a special shape.
But when we show markers explicitly, the font size gets very small.
If we eliminate markers entirely, the font is bigger but the illustration seems unrelated to raising.
If we show markers as undifferentiated points, the font size is better but the markers are distinguished only by the dotted arcs from the virtual nodes. That may be too subtle.
If we add linebreaks to the PCDATA and bring the marker font size down, we can have legible fonts and also distinct markers. They are ovals here, like element nodes, but gray to show they are different.