SWEB: an SGML Tag Set for Literate Programming
C. M. Sperberg-McQueen
Version 0.5
25 September 1993
Revised August 1994
Lightly revised March 1995
Revised and extended January - March 1996
Translated into XML January 2002
4 March 1996
This document appears to be relatively stable now; it has not
changed in substance since 1996, through two (partial) implementations.
So, although it used to say
“This incomplete, unpublished document is distributed privately for
comment by friends and colleagues; it is not now a formal publication
and should not be quoted in published material”, I now withdraw
that warning; cite it if you wish!
Abstract
This document describes an SGML tag set for literate programming.
First, markup is provided for embedding fragments of
programming-language code into SGML documents in arbitrary order, to be
recombined before compilation into the order required by the programming
language's syntax. Next, tags are defined for identifiers, keywords,
code fragments, and literal values occurring as phrase-level elements in
the prose documentation. Finally, tags for indexing and for a general
structure for reference documentation (alphabetical lists of functions
and identifiers, etc.) are defined. For each type of markup, the
document gives examples and describes how the markup should be processed
by conventional literate-programming weave and
tangle processes.
Notes for revision of
swebhtml translator:
- italicize <att> elements
- format <list type="taglist"> or <list
type="desclist"> elements like glossary lists (but we probably need a
better way of handling glossary lists, too ..)
Revisions still needed in this document:
- merge two existing sections on reference material
- make scraps produce proper TEI.extensions.ent and .dtd
files
- incorporate reference material for all tags
Sweb is a simple SGML tag set for simple literate
programming. It is designed to be used as an
added tag set
in conjunction with the
base tag set for prose defined by
the Text Encoding Initiative (TEI). It could be combined equally well
with other standard tag sets, such as the DocBook tag set developed by
the Davenport Group of software vendors, or the AAP tag set developed by
the Association of American Publishers and later revised as ISO 12083.
I will assume the TEI DTD is being used as the base here, since I
understand the TEI extension mechanisms best.[
1]
The Sweb tag set includes just a few extra tags for use in the text
of the documentation itself: they mark scraps of code and references to
identifiers and code fragments in text. In addition, there are two tags
which support the specification of multiple versions of a program in the
same Sweb document, and there is a larger tag set for reference
listings of the sort found in the second half of the TEI Guidelines, or
in the reference manuals for programming-language function libraries.
These parts of Sweb are discussed in turn, after an introduction which
describes the rationale for Sweb itself.
1. Introduction
1.1. The Basics of Literate Programming
In ‘literate programming’, a programmer constructs
documentation for a program together with the program code itself, both
kinds of material being included in a single document; by tying the
program code intimately to the documentation, literate programming tools
make it substantially easier to keep the program and the documentation
consistent with each other. It is also more embarrassing if one fails
to do so.
In the words of Donald Knuth, who formulated the idea and used it in
the construction of his programs TeX and MetaFont:[
2]
Let us change our traditional attitude to the
construction of programs. Instead of imagining that our main task is to
instruct a computer what to do, let us concentrate rather
on explaining to human beings what we want a computer to
do.
Among the benefits attributed to literate programming by its
practitioners and others are: better (and usually fuller) explanation
of the program's organization and algorithm, better indexing and
cross-indexing of identifiers, easier maintenance, and easier
modification of the program by the original author and others.
In Knuth's Web system, and the several similar systems
inspired by it, the programmer prepares a single document, called the
web, which contains both fragments of program code and
prose explanation of the program structure. The code fragments may be
introduced in any order suited to their explanation; they do
not need to obey the restrictions imposed by the
programming language on the sequence of constructs. This freedom of
ordering allows global variables to be introduced together with the code
which manipulates them, and for the main program and its subroutines to
be declared and described in whatever order most clarifies the
exposition, even in programming languages which require a rhetorically
less effective order for compilation.
The web is processed in two different ways: a weave
processor reads the web and produces program documentation (typically in
the form of a file in a formatting language such as TeX or troff); a
tangle processor rearranges the scraps of program code in
the order required by a compiler, expands macros if necessary, and
produces a compilable program in the programming language. In some
systems, e.g. nuweb, a single program both weaves and
tangles the input. Different web systems use different document
languages and different programming languages. Knuth's original system
used TeX for documentation and Pascal for program code; other systems
have been developed for other programming languages and (less
frequently) other document languages. In Knuth's system and many
others, the weave component typesets the program code;
such systems must thus endow weave with knowledge of the
programming language being used.
Some newer systems, notably Norman Ramsey's noweb and
Preston Briggs's nuweb, are programming-language
independent: they have no hard-coded knowledge of the programming
languages used, and thus can be used to write literate programs in any
programming language, or in several languages. Without knowledge of the
programming-language syntax, of course, elaborate typesetting is not
feasible; language-independent web systems typically eschew typesetting
entirely, reproducing scraps in a monospaced font and following verbatim
the layout of the source. Sweb follows the practice of these
language-independent systems: it performs no typesetting of the
source code and requires no particular programming language.
1.2. Literate Programming and SGML
Like literate programming tools, SGML systems are designed to make it
possible to process a single input document in very different ways with
different programs. The notion of creating an SGML-based tool for
literate programming is thus an obvious one; Sweb is an
attempt at such a tool. In discussing Sweb and similar
tools, however, it will be useful to distinguish two areas of concern:
first, the type of processing to be performed on the SGML-encoded webs,
and as a separate issue the markup (i.e. the set of SGML elements)
required to make that processing possible.
The processing to be performed has already been described in general
terms: the user must be able to weave the web into
well-formatted documentation, and to tangle it into
executable programs. In the interests of generality, I will abstract
away from the details of this processing and where possible formulate
our markup language in declarative, rather than procedural terms. That
is, I wish to follow the general principle of SGML, which is that
process-specific information belongs in a style-sheet or other mechanism
for controlling the process, and not in the SGML document.
So, first of all, this document defines an SGML tag set for literate
programming. Documents tagged this way can be translated automatically
into the specialized markup used by virtually any existing
literate-programming tool, or processed directly by SGML-aware
application software. They can also be operated upon directly by
SGML-aware literate-programming tools.
Documents using these tags can be:
- edited using SGML-aware editors
- validated using SGML parsers
- woven directly into SGML documentation
- tangled directly into programs
- translated automatically into arbitrary web systems
As a secondary matter, this document will also describe some aspects
of the processing of documents marked up with these tags.
Implementations of software to perform these processes will be described
in separate documents; many details of processing will vary from
implementation to implementation. At the time of writing, the only
implementation is a partial implementation of tangle functionality
using yacc and lex; weave functionality
is provided by direct display of the Sweb document in an SGML browser
such as Panorama or by translation into HTML for delivery
on the World Wide Web.
2. Encoding Code Scraps
In literate programs, the program text itself is included in the
document, typically in small pieces which I call scraps.
Knuth refers to them as modules, but this term derives from
the rather specialized text structure imposed on webs by his original
system, and may mislead some readers into thinking that a program is
more modular in construction than is actually so. So I prefer the
informal term scrap, which I adopt from the
nuweb system written by Preston Briggs of Rice
University.
2.1. The Tags
Scraps of code in a document are encoded using the <
scrap>
element; they are embedded in other scraps using the standard TEI
cross-reference elements <
ref> and <
ptr>:[
3]
- <scrap> contains a scrap of program code, in any language.
Attributes include:
- name gives a name to the scrap, to be used as a label for the scrap
and in cross-references to it.
- file names the output file to which this scrap is to be written
out; not required if this scrap is embedded by another one elsewhere.
- prev (i.e. continuation of) gives the SGML identifier of a scrap for
which this element is a continuation.
- version indicates which version(s) of a program this scrap belongs
to, when the web defines multiple versions.
By convention, the scrap identified as ‘previous’
should precede its continuations in the document --- but this is not
formally required.
Like any TEI element, the <scrap> element can bear the global
lang attribute, which in this case records the programming
language of the scrap. Since programming languages are not listed in
ISO 639, the values for programming languages must be improvised by the
author.
For example, here is a scrap of code from Knuth and Levy's rewrite of
the Unix utility
wc:[
4]
<scrap id='wrstat'
lang='C'
name='Write statistics for file'>
wc_print(which, char_count,word_count,line_count);
if (file_count)
printf(" %s\n", *argv); /* not stdin */
else
printf("\n"); /* stdin */
</scrap>
Scraps can be referred to from other scraps, using the standard TEI
elements <
ptr> (which uses the SGML identifier of the scrap being
embedded as the value of its
target attribute) or
<
ref> (which can use its
target attribute, or which
can use its content to match the name of the target scrap). Using the
SGML
ID/
IDREF mechanism, Knuth and Levy's rewrite of
wc
would have the following top-level scrap; it has an empty string as its
name value, since they make this top-level scrap anonymous.
<scrap name="">
<ptr target="hdrfiles">
<ptr target="globals">
<ptr target="functns">
<ptr target="main">
</scrap>
The other scraps involved would have the following overall structure:
<scrap id="hdrfiles" name="Header files to include">
#include <stdio.h>
</scrap>
<!-- ... -->
<scrap id="globals">
int status = OK; /* exit status of command, initially OK */
char *prog_name; /* who we are */
</scrap>
<!-- ... -->
<scrap id="functns"> <!-- ... --> </scrap>
<!-- ... -->
<scrap id="main"> <!-- ... --> </scrap>
<!-- ... -->
The formal declaration of <
scrap> is as follows:
〈 1 Code scraps 〉 ≡
<!ENTITY % model.scrap '(#PCDATA | ptr | ref)*' >
<!ELEMENT scrap - - (%model.scrap) >
<!ATTLIST scrap %a.global
name CDATA #IMPLIED
file CDATA #IMPLIED
version IDREFS #IMPLIED
index CDATA 'auto' >
This code is not used elsewhere.
I use the entity
model.scrap as the content
model for <
scrap> element in order to make it easier to
customize if desired. It also makes it easier to see that the
<
scrap> and <
recap> elements are declared
identically.
2.2. Matching Continuations and Embedded Scraps
2.2.1. ID/IDREF Matching
The usual method of matching cross references with their targets in
SGML is to use the SGML
ID/
IDREF mechanism: the target is given an
ID attribute, which is named in an
idref attribute
in the cross reference. That method can be used to indicate the
embedding and continuation of code scraps, thus:
- within a <scrap>, a <ref> or <ptr> element
can give the ID of another scrap as the value of its target
attribute. This indicates the logical embedding of the target scrap
within the scrap containing the reference. This is not the same
processing as will be given to cross references outside scraps, but the
context disambiguates the two kinds of processing desired.
- a <scrap> can give the ID of another scrap as the value of
its prev attribute; it will be treated as a continuation of
that scrap. (Ambitious processors may choose to support linking in
the other direction, by means of the next attribute, which
is the inverse of prev. Users should not count on such
support, though.)
2.2.2. String-Prefix Matching
Most literate programming tools inherit from Knuth's WEB system,
however, the convention that code scraps and references to them can be
matched by the names given to each. That is, a code fragment first
referred to thus:[
5]
@<Program to print the first thousand
prime numbers@>
can later be referred to (or, as here, defined) thus:
@<Program to print...@>=
...
In other words, any name ending in an ellipsis is considered to match
any name beginning the same way. (The prefix given before the ellipsis
must be unique.) This method, which I call the string-prefix
match, has the convenience of allowing the web to be slightly
more readable for users, and eliminating the requirement that the
programmer look up, or keep notes on, the SGML identifiers used for each
scrap. It has the inconvience, of course, of not being validatable by
the SGML parser.
When the ID/IDREF mechanisms mentioned above (the target
attribute on <ref> elements, and the prev attribute
on <scrap> elements) are not used, then a string-prefix match is
performed, matching the name attribute of a scrap with either
the name attribute of a continuation or the parsed-character
data content of a <ref> element.
Using string-prefix matching under
Sweb, the example
from Knuth's prime-number program would read:
<!-- top-level scrap refers to 'Program ...' -->
<scrap name=''>
<ref>Program to print the first thousand prime numbers</ref>
</scrap>
<!-- ... -->
<!-- This scrap defines that program. -->
<scrap name='Program to print...'>
<!-- ... -->
</scrap>
The top level of Knuth and Levy's
wc would be encoded
this way:
<scrap name=''>
<ref>Header files to include</>
<ref>Global variables</>
<ref>Functions</>
<ref>The main program</>
</scrap>
The two methods of reference can be combined, for legibility. In this
case,
weave and
tangle will obey the SGML
IDREF and ignore the
name string.
<scrap name="">
<ref target="hdrfiles">Header files to include</>
<ref target="globals" >Global variables</>
<ref target="functns" >Functions</>
<ref target="main" >The main program</>
</scrap>
2.3. Processing Scrap References
Within a code scrap, references to other code scraps serve as cross
references or hyperlinks to those other code scraps. In a printed copy
of the Sweb document, such cross-references will normally contain
the name of the embedded scrap, with a scrap number or page number; by
convention these appear together in Roman type enclosed in angle
brackets. In an online browser, the cross references should display in
much the same way, but should also be live hypertext links: the user
should be able to click on the name and traverse the link automatically,
so that the embedded scrap appears on the screen for consultation.
The live-link behavior is easy to accomplish if
ID/
IDREF linking is
used, since SGML browsers conventionally interpret
ID/
IDREF links this
way. If string-prefix matching is used, however, very few online
browsers will be able to treat scrap references as live hyperlinks. It
is desirable, therefore, for a
weave process to normalize
string-prefix references by transforming them to normal
ID/
IDREF
references. It will simplify the task for formatters and display
programs — not to mention later readers of the web — if any
abbreviated forms of the name are replaced, at the same
time, with the full name. After such normalization, the first example
given in the preceding section would look like this:
<!-- top-level scrap refers to 'Program ...' -->
<scrap name="">
<ref target="program">Program to print the first thousand prime numbers
</ref>
</scrap>
<!-- ... -->
<!-- This scrap defines that program. -->
<scrap id="program"
name='Program to print the first thousand prime numbers'>
<!-- ... -->
</scrap>
A similar normalization is useful for <ptr> elements. Some
SGML browsers have no difficulty in following the pointer, finding the
name of the scrap pointed to, and displaying that name as part of the
normal rendition of the pointer; such text reflection,
however, is beyond the current grasp of other browsers. Such browsers
will give better results if text is copied (duplicated) in the source
rather than being reflected at display time; in Sweb, this effectively
means replacing <ptr> elements with equivalent <ref>
elements. The target and all other attributes of the
<ref> are the same as those of the <ptr> element; the
content of the <ref> is the name value of the <scrap>
element being pointed to.
A full
Sweb implementation, therefore, performs all of the
following normalizations on its input unless the user indicates by means
of run-time flags that they should be suppressed:
- String-prefix scrap references are replaced by <ref>
elements linked by the SGML ID/IDREF mechanism.
- Scrap references using <ptr> elements are replaced by
equivalent references using <ref> elements.
- Each <ref> element is normalized to
contain the name of the scrap it points to.
Full Sweb implementations should also identify blind references and
unreachable scraps and issue warning messages for them. In cases where
blind references are not an error, the warnings can be suppressed by
removing the target attribute from the <ref> element
in question and processing the file with a flag suppressing
string-prefix matching; this will indicate to the weave processor that
no target should be sought for the <ref> element. In cases where
scraps are known to be unreachable, warnings may be suppressed by giving
the value unreachable to the scrap's rend
attribute, in effect saying “I know this scrap is unreachable. In
this case that is not an error. Don't bother me.” [Is this
necessary? Is there a better way?]
2.4. Output to Files
The file attribute on the <scrap> element is used
to specify that a given scrap is to be written out to the file named.
If a scrap tagged with a file name refers to a nested scrap, both scraps
will be written out to the same file; a scrap referred to by several
others will be written out several times, once for each reference.
Any scrap bearing a value for the file attribute will be
written out to the file indicated, together with any scraps it embeds,
followed by any continuations. This will happen even if the scrap
has already been written out to some other file, or has been
referred to by another scrap in the same file. Embedded scraps and
continuations should therefore not bear the file attribute,
unless it is intended that they appear in the output twice.
If output files are rewritten every time an Sweb document is
reprocessed, the resulting changes to the operating system time-stamp on
the files are likely to confuse make and similar software
development utilities. They rely on the time stamp to indicate the last
time changes were made to a file, and will be confused if the time stamp
changes even when the file contents do not. (Changes in the Sweb
source may have affected only other output files.)
To avoid this problem, full
Sweb implementations should replace
output files only if the new version will be different from the
existing version. The
nuweb system demonstrates a simple method of
accomplishing this:
- each output file is written as a temporary file
- the temporary file is then compared to the existing file,
if any, bearing the name given in the web source
- if the two files are different, the existing file is
deleted and the temporary file is renamed with the proper name
- if the two files are the same, the temporary file is
deleted and the existing file is left untouched
Other algorithms are possible, but this one seems to work quite well and
to have few system dependencies.
2.5. A Simple Example
As an example, consider the sample input file included in the
one-page summary for
nuweb, a literate programming tool written by
Preston Briggs of Rice University.[
6] As a
nuweb file, it looks like this:
Here is the overall structure of an example program:
@d Sample Program
@{@<Declarations@>
@<Main Program@>
@<Subroutine@>
@}
Some declarations are always necessary:
@d Declarations
@{Easy things
Real things
Complex things
@| Things Complex declarations
@}
The main program is quite straightforward:
@d Main Program
@{Call @@subroutine
Exit
@}
as is the subroutine
@d Subroutine
@{@<Declarations@>
Useful Job With Things
@}
Finally, write the entire program out to a file:
@o sample.prg -it
@{@<Sample...@>
@}
The files I've generated are:
@f
the macros defined are:
@m
and the terms I've asked to be indexed are:
@u
An SGML user could define many of the special @-sequences as
SGML short references for tags, but to make the SGML structure of this
example clearer, I will use explicit SGML tags. An
Sweb
document equivalent to the
nuweb document above is:
<p>Here is the overall structure of an example program:
<scrap name='Sample Program'>
<ref>Declarations</>
<ref>Main Program</>
<ref>Subroutine</>
</scrap>
<p>Some declarations are always necessary:
<scrap name='Declarations'>
Easy things
Real things
Complex things
<index level1='Things'>
<index level1='Complex'>
<index level1='declarations'>
@}
The main program is quite straightforward:
<scrap name='Main Program'>
Call @subroutine
Exit
</>
as is the subroutine
<scrap name='Subroutine'>
<ref>Declarations</>
Useful Job With Things
</scrap>
Finally, write the entire program out to a file:
<scrap file='sample.prg' rend='noindent keeptabs'>
<ref>Sample...</>
</scrap>
The files I've generated are:
<divGen type='filenames-index'>
the macros defined are:
<divGen type='scrapnames-index'>
and the terms I've asked to be indexed are:
<divGen type='index'>
3. Phrase-Level Elements
References to identifiers, and discussions of program statements, in
the text require the special material to be specially marked. For this,
the following elements are provided:
- <code> contains a fragment of programming code quoted or referred to in
the prose text.
- <ident> contains an identifier from the program, used in prose text.
- <kw> contains a keyword in some formal language
- <lit> contains a constant literal value in some formal
language
- <comment> contains a comment in some formal language (normally
including the delimiters)
- <delim> contains a delimiter in some formal language
- <eg> contains an example of some sort, e.g. in some formal
language (which should not be reformatted)
Some elements are defined solely as syntactic sugar for the
<
ident> and <
code> elements, for use in describing SGML
tag sets.
- <gi> contains the generic identifier of some SGML element type
- <att> contains an attribute name for some SGML element type
- <val> contains an attribute value (without delimiters)
- <ent> contains an entity name (without delimiters)
- <tag> contains an SGML start-tag or end-tag (without the delimiters)
The <
ident> element allows single identifiers to be marked as
such, so that they can be formatted, and possibly indexed,
appropriately. The <
code> and <
eg> elements allow larger
code fragments (statements, expressions, even code blocks) to be
referred to in-line without having to be included in the web.
Within <
code> elements, the token-level tags <
ident>,
<
kw>, etc. may or may not be used. If they are used, the
rendering engine can display each in a characteristic font or color; if
they are not used, a uniform monospaced font may be used.[
7]
The following examples illustrate their use:[
8]
The first example refers to both identifiers (
p)
and keywords (
do while):
The first prime number being given (= 2), the thousandth being
assumed unknown to the programmer, the most natural order of
filling the elements of array <ident>p</ident> is in order of
increasing subscript value, and if we express just that (with a
simple repetitive <kw>while do</kw> clause) we come to ...
The second example, from the same source, shows code fragments
within the prose discussion (here with tagging of some
but not all internal components):
<p><hi rend="sc">Remark.</hi> Here <code><kw>while</kw>
<ident>p[ord]</ident> ↑ 2 ≤ <ident>j</> <kw>do</></code>
can be replaced by
<code><kw>if</kw> <ident>p[ord]</> ↑ 2 ≤ <ident>j</>
<kw>then</></code>, but to my taste the marginal gain in
efficiency is not worth the intellectual effort to prove its
validity. A programmer should learn to be lazy at the right
moment and to let the principle <q>Safety First</q> prevail!
</p>
The formal declaration of these elements is this:
[
It might plausibly be argued that some of these, at least
tag and att belong in TEI Lite, not
here; I agree, but until TEI Lite incorporates them, the easiest
way to make them accessible in Sweb documents is to include
them here.
]
〈 2 Define phrase-level elements 〉 ≡
<!ENTITY % x.tokentypes '' >
<!ENTITY % m.tokentypes '%x.tokentypes ident | kw | lit |
comment | gi | att | val | ent'>
<!ELEMENT code - O (%paraContent;) >
<!ATTLIST code %a.global; >
<!ELEMENT ident - O (#PCDATA) >
<!ATTLIST ident %a.global; >
<!ELEMENT kw - O (#PCDATA) >
<!ATTLIST kw %a.global; >
<!ELEMENT lit - O (#PCDATA) >
<!ATTLIST lit %a.global; >
<!ELEMENT delim - O (#PCDATA) >
<!ATTLIST delim %a.global; >
<!ELEMENT comment - O (#PCDATA | %m.tokentypes)* >
<!ATTLIST comment %a.global; >
<!ELEMENT delim - O (#PCDATA) >
<!ATTLIST delim %a.global; >
<!ELEMENT gi - O (#PCDATA) >
<!ATTLIST gi %a.global; >
<!ELEMENT att - O (#PCDATA) >
<!ATTLIST att %a.global; >
<!ELEMENT val - O (#PCDATA) >
<!ATTLIST val %a.global; >
<!ELEMENT ent - O (#PCDATA) >
<!ATTLIST ent %a.global;
type (ge | pe) 'ge' >
<!ELEMENT tag - O (#PCDATA) >
<!ATTLIST tag %a.global;
type (stag | etag) 'stag' >
This code is not used elsewhere.
4. Multiple Versions
A single Sweb document may need to be used to produce
multiple versions of a given file, e.g. to support different platforms,
or to create a test version with a new algorithm, while retaining the
old algorithm for the production version. It may also be
desirable for local documentation of modifications to a tag set
to be integrated with the base documentation for that tag set:
some sections of the local documentation should replace
corresponding sections of the base document, while other sections
are purely supplemental.
In general, local modifications, whether they represent
omissions, changes, or additions, can and should be handled as
additional material documenting a new (local) version, and tagged
using the techniques described here for multiple versions.[
9]
Three approaches may be used to generate multiple versions of a
program:
- an external pre-processor (e.g. the C pre-processor) may be used,
and the pre-processor directives define, ifdef,
and endif may be used in the code to control the action
of the pre-processor
- SGML marked sections may be used, which are controlled by
parameter entity references and included or excluded based on the values
declared for those parameter entities
- the exclude and version attributes
may be
used on the <scrap> element, to indicate the equivalence of
different scraps, and to indicate which versions use which scrap.
The first solution has a natural appeal in many situations,
since in environments with such a preprocessor it is probably the most
common method of producing multiple versions from the same code.
If the code has many small conditional sections, however, this approach
can make it hard to focus on any one version of the program; if the
small conditional sections are merged to produce longer parallel
sections, it can become hard to keep different versions
synchronized.
The second approach (marked sections) has a similar appeal, since
version control is one of the intended uses of conditional marked
sections. When more than a few versions of the material exist, however,
the marked-section technique may become clumsy. Above all, it is hard
to know how to express, with marked sections, the relations among
versions, especially dependencies, or to capture the idea that if
version 3 is not mentioned explicitly in a particular scrap, then
version 3 is identical, for that scrap, to version 2. Most SGML users
with requirements for industrial-strength version control use
specialized elements for the job, not marked sections.[
10]
The third approach allows the explicit claim that two scraps
of code are, in some sense, equivalent, which is useful for
documentation purposes. In this approach, a scrap may be tagged
as representing an alternative version of some other scrap.
Since they are alternative realizations of the same specification
they are in some sense equivalent. The equivalence of two scraps
is indicated by the
exclude attribute on one scrap
using the SGML
ID of the other as its value:
- exclude points to elements that correspond to the current element
in some way, and which therefore constitute alternative versions
of it, not to be used at the same time.
A set of scraps connected by such equivalence relations is an
equivalence class.[
11]
4.1. Scrap References in Multi-Version Webs
When a web defines only one version of a web, a reference to another
scrap is interpreted, as described above, as an indication that that
scrap belongs logically at the point of reference. When multiple
versions are present in the same web, the reference is interpreted as a
reference not to one particular scrap, but as a reference to the
equivalence class: any scrap in the equivalence class might potentially
be substituted for the reference. Which scrap is selected depends on
which version is being generated; this in turn is controlled by the user
interface to the tangle processor (e.g. by selection from a menu, or by
command-line option).
In the simple case, one scrap in the equivalence class identifies
itself as belonging to the version being generated (by giving the
version identifier on its version attribute); that is the
scrap used by tangle. If two scraps in the same equivalence class
identify themselves as belonging to the version being generated, it is a
fatal error. If no scrap is identified as belonging to the version in
question, a fallback must be found.
4.2. Version Information and Fallbacks
The fallback information is given by the <
version> element,
which declares one version of the program or package, and identifies
a fallback version to use when necessary. Versions are declared in
a <
versionList> element, which should normally appear in the
front matter or introduction to the document (e.g. in a section that
describes which versions of the program exist).
- <versionList> contains a series of <version> elements.
- <version> declares one version of the program, file, or package.
Attributes include:
- id assigns a unique identifier to the version being declared.
- n gives a name for the version (may be multiple words)
- fallback identifies the version to use, when no scrap is identified
as belonging to the version being declared (normally the
version used as the basis for the version being declared).
The fallback version is typically the one from which a new version was
derived. If no scrap in an equivalence class identifies itself as
belonging to the version being generated, the
tangle processor will
find the scrap belonging to the fallback version. If there is none, the
fallback's fallback will be sought, and so on until there are no further
fallbacks, at which point a scrap with no version information at all
will be used (which will normally be the first, unmarked version); if
none or more than one is found, it is a fatal error.
The formal declaration of the <
version> and
<
versionList> elements is as follows:
〈 3 Declare version element 〉 ≡
<!-- Version declarations -->
<!ELEMENT version - O EMPTY >
<!ATTLIST version
lang IDREF %INHERITED
rend CDATA #IMPLIED
id ID #REQUIRED
n CDATA #IMPLIED
fallback IDREF #IMPLIED >
<!ELEMENT versionList - O (version*) >
<!ATTLIST versionList %a.global; >
This code is not used elsewhere.
4.3. Example: Dijkstra on Prime Numbers
For example, Dijkstra's classic article on “Stepwise
Program Construction” presents a program to fill an array with
the first thousand prime numbers. He presents a scrap of code (scrap
0), which defines the overall program and refers to another scrap
(scrap 1), which refines the problem and refers to a third scrap
(scrap 2), which does a portion of the work and refers the rest to
a fourth scrap (scrap 3). After presenting a first version of
scraps 1 and 2, however, Dijkstra observes that they can be made
substantially more efficient by only looking at odd numbers when
seeking primes greater than 2, and presents alternate versions of
these two scraps (scraps 1b and 2b). Further improvements result
in versions c and d; and version d is the first to define a scrap
3. A final revision results in scraps 1e, 2e, and 3e, which represent
the final form of the program.
The versions of this program can be declared thus:
<versionList>
<version id="A">
<version id="B" fallback="A">
<version id="C" fallback="B">
<version id="D" fallback="C">
<version id="E" fallback="D">
</versionList>
The first few code fragments would look something like this;
note that Dijkstra uses the term
version
in a rather loose sense when labeling his code scraps. The root
scrap has no version information, because it is the same for all
versions: if we use the
ID/
IDREF mechanism to match scraps and
references to them it will look like this:[
12]
<p>We shall now give the coarsest version
of the program, viz.
<scrap id="program" name='version 0'>
begin <ptr target="assign"> end
</scrap>
<p>When this action ...
The first version of scraps 1 and 2 follows:
<p>The first prime number ... we come to
<scrap id="assign"
n='1a'
name='assign to the array p
the prime table as described'
version="A">
begin integer k, j: k := 1; j := 1;
while k <= 1000 do
begin <ptr target="increase">;
p[k] := j; k := k + 1;
end
end
</scrap>
<p>Version 1a is a perfect program when ...
With a simple repetitive <kw>repeat until</kw>
clause (which may act upon a sequence of statements)
we come for <q>increase <ident>j</ident> until
the next prime number</q> to
<scrap id="increase"
n='2a'
name='increase j until the next prime number'
version="A">
begin boolean jprime;
repeat j := j + 1; <ref>give to <ident>jprime</>
the meaning: <ident>j</> is a
prime number</ref>;
until jprime
end
</scrap>
<p>If we substitute version 2a for the appropriate
operation in version 1a our resulting program is
undoubtedly correct. ...
Scrap 2a refers to a scrap (“give to
jprime the
meaning:
j is a prime number”) which is never
actually written, because at this point Dijkstra stops and considers an
obvious optimization: checking only odd numbers for primality. Since
there is no such scrap, there is no SGML
ID to point at, and we render
the reference using the <
ref> element. (This also shows, in
passing, how identifiers can be tagged in scrap references; working out
the details is left to the reader as an exercise.)[
13]
Dijkstra's second cut at scraps 1 and 2 follows:
So we come to
<scrap exclude="assign"
n='1b'
name='assign to the array p the
prime table as described'
version="B">
begin integer k, j: p[1] := 2; k := 2; j := 1;
while k <= 1000 do
begin <ptr target="increase">;
p[k] := j; k := k + 1;
end
end
</scrap>
where the analogous elaboration of the
operation between quotes leads to
<scrap exclude="increase"
n='2b'
name='increase odd j until the next
odd prime number'
version="B">
begin boolean jprime;
repeat j := j + 2; <ptr target="jprime-odd">;
until jprime
end
</scrap>
Dijkstra's example raises an important question. Since his scraps 1a
and 1b have exactly the same function, as do 2a and 2b, they are shown
here as members of the same equivalence class. What should be done,
however, with the various versions of the scrap “give to
jprime [for odd j] the meaning: j is a
prime number”? The variations Dijkstra introduces in its name imply
variations in its function; mathematically, as it turns out, this
implication is false, in this case, but it is easy to imagine cases
where analogous scraps do have different functionality, reflecting
different assumptions in the contexts of their use. The rule of Sweb
is simple: analogous scraps, called from similar places in equivalent
code, which perform mathematically different functions, are
not equivalent, and should not be placed in equivalence
classes. Instead, they should be treated as different scraps entirely.
(It is possible to imagine cases where this will require more verbose
encoding of the scraps, but they are all rather far fetched, so this
should not be burdensome in practice.) Code called from the same
location is normally equivalent and can normally be tagged as a
single equivalence class.
[Discussion ... possibly better example, with branching]
5. Recapitulations
When introducing new code to a reader, an author may desire to
present first a very simple skeleton of the code, then develop some of
the details of the code, with discussion, and finally present the
fleshed-out result. For example, in Dijkstra's article on “Stepwise Program Construction” mentioned above, the final
revision of the program is followed by a listing showing the entire
text, with all substitutions (scrap references) resolved. The final
listing, which is identical in content to scraps 0, 1e, 2e, and 3e, is a
recapitulation of the work done so far. Most literate
programming systems don't support recapitulations at all.[
14]
Recapitulations may be indicated in
Sweb documents by
means of the <
recap> element:
- <recap> indicates that the scrap indicated should be reprinted at
the location of the <recap> element, with backward
references resolved and forward references left
unresolved.[15]
Dijkstra's conclusion, for example, would be encoded thus:
<p>Finally, we perform all substitutions to construct
a single statement.
<recap scrap="program"
version="E"></>
<p>We could have made the inner blocks into compound statements
by moving the declarations for <ident>jprime</> and <ident>n</>
to the outside. We have not done so: clarity does not gain by
it and whether there is a point in doing it is rather dependent
on the implementation.
<p>Thus ends the treatment of the first example.</p>
The formal declaration for <
recap> is this:
〈 4 Declare recapitulation element 〉 ≡
<!-- Recapitulations -->
<!ELEMENT recap - - (%model.scrap) >
<!ATTLIST recap %a.global;
scrap IDREF #REQUIRED
version IDREFS #IMPLIED >
This code is not used elsewhere.
6. Scrap Wrappers: Encoding Scraps Revisited
The tags described so far are fine for tangling the web into
compilable code, and for basic printing and display. The weave
processor, however, can do only limited work in improving the
online or printed display of the web. The <ident> and
similar tags cannot appear in the names of scraps, because the
names are attribute values rather than elements; if one scrap is
continued by several others, weave cannot generate and display
a list of the continuation scraps; it is not possible to
indicate, after each scrap, where it was used.
A better
weave processor is possible, but it requires that there be
an element related to a scrap, in which to give pointers to continuation
scraps, to equivalent scraps, and to scraps which embed the current
scrap. We could give these as notes adjacent to the scrap itself,
without extending the current tag set. But it is clearly preferable to
bind them together with the scrap using normal SGML methods: I mean, by
wrapping them together with the scrap into some larger element. For
now, I call this element <
scrapInfo>.[
16]
The <
scrapInfo> element is declared thus:
〈 5 Declare scrapInfo element 〉 ≡
<!ELEMENT scrapInfo - - (head?, scrap, scrapDefs?,
scrapEquivs?, scrapRefs?,
indexDefs?, indexRefs?) >
<!ATTLIST scrapInfo %a.global >
<!ENTITY % model.scrapPtrs '(#PCDATA | ptr | ref | xptr |
xref)*' >
<!ENTITY % model.scrapDefs '%model.scrapPtrs' >
<!ENTITY % model.scrapEquivs '%model.scrapPtrs' >
<!ENTITY % model.scrapRefs '%model.scrapPtrs' >
<!ELEMENT scrapDefs - - (%model.scrapDefs) >
<!ATTLIST scrapDefs %a.global >
<!ELEMENT scrapEquivs - - (%model.scrapEquivs) >
<!ATTLIST scrapEquivs %a.global >
<!ELEMENT scrapRefs - - (%model.scrapRefs) >
<!ATTLIST scrapRefs %a.global >
<!ENTITY % x.chunk 'scrapInfo |' >
This code is not used elsewhere.
Any <
scrap> element can be wrapped inside a <
scrapInfo>
element unless it is already within one; the scrap processing is exactly
the same, except that:
- When a <scrap> appears within a <scrapInfo>
element, its name is taken from the <head> of the
<scrapInfo>, not from the name attribute of the
<scrap> itself.[17]
In normal operation, the <
scrapDefs>, <
scrapEquivs>,
and <
scrapRefs> elements will be generated by the
weave
processor. Typically, they will contain a sequence of <
ptr> or
<
ref> elements, optionally preceded by fixed text. (In some
cases, it may be preferable to generate the fixed text in the SGML
rendering software. I leave open the possibility of including it in the
Sweb document to allow for (a) manual intervention and (b) really,
really stupid SGML software.) For example, in normalized form, the
scrap already seen from Knuth and Levy's
wc might read:
It's convenient to output the statistics by defining
a new function <ident>wc_print</ident>; then the same
function can be used for the totals. Additionally
we must decide here if we know the name of the file
we have processed or if it was just <ident>stdin</ident>.
<scrapInfo>
<head>Write statistics for file</head>
<scrap id="wrstat"
lang='C'>
wc_print(which, char_count,word_count,line_count);
if (file_count)
printf(" %s\n", *argv); /* not stdin */
else
printf("\n"); /* stdin */
</scrap>
<scrapRefs>This code is used in
<ref target='process-all'>section 8</ref>.</scrapRefs>
</scrapInfo>
The <indexDefs> and <indexRefs> elements are described
elsewhere.
7. Indices
Knuth's original Web system not only generates the compilable form
of the program and typesets the documentation, but also generates
indices of variables, function and procedure names, etc., showing where
they are defined and where they are used. It can do this because it
understands enough Pascal to parse the Pascal code and index it
properly. In later literate programming systems, it has become
conventional to generate cross-reference listings of various types:
identifiers in the program, file names, names of modules or program
scraps, etc.
Language-independent literate-programming systems generally find it
hard to generate such intelligent indices without help, since without
understanding the program syntax it is difficult to distinguish
identifiers in the program, which should be indexed, from comments,
literals, or keywords which normally should not be indexed. Indices are
so useful, though, that language-independent systems have developed
methods of generating them even without hard-coded knowledge of the
programming language, generally by using a mixture of automatic
processing and manual assistance. The nuweb system, for example,
allows the user to specify what identifiers are defined in each scrap,
and indexes all occurrences of those identifiers in all scraps. Or
rather, it indexes occurrences of those strings in all
scraps. A simple string-matching algorithm is used, which does no
tokenization; the lack of tokenization means identifiers like
i are matched and indexed if the letter
i appears anywhere in the scrap. [Add discussion
of indexing in other systems (Web, noweb, Fweb, ...) here.]
Sweb allows several approaches to indexing:
- the user can specify index entries using the <indexDefs>
element, which can appear within each <scrap> element. Sweb
indexes every occurrence, in any scrap, of any identifier named in any
scrap.
- index entries can be generated using language-independent or
language-specific filters which locate identifiers in each scrap and
generate <indexDefs> or <indexRefs> elements.
- the user can suppress all automatic indexing and provide
all the index information manually
Each approach is described more fully in the following sections.
7.1. Normal Indexing
Sweb systems normally use a kind of semi-automatic indexing similar
to that performed by
nuweb. The author of the web can indicate, in
the <
indexDefs> element, which identifiers are defined or
declared in the associated scrap. These identifiers are indexed as
being defined in that scrap, and as being used in any other scrap within
which their names occur.[
18]
The author can also indicate, by means of the <
divGen> element,
where the generated index should be inserted in the output document.
The <
indexDefs> element is used to indicate identifiers
defined in the current scrap, or more generally to indicate any index
strings for which the current scrap should be marked as a primary index
entry. The standard TEI elements <
index> and <
divGen> are
used in the normal way to control indexing.
- <indexDefs> contains a list of identifiers to
be indexed as being defined in the current scrap; the list may include
both simple blank-delimited strings and conventional <index>
entries as defined by the TEI core tag set.
- <index> indicates that a specific
index entry should be made for the current location (scrap, page,
section, ...). Sweb extends the standard TEI <index> element
with the attributes target and resp.
Attributes include:
- index identifies, normally by means
of a single token, which of several indices should contain the index
entry
- level1 gives the form of the main index
entry
- level2 gives the sub-entry, if any
- level3 gives the sub-sub-entry, if
any
- level4 gives the fourth-level sub-entry,
if any
- target indicates the SGML element to
which the <index> element applies; if omitted, the index element
is assumed to apply to the specific location at which it appears.
- resp indicates who is responsible for
the index entry; automatically generated <index> elements will
bear the name of the program which generated them as the value of this
attribute
- <divGen> indicates the
location at which a <div> element of some type is to be
generated by a document processor; in particular, an index or table of
contents. Attributes include:
- type indicates the type of <div> element
to be generated automatically and inserted in place of the
<divGen> element.
Suggested values include:
- table of contents
- main index; if multiple indices
are desired, they may be distinguished by a second token:
index identifiers,
index filenames,
index scrapnames,
index names, etc.
All
weave processors for
Sweb will accept simple blank-delimited
strings as input; it is expected, though not required, that they will
replace the input strings in their output with equivalent <
index>
entries. This will allow human editors to use the simpler form for the
first cut at an index, and then edit the output from
weave if
necessary, to get better results.
In normal processing, a
weave processor will
- create a primary index entry for each identifier in the list,
pointing to the current scrap
- normalize the contents of <indexDefs> by replacing each
identifier in the list with an <index> element in which the
identifier appears as the level1 attribute value[19]
- seek other occurrences of the identifier in other scraps,
and make secondary index references for all the other scraps in
which the identifier is found
- insert, at the location indicated by an appropriate
<divGen> element, the index of identifiers thus created
For example, one of the scraps of Dijkstra's prime-numbers program
might look like this in the first draft of a document, with an index
entry specified for
jprime:
<scrapInfo>
<head>increase <ident>j</ident> until the next prime number</head>
<scrap id="increase"
n='2a'
version="A">
begin boolean jprime;
repeat j := j + 1; <ref>give to <ident>jprime</>
the meaning: <ident>j</> is a
prime number</ref>;
until jprime
end
</scrap>
<indexDefs>jprime</indexDefs>
</scrapInfo>
After
weave processing, the <
indexDefs> element would contain
<
index> elements, not just raw tokens:
<indexDefs>
<index level1='jprime' index='identifiers'>
</indexDefs>
At the end of the document, the identifier index would include an entry
something like this:
<item><ident>jprime</ident>:
defined in
<ref target="increase">increase <ident>j</ident>
until the next prime number</ref>,
used in ...
</item>
In some cases, it may be desirable to use the more structured form of
an <
index> element. If we wanted to use second-level index
entries to specify the type of each identifier, for example, we might
substitute an <
indexDefs> entry like the following:
<indexDefs>
<index
level1='jprime'
level2='(boolean)'
index='identifiers'
rend ='defined'
>
</indexDefs>
In this case, the
weave processor would leave the <
indexDefs>
element alone, and the entry in the identifier index would look
something like this:
<item><ident>jprime</ident>
<list type="simple">
<item>(boolean):
defined in
<ref target="increase">increase <ident>j</ident>
until the next prime number</ref>,
used in ...
</item>
</list>
</item>
The formal definition of <
indexDefs> is as follows:
〈 6 Define indexDefs 〉 ≡
<!ELEMENT indexDefs - O (#PCDATA | index | %m.tokentypes)* >
<!ATTLIST indexDefs %a.global >
This code is not used elsewhere.
7.2. Special Cases: Fully Automatic Indexing
7.2.1. ScrapIndexer
It may be convenient at times to have a fully automatic method of
generating index entries; it will be even more convenient if the index
entries thus generated can be used as they are, or edited manually to
remove any dross. What we want is a mechanism to take an Sweb
input file without index elements, and produce as output a second Sweb
input file, the same as the first except that it does have index
elements, which the user can delete or change if they are not quite
right.
Really good automatic indexing requires language-specific processing,
but even relatively simple language-independent index-entry generators
can be good enough to be useful. It's more important, in such simple
indexers, to ensure that all identifiers are indexed, than to ensure
that all non-identifiers are not indexed: during development of a
program, extra index entries do no serious harm, and it is easier
to delete unwanted index entries than to type in index entries for
identifiers the automatic processor missed.
The
Sweb system defines a free-standing language-independent
processor to generate index entries for scraps, which currently bears
the name
ScrapIndexer. It behaves as follows:
- it assumes a simple generic language specification
which defines, for each language in a document:
- character strings which serve as delimiters or operators
- character strings which begin and end comments which can nest
- character strings which begin and end comments which cannot nest
- character strings which delimit literals
- regular expressions which match legal constants
- keywords in the language
- it tokenizes the contents of every scrap based on the values in
the language specification
- it indexes each token in the result; if the scrap bears a
lang attribute, the language name is given in
parentheses as a level-two index term
The default language specification
- treats the strings
+,
++,
-,
--,
*,
/,
=,
==,
<,
>,
<=,
>=,
%,
(,
),
[,
],
{,
},
&,
&&,
|,
||,
,,
:,
?,
^,
~,
!,
|, and
;
as delimiters or operators
- assumes that # and
// begin comments which run to the end of the
line
- assumes that strings between
/* and
*/, between
(* and
*), and between
<!-- and
-->
are comments (of which the first but not the latter two nest) and should
not be indexed
- assumes that strings between double and single quotes
are literals and should not be indexed
- assumes that tokens beginning with a numeral are numeric
constants and should not be indexed
For example, the following scrap
<scrapInfo>
<head>Command-line flags</head>
<scrap id="clflags" lang="C">
/* get arguments and set things up ... */
while (--argc > 0 && ((*++argv)[0] == '-' || (*argv)[0] == '/'))
while (c = *++argv[0])
switch (c) {
case 't': fTrace = fDebug = fVerbose = 1;
iMsglevel = msgTRACE;
yydebug = 1;
break;
case 'd': fDebug = fVerbose = 1;
iMsglevel = msgDEBUG;
break;
case 'v': fVerbose = 1;
iMsglevel = msgVERBOSE;
break;
default:
fprintf(stderr,"sweb: unknown option %c\n",c);
argc = 0;
break;
}
</scrap>
</scrapInfo>
would look something like this after processing by
ScrapIndexer:
<scrapInfo>
<head>Command-line flags</head>
<scrap id="clflags" lang="C">
...
</scrap>
<indexRefs>
argc argv break c case default fDebug fprintf fTrace fVerbose iMsglevel
msgDEBUG msgTRACE msgVERBOSE stderr switch while yydebug
</indexRefs>
</scrapInfo>
After running
ScrapIndexer, it is easy to edit the results by deleting
the tokens which should not be indexed (e.g. because they are keywords
of the language):
<indexRefs>
argc argv c fDebug fprintf fTrace fVerbose iMsglevel
msgDEBUG msgTRACE msgVERBOSE stderr yydebug
</indexRefs>
Depending on the run-time options chosen, the <
indexRefs>
section produced by
ScrapIndexer might be more like this:
<indexRefs>
<index index='identifiers' level1='argc' level2='(C)'>
<index index='identifiers' level1='argv' level2='(C)'>
<index index='identifiers' level1='break' level2='(C)'>
<index index='identifiers' level1='c' level2='(C)'>
<index index='identifiers' level1='case' level2='(C)'>
...
</indexRefs>
This more elaborate output makes it easier for the user to distinguish,
if desired, among variables, functions, etc., by editing the value given
for the
level2 attribute. Since
weave processing does not
alter these elements if they contain <
index> elements, the user's
manual improvements won't be overwritten by later runs of
Sweb. (If
the scrap changes radically, it may be necessary to run
ScrapIndexer
again, indicating by means of run-time flags which scraps should be left
alone and which reindexed from scratch.)
In some cases, the default language specification is radically
unsuitable; the next sections describe methods, one simple and one more
complex, of dealing with such cases.
7.2.2. Generating Indexers
The simple automatic index generation just described is probably
useful in many cases, but it is inadequate in others. Sometimes, it's
inadequate because its model of programs ignores function and module
boundaries or because the index needs to contain more information about
the constructs being described in the source code (e.g. whether they are
variables, functions, data structures, elements, attributes, or
something else). In such cases, special-purpose language-specific
indexing routines may be needed; for discussion, see below, section
Language-Specific Indexing.
In other cases, the simple-minded flat model of ScrapIndexer is not a
problem, but the default language specification used by ScrapIndexer
does an unacceptably poor job of sifting out the identifiers of a
programming language from the other stuff. In this case, it would be
nice if ScrapIndexer could read a language specification for each
language in the web, and use them in its processing. (Let us give the
name SmartIndexer to this putative enhanced version of ScrapIndexer.)
Unfortunately, I expect the first version of ScrapIndexer to be written
in yacc and lex, and it seems unlikely that I will be able to figure out
how to make yacc and lex do this right.
If we cannot have a SmartIndexer program, it would be nice to have a
program to read a language specification and generate a new version of
ScrapIndexer, with the new language specification hard-coded into the
yacc and lex. This would require the user to have yacc and lex
available, but not to know very much about how to use them. I think I
know how to write such a ScrapIndexer-generator, but it's a low priority
project for me, since if the default language specification of
ScrapIndexer is no good, I can always change it myself, and I don't need
a clever automatic way to do it.
If we can't write a smart indexer, and can't automatically generate
new versions of ScrapIndexer, the next best thing would be to have an
easy way of making a new version of ScrapIndexer by hand. Since making
new versions of software requires nothing more arcane than Sweb's own
version-control tags, and a version of Sweb which supports multiple
versions, this seems relatively straightforward. This section
describes, in general terms, what is involved.
It also defines the DTD for a language-specification document, and
the behavior prescribed for an indexer-generator, so that any readers
with time on their hands can try those hands at an automatic generator
for new versions of ScrapIndexer.
The
ScrapIndexer program, needless to say, is written using
Sweb,
and all the assumptions built into it may be changed by adding new
versions of the relevant scraps to that document. The yacc grammar
embodies the program's knowledge of
Sweb document structure, such as
the fact that
ScrapIndexer only has to worry about scraps. It also
embodies assumptions about the nature of the source code: no structure,
just a sequence of keywords, comments, strings, identifiers, etc. The
lexical scanner -- more specifically, the lex definition section --
makes explicit all the default assumptions about what strings represent
what kinds of tokens. All that's necessary, to make a new version of
the program, is:
- add a new <version> element to the
<versionList> at the beginning of the document
- add a new section to the document, describing the new
version
- in the new section, make a scrap equivalent to the scrap
lexdefs, with its version attribute
set to the ID of the new version.
- use Sweb to generate the proper new version of the source code,
then use yacc and lex and C to compile it.
For example, let us assume we want a simple indexer for Scheme or
Lisp. We would add a new <
version> element to the
<
versionList>:
<versionList>
...
<version id="Scheme1.0">Scheme ver. 1.0</version>
</versionList>
Then, we would add a new section to the document:
<div1 id="lisp"><head>Alternate Version for Lisp</head>
<p>...</p>
within which we would provide an alternate scrap for the lex
definitions:[
20]
<scrap id="lisprules"
exclude="lexrules"
name="Lex rules for Lisp tokenization">
Flatcomment (";".*)
Nestcommstart '\0'
Nestcommend '\0'
Delims [\'()\[\] \t]
Sign [-+]
Integer {Sign}?[0-9]+
Rational [0-9]+"/"[0-9]+
Fixpoint {Sign}?[0-9]+"."[0-9]+
Scientific ({Integer}|{Fixpoint})("e"{Sign}?[0-9]+)
Real {Fixpoint}|{Scientific}
Number {Integer}|{Real}|{Rational}
String "\""[^\n\"]"\""
Namedchar backspace|return|page|linefeed|newline|rubout|space|tab
Character ("#\"[^\n\t ])|("#\"{Namedchar})
Constant {String}|{Number}|{Character}
StandardNames "+"|"-"|"*"|"/"|"car"|"cdr"|"c"[ad][ad]+"r"
</scrap>
Then we would:
- run Sweb over the ScrapIndexer source file, specifying
Scheme1.0 as the version to generate
- compile the result with yacc, lex, and C
- run it as needed over our Sweb documents
Care must be taken, of course, not to overwrite the default version
unintentionally. The resulting indexer should accept a run-time
indication of what languages it is responsible for, and should not run
on scraps identified as being in other languages. Scraps without
values for the
lang attribute may or may not be processed;
this should depend on a run-time flag.
A tool for the automatic generation of indexers in the style of
ScrapIndexer might work as follows:
- The user writes a language specification using the special
DTD described below.
- The indexer-generator reads the language specification and
writes source code for an indexer, in the form of a yacc/lex
program, a set of emacs macros, or what have you.
- The resulting tool is used in the same way as
ScrapIndexer itself.
The language specification provides the following elements for
defining the syntax of a language and modifying the default behavior.
- <delimiter> contains a regular expression for a string
which should be treated as a delimiter or operator
- <flatComment> contains a regular expression matching an
entire comment; comments of this type are assumed not to nest
- <nestingComment> contains two subelements
<start> and <end>, each containing a regular expression
matching the start-delimiter or end-delimiter of a comment; comments
of this type assumed to nest
- <string> contains a regular expression matching a literal
string
- <constant> contains a regular expression matching a
constant
- <keyword> contains a regular expression matching a
keyword or standard function
- <name>, if supplied, contains a regular expression matching
a legal identifier in the language
The regular expressions should follow the conventions of
lex.
If <name> is supplied, only tokens matching the given regular
expression will be indexed. Otherwise, all tokens not matching any of
the other token types will be indexed.
Multiple occurrences of each element type may be given; the token
type concerned will be defined as a logical alternation of all of
them. For example, some of the keywords of Scheme might be defined
thus:
<keyword>"*"</keyword>
<keyword>"+"</keyword>
<keyword>"-"</keyword>
<keyword>"/"</keyword>
<keyword>"1+"</keyword>
<keyword>"1-"</keyword>
<keyword>abort</keyword>
<keyword>abs</keyword>
<keyword>and</keyword>
<keyword>append</keyword>
<keyword>apply</keyword>
<keyword>assoc</keyword>
<keyword>atom?</keyword>
Formally, the
languageSpecification DTD contains the
following declarations.
〈 7 DTD for language specifications 〉 ≡
<!ELEMENT languageSpecification - -
(delimiter | flatComment
| nestingComment | string
| constant | keyword | name)* >
<!ATTLIST languageSpecification
languages NAMES #REQUIRED >
<!ELEMENT delimiter - O (#PCDATA) >
<!ELEMENT flatComment - O (#PCDATA) >
<!ELEMENT nestingComment - - (start, end) >
<!ELEMENT start - O (#PCDATA) >
<!ELEMENT end - O (#PCDATA) >
<!ELEMENT string - O (#PCDATA) >
<!ELEMENT constant - O (#PCDATA) >
<!ELEMENT keyword - O (#PCDATA) >
<!ELEMENT name - O (#PCDATA) >
This code is not used elsewhere.
7.2.3. Language-Specific Indexing
For more sophisticated indexing, an indexer must have better
knowledge of the language than is possible with ScrapIndexer and its
derivatives. For most languages, such knowledge can readily be supplied
in the form of free-standing filters similar to ScrapIndexer which
read Sweb documents as input and produce equivalent Sweb documents
as output, differing only in the addition or alteration of indexing
information.
Such filters must:
- recognize the beginning and ending of each <scrap>
element
- detect, from the lang attribute on each scrap,
whether the scrap should be processed or ignored
- detect, from the index attribute of the scrap,
whether automatic indexing has been suppressed (if the attribute bears
the value auto, it has not been suppressed; if it bears the
value manual, automatic indexing filters should not process
the scrap; other values have no standard interpretation
- record the id attribute of each scrap
processed
- parse scraps in the specific language recognized by the filter --
keeping in mind that arbitrary parts of the code may be replaced by
<ptr> or <ref> elements -- and generate appropriate lists
of identifiers for each such scrap
- if the scrap has no <scrapInfo> element surrounding
it, optionally insert such an element; alternatively, ignore such scraps
(i.e. require a prior run of Sweb itself).
- supply <indexDefs> and/or <indexRefs> elements for
each scrap, containing elements with the appropriate values of
resp (the name of the filter),
level1 (the identifier being indexed), and
level2 (whether it is a variable, function name, etc.,
and the name of the language)
- recognize and suppress <indexDefs>, <indexRefs>, and
<index> elements generated by previous runs of the filter
- pass all other information, both inside and outside of scraps,
through to output without change
Given a proper grammar for the language in question, it should be
possible for language-specific indexers to distinguish functions from
variables, and declarations or definitions from uses of those functions
and variables. A sample language-specific program is described in the
next section.
Language-dependent filters might also be
used to typeset code in specific languages, adding formatting or
structural tags to the code. As defined, <
scrap> elements can
contain only character data, <
ref> elements, and <
ptr>
elements, however, so typesetting filters must either translate the
document into a new DTD, or modify the standard
Sweb
DTD by redefining
model.scrap. One obvious
possibility (with which, however, I have not yet experimented) is to
redeclare scraps, adding to the content model
- all of the phrase-level elements defined as members of the class
tokentypes (to allow individual tokens to be rendered in
an appropriate font), and
- a <block> element to mark blocks of code which need to be
indented vis-a-vis their parent blocks; blocks can also be used to force
line breaks at the beginning and end of the block.
Such a redeclaration might make it possible to support type-setting of
source code in
Sweb without losing the declarative nature of SGML
tagging, and without binding the user to any particular typesetting
system. Further work is needed in this area.
7.2.4. SGML-Specific Indexing
A language-specific processor for SGML will be supplied, which
incorporates a DTD processor which recognizes element and attribute-list
declarations and indexes generic identifiers of elements being
declared, generic identifiers of elements named in the content model,
and attribute names and values. In entity declarations, the SGML
indexer will recognize and index the name of the entity being declared.
When the DTDs being processed follow TEI naming conventions, entity
declarations for model- and attribute-classes will be recognized and the
entity text will be analysed correctly.
For example, when processing the scrap in this document, in which
the <
scrap> element is defined, the SGML indexer would produce
the following output:
<scrapInfo>
<head>Code scraps</head>
<scrap id="dScrap">
<!ENTITY % model.scrap '(#PCDATA | ptr | ref)*' >
<!ELEMENT scrap - - (%model.scrap) >
<!ATTLIST scrap %a.global
name CDATA #IMPLIED
file CDATA #IMPLIED
version IDREFS #IMPLIED
index CDATA 'auto' >
</scrap>
<scrapDefs>
<index level1='model.scrap (SGML entity)'>
<index level1='scrap (SGML element)'>
<index level1='name (attribute) ' level2='on scrap'>
<index level1='file (attribute) ' level2='on scrap'>
<index level1='version (attribute) ' level2='on scrap'>
<index level1='index (attribute) ' level2='on scrap'>
</scrapDefs>
<scrapRefs>
<index level1='ptr (SGML element)' level2='in model.scrap entity'>
<index level1='ref (SGML element)' level2='in model.scrap entity'>
</scrapRefs>
</scrapInfo>
7.3. Special Cases: Fully Manual Indexing
Automatic index processing can be supressed entirely, to allow
the author of the web complete control over the indexing of scraps.
Suppression of automatic index processing also allows manual
tweaking of the index, after an automatic process has generated a
rough draft.
To suppress automatic index generation entirely, it is necessary
only to avoid running any of the automatic index generators.
Automatic index generation can be suppressed for individual scraps, by
setting their index values to manual.
To suppress the normal index processing behavior (as described above
in section
Normal Indexing), it is
necessary to suply an appropriate run-time flag.
7.4. The Back-of-the-Book Index
Normal index processing produces a full index of identifiers at
the location specified by the user by means of a <divGen>
element. Its contents are very simple: a <list> of type
index. Each index entry is an item; each item consists of
the entry form itself, followed by a series of <ref> elements
pointing at the scraps in which the identifier in question is defined or
used. The point of definition will normally be marked with an asterisk,
but this may vary from implementation to implementation.
A fragment of the print index from a literate program might look
something like this:[
21]
<list type="index">
<!-- ... -->
<item><ident>nactive_heaps</ident> *75c, 76a, 76b </item>
<item><ident>NEQ</ident> *24a </item>
<item><ident>newcell</ident> 41a, 52c, 70a, 84c</item>
<item><ident>nheaps</ident> 19b, 21b, 22b, 62c, 69, 73b, 74, 75c, 76a,
76b </item>
<item><ident>NIL</ident> 16b, 20a, *24a, 31a,
<!-- ... -->
</item>
<!-- ... -->
</list>
A version of this index intended for online use might look slightly
different: the page numbers might better be replaced with <
ref>
elements containing the names of the scraps, with the scrap identifiers
in the
target attributes:
<list type="index">
<!-- ... -->
<item><ident>nactive_heaps</ident>
<ref target="gcroutines75c">*Define nactive_heaps()</ref>,
<ref target="gc76a">define gc_status()</ref>,
<ref target="gc76b">Define gc_info()</ref>
</item>
<item><ident>NEQ</ident>
<ref target="type-predicates">*Define Lisp type predicates</ref>
</item>
<item><ident>newcell</ident>
<ref target="newcell">Define newcell</ref>,
<ref target="fopen">Define fopen</ref>,
<ref target="gc-for-newcell">Define gc_for_newcell</ref>,
<ref target="lmfuncs">Declare list-management functions</ref>
</item>
<item><ident>nheaps</ident>
<ref target="heapvars">Heap variables</ref>,
<ref target="process-cla">Define process_cla()</ref>,
<ref target="print-hs-1">Define print_hs_1</ref>,
<ref target="init-storage-1">Define init_storage_1</ref>,
<ref target="allocate-aheap">Define allocate_aheap()</ref>,
<ref target="looks-pointerp">Define looks_pointerp</ref>,
<ref target="gcsweek">Define gc_sweep</ref>,
<ref target="gcroutines75c">Define nactive_heaps()</ref>,
<ref target="gc76a">Define gc_status()</ref>,
<ref target="gc76b">Define gc_info()</ref>
</item>
<item><ident>NIL</ident>
<ref target="globals">Declarations for siod.c</ref>,
<ref target="lisp-obj">Lisp objects</ref>,
<ref target="type-predicates">*Define Lisp type predicates</ref>,
<ref target="leval-args">Define leval_args</ref>,
<!-- ... -->
</item>
<!-- ... -->
</list>
7.5. Using Sweb Indexing Facilities in Practice
The markup and indexing processes described in this document can
be used in a variety of ways. In a ‘normal’ case, if
any exist, the programmer might index all scraps manually, allowing
Sweb to generate an index of identifiers.
In some cases, it might be more convenient to run ScrapIndexer
periodically during development, so as to get a rough index
automatically. Since ScrapIndexer does not distinguish between
definitions and mere uses of variables, the resulting index is not so
useful, but it also requires less work to generate. When the program is
complete or nearly complete, it would be sensible to run ScrapIndexer
(or some other automatic index generator) a final time, and then to edit
the results, moving <index> elements as needed into the
<indexDefs> element from the <indexRefs> element. The
actual index can be generated by the default behavior of Sweb, or
the index processing can be set to fully manual operation.
The general rule is simple: use the automatic index generators to
generate rough indices during development, and use their output as a
rough first draft, to be manually edited and tweaked as the document
approaches completion.
8. Generating Reference Material
In some cases, it is not enough to generate an index of identifiers.
When defining a library, or an SGML tag set -- and even when defining a
relatively complex program -- it is useful to provide reference material
for various constructs of the library, tag set, or program.
For a function library, reference material might list the name and
arguments of each function, with a description of the function's
purpose. For SGML document type definitions, reference material might
list, for each element type, the generic identifer, attributes, and
formal declaration of the element, with a prose description of its
usage. (Several DTDs exist for DTD documentation; perhaps the most
accessible, and certainly the one I know best, is the Tag Set
Documentation DTD defined by the Text Encoding Initiative.)[
22]
Some complications await us if we wish to support such reference
materials; more complications arise if we want to generate them
semi-automatically.
One complication is that the constructs of the formal system, be they
functions or SGML element types, should normally be described in the
prose documentation, too. Multiple descriptions, of course, risk
inconsistency. Sweb provides a method of controlling
redundancy and reducing inconsistency. A glossary list (i.e. a
<list> or <glossary>) in the text with a
type of desclist will be recognized by
Sweb as containing descriptions of important items in the
formal system being defined, and reference material will be generated
for those items, if it doesn't already exist.
The reference material for a given item includes at least the
following SGML elements:[
23]
- contains the reference documentation
itself
- contains the name of the the variable,
element type, function, etc.
- contains a description, copied from the <item> or
<gloss> portion of the description list
With sufficiently intelligent language-specific processors, and manual
editing of the results, or just with patient manual labor, the
reference material may also include:
- contains the full name of the item, if the formal identifier is an
abbreviation
- description of subordinate items (<attributes>
documents the attributes of an element type,
<arguments> the arguments of a function,
<members> the members of a structure or class,
<range> the range of values of a data type)
- contains an example of the item (tagged with <eg>),
together with commentary (tagged as a series of <p>
elements)
- contains remarks on the scope, usage, or other interesting aspects
of the item
- indicates what module or part of the system the item is part
of
- indicates what file(s) the item is defined or declared in
- contains pointers to other related items, whatever the
relationship (parents, children, superclasses, subclasses, etc.)
- contains the formal declaration or definition of the item
- contains an informal paraphrase of the declaration or
definition
- gives a cross reference to a section in the document which
describes or mentions this item
- identifies an equivalent construct in another system
For example, here is a sample description for an imaginary SGML
element named
bl (for ‘
blort’):
<refDoc type="tag">
<ident type="gi">bl</>
<rs >blort</>
<desc >Marks any sequence of granfalloons.</desc>
<subordinates>
<refDoc type="att">
<ident type="gi">id</>
<rs >identifier</>
<desc >Gives a unique ID to this blort.</desc>
<informal >any unique SGML name</>
<formal >id ID #REQUIRED</>
</refDoc>
</subordinates>
<exemplum><p>This is how it works: ...</p>
<eg> ... </eg>
</exemplum>
<remarks><p> ... </remarks>
<module> ... </module>
<files> ... </files>
<ptrList type='classes' targets='clspqr chunk'>chunk, Rahtz-class</>
<ptrList type='children' targets='granf'>granfalloon</>
<ptrList type='parents' targets='&divs; front body back'>back body
div div0 div1 div2 div3 div4 div5 div6 div7 front</>
<informal>Contains a series of granfalloon elements, optionally
preceded by a heading.</>
<formal><![CDATA[
<!ELEMENT blort - O (head?, granfalloon*) >
<!ATTLIST blort
id ID #REQUIRED >
]]>
</formal>
<ptr type="doc" target="chap3">
<equiv scheme="TEI">seg type="blort"</>
</refDoc>
In describing the handling of <refDoc> elements, we can
distinguish two main cases: either no <refDoc> element exists
for a given item, or one does.
If no <refDoc> element exists for a given item, then unless
the user indicates (by means of run-time flags) that this action should
be suppressed, one is generated by the Sweb processor for each item
found in a <label> element within a <list> element of
type desclist. The generated <refDoc>
contains only an <ident> element (taken from the <label>
element of the description list) and a <desc> element (taken from
the corresponding <item> element). If the appropriate run-time
flag is given, the Sweb processor will also generate empty elements
for the rest of the reference documentation. The <refDoc>
element is inserted into the document wherever an appropriate
<divGen> element is encountered in the input.
Language-specific processors may also be written to parse the various
languages used in the web, recognize the important constructs, and copy
them as appropriate into the reference documentation. The
preparation of such language-specific processors is not discussed
further here.
When it generates reference documentation, Sweb creates
redundancies in the text, between descriptions and code scraps in the
prose and the <desc> elements and code scraps in the reference
documentation. Whenever Sweb copies material from one location in the
input to multiple locations in the output, one location in the output is
marked as the original, the other as a copy. When on later runs the
redundant material is encountered in the input, the copy is ignored and
a new copy is made.
Flags on the
Sweb processor can be used to control the actions
desired:
- suppress all reference-list activity (copy everything through
without change)
- emit two copies of relevant material, marking the reference
documentation as the original and marking the prose as the copy
- emit two copies of relevant material, marking the prose as the
copy and marking the reference documentation as the original
- emit only one copy of relevant material, putting it into the
reference documentation and leaving a special pointer element in the
prose to show where the material needs to be mirrored
- put the material in the prose, leaving a special
pointer element in the reference documentation to show where the
material needs to be mirrored
In all cases, the copy element is identified by having a
copyOf or sameAs attribute pointing at the
original. (In the latter case, the copy has a redundant copy of the
material; in the former case, the copy is empty.) The original may be
identified merely by being the target of a sameAs or
copyOf attribute; optionally, it may also receive a
select attribute whose value is its own ID. This is
strictly speaking unnecessary, but helpful in flagging clearly its role
as the master copy, for human editing.
The formal declarations of these elements are these:
〈 8 Declare refDoc 〉 ≡
<!ENTITY % m.subs 'attributes | arguments | members | range' >
<!ELEMENT refDoc - - (ident, rs?, desc, (%m.subs)?,
exemplum*, remarks?, module?,
files?, ptrList*, informal?, formal?,
ptr*, equiv*) >
<!ELEMENT (attributes | arguments | members) - - (refDoc*) >
<!ELEMENT range - - (code, desc)* >
This code is not used elsewhere.
9. Technical Issues
9.1. Processing Webs
Sweb is a single-DTD system, designed to be suitable for handling
by a single processor rather than by separate
weave and
tangle
processors. By
single-DTD system, I mean a system
whose output conforms to the same markup language as the input. This
distinguishes it from virtually all of its predecessors, whether
single-processor or double-processor systems:[
24]
they universally produce, as output, files which cannot be read again as
input.
In practice, however, a double-DTD system has the undesirable
effect that everyone reads the output form of the document and no one
ever sees the input form.[
25] A single-DTD system
eliminates this problem: anyone interested in using the system can
imitate the file format they see in the finished products.
The single-DTD system also makes it much easier to do a first sketch
of a program or SGML DTD, without the reference material.
A single process which both weaves and tangles the input must:
- tangle the output files
- normalize <ptr> elements into <ref> elements for the
sake of browsers which can't reflect text dynamically
- expand <recap> elements inline, ignoring any content they
have in the input
- support multiple versions of the code, recognizing equivalence
classes and tangling a single version of the output (by default, the
last version in the <versionList>; the user may specify another
version at run time)
- normalize string-prefix matching scrap references to <ref>
elements using ID/IDREF linking, normalizing the content of the
<ref> elements at the same time to agree with the full name of
the scrap referred to
- generate indices on demand
- generate reference material
These are listed in what seems to me a natural order for gradual
implementation.
9.2. Invocation / Selection
The tag set declared here functions as a user-supplied extension to
the host TEI DTD; apart from being supplied by the user instead of by
the TEI, it works the same way the TEI ‘additional tag
sets’ work. The formal declarations of the SGML elements will
be contained in one file, while the SGML glue needed to integrate the
Sweb tags with the native TEI tag sets will be included in
another. The system names of these are sweb.dtd and
sweb.ent, respectively.
Normally, literate programming documents will use the TEI base tag
set for prose (verse and drama would also work, but might seem
excessively ‘literate’ and frighten later maintenance
programmers who have to work with the web). The tag set described here
also requires portions of the additional tag set for linking. The
document type declaration for a literate programming document must
therefore select the TEI tag sets for prose and linking, and identify
the files
sweb.dtd and
sweb.ent as
containing extensions we wish to use. In the simple case, when we are
not using any other extensions, the document type declaration can look
like this:
<!DOCTYPE tei.2 SYSTEM 'tei2.dtd' [
<!ENTITY % TEI.prose 'INCLUDE' >
<!ENTITY % TEI.linking 'INCLUDE' >
<!ENTITY % TEI.extensions.dtd SYSTEM 'sweb.dtd' >
<!ENTITY % TEI.extensions.ent SYSTEM 'sweb.ent' >
]>
In the more complex case, we have to combine the
Sweb
additional tag set with other extensions, for example, those which
define TEI Lite. In this case, we need only:
- create a new TEI.extensions.ent file,
containing entity declarations for the two extensions' own
entity-declaration files. If the two entities modify any of the same
TEI entity declarations, then new declarations must be put together by
hand for the entities concerned. After this is done, the two extension
entity files can be referenced.
- create a new TEI.extensions.dtd file,
containing entity declarations for the two extensions' own
element-declaration files. If the two entities modify any of the
same
TEI element or attribute list declarations, then new declarations must
be put together by
hand for the elements concerned. After this is done, the two
extension DTD files can be referenced.
[example needed.]
9.3. Specialized Handling of Standard Elements
Some fundamental literate programming facilities can be handled by
standard elements and require no specialized tags. It may be necessary,
however, for these standard elements to be processed in specialized
ways. This section describes the special processing required for
standard elements when used in literate programs.
The <ref> element and <ptr> elements are used in
program scraps, to indicate embedding of other scraps.
Tangle processors should follow the link recursively and
embed the other scrap at the location indicated; weave
processors should print the name of the other scrap, followed by a
module or scrap number, section number, or page number, all enclosed
within angle brackets, or using whatever typographic convention is
adopted for module references.
The <
divGen> element is used to signal that a
text-division (<
div>) element should be generated, containing
an index of a kind specified in the
type attribute. Special
values which should be recognized by
Sweb processors are:
- filenames-index
- index of files generated
- scrap-index
- index of scrap names
- version-index
- index of version names
- index
- index of identifiers (manually indexed)
[Idea: make two-level convention, so <divGen type='index
filenames'> dumps the index created by <index index='filenames'
level1='myfile.dat'> and so on. Then it's simple to allow the user
to specify whether the identifiers belong in the main index, or in an
index of their own. Ditto for everything else: filenames can go in
their own index, or they can be put into the main index, depending on
one's preferences.]
9.4. Integrating the Sweb Tags into the Base Tag Set
In order to integrate the tags defined here into the TEI base, we
need to include them into the TEI element-class system. I do this by
defining some specialized parameter entities:
〈 9 〉 ≡
<!ENTITY % x.tokentypes '' >
<!ENTITY % m.tokentypes '%x.tokentypes ident | kw | lit |
comment | gi | att | val | ent'>
<!ENTITY x.phrase '%m.tokentypes |'>
<!ENTITY x.chunk 'scrap | versionList |'>
This code is not used elsewhere.
[This section needs work.]
9.5. Implementing Language-Specific Processing
Knuth's original WEB system, and many of its offshoots, typeset the
program code in a readable style, formatting code blocks appropriately
(with indentation showing code structure), printing keywords in bold,
literals in a monospaced font, and identifiers in italics, and using
special characters in the standard TeX fonts for important operators
such as assignment, logical-and, and logical-or. This requires
substantial knowledge of the programming-language syntax to be
hard-coded into the program or located in a table somewhere for
consultation by the program.
Hard-coding the program's linguistic knowledge also has the drawback
of limiting a web to the language(s) known to the weave
and tangle processors; in an attempt to evade this
drawback, some tools simply print all program code in a monospaced
typewriter font, reserving normal typesetting for module references and
comments.
In Sweb, program scraps and inline <code> elements are
processed, by default, as verbatim text using a typewriter font.
Language-specific processing can be achieved by writing special routines
to parse program fragments and insert the proper formatting codes; these
special routines should be introduced into the style-sheet controlling
the weaving process. In this way, special indexing, etc., can also be
introduced.
[Specific examples needed here.]
[N.B. the discussion above is silently assuming that Sweb will be
implemented by an SGML-aware filter program driven by a style sheet, or
by a pipeline of such filters, implemented with tf, CoST, OmniMark, or
similar programs. Since Sweb's output is also an SGML document, then
language-specific processors can be run on the Sweb output. They
need only distinguish
scraps and inline code fragments properly, and identify their language
correctly.]
9.6. Tables of Correspondences
As an aid in understanding the Sweb tag set, and the
expected processing, I provide the following partial lists of commmands
in some widely used literate programming tools, together with the
equivalent in Sweb.
For
WEB:
- @*
- major module: use <div1>
- @&#SPACE;
- module: use <div2>, <div3>, etc.,
or <p>.
- @p
- Pascal program: use <scrap>
- @<
- module definition: use <scrap>
- @<
- module reference: use <ref> or <ptr>
- @^
- index entry (in normal font):
use <index> with an appropriate level2 value
- @.
- index entry (in typewriter font):
use <index> with an appropriate level2 value
- @>
- end of argument: this is handled differently in SGML;
mark the end of an SGML element with the appropriate end-tag, the end of
an attribute value with a closing quotation mark.
- |...|
- Pascal-within-text: use <code> or
<ident>
- @!
- mark next identifier as defining occurrence: not
implemented
- @;
- format like Pascal semicolon: not implemented
- @d
- define a macro: not implemented
N.B. in this version of
Sweb, a number of WEB
facilities have no SGML analogues; these may eventually be added, but
this has not yet been done. In addition to those already mentioned,
these constructs without equivalents include:
- facilities for overriding weave's automatic
formatting of program
text
- ways to force tangle to concatenate adjacent tokens, to
manufacture identifiers.
- facilities for octal, decimal, and alphabetic constants
The special markup for overriding the default formatting depends
critically upon the specific algorithms used to typeset the program
code, and has a very unSGML-like feel to it, so it may never be added;
however, it may be essential to really satisfactory results, so it may
be necessary to define it in the form of a set of declarative
hinting elements for use by any formatter.
For
nuweb:
- @o filename flags @{ ... @}
- write the scrap out to the named file, allowing no page breaks
within the scrap:
<scrap file="fn" rend="flags"> ...
</scrap>
- @O file flags @{ ... @}
- ditto, but allow page breaks within the scrap:
<scrap file="fn" rend='loose flags'> ...
</scrap>
- @d scrapname @{ ... @}
- define a scrap with the name given; allow no page breaks:
<scrap name='scrapname'> ...
</scrap>
- @D scrapname @{ ... @}
- define a scrap with the name given; allow page breaks:
<scrap name='scrapname' rend='loose'> ...
</scrap>
- @{ ... @}
- delimit a scrap (this should always be preceded by @d or @o):
<scrap> ... </scrap>
- @< name@>
- invokes the scrap named. <ptr
target="scrapid"> or
<ref>name</ref> or
<ref target="scrapid">name</ref>
- @f
- generate index of filenames created:
<divGen type='filenames-index'>
- @m
- generate index of macro names (scrap names):
<divGen type='scraps-index'>
- @u
- generate index of identifiers or other terms:
<divGen type='terms-index'> or
<divGen type='index'>
10. Alphabetical List of Tags
[To be supplied ... ]
Notes
[
1] The ideas
in this paper derive in part from several years' experience with the ODD
system of tag-set documentation, and from discussions with friends and
colleagues. In particular, I thank Lou Burnard of Oxford University
Computing Services, with whom I designed the ODD system, and Bob Hyman
of the University of Illinois at Chicago Academic Computer Center, who
has taken a collegial interest in these ideas and their potential
implementation.
[
2]
Donald E. Knuth,
“Literate Programming,”
The Computer Journal
27 (1984): 97-111,
rpt. [rev.] in his
Literate Programming,
CSLI Lecture Notes Number 27
([Stanford, California]: Center for the Study of Language and
Information, 1992), pp. 99-136, here p. 99.
[
3]
The tags here are all that are necessary for
Sweb input, but output
may include further tags, described below in section
Scrap Wrappers.
[
4]
Silvio Levy and
Donald E. Knuth,
“An Example of CWEB,”
in Knuth's collection
Literate Programming,
CSLI Lecture Notes Number 27
([Stanford, California]: Center for the Study of Language and
Information, 1992), pp. 341-348.
This is scrap 17, on p. 345.
[
5]
The example is from Knuth's revision of a famous program by Edsger
Dijkstra, in Knuth's “Literate Programming,” pp.
113, 115; see also pp. 103-104 for a formatted copy. Dijkstra's
original prime-number program is discussed further below.
[
6] The one-page summary
was prepared by Adrian F. Clark for
nuweb version 0.6. The subroutine
has an at-sign in its name, presumably to illustrate how literal
at-signs are coded in
nuweb documents.
[
7] Language-specific processors might be used to parse
<
code>, <
eg>, or <
scrap> elements and tag the
component parts of the code appropriately.
[
8] Edsger W. Dijkstra,
“EWD227: Stepwise Program Construction”,
in his
Selected Writings on Computing: A Personal
Perspective (New York, Heidelberg, Berlin: Springer-Verlag,
1982), pp. 1-14, here pp. 3 and 7. This is the program rewritten
by Knuth as an example in his original article on literate programming.
[
9] Where it is desired to produce local documentation
which completely replaces selected sections of the base
documentation, the techniques described here won't quite do the
job; they can, however, be extended in an obvious way,
particularly by use of the
exclude attribute to
indicate which <
div> elements in the local
documentation should replace which <
div> elements in the
base documentation.
[
10] Examples include the Text Encoding Initiative and the
Davenport Group.
[
11] The
corresp attribute, which is also declared globally when the
TEI additional tag set for linking is in use, may be preferable for this
purpose; its name at least seems less confusing. The
exclude
attribute, however, seems to capture the semantics of the situation more
precisely; scraps in an equivalence class are not only equivalent, but
mutually exclusive alternants. If left completely on my own, I'd use an
equiv attribute, but there seems no point in adding a new
attribute which differs from
exclude only in name.
The real problem is that only linguists have really internalized the
notion that when two items are in complmentary distribution, and thus
effectively exclude each other, they are variant forms of
the same thing. For the rest of us, being variant forms and exclusion
seem somehow contradictory. (The linguist in me, of course, observes
triumphantly that this means they are variant forms of the same
phenomenon ...)
[
12]
I evade, for the moment, some questions raised by Dijkstra's
sample program, notably his use of identifiers and keywords
within scrap names, which cannot be represented with the tags
introduced so far; for a solution, see below, section
Scrap Wrappers. I also ignore, for now,
the variation in scrap names in Dijkstra's
text.
[
13]
It does not show, of course, how identifiers can be tagged in scrap
names, which are SGML attributes, not content. For a
discussion of this and related issues, see the section on
Scrap Wrappers.
[
14] I am unaware, in fact, of
any which support
recapitulations.
[
15]
It seems unnecessary to allow the user to specify, through
special
evaluate and
quote attributes, or
rule (with values of
evaluate or
quote) and
exceptions, which scrap
references should be expanded (evaluated) and which should be
left alone (quoted, taking
quoted as an
antonym of
evaluated). If any users of
the tag set actually need this capability, I hope they'll let me
know. For now, it seems likely that <
recap> should almost
always expand only the references to scraps already seen, leaving
forward references unexpanded.
[
16] Other names
once or still under active consideration are <
wrappedScrap>,
<
wrap>, <
scrapWrap>, <
sWrap>, and <
sw>
(which can mean either ‘
scrap wrap’ or
‘
Sweb’). I like <
scrapwrap> or <
sWrap>
myself, but the moans of my friends were too audible. Thanks to Bob
Hyman for suggesting the name <
scrapInfo>, which has the
merit of eliciting no moans at all.
[
17]
For now, I am resisting the temptation to add the rule:
- References to the ID of the parent <scrapInfo>
element are interpreted as references to the
embedded <scrap>.
It does seem natural to want to assign an
ID to the entire scrap
wrapper, but it seems simpler to forbid the use of such alternate
identifiers as aliases.
[
18] In some implementations, as in
nuweb, very simple string matching may be used when searching for
other scraps which refer to the identifier; in some, there may be a less
error-prone approach. Consult the implementation-specific
documentation.
[
19] Individual implementations may also choose to append (or to
supply as a
level2 value) a string consisting of the name of
the programming language and the string
identifier, so that the identifier
blort in a fragment written in Pascal would be
replaced by
<index level1='blort (Pascal identifier)'> or
<index level1='blort' level2='Pascal identifier'>. An
implementation might equally well prefer to leave such refinements to
specialized language-specific indexing tools, on which see below,
section
Special Cases: Fully Automatic
Indexing.
[
20] The syntactic details approximate those
given in R. Kent Dybvig,
The Scheme Programming
Language (Englewood Cliffs: Prentice-Hall, 1987). Some details
have been changed for convenience (e.g. the number 100.000 is classed as
a real, not an integer); other details may have been changed
inadvertently. No effort is made to list all the predefined functions
of Scheme.
[
21] This fragment is transcribed
(manually!) from a partially indexed copy of the Scheme interpreter
siod.
[
22] The TSD is based on the
Odd system for DTD generation
designed by Lou Burnard and myself -- the tags described here are a
generalization of the
Odd and TSD tag sets.
[
23]
The tags used are generalizations of tags in the TEI auxiliary tag set
for Tag Set Documentation, with the following equivalences:
Sweb : TEI TSD
- refDoc : tagDoc, classDoc, entDoc, attDef
- ident : gi, class, entname, attname
- subordinates : attList, valList
- module : part
- ptrList : classes, parents, children
- informal : dataDesc
- formal : elemDecl, attlDecl, attdef/dataType
Identity mappings
- rs
- desc
- exemplum
- eg
- remarks
- files
- ptr type="doc"
- equiv
[
24] Or even
triple-processor systems, like the TEI
Odd system, in which
oddp2x performs
weave functions,
odddtd performs
tangle functions, and
oddref generates reference material.
[
25] In the case of
Odd, the
input format remained, in effect, a secret known only to the editors of
the TEI and those few members of work groups willing and able to learn
it successfully from our examples. Everyone else produced drafts which,
at best, attempted to look like P2X files. Since these had to be
reduced manually to legal
Odd format, I had plenty of time to reflect
on how useful a single-DTD design would be.