XSLT Workshop

Exercises

C. M. Sperberg-McQueen

18 June 2002

1. Sunday a.m. 1: introductions
2. Sunday a.m. 2: simple transformations (HTML)
3. Sunday p.m. 1: modes, table of contents, hypertext links
4. Sunday p.m. 2: XPath
5. Monday a.m. 1: functions, numbering
6. Monday a.m. 2: Near-identity and other XML-to-XML transforms
7. Monday p.m. 1: named templates, variables, parameters
8. Monday p.m. 2: sorting and grouping
9. Other possible exercises / samples
- 9.1. Notes from Sinclair
  - 9.1.1. Word frequency profiles
  - 9.1.2. Word index
  - 9.1.3. KWIC format
  - 9.1.4. Concordance to sentences

A. Hints
B. References
- B.1. Sources of some exercises
- B.2. CSS cheat sheets
C. Solutions

This document lists exercises to be done either as a group or individually in the hands-on sessions. Many of these exercises are taken or adapted from the works cited in the References; thanks to the authors.

Note that many more exercises are given than you will have time for during the workshop; it is hoped that some will be interesting enough to pursue on your own afterwards.

1. Sunday a.m. 1: introductions

who are participants?
who are instructors?
what is XSLT and where does it come from?
XML-to-XML transforms as a crucial problem/task
overview of course syllabus (one hour, working through all the topics to be covered in introductory way)

No independent exercises for this segment.

2. Sunday a.m. 2: simple transformations (HTML)

Topics covered:

basics of XSLT syntax (it's XML!)
generating output: text, literal elements, element and attribute constructors
exercise: Hello, world
flow of control through apply-templates
exercise: simple HTML stylesheet for some part of TEI
default rules
exercise: elements without a template show up in red
if, choose
modify stylesheet to select style based on attribute value (TEI list/@type, hi/@rend)

Group exercises:

Hello, world: make a stylesheet to handle the file hello-world.xml.
Greetings: make a series of stylesheets to translate the file greeting.xml into HTML:
1. wholly canned output
2. tag each hello element as an h1
3. provide introductory text with count of hello elements, and embed the hello elements themselves as li elements in an ordered list (ol)
4. style different languages in different ways, using the style attribute on the output li elements:
  - for English (en): style="font-family: Palatino Linotype; color: #777;"
  - For Norwegian (no): style="font-family: Comic Sans MS; color: brown;"
  - For French (fr): style="font-family: Script; color: blue; font-size: larger;"
  - For German (de — or more precisely, all the values where the first two letters of the lang attribute are “de”: style="font-family: Arial Black; color: purple;"
  - For Swabian German (de-schwaben): style="font-family: Arial; color: red;"
5. Select just one greeting to display (e.g. de-schwaben)

Individual exercises.

Exercises everyone should be able to do fairly easily; pick any one.

The file rossetti.xml includes the text of Christina Rossetti's poem “Remember”, encoded in TEI Lite. Write a style sheet to translate this to HTML. Suggestions:
- For the elements TEI.2, text, front, docTitle, and body, you may just want to process their children, using a template like <xsl:template match="text"><xsl:apply-templates/></xsl:template>
- For now, you may wish to skip over the TEI Header; you can do this with an empty template: <xsl:template match="teiHeader"/>.
- Translate the TEI titlePart element (part of docTitle) to an HTML h1, docAuthor and docDate to (say) h3.
- Translate lg to either div or p.
- If you translate l to an HTML p element, you will get the proper line wrapping, etc., but there will (depending on your browser and settings) probably be too much space between lines; you may get better results appending an HTML br (line break) to each line.
- Other styling at your taste.
The file prince.xml contains the text of Oscar Wilde's story “The Happy Prince”; write an XSLT stylesheet to transform this to HTML. The basic plan can be similar to that for the sonnets; in addition, you will encounter P and Q elements, which can be rendered as HTML p elements and as untagged data preceded and followed by double quotation marks, or marked by an ldquo character (“) and an rdquo character (”). If you wish to refer to these entities by name instead of character number, you'll need to declare them in the XSLT stylesheet:
```
<!ENTITY ldquo  "&#x201C;" >
<!ENTITY rdquo  "&#x201D;" >
```

Optional exercises: do any of these which catch your fancy.

Alex White's How to spec type [White 1987] provides a rich collection of typographic styles, with instructions on how to specify them when marking copy for a typesetting service. Using the sample document greeked.xml, try to reproduce the examples on p. 46 of [White 1987]. White specifies all examples as 9/10 ITC Avant Garde Gothic Medium Condensed. Examine your machine's fonts to select a font; on a standard machine, Arial Black or Impact may be the best matches. To specify the former, the style attribute should include font-family: Arial Black; — to specify 9-point type on 10-point leading, with an 8-pica width (9/10 x 8), use width: 8 pc; font-size: 9pt; line-height: 10pt; or the equivalent.
- Set this paragraph justified, x 8, with one-em indent. width: 96 pica; text-indent: 1em;
- Set this paragraph justified, x 12, with a two-em indent. width: 12 pica; text-indent: 2 em;
- Set this paragraph flush left / ragged right, x 12, with a six-em indent. width: 12 pica; text-indent: 6em;
- (Put all three styles into single output stream using CSS stylesheet.)
The file shelley.xml contains Oscar Wilde's poem “The Grave of Shelley”; write an XSLT stylesheet to translate it into HTML. The tags used are substantially the same as in the Rossetti poem, but for technical reasons the tags in this file are uppercase. Additionally, this document contains a TEI trailer element; this can be formatted in HTML as <div style="text-align: right; margin-right: 15%;">

The file sinclair-defs.xml contains a set of definitions, each tagged according to its structure into first and second parts, each part having the content model shown:

<!ELEMENT first (operator?, co-text?, topic, co-text?)>
<!ELEMENT second (operator?, comment)>

The comment may be divided into chunks marked as chunk elements.

Produce the tabular display shown below.

FIRST PART				SECOND PART
Operator	Co-text (1)	Topic	Co-text (2)	Operator	Comment	Chunks
	a	house		is	a building in which people live	1 2
if	you	defeat	someone		you win a victory over them in a contest such as ...	1 2
	a	pure	substance	is	not mixed with anything else
if	something happens	often			it happens many times or much of the time	1 2

(Adapted from pp. 124ff of [Sinclair 1991].)

3. Sunday p.m. 1: modes, table of contents, hypertext links

Topics covered:

idea of modes
exercise: toc (using toc mode)
id-generate() function
exercise: toc with hyperlinks
exercise: TEI ref, ptr

Group exercises:

Write another stylesheet for greeting.xml: after the list of known greetings, add a horizontal rule (hr element) and a list of language codes used in the document. (Hint below.)
Add a table of contents to a stylesheet for a TEI Lite text structured with div elements.
Add hyperlinks to the table of contents leading to the body of the text.

Individual exercises.

Easy:

Modify your stylesheet for the sonnets to print the lines of the sonnet with a blank line between each line. ([Hockey 1985] ex. 1.1, p. 15.)
Do any exercises from the previous section that you find interesting and have not done.

Medium:

The files gorbals51.xml and gorbals81.xml contain an XML encoding of sample data from the 1851 and 1881 censuses of ‘the Gorbals’, a working-class district of Glasgow, prepared by the education unit of the Strathclyde Regional Archives together with the Glasgow Division of Strathclyde Region's Education Department (the 1851 is taken from Michael Anderson's ESRC-funded 2% sample of the 1851 census; the 1881 sample was constructed to match).

Write an XSLT stylesheet to display the material in an HTML table.
[Hockey 1985] Ex. 2.4, p. 30. Print out the prosopographic dataset in a more readable form.
Several files in the data directory contain fairly substantial literary works:
- peergynt.xml: Henrik Ibsen, Peer Gynt (1875), tr. by William and Charles Archer, transcribed by Project Eris, encoded in TEI Lite by C. M. Sperberg-McQueen.
- frnknstn.xml: Mary Wollstonecraft Shelley, Frankenstein: or, the modern Prometheus (1831), encoded in TEI Lite by Peter-john Byrnes, Jeff Chisa, and Wendell Piez.
- dorian.xml: Oscar Wilde, The Picture of Dorian Gray (1890), encoded in TEI by the CELT project.
- earnest.xml: Oscar Wilde, The Importance of Being Earnest: A trivial comedy for serious people (1895), encoded in TEI by the CELT project
- lying.xml: Oscar Wilde, The decay of lying (1891), encoded in TEI by the CELT project
- salome.xml: Oscar Wilde, Salome: a tragedy in one act (1893), encoded in TEI by the CELT project.
- savile.xml: Oscar Wilde, Lord Arthur Savile's Crime: a study in duty (1891), encoded in TEI by the CELT project.
Develop a stylesheet which transforms one or more of these into an acceptably attractive HTML. Provide a table of contents if, but only if, the document actually has div or divN elements.

Challenging:

Using the auxiliary code tables in gorb51birthplace.txt, gorb51status.txt, and gorb51occod.txt, check to make sure that the cbpcod, status, and occod attributes in the census data contain only valid codes; display any errors in red (color: red;). (Adapted from [Hockey 1985] Ex. 6.2, p. 81.)
In [Halpern-Hamu 1999], Charlie Halpern-Hamu speculates on several unusual kinds of transformations one might wish to perform (and gives examples). Some of those he discusses are worth listing, though probably not feasible as course exercises for students of this course [If you know how to do these already, why are you taking this course?]:
- Given an XML Schema document, generate an instance conforming to it.
- Given an XML document instance, generate a schema which covers it. [Requires active knowledge of XML Schema syntax.]
Others are feasible, though challenging:
- Given an XML Schema document, generate a skeleton stylesheet for it. (Hint below.)
(Adapted from [Sinclair 1991].) Improve on the formats:
- [Sinclair 1991]word index
- [Sinclair 1991]KWIC format (so-called by Sinclair)
- [Sinclair 1991]concordance to sentences

4. Sunday p.m. 2: XPath

data model
axes of selection
short and long syntax
predicates

Individual exercises.

Fairly simple: you should be able to do these without too much work; do any two.

(Adapted from p. 47 of [White 1987].) All examples 9/10 ITC Avant Garde Gothic Medium Condensed (as before, you will need to substitute an appropriate font; Arial Black or Impact may be the closest you will get to Avant Garde Gothic Medium Condensed).
- Set this paragraph justified, x 8 1/2, with 2-pica hanging indent (text-align: justify; width: 10.5 pc; margin-left: 2pc; text-indent: -2pc;).
- Set first, third, fifth ... paragraphs justified, x 8, with no indent (text-align: justify; width: 8 pc; margin-left: 2.5pc; text-indent: 0pc;).
  
  Set every other paragraph justified x 10 1/2, and align on right edge (text-align: justify; width: 10.5 pc; margin-left: 0pc; text-indent: 0pc;).
- Set alternate paragraphs flush left / ragged right x 9 (text-align: left; width: 11 pc; margin-left: 0pc; margin-right: 2pc;), flush right / ragged left x 9 (text-align: right; width: 11 pc; margin-right: 0pc; margin-left: 2pc;), align flush edges on 11 pi column. (The alternation of margin-left: 2pc; with margin-right: 2pc; forces the flush side to the outer margin of the 11-pica column.)
(From [White 1987] p. 49.) All examples 9/10 x 14 justified ITC Avant Garde Gothic Medium Condensed.
- “No paragraph indents. Set running text and insert 10pt dingbat Q1.” (N.B. Q1 is a star, in this case; ☆, conventionally &star;, or ★, conventionally &starf;, may be used.)
- “No paragraph indents. Set running text and alternate 9/10 Medium Condensed with 9/10 Bold Condensed for each paragraph.” If your machine does not have related font families in normal and bold weights, alternating font-weight: normal; with font-weight: bold; will give a similar effect.
(From [White 1987] p. 50, figure 1). “10/10 x 13 Optima FL alternating with bold FR on central axis per thumbnail.” Odd-numbered paragraphs get margin-left: 0pc; margin-right: 6.5pc; text-align: right; even-numbered paragraphs get margin-left: 6.5pc; margin-right: 0pc; font-weight: bold; text-align: right; all get same font (Lucida may be the closest approximation to Optima on your machine). Note that if instead of absolute widths you specify widths and margins in percentages, the layout adjusts as the user rescales the window.
Modify your stylesheet for Rossetti poem to concatenate pairs of lines, so that lines 1 and 2 appear on one line of output, as do 3 and 4, 5 and 6, etc. This is sample program 1.2 on pp. 13-14 of [Hockey 1985].

Medium challenging

(Adapted from [Sinclair 1991]* pp. 30 and 140.)

A first order word list is “a list of word forms in the order of their first occurrence, noting the frequency of each. Each successive word-form is compared with each previous one. If it is a new word-form it is provided with a counter set at 1; if the word-form has occurred before, it is deleted from the text and 1 is added to the counter at the place of first occurrence of the word-form.” The example in Appendix I Figure 1 (p. 140) begins:

there 2

are 2

many 2

kinds 2

of 10

activity 6

and 8

communication 5

is 11

only 3

one 1

them 1

although 2

often 1

Exercise: generate such a first order word list for the sample text(s).

More challenging; XPath as a query language:

(Adapted from [Hockey 1985] Ex. 2.1, p. 30.) Using the Gorbals census files, print out all the information for people who were born in Ayr or Edinburgh. For people known to have been born outside Glasgow. For people not known to have been born in Glasgow.
(From [Sinclair 1991].) Print out lemmas which have initial cap and occur at the beginning of a sentence, or immediate following a colon.

Daniel I. Greenstein, A Historian's Guide to Computing (Oxford: Oxford University Press, 1994) suggests some possible data and things computers would be able to help a historian do with it.

On pp. 64ff, [Greenstein 1994] discusses a flat-file database with census data from 1881. “Example data are based on the “Gorbals Census Datafile”, prepared by the Strathclyde Regional Archives together with the Glasgow Division of Strathclyde Region's Education Department; see Glasgow's Department of Education, “Using Quest: Gorbals Census Datafile” (Glasgow, 1984).” Transcription of parts of the data shown (8 records) are in gorbals00.xml; the full samples from 1851 and 1881 are in gorbals51.xml and gorbals81.xml. A restructuring of the data intended to reduce redundancy is given in gorbals51.grouped.xml and in gorbals81.grouped.xml; do the exercises with either the ‘flat’ form or the ‘grouped’ form. The fields in the original data are reproduced here as attributes; Greenstein describes them thus:

Field	Datatype	Description
HNo	number	Household number — sequential number assigned by the database creator to each household [appears to be part of the gid (Gorbals ID) column in the version we have -MSM]
PNo	number	Person number — sequential number assigned by the database creator to the people enumerated in each household [appears to be part of the gid (Gorbals ID) column in the version we have -MSM]
Surname	c20
Forename	c16
Address	c20	Street address.
Relation	c12	Relationship to head of household.
Sex	c1	Male or female.
Age	number
Occupation	c20
TBirth	c20	Town of birth.
Rooms	Number	Number of rooms in household.
HSize	number	Number of people in household as calculated by the database creator.

The exercises suggested here are inspired by Greenstein's discussion of things one can do with this database.

Display all the records pertaining to people who gave their occupation as “Seamstress”. ([Greenstein 1994] p. 64)
Count the heads of household born in Glasgow. ([Greenstein 1994] p. 64)
Sort by age (descending). ([Greenstein 1994] p. 64)
Display a table with HNo, PNo, Occupation, TBirth, and CBirth columns (only) for male heads of households. ([Greenstein 1994] p. 69)
Display a table with occupation and birthplace information (or: with all information) for male heads of households born in Glasgow. ([Greenstein 1994] p. 69)
Display a table with occupation and birthplace information for sons of all heads of households. ([Greenstein 1994] p. 69)

5. Monday a.m. 1: functions, numbering

numbering
group exercise: section numbering, Dewey-style section numbering
string functions
exercises using string functions

Individual exercises.

Easy selection based on string contents: you should be able to do any of these without much effort by modifying match or select attributes in a stylesheet. (Note that some of these require each word of the text to be tagged with a w element — or at least would be much easier with such data.)

[Hockey 1985] Sample program in section 1.17, p. 12. (Write a program to) find how many times the string 'William' occurs in the given text. (sh01.17)
[Hockey 1985] Sample program 1.1, p. 13. Print all the lines beginning with 'T'.
[Hockey 1985] Ex. 1.2, p. 15. Print out all the lines which begin "Remember".
[Hockey 1985] Ex. 2.3, p. 30. Print out all of the people in the Gorbals census data who were born in Edinburgh or who were cabinet makers.
[Hockey 1985] Ex. 2.2, p. 30. Print all the words of the sonnet which contain an L.
[Hockey 1985] Ex. 3.1, p. 39. Count the number of words in the sonnet.
[Hockey 1985] Ex. 3.2, p. 39. What percentage of people in the dataset are identified as weavers? How does the answer vary if you change the formulation of the test, using contains() or using equality?
[Hockey 1985] Ex. 3.3, p. 39. Count the number of people in the Gorbals datasets who are over 60.
[Hockey 1985] Ex. 3.4, p. 39. Print out all the words in the first ten lines [or: in the octet, in the sestet] of the sonnet which contain an R.
[Hockey 1985] Ex. 4.1, p. 50. Count the number of occurrences of the word YOU in the sonnet; print the result indented 10 spaces.
[Hockey 1985] Ex. 4.2, p. 50. Print all the words in the sonnet which end in S or Y.
[Hockey 1985] Ex. 4.3, p. 50. Print out the names and identification numbers of all the people in the dataset, with the names converted to lowercase except for the first letters. Use the form Christian name, surname.

More challenging (require less obvious uses of built-in functions):

[Hockey 1985] Ex. 1.3, p. 15. Print out the sonnet, changing each occurrence of "remember" to "forget".
[Hockey 1985] Ex. 2.5, p. 30. Write each line of the sonnet backwards (letter by letter).
[Hockey 1985] Ex. 7.1, p. 92. Print out the sonnet, underlining every occurrence of letter T.

Challenging, either because non-obvious use of functions is required, or because the overall control structure of the stylesheet is complicated:

[Hockey 1985] Sample program, p. 53. Count the number of words [tokens] in the sonnet which are one letter long, two letters long, three, etc. up to ten letters long. (Mendenhall statistics.)
[Hockey 1985] Ex. 5.1, p. 67. Make a word count of all the words in the sonnet which begin with R, S, or T. Print them out with their frequencies.
[Hockey 1985] Ex. 5.2, p. 67. Count the number of people from each occupation.
[Hockey 1985] Ex. 5.3, p. 67. Modify the previous example so that instead of printing out the occupation code, the output includes the full name of the occupation (expansion of the code).
5 [Sinclair 1991]Exercise: Given a list of initial cap forms that should be lowercased, lowercase all occurrences of those lemmata.
5 [Sinclair 1991]Exercise: Given a list of inflected forms with their preferred lemma, change all occurrences of the inflected forms to give the correct lemma.
5 [Sinclair 1991]Exercise: Print out list of lemmas with initial caps which are not tagged as part of a proper noun.
5 [Sinclair 1991]pp. 32, 141. A word frequency profile is a characterization of a text's word frequencies using a table like one of those shown below. Exercise: given suitable intermediate data, produce the tables just given. [Task for instructors: devise a format for such intermediate data.]
5 [Sinclair 1991]p. 35. Exercise: select words from word list or text by length and characters (e.g. three-letter words beginning with 't').

Quite challenging:

Recognize and tag numbers (integers, decimals, or numbers in scientific notation) in running prose (or: within special fields where they are expected).
In the Gorbals census data, split the house attribute so as to put the street name and the house number in different fields; this will make it easier to sort by street name.
Recognize and tag dates occurring in running prose or special elements like date-place lines in letters. Assume that they have a standard format like “18.06.2002” or “18 Jun 2002” or “18th inst.”
In a collection of correspondence where most items have headings of the form “From Thomas Jefferson” or “To Benjamin Franklin” or “From the President to the Secretary of State”, recognize and tag the name of the sender or of the addressee.

6. Monday a.m. 2: Near-identity and other XML-to-XML transforms

copy
exercise: null transform
exercise: supply (random) IDs
exercise: supply IDs constructed from element type and position
exercise: supply numeric tumblers
exercise: supply typed tumblers
exercise: supply canonical references

Exercises:

6 [Sinclair 1991]Exercise: generate a first cut at lemma by copying the word into the lemma attribute.

7. Monday p.m. 1: named templates, variables, parameters

named templates
variables and templates
using named templates to define recursive functions
exercise: given a Unix-style file ID with full path, extract the directory (without the filename part); extract the filename part
exercise: extract blank- and punctuation-delimited words
exercise: make a list of the characters which actually occur in a document
exercise: tag words of a document

Exercises:

* tag w elements (in a specific context) (tagwords.xsl)
* generate data for word-frequency calculator shell script (mots15/bin/worddata.xsl)
[Hockey 1985] Ex. 1.4, p. 15. Print out alternate lines; within the lines you print out, delete all the spaces.
[Hockey 1985] Ex. 4.4, p. 50. Find the word which comes first in alphabetical order in each line of the sonnet.
[Hockey 1985] [Find the longest word in the sonnet.]
[Hockey 1985] Ex. 6.1, p. 81. Make a frequency count of all words in the sonnet which begin with a vowel and all those which end with a vowel, using a function [named template] to print the tables.
[Hockey 1985] Ex. 7.2, p. 92. Find all the words in the sonnet which contain a doubled letter.

8. Monday p.m. 2: sorting and grouping

make index from TEI index elements
make word frequency list

Exercises:

(given w tagging)(?), produce a KWIC concordance for viewing in HTML browser (table display ...)
* generate word-frequency list (wordfreqlist.xsl)
* generate full concordance (assuming presence of w and s or l elements) (conc01.xsl, conc02.xsl; N.B. these don't have reference information or toc or anything like that)
[Hockey 1985] [Assuming that the sonnet is tagged in some appropriate way with rhyme information, make a rhyme dictionary. Form of the dictionary may be either
one entry per rhyme pattern: heading is the (standard) spelling of the rhyme part, given as a hyphen followed by the vowel of the final stressed syllable (e.g. -anc, -ar, ... -ære, etc., or

an entry for each word form, showing the items it's rhymed with.

]
[Hockey 1985] Ex. 7.3, p. 92. Count the number of occurrences of all the words which begin A, E, I, O, U, using a separate table for each vowel.
[Hockey 1985] Ex. 8.1, p. 101. Count the number of people employed in each occupation and print out their names and id numbers under a heading for that occupation.
[Hockey 1985] Ex. 9.1, p. 116. Make a word list of all the words in the sonnet and print out the words in alphabetical order together with their frequencies. Line them up in four columns.
[Hockey 1985] Ex. 9.2, p. 116. Make a catalogue of all the people ordered by place of birth, then by date of birth.
8 [Sinclair 1991]p. 28, discussion of words and word forms. Introduce examples with lemma="..." and sentence tagging. Exercise: make lemmatized word list from text tagged with lemmas.
8 [Sinclair 1991]Exercise: Given a text which distinguishes the two uses of to mentioned by Sinclair (preposition has lemma="to.1" and infinite marker has lemma="to.2"), make a word list which distinguishes the two.
8 [Sinclair 1991]p. 31. Given a frequency list (in some order), present it alphabetized; present it sorted by frequency.
8 [Sinclair 1991]Exercise: generate suitable intermediate data from a text.
8 [Sinclair 1991]Exercise: sort by text order; by right context; by left context (N.B. text reverse may be useful here.)
- [Sinclair 1991]word index
- [Sinclair 1991]KWIC format (so-called by Sinclair)
- [Sinclair 1991]concordance to sentences
8 [Sinclair 1991]p. 34. Exercise: make frequency list of words occurring within distance n (measured in words [or otherwise] of a given word.
Given an XML document instance, generate a skeleton stylesheet for it. This exercise is borrowed from [Halpern-Hamu 1999]. (Hint.)

9. Other possible exercises / samples

In the lists below, exercises I've done (oops, forgot to time myself) are marked * and the filename is given.

replicate OCP
quality assurance tasks
Enhancement stages for word frequency list, word index, concordance
- * make first-order word list (wordlist.xsl)
- * add frequency to list of distinct words (wordfreqlist.xsl)
- * make concordance (add context information) (conc01)
- * highlight hit in concordance (distinguish left context, hit, right context) (conc02)
- add reference information
- add hyperlinks to full text
- add toc of word forms
tag (part of) a text

9.1. Notes from Sinclair

Notes from John Sinclair, Corpus Concordance Collocation (Oxford: Oxford University Press, 1991). Sinclair does not set exercises, but his discussion sometimes suggests tasks one could accomplish with XSLT.

? [Sinclair 1991][Sample problem: given some collection of information about a bunch of texts or samples — for example, the headers of the British National Corpus — select the texts meeting one or more criteria and then sort by yet other criteria. E.g. select all belles-lettres, sort by gender of author.]
- [Sinclair 1991]p. 21 apropos Sinclair's endorsement of a ‘clean text policy’ on the grounds that different analyses will get in the way of each other and that linguists disagree with each other on what various abstractions (e.g. word) mean. Give examples showing how multiple analyses can co-exist without interference. Also show stand-off markup. Set exercises using both inline and stand-off markup, to show the difference in ease of processing.
[Sinclair 1991]pp. 32-33. Various forms of concordance, reproduced below:
- [Sinclair 1991]word index
- [Sinclair 1991]KWIC format (so-called by Sinclair)
- [Sinclair 1991]concordance to sentences

9.1.1. Word frequency profiles

A word frequency profile is a characterization of a text's word frequencies using a table like the following:

Word Frequency Profile (1)
Word-form count	Number such	Vocabulary total	% of vocabulary	Word-form total	% of text
1	85	85	75.22	85	44.97
2	15	100	88.50	115	60.85
3	4	104	92.04	127	67.20
4	1	105	92.92	131	69.31
5	3	108	95.58	146	77.25
6	1	109	96.46	152	80.42
8	2	111	98.23	168	88.89
10	1	112	99.12	178	94.18
11	1	113	100.00	189	100.00

Sinclair explains this table thus:

To explain these, let us examine line 3 in [the preceding table]. This line deals with word-forms which occur three times. column 2 tells us that there are four of them, and column 3 keeps a running total of the number of different word-forms that have been reported. Column 4 relates the number in column 3 with the total of 113, expressed as a percentage. So the word-forms that occur no more than three times constitute 92.04 per cent of the 'vocabulary'—the number of different word-forms.
Column 5 considers the running text, where the total number of word-forms is 189. If there are four word-forms which occur three times each, then 12 must be added to the previous total in column 5. Column 6 relates the number in column 5 to the total of 189 and reports that word-forms that occur no more than three times occupy 67.2 per cent of the text. This figure will drop as the text size increases. In longer texts, the most frequent word-form, the, itself occupies about 2.5 per cent of the text.

The other version of this table works in the other direction:

Word Frequency Profile (2)
Word-form count	Number such	Vocabulary total	% of vocabulary	Word-form total	% of text
11	1	1	0.88	11	5.82
10	1	2	1.77	21	11.11
8	2	4	3.54	37	19.58
6	1	5	4.42	43	22.75
5	3	8	7.08	58	30.69
4	1	9	7.96	62	32.80
3	4	13	11.50	74	39.15
2	15	28	24.78	104	55.03
1	85	113	100.00	189	100.00

9.1.2. Word index

Sinclair's word index begins:

Blood
- AE12 1050
- AE12 1209
- AE12 1216
- AE12 1254
- AE12 1309
- AE12 1376
- OAL 271
- AU 415
- AU 608
- AU 609
- WB 91
- WB 382
- WB 402
- WB 439
Bloom
- SAA 1120
- T27 10
- NS 1
- G4 211
- AE10 450
- ...
...

(He uses a slightly different, tabular format.)

9.1.3. KWIC format

Sinclair's ‘KWIC’ format differs from what many sources call KWIC in that its context is measured in words (tokens) rather than in characters.

of activity and communication	is	only one of them
communication where the activity	is	halted in time if
whole process the activity	is	obvious enough the nervous
nervous activity of authors	is	legendary and the silent
reader in his armchair	is	making continuous fast and
radio listener his brain	is	highly active if he
highly active if he	is	taking anything in there
human communication through language	is	only a small sub-section
small sub-section although it	is	very important attempts to
mankind's only remaining boast	is	that we thought of
of it first it	is	certainly an intricate and

9.1.4. Concordance to sentences

Sinclair has an interesting approach to providing whole-sentence context. Instead of printing it run-on with the keyword bolded or otherwise singled out, he uses a table with a special kind of word wrap which I am not sure I can fully reproduce. The sentence concordance for is looks roughly like this:

There are many kinds of activity, and communication
There are many kinds of activity, and communication	is	only one of them, although often it does not look much like activity.

But in a library we wee a stage of communication where the activity
	is	halted in time.
		halted in time.
If we consider the whole process, the activity
If we consider the whole process, the activity	is	obvious enough.
		obvious enough.
The nervous activity of authors
The nervous activity of authors	is	legendary, and the silent reader in his armchair is making continuous, fast, and precise eye movements.

The nervous activity of authors is legendary, and the silent reader in his armchair
	is	making continuous, fast, and precise eye movements.
		making continuous, fast, and precise eye movements.
And, like the radio listener, his brain
And, like the radio listener, his brain	is	highly active if he is taking anything in.
		highly active if he is taking anything in.
And, like the radio listener, his brain is highly active if he
	is	taking anything in.
		taking anything in.
There are many kinds of communication, too, and human communication through language
	is	only a small sub-section, although it is very important.
		only a small sub-section, although it is very important.
There are many kinds of communication, too, and human communication through language is only a small sub-section, although it
	is	very important.
		very important.
Perhaps mankind's only remaining boast
Perhaps mankind's only remaining boast	is	that we thought of it first!
		that we thought of it first!
It
It	is	certainly an intricate and distinctive kind of activity.
		certainly an intricate and distinctive kind of activity.

A. Hints

This appendix provides some hints for those having trouble attacking some of the exercises above.

Write another stylesheet for greeting.xml: after the list of known greetings, add a horizontal rule (hr element) and a list of language codes used in the document. Try a duplicate template which matches the hello element in a separate (language-code) mode; call it from the greetings template.

Given an XML Schema document, generate a skeleton stylesheet for it. Given an XML document instance, generate a skeleton stylesheet for it. From the root, generate an xsl:schema element; for each element-type declaration, generate an XSL template which matches elements of that type. Extra credit (after study of XPath, and requires some knowledge of XML Schema): for elements declared local to some type, formulate a match value which matches them only in appropriate contexts.

B. References

Sources of some exercises

Greenstein, Daniel I. 1994. A Historian's Guide to Computing. Oxford: Oxford University Press.

Halpern-Hamu, Charlie. 1999. “Stupid XSL tricks”. Markup Technologies '99 conference, Philadelphia, December 1999. [Alexandria, Virginia]: Graphic Communications Association.

Hockey, Susan. 1985. Snobol programming for the humanities. Oxford: Clarendon Press.

Sinclair, John. 1991. Corpus Concordance Collocation Oxford: Oxford University Press.

White, Alex. 1987. How to spec type. New York: Watson-Guptill.

CSS cheat sheets

Brian Wilson, “CSS Property Index”, on Index DOT css http://www.blooberry.com/indexdot/css/ (with browser notes on IE, Mosaic, Netscape, Opera)

ProjectCool Developer Zone, “CSS Style Properties”. Summary of properties, with notes on implementation in NS 4.0, IE 3, IE 4; last updated 1998 (CSS 1). http://www.devx.com/projectcool/developer/reference/css_style.html

John Pozadzides and Liam Quin, “CSS1 Properties”. http://www.htmlhelp.com/reference/css/all-properties.html

Miloslav Nic and Jiri Jirat, “CSS 2 reference with examples” http://zvon.org/xxl/CSS2Reference/Output/index.html

Matthew Haughey, “The little shop of CSS horrors: A look at the dark side of CSS rendering”. http://haughey.com/csshorrors/ (as addendum, not main cheat sheet!)

C. Solutions

Prepared solutions are provided for some exercises. Don't look at them until you have made a serious effort to do it yourself.

Greetings file: greeting-01.xsl, greeting-02.xsl, ... greeting-06.xsl
White, p. 46: white46.xsl. [I haven't got justification to work.]
Hockey 1.1.

there	2
are	2
many	2
kinds	2
of	10
activity	6
and	8
communication	5
is	11
only	3
one	1
them	1
although	2
often	1