<!DOCTYPE TEI.2 PUBLIC '-//C. M. Sperberg-McQueen//DTD
          TEI Lite 1.0 plus SWeb (XML)//EN'
          'http://www.w3.org/People/cmsmcq/lib/swebxml.dtd' [
<!ENTITY mdash  "&#x2014;" ><!--=em dash-->
<!ENTITY lsquo  "&#x2018;" ><!--=single quotation mark, left-->
<!ENTITY rsquo  "&#x2019;" ><!--=single quotation mark, right-->
<!ENTITY ldquo  "&#x201C;" ><!--=double quotation mark, left-->
<!ENTITY rdquo  "&#x201D;" ><!--=double quotation mark, right-->

]>
<?xml-stylesheet type="text/xsl" href="dialog.xsl"?> 
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title>Are C1 characters legal in XHTML 1.0?</title>
<author>C. M. Sperberg-McQueen</author>
</titleStmt>
<publicationStmt>
<pubPlace>Cambridge, Mass.</pubPlace>
<pubPlace>Sophia-Antipolis</pubPlace>
<pubPlace>Tokyo</pubPlace>
<publisher>World Wide Web Consortium</publisher>
<date>2007</date>
</publicationStmt>
<sourceDesc>
<p>Transcribed from an email to Olivier Thereaux.</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>

<front>
<titlePage>
<docTitle>
<titlePart>De litteris regentibus C1 quaestiones septem</titlePart>
<titlePart>or</titlePart>
<titlePart>Are C1 characters legal in XHTML 1.0?</titlePart>
</docTitle>
<docDate>23 March 2007</docDate>
<docAuthor>C. M. Sperberg-McQueen</docAuthor>
<docAuthor>with a little help from my friends</docAuthor>
</titlePage>
</front>

<body>
<p>A friend writes:<note place="foot"><bibl>Olivier Thereaux,
<title level="a">help with "sgml declaration for xml"?</title>
Email to MSM and w3t-archive, 
23 March 2007.</bibl></note>
<q type="block"><p>
There is a reported bug in the validator, that SGML character number
128-159 are not allowed for xml-based markup languages.
<xref>http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164</xref>
</p>
<p>
We have a test case at:
<xref>http://validator.w3.org/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2&amp;charset=%28detect+automatically%29&amp;doctype=Inline</xref>
which indeed complains about these.
</p>
<p>
Our parser is opensp, and our opensp uses
<xref>http://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.soc</xref> as a
catalog in xml mode, and thus
<xref>http://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.dcl</xref> as an
sgml declaration for xml.
</p>
<p>
In the bugzilla item I mentioned above, Terje Bless, who generally
knows much more about SGML than I do, thinks it may just be that our
sgml declaration for xml should be updated to include this character
range.  http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164#c5
</p>
<p>
As I am rather confused by the issue, I'd appreciate any guidance,
diagnosis, or pointer, you could provide.
</p></q>
</p>
<p>This is the second time this week I've indulged my
inner language-lawyer in response to some query.  I reproduce
my reply to this question here, as an Awful Warning to those
who might otherwise be tempted to ask me questions about XML.
What I wrote was (more or less):</p>

<p>OK, I'll try.</p>

<!--* <p>
(I'm cc'ing Steven Pemberton and Dave Raggett because this problem
turns out to involve not just XML but also XHTML and HTML 4.01.  I'm
cc'ing Jon Bosak because he appears to have written the first SGML
declaration for XML and may remember why it is the way it is.
</p>
<p>
Gentlemen, please consider yourself requested to glance through this
discussion; if, at your leisure, you would condescend to correct my
errors, misstatements, misapprehensions, and damn'd lies, I would be
very grateful to you.  You may well be tempted to let me stew in the
juice of my own mistakes, but have a heart: remember that if the
errors in this discussion aren't corrected, Olivier will remain
uninstructed and the W3C's HTML validator may produce erroneous
results.  So for their sakes, if not for mine, please do check through
this excessively long screed!)
</p> *-->
<p>
I'm going to ask a number of short, pointed questions, provide long,
digressive answers (sorry about that), and then say what I think it
all means for your problem.
</p>
<p>
Note that the character range x80-x9F is known, for historical
reasons, as "the C1 range" or "the C1 characters".  I can explain if
you wish.  But be careful what you wish for.
</p>
<div>
<head>
Are characters x7F through x9F legal in XML?
</head>
<p>
In XML 1.0, the grammar production for Char includes them, so yes.
<eg>
    [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
                [#x10000-#x10FFFF] 
                /* any Unicode character, excluding the surrogate
                   blocks, FFFE, and FFFF. */
</eg>
</p>
<p>
The formulation of the Char production has changed from time to time,
but 7F and the C1 range have always been included.
</p>
<p>
In XML 1.1, the grammar production for Char continues to allow them,
but the 'document' production takes care to exclude them (in their
literal form) from the document.
<eg>
    [1] document ::= ( prolog element Misc* ) 
                     - ( Char* RestrictedChar Char* )
</eg>
</p>
<p>
In 1.1, the C1 characters may be referred to using numeric character
references (&amp;#x80; etc.) but not used as literal characters.
</p>
<p>
So: XML 1.0 does not forbid the use of these characters.  XML 1.1
forbids their appearance as literals but not as numeric character
references.
</p>
</div>
<div>
<head>
Are these characters legal in Unicode?
</head>
<p>
It might be argued (I think Chris Lilley has done so) that since the
C1 characters aren't really Unicode, they aren't legal in a document
whose document character set is supposed to be Unicode.
</p>
<p>
The Unicode 2.0 spec open on my desk, however, says 
<q type="block">
<p>
    Like the C0 control codes, the Unicode Standard makes no specific
    use of these C1 control codes, but provides for the passage of
    their numeric code values intact, neither adding to nor
    subtracting from their semantics.  The semantics of the C1
    controls are generally determined by the application with which
    they are used.  However, in the absence of specific application
    uses, they may be interpreted according to the semantics specified
    in ISO 6429.
</p>
<p>
    (p. 6-5, section "Latin-1 Supplement: U+0080 - U+00FF")
</p></q></p>
<p>
I take that to mean that for all intents and purposes they are legal
Unicode characters.  That Unicode does not assign meanings to them
does not constitute an argument that they are excluded from Unicode:
there are lots of gaps in Unicode.  U+0FB0, for example, is also not
defined as meaning a specific character (at least in Unicode 2.0; I'm
too lazy to check the current version), but it's clearly got to be
accepted in a Unicode data stream.
</p>
<p>
So: Unicode includes these characters.
</p>
</div>
<div>
<head>
Does the W3C validator accept them?
</head>
<p>
No.  The test case you linked to illustrates that very nicely.
</p>
</div>
<div>
<head>
Does the SGML declaration used by opensp really exclude them?  How?
</head>
<p>
Yes, the declaration at 
<xref>http://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.dcl?rev=1.3&amp;content-type=text/x-cvsweb-markup</xref>, which you say our installation of
opensp is using, does exclude 7F and the C1 range.
</p>
<p>
The document character set is defined by a CHARSET declaration
<eg>
  CHARSET
         BASESET
             "ISO Registration Number 177//CHARSET
              ISO/IEC 10646-1:1993 UCS-4 with implementation
              level 3//ESC 2/5 2/15 4/6"
         DESCSET
                 0        9  UNUSED
                 9        2       9
                11        2  UNUSED
                13        1      13
                14       18  UNUSED
                32       95      32
               127        1  UNUSED
               128       32  UNUSED
               160    55136     160
             55296     2048  UNUSED -- surrogates --
             57344     8190   57344
             65534        2  UNUSED -- FFFE and FFFF --
             65536  1048576   65536 -- 16 planes outside BMP --
</eg>
</p>
<p>
The "document character set" as defined by SGML is rather unlike the
"document character set" concept of HTML 4, which brilliantly co-opted
the SGML term and gave it a new and better meaning.  (At least, that's
the way I understand the history of events.)
</p>
<p>
As defined by SGML, the document character set is the actual coded
character set (aka character encoding) the parser can expect to
encounter, conceived as a mapping from integers to characters.  The
bit combinations come in, and the parser knows what characters they
represent by reference to the character set declaration.
</p>
<p>
In HTML, by contrast, the "document character set" is the repertoire
of abstract objects called "characters" which may occur in an HTML
document and which are mapped 1:1 with a set of integers.  The integer
mappings are relevant for numeric character references, but for
nothing else.  In particular, the HTML spec explicitly clarifies that
the document character set has nothing in particular to do with the
encoding in which data may arrive, except that the abstract characters
encoded by the encoding had better be present in the document
character set.  (The HTML and later XML view is concisely summarized
by Gavin Nicol at
<xref>http://lists.w3.org/Archives/Public/w3c-sgml-wg/1997Jan/0287.html</xref>.)
</p>
<p>
XML essentially adopted the ideas of HTML 4 on this point: the ISO
10646/Unicode character set is conceived as a large and abstract
pairing of integers and characters, one step divorced from the messy
business of actual encoding.
</p>
<p>
So from the point of view of an SGML processor, the character set
description above documents which bit patterns will and won't occur in
the input stream.  (The ones that won't occur are important, because
the SGML spec assumes a processor may want to use them for its own
internal purposes.)  From the HTML and XML point of view, the
description documents which characters will and won't occur.
</p>
<p>
How does it work?  The BASESET says that we'll describe the document
character set by reference to the coded character set whose public
identifier is "ISO Registration Number 177//CHARSET ISO/IEC
10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6".
The reader is assumed to be in a position to understand what
references to that character spec mean.  
</p>
<p>
The DESCSET bit contains a sequence of triples which assign meanings
to integers, using a kind of run-length documentation.
<eg>
                 0        9  UNUSED
</eg>
means the numbers from 0 to 8 (starting position 0, length of sequence
9) are not used.  They are NONSGML characters -- they may (in SGML but
not in XML 1.0) be referred to by means of numeric character
references, but they will NOT appear as literals.
<eg>
                 9        2       9
</eg>
means that characters 9 and 10 (sequence of length 2, starting at 9)
have the meanings of characters 9 and 10 in the base character set,
which in this case is HT and LF.
</p>
<p>
And so on.  So the lines
<eg>
               127        1  UNUSED
               128       32  UNUSED
</eg>
mean that the character whose number is 127 (conventionally DEL) is
not used, and neither are the 32 characters from 128 through 159 
(x80 - x9F).
</p>
<p>
A character set declaration similar to this one, but which allows
DEL and the C1 range, would have a DESCSET section like this:
<eg>
         DESCSET
                 0        9  UNUSED
                 9        2       9 -- HT, LF --
                11        2  UNUSED
                13        1      13 -- CR --
                14       18  UNUSED
                32       95      32 -- space through tilde --
               127        1     127 -- DEL, legal in XML 1.0 --
               128       32     128 -- C1 controls, legal in XML 1.0 --
               160    55136     160
             55296     2048  UNUSED -- surrogates --
             57344     8190   57344
             65534        2  UNUSED -- FFFE and FFFF --
             65536  1048576   65536 -- 16 planes outside BMP --
</eg>
</p>
<p>
You could of course replace the three lines for 127-55295 with
the single line 
<eg>
               127    55169     127
</eg>
but that would only confuse people, I think.
</p>
<p>
So: the SGML declaration used by many people as representing the rules
of XML 1.0 disagrees with the XML 1.0 spec on the characters x7F
through x9F.
</p>
<p>
It may be worth noting that the rule in the XML 1.1 spec which some
people find odd, that says that the characters in the range x7F-x9F
may be referred to using numeric character references but must not
appear as literals, is precisely the rule implied by the SGML
declaration: by marking the characters UNUSED, the SGML declaration
says they don't appear as literals, but not that they can't be
referred to numerically.
</p>
</div>
<div>
<head>
When did this discrepancy enter into the SGML declaration?
</head>
<p>
As far as I can tell, it's always been there.  
</p>
<p>
I have looked at all the published drafts of XML 1.0 to see if some
early draft excluded the C1 characters; no, as mentioned above they
all include 7F and the C1 controls.
</p>
<p>
I have consulted Dave Peterson, who worked intensively with me in the
winter of 1996-97 to categorize all of the divergences between SGML
and the first draft of XML, and who on the basis of that work prepared
the first draft of what became the Web SGML Annex, to ask him if he
remembered the responsible ISO WG deciding that they needed to exclude
the C1 controls.  He has no memory of such a decision, and neither he
nor I can think of a reason the SGML WG would have felt it necessary.
The SGML spec goes to extreme lengths to try make it possible to
describe arbitrarily weird encodings and use them to encode SGML
documents.  (In fact even the huge complexity of the character set
mechanisms in SGML falls short of the ingenuity of some designers of
character encodings, so SGML can't describe some existing encodings
well -- but even so, those encodings can be used to encode SGML
documents.)
</p>
<p>
It appears likely that the SGML declaration in the Web SGML Annex was
copied from the SGML declaration formulated by James Clark during the
development of XML and published as part of the SGML/XML note
(<xref>http://www.w3.org/TR/NOTE-sgml-xml-971215.html</xref>).  That SGML
declaration excludes these characters; I do not understand why.  Nor
do I understand why the discrepancy between the definition of Char in
the XML spec and the SGML declaration was not noticed by the XML
Working Group and eliminated.  Possibly I have simply forgotten some
discussion of the topic.
</p>
<p>
Further excavation reveals that an SGML declaration was included in
the first published working draft of XML -- in the printed form only,
however, not the version at 
<xref>http://www.w3.org/TR/WD-xml-961114</xref>
</p>
<p>
Like every other SGML declaration for SGML I have found today, that
one excludes 7F and the C1 controls.
</p>
<p>
Those whose pain threshold for character set discussions has not
already been exceeded will find more discussion in the long thread at
<xref>http://lists.w3.org/Archives/Public/w3c-sgml-wg/1997Jan/0162.html</xref>
which seems to indicate that the first SGML declaration for XML was
actually drafted not by James Clark but by Jon Bosak.  If it's the one
in <xref>http://www.w3.org/TR/WD-xml-lang-970331</xref> then the error really HAS
always been there.
</p>
</div>
<div>
<head>
Are these characters legal in HTML 4?
</head>
<p>
Initially, one might be unsure.
</p>
<p>
The prose suggests that they are legal.  HTML 4.01 says
that its document character set is Unicode, and nowhere
in the section on HTML document representation in HTML 4.01 
(<xref>http://www.w3.org/TR/1999/REC-html401-19991224/charset.html</xref>)
have I
found anything that implies that the C1 characters are excluded.
</p>

<p>
On the other hand, the HTML 4 spec has an SGML declaration that
indicates quite clearly that the characters are not legal.  
(<xref>http://www.w3.org/TR/1999/REC-html401-19991224/sgml/sgmldecl.html</xref>)</p>
<p>
The relevant part of the SGML declaration reads:
<eg>
    CHARSET
          ...
         DESCSET 0       9       UNUSED
                 ...
                 127     1       UNUSED
                 128     32      UNUSED
                 ...
</eg>
</p>
<p>
Is the SGML declaration normative?  It would seem to be: it's in a
numbered section, not an appendix, and it's not labeled non-normative
or informative.  And the section on conformance describes HTML as a
conforming SGML application.
</p>
<p>
So: I conclude that the SGML declaration is normative and that 7F and
the C1 controls are not legal in HTML 4.
</p>
</div>
<div>
<head>
Are these characters legal in XHTML 1.0?
</head>
<p>It might appear not.</p>
<p>
XHTML describes itself as a reformulation in XML of HTML 4.01, so I
believe that the character-set restriction of HTML 4 is inherited by
XHTML 1.0.  It's no longer enforced by the lower-level markup system,
so in XHTML it would appear to be an "application convention", i.e.  a
rule that goes beyond those imposed by XML.  The comparison of XHTML
1.0 with HTML 4.01
(<xref>http://www.w3.org/TR/2002/REC-xhtml1-20020801/#diffs</xref>) seems to
suggest that the differences are all subtractions from the set of
legal documents: XHTML forbids some things allowed by HTML 4. If there
are any points where it says XHTML allows things disallowed by HTML 4,
I didn't see them.
</p>
<p>
May we conclude that XHTML 1.0, like HTML 4, excludes x7F and the C1
controls?</p>
<p>In the first draft of this treatise I did so conclude.  But
<!--* Steven Pemberton, the chair of the HTML WG, has
    * put forward a different analysis. *-->
a different analysis is possible.</p>
<p>
XHTML 1.0 was intended as an XML 1.0 application, and all XML 
applications have the same rule for character sets. The WG 
regarded itself as just adopting whatever it was that the XML 
spec said; they didn't believe they had an option. </p>
<p>On that analysis, the rule for XHTML 1.0 is whatever the 
rule for XML 1.0 is, which does not forbid these characters. 
XHTML 1.0 assumes a &lsquo;generic&rsquo; XML parser.</p>
<p>The absence of the difference from the list of differences
is to be understood not as a statement that there is no differnce,
but as an omission from the list, either because it was
regarded as uninteresting or because the WG didn't notice this
particular difference.  The overarching goal in developing XHTML 1.0
was, to quote the chair of the HTML WG, &ldquo;to be a generic
XML as we could&rdquo;.</p>

<p>
So: we conclude that XHTML 1.0, unlike HTML 4, does not exclude 
x7F and the C1 controls by means of its SGML declaration. If
they are legal in XML in general, they are legal in XHTML 1.0.</p>
<p>N.B. this does not mean that it's a good idea to 
<emph>use</emph> 7F and the C1 controls in XHTML 1.0 documents.
See the article
<title>FAQ: HTML, XHTML, XML and Control Codes</title>
at 
<xref>http://www.w3.org/International/questions/qa-controls</xref>
for a concise statement of why they should in practice be avoided.</p>
</div>
<div>
<head>
Summary  
</head>
<p>
1) There's definitely a bug in the SGML declarations in wide
circulation for XML 1.0.  Either that, or I am being defeated once
more by ISO 8879's character set mechanisms.
</p>
<p>
2) HTML 4 excludes 7F and the C1 range.
</p>
<p>
3) XHTML 1.0 does <emph>not</emph> exclude 7F and the C1 range.
</p>
<p>
4) The validator seems to be correct in rejecting the characters
in question in HTML 4.0 documents.</p>
<p>5) The validator appears to be incorrect in rejecting the test
document at <xref>http://test.wikipedia.org/wiki/User:R._Koot/C1-2</xref>,
which self-identifies as XHTML 1.0 Transitional.</p>
<p>

I hope this helps.
</p>
</div>
</body>
</text>
</TEI.2>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:"/Library/SGML/Public/Emacs/sweb.ced"
sgml-omittag:t
sgml-shorttag:t
End:
-->

