Are C1 characters legal in XHTML 1.0?

De litteris regentibus C1 quaestiones septem or Are C1 characters legal in XHTML 1.0? 23 March 2007 C. M. Sperberg-McQueen with a little help from my friends

A friend writes:Olivier Thereaux, help with "sgml declaration for xml"? Email to MSM and w3t-archive, 23 March 2007.

There is a reported bug in the validator, that SGML character number 128-159 are not allowed for xml-based markup languages. http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164

We have a test case at: http://validator.w3.org/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2&charset=%28detect+automatically%29&doctype=Inline which indeed complains about these.

Our parser is opensp, and our opensp uses http://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.soc as a catalog in xml mode, and thus http://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.dcl as an sgml declaration for xml.

In the bugzilla item I mentioned above, Terje Bless, who generally knows much more about SGML than I do, thinks it may just be that our sgml declaration for xml should be updated to include this character range. http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164#c5

As I am rather confused by the issue, I'd appreciate any guidance, diagnosis, or pointer, you could provide.

This is the second time this week I've indulged my inner language-lawyer in response to some query. I reproduce my reply to this question here, as an Awful Warning to those who might otherwise be tempted to ask me questions about XML. What I wrote was (more or less):

OK, I'll try.

I'm going to ask a number of short, pointed questions, provide long, digressive answers (sorry about that), and then say what I think it all means for your problem.

Note that the character range x80-x9F is known, for historical reasons, as "the C1 range" or "the C1 characters". I can explain if you wish. But be careful what you wish for.

Are characters x7F through x9F legal in XML?

In XML 1.0, the grammar production for Char includes them, so yes. [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The formulation of the Char production has changed from time to time, but 7F and the C1 range have always been included.

In XML 1.1, the grammar production for Char continues to allow them, but the 'document' production takes care to exclude them (in their literal form) from the document. [1] document ::= ( prolog element Misc* ) - ( Char* RestrictedChar Char* )

In 1.1, the C1 characters may be referred to using numeric character references ( etc.) but not used as literal characters.

So: XML 1.0 does not forbid the use of these characters. XML 1.1 forbids their appearance as literals but not as numeric character references.

Are these characters legal in Unicode?

It might be argued (I think Chris Lilley has done so) that since the C1 characters aren't really Unicode, they aren't legal in a document whose document character set is supposed to be Unicode.

The Unicode 2.0 spec open on my desk, however, says

Like the C0 control codes, the Unicode Standard makes no specific use of these C1 control codes, but provides for the passage of their numeric code values intact, neither adding to nor subtracting from their semantics. The semantics of the C1 controls are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the semantics specified in ISO 6429.

(p. 6-5, section "Latin-1 Supplement: U+0080 - U+00FF")

I take that to mean that for all intents and purposes they are legal Unicode characters. That Unicode does not assign meanings to them does not constitute an argument that they are excluded from Unicode: there are lots of gaps in Unicode. U+0FB0, for example, is also not defined as meaning a specific character (at least in Unicode 2.0; I'm too lazy to check the current version), but it's clearly got to be accepted in a Unicode data stream.

So: Unicode includes these characters.

Does the W3C validator accept them?

No. The test case you linked to illustrates that very nicely.

Does the SGML declaration used by opensp really exclude them? How?

Yes, the declaration at http://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.dcl?rev=1.3&content-type=text/x-cvsweb-markup, which you say our installation of opensp is using, does exclude 7F and the C1 range.

The document character set is defined by a CHARSET declaration CHARSET BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED 160 55136 160 55296 2048 UNUSED -- surrogates -- 57344 8190 57344 65534 2 UNUSED -- FFFE and FFFF -- 65536 1048576 65536 -- 16 planes outside BMP --

The "document character set" as defined by SGML is rather unlike the "document character set" concept of HTML 4, which brilliantly co-opted the SGML term and gave it a new and better meaning. (At least, that's the way I understand the history of events.)

As defined by SGML, the document character set is the actual coded character set (aka character encoding) the parser can expect to encounter, conceived as a mapping from integers to characters. The bit combinations come in, and the parser knows what characters they represent by reference to the character set declaration.

In HTML, by contrast, the "document character set" is the repertoire of abstract objects called "characters" which may occur in an HTML document and which are mapped 1:1 with a set of integers. The integer mappings are relevant for numeric character references, but for nothing else. In particular, the HTML spec explicitly clarifies that the document character set has nothing in particular to do with the encoding in which data may arrive, except that the abstract characters encoded by the encoding had better be present in the document character set. (The HTML and later XML view is concisely summarized by Gavin Nicol at http://lists.w3.org/Archives/Public/w3c-sgml-wg/1997Jan/0287.html.)

XML essentially adopted the ideas of HTML 4 on this point: the ISO 10646/Unicode character set is conceived as a large and abstract pairing of integers and characters, one step divorced from the messy business of actual encoding.

So from the point of view of an SGML processor, the character set description above documents which bit patterns will and won't occur in the input stream. (The ones that won't occur are important, because the SGML spec assumes a processor may want to use them for its own internal purposes.) From the HTML and XML point of view, the description documents which characters will and won't occur.

How does it work? The BASESET says that we'll describe the document character set by reference to the coded character set whose public identifier is "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6". The reader is assumed to be in a position to understand what references to that character spec mean.

The DESCSET bit contains a sequence of triples which assign meanings to integers, using a kind of run-length documentation. 0 9 UNUSED means the numbers from 0 to 8 (starting position 0, length of sequence 9) are not used. They are NONSGML characters -- they may (in SGML but not in XML 1.0) be referred to by means of numeric character references, but they will NOT appear as literals. 9 2 9 means that characters 9 and 10 (sequence of length 2, starting at 9) have the meanings of characters 9 and 10 in the base character set, which in this case is HT and LF.

And so on. So the lines 127 1 UNUSED 128 32 UNUSED mean that the character whose number is 127 (conventionally DEL) is not used, and neither are the 32 characters from 128 through 159 (x80 - x9F).

A character set declaration similar to this one, but which allows DEL and the C1 range, would have a DESCSET section like this: DESCSET 0 9 UNUSED 9 2 9 -- HT, LF -- 11 2 UNUSED 13 1 13 -- CR -- 14 18 UNUSED 32 95 32 -- space through tilde -- 127 1 127 -- DEL, legal in XML 1.0 -- 128 32 128 -- C1 controls, legal in XML 1.0 -- 160 55136 160 55296 2048 UNUSED -- surrogates -- 57344 8190 57344 65534 2 UNUSED -- FFFE and FFFF -- 65536 1048576 65536 -- 16 planes outside BMP --

You could of course replace the three lines for 127-55295 with the single line 127 55169 127 but that would only confuse people, I think.

So: the SGML declaration used by many people as representing the rules of XML 1.0 disagrees with the XML 1.0 spec on the characters x7F through x9F.

It may be worth noting that the rule in the XML 1.1 spec which some people find odd, that says that the characters in the range x7F-x9F may be referred to using numeric character references but must not appear as literals, is precisely the rule implied by the SGML declaration: by marking the characters UNUSED, the SGML declaration says they don't appear as literals, but not that they can't be referred to numerically.

When did this discrepancy enter into the SGML declaration?

As far as I can tell, it's always been there.

I have looked at all the published drafts of XML 1.0 to see if some early draft excluded the C1 characters; no, as mentioned above they all include 7F and the C1 controls.

I have consulted Dave Peterson, who worked intensively with me in the winter of 1996-97 to categorize all of the divergences between SGML and the first draft of XML, and who on the basis of that work prepared the first draft of what became the Web SGML Annex, to ask him if he remembered the responsible ISO WG deciding that they needed to exclude the C1 controls. He has no memory of such a decision, and neither he nor I can think of a reason the SGML WG would have felt it necessary. The SGML spec goes to extreme lengths to try make it possible to describe arbitrarily weird encodings and use them to encode SGML documents. (In fact even the huge complexity of the character set mechanisms in SGML falls short of the ingenuity of some designers of character encodings, so SGML can't describe some existing encodings well -- but even so, those encodings can be used to encode SGML documents.)

It appears likely that the SGML declaration in the Web SGML Annex was copied from the SGML declaration formulated by James Clark during the development of XML and published as part of the SGML/XML note (http://www.w3.org/TR/NOTE-sgml-xml-971215.html). That SGML declaration excludes these characters; I do not understand why. Nor do I understand why the discrepancy between the definition of Char in the XML spec and the SGML declaration was not noticed by the XML Working Group and eliminated. Possibly I have simply forgotten some discussion of the topic.

Further excavation reveals that an SGML declaration was included in the first published working draft of XML -- in the printed form only, however, not the version at http://www.w3.org/TR/WD-xml-961114

Like every other SGML declaration for SGML I have found today, that one excludes 7F and the C1 controls.

Those whose pain threshold for character set discussions has not already been exceeded will find more discussion in the long thread at http://lists.w3.org/Archives/Public/w3c-sgml-wg/1997Jan/0162.html which seems to indicate that the first SGML declaration for XML was actually drafted not by James Clark but by Jon Bosak. If it's the one in http://www.w3.org/TR/WD-xml-lang-970331 then the error really HAS always been there.

Are these characters legal in HTML 4?

Initially, one might be unsure.

The prose suggests that they are legal. HTML 4.01 says that its document character set is Unicode, and nowhere in the section on HTML document representation in HTML 4.01 (http://www.w3.org/TR/1999/REC-html401-19991224/charset.html) have I found anything that implies that the C1 characters are excluded.

On the other hand, the HTML 4 spec has an SGML declaration that indicates quite clearly that the characters are not legal. (http://www.w3.org/TR/1999/REC-html401-19991224/sgml/sgmldecl.html)

The relevant part of the SGML declaration reads: CHARSET ... DESCSET 0 9 UNUSED ... 127 1 UNUSED 128 32 UNUSED ...

Is the SGML declaration normative? It would seem to be: it's in a numbered section, not an appendix, and it's not labeled non-normative or informative. And the section on conformance describes HTML as a conforming SGML application.

So: I conclude that the SGML declaration is normative and that 7F and the C1 controls are not legal in HTML 4.

Are these characters legal in XHTML 1.0?

It might appear not.

XHTML describes itself as a reformulation in XML of HTML 4.01, so I believe that the character-set restriction of HTML 4 is inherited by XHTML 1.0. It's no longer enforced by the lower-level markup system, so in XHTML it would appear to be an "application convention", i.e. a rule that goes beyond those imposed by XML. The comparison of XHTML 1.0 with HTML 4.01 (http://www.w3.org/TR/2002/REC-xhtml1-20020801/#diffs) seems to suggest that the differences are all subtractions from the set of legal documents: XHTML forbids some things allowed by HTML 4. If there are any points where it says XHTML allows things disallowed by HTML 4, I didn't see them.

May we conclude that XHTML 1.0, like HTML 4, excludes x7F and the C1 controls?

In the first draft of this treatise I did so conclude. But a different analysis is possible.

XHTML 1.0 was intended as an XML 1.0 application, and all XML applications have the same rule for character sets. The WG regarded itself as just adopting whatever it was that the XML spec said; they didn't believe they had an option.

On that analysis, the rule for XHTML 1.0 is whatever the rule for XML 1.0 is, which does not forbid these characters. XHTML 1.0 assumes a ‘generic’ XML parser.

The absence of the difference from the list of differences is to be understood not as a statement that there is no differnce, but as an omission from the list, either because it was regarded as uninteresting or because the WG didn't notice this particular difference. The overarching goal in developing XHTML 1.0 was, to quote the chair of the HTML WG, “to be a generic XML as we could”.

So: we conclude that XHTML 1.0, unlike HTML 4, does not exclude x7F and the C1 controls by means of its SGML declaration. If they are legal in XML in general, they are legal in XHTML 1.0.

N.B. this does not mean that it's a good idea to use 7F and the C1 controls in XHTML 1.0 documents. See the article FAQ: HTML, XHTML, XML and Control Codes at http://www.w3.org/International/questions/qa-controls for a concise statement of why they should in practice be avoided.

Summary

1) There's definitely a bug in the SGML declarations in wide circulation for XML 1.0. Either that, or I am being defeated once more by ISO 8879's character set mechanisms.

2) HTML 4 excludes 7F and the C1 range.

3) XHTML 1.0 does not exclude 7F and the C1 range.

4) The validator seems to be correct in rejecting the characters in question in HTML 4.0 documents.

5) The validator appears to be incorrect in rejecting the test document at http://test.wikipedia.org/wiki/User:R._Koot/C1-2, which self-identifies as XHTML 1.0 Transitional.

I hope this helps.