Notes on schema annotation

C. M. Sperberg-McQueen

12 February 2002

rev. 12 April 2002

note added 30 January 2003

$Id: schema-annotation.html,v 1.3 2003/01/30 14:19:49 cmsmcq Exp $



These notes attempt to establish some common ground for discussions of schema annotation and what W3C or others should do about it. They are publicly readable.
The immediate occasion of these notes are discussions about what, if anything, the W3C XML Schema Working Group should be doing about schema annotation. Several people have indicated to me that they would like to see some work on the problem, but I am not clear on what, exactly, it is that they are looking to see the XML Schema WG do. This is my attempt to explain why I think it is unclear, and to make it easier for them to tell me what they have in mind. This version of this document has benefited from comments from Tim Berners-Lee, to whom thanks.

1. Observations and assumptions

Let me start by trying to make explicit some assumptions I am making which may or may not be fully shared by others.
  1. An application of XML or SGML defines what some people call a markup language, and other people would prefer to refer to as a markup vocabulary or namespace.[1] Since some people prefer to reserve the term markup language for meta-languages like XML and SGML, the following discussion will use the term vocabulary — without, however, intending to obscure the fact that the XML-based applications in question do have rules that go beyond the provision of names and may be captured in whole or in part by syntactic formalisms.
  2. The constructs of an application of XML or SGML (that is, a vocabulary) include element( type)s, attributes, notations, processing-instruction targets, and entities.
  3. In some cases, the constructs of a vocabulary may also include simple or complex datatypes and substitution groups (as in XML Schema 1.0), non-terminals (as in Relax and Relax NG), classes (as in the ODD system used to generate the Text Encoding Initiative DTDs), or other abstractions.
  4. Where it is necessary to refer to the occurrences of markup constructs in actual documents, I will often refer to information items, without wishing to imply that my discussion is limited to constructs which have defined names in the XML Information Set specification.
  5. Many people (vocabulary designers, schema and DTD authors, application developers, people trying to make it easier to work with documents in markup languages designed by others, and no doubt others, too) wish to say, for specific constructs in a vocabulary, what they mean.[2]
  6. Some of those who wish to say what markup constructs mean wish to do so using some machine-processable notation; others would be happy with better tools for human-understandable documentation. I am here concerned mostly with the former, though I think good rules for machine-processable specification of meaning will often also help make meaning clear to humans.
  7. Different people have very different ideas of what it would mean, for the constructs of a vocabulary, to say what they mean. For purposes of discussion, I identify four:
    Some people mean by this that they wish to be able to specify how data structures internal to some application software are serialized as XML, or how XML is de-serialized into data structures; questions like “When does an element become an object of class Foo, and when does it become an object of class Foobar?”, asked with reference to some set of object classes defined in some programming language, are central to their concerns. Call this the concrete data-structure mapping problem.
    Others wish to specify how to map XML document instances into columns, rows, and tables in some SQL database management system; sometimes they wish to specify a mapping into new rows of existing tables, and sometimes what is needed is a mapping which would specify which new tables to create. Call this the abstract data structure mapping problem. It differs from the concrete data structure mapping problem as the abstraction of a SQL table differs from the various programming-language constructs which might be used to implement the abstraction.
    Still others wish to specify a mapping into first-order predicate calculus as a way of defining the correct interpretation of markup. Call this the FOPC mapping problem.
    Some wish to map arbitrary XML into RDF. Call this the RDF mapping problem.
  8. I believe that these four mapping problems cover the ground, in that all the people I know who want to talk about the meaning of markup fall into one or the other of these four groups. But I have no proof that the classification is necessarily exhaustive, and I don't believe anything described in these notes crucially depends on the classification being exhaustive.
  9. Some people believe that all of these desires (mapping into arbitrary data structures, mapping into code-independent constructs like tables and rows, mapping into RDF triples, mapping into some version of first-order predicate calculus) are at root ‘the same thing’. Mostly, they seem to mean by this that if a formalism is provided for what they wish to do, they believe that everyone else's requirements will be met. They do not, in general, seem to mean that if anyone else's requirements are met, they will be able to do what they wish to do.
  10. Some people believe that the four mapping problems described above do not necessarily have much in common.
  11. The four mapping problems identified above do have in common that they involve mapping from XML notation into some other model. Let us call this other model the target model.
  12. If the target model has a syntax in which it can be serialized, I call that syntax the target formalism.
  13. If the target model has a corresponding target formalism, then all four of the mapping problems can be conceived of as involving the translation of information from one syntax (XML) into some other syntax. A mapping problem may be conceived of as a syntax-to-syntax translation even if, in practice, the result desired is not a string of characters denoting some abstraction, but some other representation of that abstraction, such as an in-memory data structure.
  14. Many applications of XML in use today emphasize convenience for authors or software developers over simplicity of the mapping to any target model; some applications do not specify any model different from the basic XML data model of nodes in a tree, with arbitrary links expressed by ID/IDREF links or by application-level information. TEI, HTML, DocBook are examples of such applications. Following Noah Mendelsohn, I refer to the XML used by applications of this sort as colloquial XML.
  15. In contrast, some applications of XML in use today obey strict rules for mapping XML constructs into constructs in some target model non-isomorphic to XML. RDF is an example: every XML construct in an RDF data stream maps into an RDF triple, a part of a triple, or a set of triples, using relatively straightforward rules. Similarly, every XML construct in TEI feature system markup maps into a feature structure, a feature, a value of a feature, or a set of feature structures, following simple and unvarying rules. XML in Layman normal form, or in any of the normal forms distinguished by Henry Thompson [Thompson 2001] has a simple mapping into labeled graph structures.[3] By analogy with the term colloquial XML, we may suggest the term artificial XML for the XML generated by these applications.
  16. None of the four mapping problems identified above appear very difficult in the case of artificial XML. So most of the interest in mechanisms for the mapping problem focuses on the problem of mapping colloquial XML into some non-XML model.
  17. Even in the case of artificial XML, however, the fact remains that XML is not identical to the target formalism. The mapping from XML into the target model or formalism will thus require rules, which may usefully be made explicit.
  18. Any solution to the mapping problem should be applicable both to colloquial and to artificial XML.

2. What's the problem?

There may be several ways to understand or formulate the problem we should be solving. Mine is as follows.
If the assumptions outlined above hold true, then the problem of saying what some markup means may at some level be conceived of as the problem of showing how to translate from XML into the target formalism. For an organization like W3C, the associated challenge is providing a mechanism by means of which one can describe how constructs in a given vocabulary are to be translated into some target formalism.

3. Design space

W3C could meet the challenge / solve the problem in any of several ways, some of them represented by existing designs, and some suggested by experience or by recent research papers. The following sections describe various approaches to the problem, but which are not all ‘solutions’ in the same sense; some of them are incompatible with each other, but not all.

3.1. Schema appinfo

One approach would be to say that XML Schema should provide some ability to include, in an XML Schema document, some declarations describing the mapping from XML which conforms to the schema into some target model or formalism. The Cambridge Communiqué [Swick/Thompson 1999], in fact, says exactly that (emphasis added):
An XML Schema schema document will be able to hold declarations for validating instance documents. It should also be able to hold declarations for mapping from instance document XML infosets to application-oriented data structures.
XML Schema provides the appinfo element on every schema construct, in order to meet this requirement.
By some possible interpretations, therefore, the problem is already solved.
Note that the appinfo element provides a place where declarations relevant to the mapping problems may be placed. It does not provide or define a vocabulary in which to express, define, or specify mappings. It thus remains agnostic in the question of mapping-language design.

3.2. Schema adjunct framework

The Schema Adjunct Framework described by TIBCO Extensibility and others and publicly described in [Vorthmann/Buck 2000] and [Vorthmann/Robie 2001] is a generalization of the XML Schema 1.0 appinfo element. The framework defines a class of documents (schema adjuncts) within which arbitrary markup constructs (identified by expressions using a subset of XPath 1.0) may be associated with arbitrary processing code or mapping specifications, with the proviso that the processing code or mapping specification must be in XML.
There are a number of design choices which might be criticized in SAF; languages which gave different answers to the questions underlying those design choices would occupy the same area of the solution space: all are languages for associating arbitrary code with arbitrary markup constructs.
SAF differs from the appinfo element of XML Schema 1.0 both theoretically and practically. The former allows arbitrary information to be associated with the declaration of a markup construct, and thus indirectly with all the instances of that construct in document instances; because SAF context expressions use XPath, they are associated directly with all the parts of document instances which match the XPath, and only indirectly with the abstract construct associated with a schema declaration. Because XPath expressions can pick out sets of information items which differ from the sets associated with specific declarations, it is theoretically possible to use SAF to express associations which cannot be expressed directly using appinfo. It is not obvious whether this should be regarded as a strength of SAF or as a design flaw; many obvious illustrations of this fact appear to be pathological.
On the practical level, SAF differs from appinfo in being separate from the schema itself. The authors motivate this by arguing that those responsible for specifying and maintaining mappings of the sort supported by SAF are often not the same people responsible for specifying and maintaining schemas or other formal definitions of vocabularies,[4] and that the maintenance cycle, access patterns, and so on are likely to be very different.
Like the appinfo mechanism, SAF does not itself define a mapping language but remains mapping-language agnostic.

3.3. Skeleton sentences

In the course of research undertaken with Allen Renear (UIUC) and Claus Huitfeldt (Bergen), I have come to believe that the best way to say what markup means is to specify the set of inferences licensed by the use of markup. Our paper [Sperberg-McQueen/Huitfeldt/Renear 2001a] outlines an approach which involves us in the FOPC mapping problem (our approximation to FOPC is, in that paper, Prolog, but we are agnostic on the target formalism, and some commentators believe as I do that our approach could be applied to the other mapping problems as well).
The key difference between our approach and the others already described is that we make some explicit claims about the internal structure of any successful mapping language.
Specifically, we argue
that the inferences licensed by colloquial XML typically involve implicit reference to information items in the document instance;
that the inferences may be described at an abstract level as a set of sentence schemata in some language (e.g. Prolog or English) which have blanks in them — these we call skeleton sentences or sentence skeletons and I will refer to the number of blanks in a sentence as its arity;
that (some of) the inferences licensed by colloquial XML markup may be listed by filling in the blanks in the skeleton sentences associated by the markup construct in question;[5]
that in a correct description of most forms of colloquial XML, not all constructs will have skeleton sentences of the same arity — different element types, for example, may be associated with skeleton sentences with one, two, or more blanks — and also that the blanks are not always to be filled in in the same way;[6]
that the blanks in skeleton sentences are typically to be filled in from information items which are in some specified location, often a location relative to the location being interpreted (e.g. "the parent element" or "the nearest ancestor with a value for xml:lang") rather than an absolute location;
that some language is necessary in which one can specify how to fill in the different blanks in the skeleton sentences; since expressions in this language must refer to locations relative to some reference point by some kind of virtual pointing, we call them deictic expressions, borrowing a traditional grammatical term for pointing expressions; the act of pointing we call deixis.
I would have said our claims were obvious and non-controversial, except that the sample mapping specifications given as examples in the SAF paper (for example) do not seem to possess what we think are the required characteristics.
In a later talk [Sperberg-McQueen/Huitfeldt/Renear 2001b], we describe a system for enumerating inferences licensed by XML markup; this system uses XPath expressions for deixis, and XSLT to generate actual sentences from skeletons.
The implementation we described is a toy intended primarily to illustrate the ideas, but it could (I believe) be refined to make it useful in practice. We do not have useful sample data at present because my co-authors and I have found that the task of specifying exactly what sentence skeletons should be associated with the element types of the TEI DTD is harder than it looked. We thus have a system which, given a set of sentence skeletons to associate with markup constructs, can readily generate a set of inferences licensed by that markup (or, in other words, translate document instances into FOPC, RDF, or other desired target formalisms), but we have at the moment no suitable set of sentence skeletons for any existing XML vocabulary.

3.4. XSLT transformations

To the extent that the mapping problems can be conceived of as requiring the translation of XML data into other models or syntaxes, an obvious candidate for specifying solutions to the problem is XSLT. As noted in the previous section, I have a toy implementation of the sentence-skeleton idea which uses XSLT for just this purpose.
The authors of the Schema Adjunct Framework argue that XSLT cannot do what SAF does, but this is clearly false. The XSLT extension framework allows arbitrary user-defined functions to be used, and these may have arbitrary meaning, including any meaning claimed to be exclusively the province of SAF. Work on full-text indexing at Sun (reported at XTech 2000) illustrates our point.
If we accept that all four mapping problems can be solved by specifying suitable mappings, and that the problem for W3C may be solved by providing a suitable language in which to specify the mappings, then we are forced to conclude that by defining XSLT the W3C has at some level already solved the schema-annotation problem.

3.5. Specialized languages

Specifying a mapping by means of an XSLT stylesheet, however, is not regarded by everyone as an entirely satisfactory solution.
XSLT is Turing-complete. While it is certainly possible to specify simple, perspicuous mappings using XSLT, it is also necessarily possible to specify mappings which are difficult to understand and thus unsuitable for consumption by humans or by machines which wish not to perform the mapping but to reason about it.
It may seem desirable, therefore, to solve particular mapping problems by specifying a specialized language for that mapping problem. Such a language can be
small
declarative
weaker than any Turing-complete language
structured to exploit the regularities in a particular mapping problem (e.g. that a mapping to a relational database will need to specify database names, table names, and column names, and no other names, or that a mapping to RDF will invariably involve specifying subjects, verbs, and object, but need not support specification of n-ary relations for n other than 2)
designed to use suggestive application-relevant names
It is possible to design such specialized languages so as to make the specification of a particular mapping simple, convenient, and clear to anyone interested in the application in question.
Such specialized languages, however, rely on tight coupling to particular forms of the mapping problem. No single such specialized language will be equally suitable for all mapping problems. I infer that the many different people who have different mapping problems in mind will want many different specialized languages.
The same inference led the designers of XML Schema 1.0 and SAF to allow expressions in arbitrary languages within their host appinfo and SAF constructs, and to omit any attempt to define a particular language for specifying mappings.
That omission, in turn, makes appinfo and SAF (and any language in the same part of the design space) vacuous at some crucial level: such a language can provide a location where software and humans can seek specifications of a particular mapping, but they cannot themselves provide or express the solutions.

3.6. Design questions / axes

It is easy to see that the design space has at least the following dimensions:
define a concrete language for specifying mappings? or escape to the meta-level and allow many such languages?
specify mappings using a general-purpose language, or using a language specialized for a particular mapping problem or class of mapping problems?
place the mapping specifications inside the schema, or outside? In a single location, or in arbitrarily many locations?
If W3C should define a single concrete language for specifying mappings, then it is easy to see what deliverable XML Schema or some WG might take on: define a language in which schema authors or others can specify a mapping from markup constructs into RDF.
If, however, we believe that a language specialized for RDF will not be equally suitable for other mapping problems, and that a language not specialized for RDF will not be convenient for RDF (or for any other specific mapping problem), then we can conclude that W3C should not define a particular concrete language — at least, not a general-purpose one. On this analysis, if W3C defines a language for specifying mappings from arbitrary XML into RDF, it should be specialized for RDF, and should not be pushed as a solution to the other mapping problems. If users believe that RDF provides a useful way to describe arbitrary concrete or abstract data structures and FOPC, they will use the RDF mapping accordingly. If they don't believe it, telling them to use the RDF mapping will not persuade them.
If we do not wish to define any language for specifying mappings, then perhaps the right deliverable is one or more Notes showing how to define and use specialized languages for the purpose. The semantics of such a specialized language could usefully be described in terms of an XSLT stylesheet which performs the appropriate translation. (That is, I specify the meaning of my specialized language L for FOPC sentence skeletons describing vocabulary V by showing how to transform sentences in L into an XSLT stylesheet which in turn translates sentences in V into my target FOPC formalism. Eric specifies the meaning of his language M for RDF sentence skeletons describing V, by showing how to translate sentence in M into an XSLT stylesheet which, given documents / sentences in V, produces RDF which captures their meaning.

4. Notes on the Cambridge Communiqué

Since reference is frequently made to the Cambridge Communiqué [Swick/Thompson 1999] as a reference point for discussions of schema annotation, it may be worth while to comment briefly on some of the points made in its section 3, Observations and Recommendations.
2. An XML Schema schema document will be able to hold declarations for validating instance documents. It should also be able to hold declarations for mapping from instance document XML infosets to application-oriented data structures.
As noted above, the XML Schema 1.0 appinfo element can occur as part of the annotation of most schema constructs. This element, like the documentation element which is its sibling, can contain any well-formed XML. It is thus able to hold declarations for mapping from instance document XML infosets to application-oriented data structures, in any notation which may be desired by the schema author.
3. For evolvability and interoperability, the XML Schema specification should provide an extension mechanism allowing for the augmentation of XML Schema schemas with additional material. At a minimum, XML Schema should permit elements from other namespaces to be included in schema documents. This extension mechanism should also permit individual extensions to be marked 'mandatory', meaning that a document instance cannot be deemed 'schema valid' if the processing required by a marked extension cannot be performed.
Because the documentation and appinfo branches of the annotation element can take any well-formed XML as content, they allow for the augmentation of schema documents with additional material.[7]
In addition, almost all schema constructs allow for attributes from any namespace other than that of XML Schema; for some applications, some designers feel that using attributes to augment the schema document feels less obtrusive than using elements.
XML Schema 1.0 does not provide a mechanism for marking extensions as mandatory; this got dropped somewhere along the way, for reasons I no longer recall.
4. The extension mechanism should be appropriate for use to incorporate declarations ("mapping declarations") to aid the construction of application-oriented data structures (e.g. ones implementing the RDF model) as part of the schema-validation and XML infoset construction process. This facility should not be exclusive to RDF, but should also be useable to guide the construction of data structures conforming to other data models, e.g. UML.
I believe the appinfo element meets the goal described here.
5. Such mapping declarations should ideally also be useable by other schema processors to map in the other direction, i.e. from application-oriented data structures to XML infosets.
I agree that this is a particularly convenient case, but I do not believe it's soluble in the general case. It is not difficult when there is something like an isomorphism between the XML representation of some information and its representation in a data structure, but I do not believe such isomorphisms are common.
7. XML Schema does not need to be the sole provider of support for layering application data structures on XML. XSLT, with a proposed extension mechanism, could be used for specifying mappings from XML document instances to application data structures - including RDF graphs. The reversibility of mappings specified with XSLT or similar transformation languages is an issue.
Yes.

A. References

This list is not quite complete.

[Swick/Thompson 1999] Swick, Ralph R., and Henry S. Thompson, ed. The Cambridge Communiqué. W3C NOTE 7 October 1999. http://www.w3.org/TR/schema-arch

[Sperberg-McQueen/Huitfeldt/Renear 2001a] Sperberg-McQueen, C. M., Claus Huitfeldt, and Allen Renear. “Meaning and interpretation of markup.” Markup Languages: Theory & Practice 2.3 (2001): 215-234. http://www.w3.org/People/cmsmcq/2000/mim.html

[Sperberg-McQueen/Huitfeldt/Renear 2001b] Sperberg-McQueen, C. M., Claus Huitfeldt, and Allen Renear. “Practical extraction of meaning from markup.” Paper given at ACH/ALLC 2001, New York, June 2001. (Slides at http://www.w3.org/People/cmsmcq/2001/achallc2001/achallc2001.slides.html)

[Thompson 2001] Thompson, Henry S. “Normal Form Conventions for XML Representations of Structured Data”. Talk at XML 2001, Orlando, December 2001. http://www.ltg.ed.ac.uk/~ht/normalForms.html

[Vorthmann/Buck 2000] Vorthmann, Scott, and Lee Buck. “Schema adjunct framework”. Draft Specification 24 February 2000. [Chapel Hill]: Extensibility. http://www.extensibility.com/saf/spec/

[Vorthmann/Robie 2001] Vorthmann, Scott, and Jonathan Robie. “Beyond schemas: Schema adjuncts and the outside world”. Markup Languages: Theory & Practice 2.3 (2001): 281-294.


Notes

[1] I prefer the term language because it allows for grammatical rules in a way that the terms namespace and vocabulary do not. The term vocabulary seems most plausible for constructs like RDF vocabularies, which change the set of available terms for a language without changing the syntax of the language.
[2] Some will shy away from this formulation of the goal but will nonetheless acknowledge a wish to translate vocabulary constructs into some other form, normally a form which seems to them more familiar, more useful, or more tractable. For simplicity of formulation, I will do these people the mild injustice of lumping them together with those who are in search of meaning. As will become clear below, I think the practical problem is the same in the two cases.
[3] In practice, the motive behind such designs is frequently to abstract away from syntactic variation in different colloquial XML vocabularies with the same application area. As the syntactic differences become inaccessible, so also the syntactic constraints built into many colloquial XML applications become impossible to express in syntactic terms. Perhaps as a result, many XML applications in this class have only weak syntax-based validation or none at all.
[4] This makes clear that SAF is intended primarily for support of the concrete or abstract data structure or RDF mapping problems, not the FOPC mapping problem, which some people (e.g. [Sperberg-McQueen/Huitfeldt/Renear 2001a]) believe to be the primary responsibility of the schema designer.
[5] The qualification “some of” is related to the technical point that if a markup construct licenses the inference P, it also licenses an infinite set of inferences not(not(P)), not(not(not(not(P)))), etc. The inferences can therefore never be listed exhaustively. I suspect they are recursively enumerable, and will in practice usually have a small finite basis, but my co-authors are not yet ready to agree with me on that point. It is possible that systems which do not provide the thesis not(not(p))p may escape having infinite sets of inferences, but this is not certain. And even finite sets of inferences may be large enough that it is more convenient to supply a basis for them than to list them all. [[Note, 2003-01-30: Any system which allows us to infer, from p, either that p and p or that p or q (or even that p or p), or both, necessarily has an infinite set of inferences from any non-empty set of axioms: from p we have p and p, p and p and p, p or p, p or p or p, etc.]]
[6] The variation in arity and correct replacement text among element and attribute types is, I believe, one of the big differences between colloquial and artificial XML; artificial XML is designed to provide as much consistency here as possible, in order to make purely mechanical translation of the XML into the target model not just possible, but trivially easy.
[7] The WG discussed allowing arbitrary XML to be interspersed with the rest of the declarations instead of rooted specifically in annotation, but decided against it.