These notes attempt to establish some common ground for discussions
of schema annotation and what W3C or others should do about it. They
are publicly readable.
The immediate occasion of these notes are discussions about what,
if anything, the W3C XML Schema Working Group should be doing about
schema annotation. Several people have indicated to me that they
would like to see some work on the problem, but I am not clear on
what, exactly, it is that they are looking to see the XML Schema WG
do. This is my attempt to explain why I think it is unclear, and to
make it easier for them to tell me what they have in mind. This
version of this document has benefited from comments from Tim
Berners-Lee, to whom thanks.
3. Design space
W3C could meet the challenge / solve the problem in any of several
ways, some of them represented by existing designs, and some suggested
by experience or by recent research papers. The following sections
describe various approaches to the problem, but which are not all
‘solutions’ in the same sense; some of them are
incompatible with each other, but not all.
3.1. Schema appinfo
One approach would be to say that XML Schema should provide some
ability to include, in an XML Schema document, some declarations
describing the mapping from XML which conforms to the schema into some
target model or formalism. The Cambridge Communiqué [
Swick/Thompson 1999], in fact, says exactly
that (emphasis added):
An XML Schema schema document
will be able to hold declarations for validating instance documents.
It should also be able to hold declarations for mapping from
instance document XML infosets to application-oriented data
structures.
XML Schema provides the appinfo element on every
schema construct, in order to meet this requirement.
By some possible interpretations, therefore, the problem is already
solved.
Note that the appinfo element provides a place
where declarations relevant to the mapping problems may be placed.
It does not provide or define a vocabulary in which
to express, define, or specify mappings. It thus remains
agnostic in the question of mapping-language design.
3.2. Schema adjunct framework
The Schema Adjunct Framework described by TIBCO Extensibility and
others and publicly described in
[
Vorthmann/Buck 2000] and
[
Vorthmann/Robie 2001] is a generalization
of the XML Schema 1.0
appinfo element. The framework
defines a class of documents (schema adjuncts) within which arbitrary
markup constructs (identified by expressions using a subset of XPath 1.0)
may be associated with arbitrary processing code or mapping
specifications, with the proviso that the processing code or mapping
specification must be in XML.
There are a number of design choices which might be criticized in
SAF; languages which gave different answers to the questions
underlying those design choices would occupy the same area of the
solution space: all are languages for associating arbitrary code with
arbitrary markup constructs.
SAF differs from the appinfo element of XML Schema 1.0
both theoretically and practically. The former allows arbitrary information
to be associated with the declaration of a markup construct, and thus
indirectly with all the instances of that construct in document instances;
because SAF context expressions use XPath, they are associated directly
with all the parts of document instances which match the XPath, and
only indirectly with the abstract construct associated with a schema
declaration. Because XPath expressions can pick out sets of information
items which differ from the sets associated with specific declarations,
it is theoretically possible to use SAF to express associations which
cannot be expressed directly using appinfo. It is not
obvious whether this should be regarded as a strength of SAF or as a
design flaw; many obvious illustrations of this fact appear to be
pathological.
On the practical level, SAF differs from
appinfo in
being separate from the schema itself. The authors motivate this by
arguing that those responsible for specifying and maintaining mappings
of the sort supported by SAF are often not the same people responsible
for specifying and maintaining schemas or other formal definitions of
vocabularies,[
4] and
that the maintenance cycle, access patterns, and so on are likely to
be very different.
Like the appinfo mechanism, SAF does not itself
define a mapping language but remains mapping-language agnostic.
3.3. Skeleton sentences
In the course of research undertaken with Allen Renear (UIUC) and
Claus Huitfeldt (Bergen), I have come to believe that the best way to
say
what markup means is to specify the set of inferences
licensed by the use of markup. Our paper [
Sperberg-McQueen/Huitfeldt/Renear 2001a] outlines
an approach which involves us in the FOPC mapping problem (our
approximation to FOPC is, in that paper, Prolog, but we are agnostic
on the target formalism, and some commentators believe as I do that
our approach could be applied to the other mapping problems as well).
The key difference between our approach and the others already
described is that we make some explicit claims about the internal
structure of any successful mapping language.
Specifically, we argue
that the inferences licensed by colloquial XML typically involve
implicit reference to information items in the document instance;
that the inferences may be described at an abstract level as a
set of sentence schemata in some language (e.g. Prolog or English) which have
blanks in them — these we call skeleton
sentences or sentence skeletons and I will refer
to the number of blanks in a sentence as its arity;
that (some of) the inferences licensed by colloquial XML markup
may be listed by filling in the blanks in the skeleton sentences
associated by the markup construct in question;[
5]
that in a correct description of most forms of colloquial XML,
not all constructs will have skeleton sentences of the same arity —
different element types, for example, may be associated with skeleton
sentences with one, two, or more blanks — and also that the
blanks are not always to be filled in in the same way;[
6]
that the blanks in skeleton sentences are typically to be filled
in from information items which are in some specified location, often
a location relative to the location being interpreted (e.g. "the
parent element" or "the nearest ancestor with a value for
xml:lang") rather than an absolute location;
that some language is necessary in which one can specify how to
fill in the different blanks in the skeleton sentences; since
expressions in this language must refer to locations relative to some
reference point by some kind of virtual pointing, we call them
deictic expressions, borrowing a traditional grammatical
term for pointing expressions; the act of pointing we call
deixis.
I would have said our
claims were obvious and non-controversial, except that the sample
mapping specifications given as examples in the SAF paper (for
example) do not seem to possess what we think are the required
characteristics.
In a later talk [
Sperberg-McQueen/Huitfeldt/Renear 2001b], we
describe a system for enumerating inferences licensed by XML markup;
this system uses XPath expressions for deixis, and XSLT to generate
actual sentences from skeletons.
The implementation we described is a toy intended primarily to
illustrate the ideas, but it could (I believe) be refined to make it
useful in practice. We do not have useful sample data at present
because my co-authors and I have found that the task of specifying
exactly what sentence skeletons should be associated with the element
types of the TEI DTD is harder than it looked. We thus have a system
which, given a set of sentence skeletons to associate with markup
constructs, can readily generate a set of inferences licensed by that
markup (or, in other words, translate document instances into FOPC,
RDF, or other desired target formalisms), but we have at the moment no
suitable set of sentence skeletons for any existing XML vocabulary.
3.4. XSLT transformations
To the extent that the mapping problems can be conceived of
as requiring the translation of XML data into other models or syntaxes,
an obvious candidate for specifying solutions to the problem is
XSLT. As noted in the previous section, I have a toy implementation
of the sentence-skeleton idea which uses XSLT for just this purpose.
The authors of the Schema Adjunct Framework argue that XSLT cannot
do what SAF does, but this is clearly false.
The XSLT extension framework allows arbitrary
user-defined functions to be used, and these may have arbitrary
meaning, including any meaning claimed to be exclusively the province
of SAF. Work on full-text indexing at Sun (reported at XTech 2000)
illustrates our point.
If we accept that all four mapping problems can be solved by
specifying suitable mappings, and that the problem for W3C may be
solved by providing a suitable language in which to specify the
mappings, then we are forced to conclude that by defining XSLT the W3C
has at some level already solved the schema-annotation problem.
3.5. Specialized languages
Specifying a mapping by means of an XSLT stylesheet, however, is not
regarded by everyone as an entirely satisfactory solution.
XSLT is Turing-complete. While it is certainly possible to
specify simple, perspicuous mappings using XSLT, it is also necessarily
possible to specify mappings which are difficult to understand
and thus unsuitable for consumption by humans or by machines which
wish not to perform the mapping but to reason about it.
It may seem desirable, therefore, to solve particular mapping
problems by specifying a specialized language for that mapping
problem. Such a language can be
small
declarative
weaker than any Turing-complete language
structured to exploit the regularities in a particular
mapping problem (e.g. that a mapping to a relational database
will need to specify database names, table names, and column names,
and no other names, or that a mapping to RDF will invariably involve
specifying subjects, verbs, and object, but need not support
specification of n-ary relations for n other
than 2)
designed to use suggestive application-relevant names
It is possible to design such specialized languages so as to make
the specification of a particular mapping simple, convenient, and
clear to anyone interested in the application in question.
Such specialized languages, however, rely on tight coupling to
particular forms of the mapping problem. No single such specialized
language will be equally suitable for all mapping problems. I
infer that the many different people who have different mapping
problems in mind will want many different specialized languages.
The same inference led the designers of XML Schema 1.0 and SAF
to allow expressions in arbitrary languages within their host
appinfo and SAF constructs, and to omit any attempt to
define a particular language for specifying mappings.
That omission, in turn, makes appinfo and
SAF (and any language in the same part of the design space)
vacuous at some crucial level: such a language can provide a location
where software and humans can seek specifications of
a particular mapping,
but they cannot themselves provide
or express the solutions.
3.6. Design questions / axes
It is easy to see that the design space has at least the
following dimensions:
define a concrete language for specifying mappings? or
escape to the meta-level and allow many such languages?
specify mappings using a general-purpose language, or
using a language specialized for a particular mapping problem or class
of mapping problems?
place the mapping specifications inside the schema, or
outside? In a single location, or in arbitrarily many locations?
If W3C should define a single concrete language for specifying
mappings, then it is easy to see what deliverable XML Schema or
some WG might take on: define a language in which schema authors or
others can specify a mapping from markup constructs into RDF.
If, however, we believe that a language specialized for RDF will
not be equally suitable for other mapping problems, and that a
language not specialized for RDF will not be convenient
for RDF (or for any other specific mapping problem), then we can
conclude that W3C should not define a particular concrete language
— at least, not a general-purpose one. On this analysis, if W3C
defines a language for specifying mappings from arbitrary XML into
RDF, it should be specialized for RDF, and should not be pushed as a
solution to the other mapping problems. If users believe that RDF
provides a useful way to describe arbitrary concrete or abstract data
structures and FOPC, they will use the RDF mapping accordingly. If
they don't believe it, telling them to use the RDF mapping will not
persuade them.
If we do not wish to define any language for specifying mappings,
then perhaps the right deliverable is one or more Notes showing how to
define and use specialized languages for the purpose. The semantics
of such a specialized language could usefully be described in terms of
an XSLT stylesheet which performs the appropriate translation. (That
is, I specify the meaning of my specialized language L for FOPC
sentence skeletons describing vocabulary V by showing how to transform
sentences in L into an XSLT stylesheet which in turn translates
sentences in V into my target FOPC formalism. Eric specifies the
meaning of his language M for RDF sentence skeletons describing V, by
showing how to translate sentence in M into an XSLT stylesheet which,
given documents / sentences in V, produces RDF which captures their
meaning.
4. Notes on the Cambridge Communiqué
Since reference is frequently made to the Cambridge
Communiqué [
Swick/Thompson 1999]
as a reference point for discussions of schema annotation, it may be
worth while to comment briefly on some of the points made in its
section 3,
Observations and
Recommendations.
2. An XML Schema schema document will be able to hold
declarations for validating instance documents. It should also be able
to hold declarations for mapping from instance document XML infosets
to application-oriented data structures.
As noted above, the XML Schema 1.0 appinfo element
can occur as part of the annotation of most schema constructs. This
element, like the documentation element which is its
sibling, can contain any well-formed XML. It is thus able to hold
declarations for mapping from instance document XML infosets to
application-oriented data structures, in any notation which may be
desired by the schema author.
3. For evolvability and interoperability, the XML
Schema specification should provide an extension mechanism allowing
for the augmentation of XML Schema schemas with additional
material. At a minimum, XML Schema should permit elements from other
namespaces to be included in schema documents. This extension
mechanism should also permit individual extensions to be marked
'mandatory', meaning that a document instance cannot be deemed 'schema
valid' if the processing required by a marked extension cannot be
performed.
Because the
documentation and
appinfo
branches of the
annotation element can take any
well-formed XML as content, they allow for the augmentation of schema
documents with additional material.[
7]
In addition, almost all schema constructs allow for attributes
from any namespace other than that of XML Schema; for some applications,
some designers feel that using attributes to augment the schema document
feels less obtrusive than using elements.
XML Schema 1.0 does not provide a mechanism for marking extensions
as mandatory; this got dropped somewhere along the way, for reasons I
no longer recall.
4. The extension mechanism should be appropriate for
use to incorporate declarations ("mapping declarations") to aid the
construction of application-oriented data structures (e.g. ones
implementing the RDF model) as part of the schema-validation and XML
infoset construction process. This facility should not be exclusive to
RDF, but should also be useable to guide the construction of data
structures conforming to other data models, e.g. UML.
I believe the appinfo element meets the goal
described here.
5. Such mapping declarations should ideally also be
useable by other schema processors to map in the other direction,
i.e. from application-oriented data structures to XML infosets.
I agree that this is a particularly convenient case, but I do not
believe it's soluble in the general case. It is not difficult when
there is something like an isomorphism between the XML representation
of some information and its representation in a data structure, but
I do not believe such isomorphisms are common.
7. XML Schema does not need to be the sole provider of
support for layering application data structures on XML. XSLT, with a
proposed extension mechanism, could be used for specifying mappings
from XML document instances to application data structures - including
RDF graphs. The reversibility of mappings specified with XSLT or
similar transformation languages is an issue.
Yes.
A. References
This list is not quite complete.
[Swick/Thompson 1999]
Swick, Ralph R., and Henry S. Thompson, ed.
The Cambridge Communiqué.
W3C NOTE 7 October 1999.
http://www.w3.org/TR/schema-arch
[Sperberg-McQueen/Huitfeldt/Renear 2001a]
Sperberg-McQueen, C. M., Claus Huitfeldt, and Allen Renear.
“Meaning and interpretation of markup.”
Markup Languages: Theory & Practice
2.3 (2001): 215-234.
http://www.w3.org/People/cmsmcq/2000/mim.html
[Sperberg-McQueen/Huitfeldt/Renear 2001b]
Sperberg-McQueen, C. M., Claus Huitfeldt, and Allen Renear. “Practical extraction of meaning from markup.” Paper
given at ACH/ALLC 2001, New York, June 2001. (Slides at
http://www.w3.org/People/cmsmcq/2001/achallc2001/achallc2001.slides.html)
[Thompson 2001]
Thompson, Henry S.
“Normal Form Conventions for XML Representations
of Structured Data”.
Talk at XML 2001, Orlando, December 2001.
http://www.ltg.ed.ac.uk/~ht/normalForms.html
[Vorthmann/Buck 2000]
Vorthmann, Scott, and Lee Buck.
“Schema adjunct framework”.
Draft Specification 24 February 2000.
[Chapel Hill]: Extensibility.
http://www.extensibility.com/saf/spec/
[Vorthmann/Robie 2001]
Vorthmann, Scott, and Jonathan Robie.
“Beyond schemas:
Schema adjuncts and the outside world”.
Markup Languages: Theory & Practice
2.3 (2001): 281-294.