Notes on logic grammars and XML Schema

A working paper prepared for the W3C XML Schema Working Group

C. M. Sperberg-McQueen

1 December 2003

N.B. not complete: work in progress

1. Introduction
- 1.1. Background
- 1.2. Direct interpretation (animation) of specifications
- 1.3. Using logic grammars in schema processing
2. Simple DTD-style document checking
- 2.1. Definite-clause grammars
  - 2.1.1. Grammar E1: A trivial grammar for a fragment of English
  - 2.1.2. Calling the parser
  - 2.1.3. Grammar E2: A simple attribute grammar
- 2.2. Representing XML in Prolog
- 2.3. A simple schema
- 2.4. Layer 1: Translation into definite-clause grammar
  - 2.4.1. Top-level rules for element types
  - 2.4.2. Rules for attributes of complex types
  - 2.4.3. Rules for content of complex types
  - 2.4.4. Rules for checking values of simple types
- 2.5. Layer 2a: checking attribute types
- 2.6. Layer 2b: checking for duplicate, defaulted, and required attributes
- 2.7. Evaluation
3. Providing PSVI properties
- 3.1. Introduction to DCTG notation
- 3.2. Another example: the English grammar fragment E1
- 3.3. English grammar with attributes (E2) in DCTG notation
- 3.4. Layer 3: Translation into DCTG notation
  - 3.4.1. Top-level rules for element types
  - 3.4.2. Rules for attributes
  - 3.4.3. Rules for content of complex types
  - 3.4.4. Rules for checking values of simple types
  - 3.4.5. Overview and Summary
  - 3.4.6. Evaluation
4. Validity, validation-attempted, and error handling (layer 4)
- 4.1. Goals
  - 4.1.1. Additional PSVI properties
  - 4.1.2. Handling invalid and partially valid input
  - 4.1.3. Reifying the schema
  - 4.1.4. Naming and argument conventions
  - 4.1.5. Author's note to reviewers
  - 4.1.6. Summary of goals
- 4.2. Prolog representation of schema components
  - 4.2.1. Element declarations
  - 4.2.2. Complex type definitions and attribute declarations
  - 4.2.3. Simple type definitions
  - 4.2.4. Other
- 4.3. Validation of simple types
  - 4.3.1. Conventions for validating elements and lexical forms
  - 4.3.2. Error codes for simple types
  - 4.3.3. Validating a lexical form against a simple type
  - 4.3.4. Checking the lexical form of a simple type
  - 4.3.5. Checking other facets on a simple type
  - 4.3.6. Checking length facets
  - 4.3.7. Checking enumerations
  - 4.3.8. Checking minima and maxima
  - 4.3.9. Checking total digits and fraction digits
  - 4.3.10. Error diagnosis for xsd:string
  - 4.3.11. Error diagnosis for integer and decimal types
  - 4.3.12. Error diagnosis for other simple types
- 4.4. Lax validation of sequences
- 4.5. Validation of complex types
- 4.6. Validation of elements
  - 4.6.1. Error codes for elements with simple content
- 4.7. Validation of attributes
- 4.8. Overview of layer-4 grammar
5. Notes on other features
- 5.1. Handling xsi:type
- 5.2. Mixed content
- 5.3. Supporting wildcards with skip and lax processing
- 5.4. Supporting substitution groups
- 5.5. Exploiting common code patterns (going to meta-level)
6. Further work

A. Works cited and further reading
- A.1. Works cited
- A.2. Further reading
B. A note on the test cases
C. SWI Prolog handling of characters
D. Error codes for elements and attributes
- D.1. Schema-validity assessment
- D.2. Elements: local validity
- D.3. Validity with respect to a complex type
- D.4. Attributes
- D.5. Simple types
  - D.5.1. Lexical forms
  - D.5.2. Values
- D.6. Miscellaneous, common constructs
E. Further reading
F. Graveyard (todo list and dead prose)
- F.1. To do
- F.2. Toward a useful layering
- F.3. Notes on the complex-type structure

This document describes some possible applications of logic grammars to schema processing as described in the XML Schema specification. The term logic grammar is used to denote grammars written in logic-programming systems; this paper works with definite-clause grammars (DCGs), which are a built-in part of most Prolog systems, and definite-clause translation grammars (DCTGs), which employ a similar formalism which more closely resembles attribute grammars as described by [Knuth 1968] and later writers and which make it a bit easier to handle complex specifications. Both DCGs and DCTGs can be regarded as syntactic sugar for straight Prolog; before execution, both notations are translated into Prolog clauses in the usual notation.

In its current state, this paper is unfinished: it is more or less complete through the description of the layer-3 grammar, but layer 4 and the topics which follow it are still fragmentary. Some highlighted notes about what needs to be done to complete the paper are included. The author hopes the work is sketched out in sufficient detail to allow the reviewers to form a solid opinion on it.

1. Introduction

1.1. Background

It is often observed that the needs of precision and completeness in technical specifications are sometimes at odds with the goal of making the specifications easy to read and understand. It is sometimes suggested (e.g. by [Wadler 2000]) that specifications would be less ambiguous, more compact, and easier to understand if they were written not in natural-language prose but in a formal notation, for example in Prolog or in Z or in the Gentzen calculus or in some formalism developed for the purpose. Some propose not to eliminate the prose but to augment it with a formal expression of the same rules. This is a special case of a practice sometimes seriously recommended for developing national or international standards: draft the specification in two languages at once (e.g. English and French) as a way of exposing ambiguity, vagueness, and misunderstandings early and forcing the working group to achieve greater clarity and precision in their work.

If the second language used is a formal language, rather than a second natural language, this practice can have the same advantages, as well as the the added benefit that the resulting specification can be read and understood not only by human readers but also by machines. If the notation used for the specification can be directly interpreted by a machine, then the specification can be tested much more readily for completeness, consistency, and accurate reflection of the developers' design decisions; when a specification can be directly interpreted by machine, the specification itself becomes its own reference implementation.

The work described below may be regarded, in part, as a test of these propositions. I show how the definitions and declarations of a schema expressed in XML Schema can be represented using a formal notation developed for writing attribute grammars in Prolog; in follow-on work I expect to provide Prolog equivalents for the validation rules, info-set contributions, and constraints on schemas given in rather formal English prose by [W3C 2001b] and [W3C 2001c]. If the arguments for formalizing technical specifications are correct, the Prolog representations ought (if they are properly done) to be less ambiguous, more compact, and easier to understand that the formal English of the specification. The lack of ambiguity is guaranteed, as long as the Prolog is syntactically correct. The compactness can easily be measured; for the schema used here as an example, the Prolog is in fact not much more compact than the XML representation of the schema. Whether the Prolog representation is clearer or not is something which individual readers must decide for themselves.[1]

1.2. Direct interpretation (animation) of specifications

The idea of direct interpretation of formal specifications is not new; many textbooks on formal methods or on specific formal methods (Z, the Vienna Development Method VDM, etc.) discuss it at least as an idea. [Stepney 1993] provides a simple Web-accessible example of the idea: she illustrates the creation of a high-integrity compiler by

specifying the semantics (both dynamic and static) of a procedural programming language formally, in Z
specifying the semantics of a target machine language (or: assembly language) formally, also in Z
specifying the translation from source language to machine language formally
specifying the syntax of the programming language in a definite-clause translation grammar
translating the static semantics, dynamic semantics, and the source-to-machine language translation as grammatical attributes[2] in the DCTG. A Prolog system asked to supply the dynamic semantics of a program fragment serves as an interpreter for the source language; a Prolog system asked to supply the translation semantics acts as a compiler for the source language.

Animation of a specification is probably easiest if the spec is in a formal language. As Stepney's book illustrates, it is still possible otherwise, but it involves both the animation of a formal specification and an interpretation of the spec (in the form of a translation into a more formal language); the latter requires careful checking.

This paper attempts to make plausible the claim that a similar approach can be used with the XML Schema specification, in order to provide a runnable XML Schema processor with a very close tie to the wording of the XML Schema specification. Separate papers will report on an attempt to make good on the claim by building an XML Schema processor using this approach; this paper will focus on the rationale and basic ideas, omitting many details.

1.3. Using logic grammars in schema processing

Any schema defines a set of trees, and can thus be modeled more or less plausibly by a grammar. Schemas defined using XML Schema 1.0 impose some constraints which are not conveniently represented by pure context-free grammars, and the process of schema-validity-assessment defined by the XML Schema 1.0 specification requires implementations to produce information that goes well beyond a yes/no answer to the question “is this tree a member of the set?” For both of these reasons, it is convenient to use a form of attribute grammar to model a schema; logic grammars are a convenient choice.

In the remainder of this paper, I introduce some basic ideas for using logic grammars as a way of animating the XML Schema specification / modeling XML Schema.

The first question that arises is:

How do we use logic grammars to model XML Schema processing?

To this, there are three answers:

Model the schema as a logic grammar; use it to parse document instances and generate the post-schema-validation infoset (PSVI).
Move up one level of abstraction: model the schema for schemas as a logic grammar; use it to parse schema documents and generate the appropriate schema components.
Do both. This could be done by writing a utility to take a set of components (in the form generated by the logic grammar for schema documents) and generate a corresponding logic grammar (in the form needed for parsing other document instances).

Some problems arise if we use logic grammars to model schemas and perform schema-validity assessment by parsing a document using the logic grammar. These problems are attacked in a series of small examples in which a DCTG is developed which represents a simple schema:

In practice, the XML input document is likely to be presented to us in the form of a tree. How do we apply logic grammars to things that are already trees instead of sequences of tokens?
How do we check XML attributes?
How do we model type assignment, black-box validation, white-box validation?
How do we model the post-schema-validation information set?

Some problems arise if we use a logic grammar to model the rules for constructing schema components, and parse schema documents with that logic grammar as a way of modeling the process of constructing schema components.

How do we use parse-tree attributes to model information-item properties?
How do we use the output of this process to drive schema-validity assessement of document instances? Can we formulate the logic grammars used for schema-validity assessment as properties / attributes of the schema components?

In the next part of this paper (section 2) I introduce the definite-clause grammar notation, with some simple examples. A simplified representation of the purchase-order schema po1.xsd from the XML Schema tutorial [W3C 2001a] shows how a DCG can be used to support DTD-style validation in a straightforward way In section 3, I describe definite-clause translation grammars (DCTGs), which provide a slightly more convenient notation for complex attribute grammars than do DCGs. The patterns illustrated by the purchase-order schema are elaborated to handle various non-DTD constructs: post-schema-validation properties, error handling, substitution groups, xsi:type, and wildcards. This simple example shows how logic grammars can be applied as representations of schemas, but does not illustrate all aspects of schema validation.

In follow-on work reported in a separate paper, I plan to work systematically through the validation rules of [W3C 2001b] and [W3C 2001c], providing a more systematic representation of schema-validity assessment in logic-grammar terms. In a third paper, I will develop a grammar which reads XML Schema documents and enforces constraints on their XML representation and on the components derived from them, together with routines for compiling schema components into the logic-grammar representation sketched here.

2. Simple DTD-style document checking

We start with a simple case: considering only a subset of the kind of schema information captured in a document-type definition (DTD) in XML 1.0 DTD syntax, we write a logic grammar to represent a schema or DTD and use it to validate a document. The same technique, of course, can be used to capture and use part of the information in a schema expressed in XML Schema 1.0.

2.1. Definite-clause grammars

Before starting on the schema processor, however, it may be helpful to review the basics of definite-clause grammars and their notation. I'll present a simple grammar in two or three versions, to illustrate the use of DCG notation.

Readers already familiar with DCGs can skip this section without loss and continue with 2.2. Readers who don't find this introduction detailed enough may consult one of the Prolog books listed among the references.

2.1.1. Grammar E1: A trivial grammar for a fragment of English

Here is a simple DCG for a trivially small fragment of English, which I'll call E1. It is taken from example 8.1 in [König/Seiffert 1989], an introduction to Prolog for linguists:

%%% Grammar E1, a trivial fragment of English
s --> np, vp.  % A sentence (s) is a noun phrase (np) plus a verb phrase
np --> det, n. % A noun phrase is a determiner plus a noun
np --> n.      % ... or just a noun.
vp --> v, np.  % A verb phrase is a verb and its direct object, an np
vp --> v.      % ... or just the verb (for intransitives).
n --> [mary].  % 'mary', 'john', 'woman, 'man', 'apple' are nouns.
n --> [john].
n --> [woman].
n --> [man].
n --> [apple].
det --> [the]. % 'the' is a determiner
v --> [loves]. % 'loves', 'eats', and 'sings' are verbs.
v --> [eats].
v --> [sings].

This grammar generates sentences like “John loves Mary” and “The man sings” and “The woman eats the apple”, as well as the somewhat less plausible “The apple sings the Mary.”

When a Prolog system consults a file containing this grammar, the rules of the grammar are automatically translated into canonical Prolog clauses using difference lists:

?- listing([s,np,vp,n,det,v]).

s(A, B) :-
        np(A, C),
        vp(C, B).

np(A, B) :-
        det(A, C),
        n(C, B).
np(A, B) :-
        n(A, B).

vp(A, B) :-
        v(A, C),
        np(C, B).
vp(A, B) :-
        v(A, B).

n([mary|A], A).
n([john|A], A).
n([woman|A], A).
n([man|A], A).
n([apple|A], A).

det([the|A], A).

v([loves|A], A).
v([eats|A], A).
v([sings|A], A).

Yes
?-

Difference lists are pairs of lists such that the second list is a suffix of the first list (i.e. the same from some point or other to the end), for example, the pair [a, b, c, d] and [c, d]. Prolog-based parsers commonly use difference lists to represent the input; they allow the Prolog clauses which recognize structures in the input list to consume as much of the input list (the first list in the pair) as they wish, and pass the rest of the input to the next clause. If a DCG predicate succeeds with its final two arguments unified to the difference-list pair [a, b, c, d] and [c, d], what it means is that the grammar rule recognized the list [a, b].

In the clause for s, we can see how one part of the parser picks up where another leaves off. The input list is A, and the np predicate recognizes a noun phrase at the beginning of A. The portion of the list which follows the noun phrase is assigned to C. The predicate vp then tries to recognize a verb phrase at the beginning of list C. What is left after the verb phrase is assigned, in turn, to list B, which is both ‘the part of the input list after the verb phrase’ (in the predicate vp) and (because the verb phrase can be the last item in a sentence) also ‘the part of the input after the s’ (in the predicate s).

For example, given the sentence [the, woman, eats, the, apple], the first rule for np will be called with its first argument bound to the entire input string; after it recognizes the initial noun phrase, the second argument will be bound to the part of the sentence which follows the noun phrase, namely [eats,the,apple]. The predicate vp will then recognize the verb phrase [eats,the,apple], leaving the empty list [] as the part of the input after the sentence.[3]

2.1.2. Calling the parser

The following fragment of a Prolog session log shows how the grammar can be used to test the grammaticality of sentences; it also shows that the grammar could use some refinement. To test whether a given sequence of tokens is an s, we call the Prolog predicate s with two difference-list arguments. If we specify the empty list as the second argument, we in effect require that the entire input sequence be a legal sentence.

?-  s([john,loves,mary],[]).

Yes
?- s([the,woman,eats,the,apple],[]).

Yes
?- s([the,man,sings],[]).

Yes
?- s([apple,eats,woman,the],[]).

No
?- s([apple,sings,the,mary],[]).

Yes
?-

We can use a variable as a second argument, instead of the empty list; in this case, Prolog will tell us each prefix of the input list which can be parsed as a sentence according to our grammar:

?- s([the,woman,eats,the,apple],Remainder).

Remainder = [] ;

Remainder = [the, apple] ;

No
?-

In one parse, the entire input is parsed as a sentence, and the remainder is the empty list. An alternative parse, however, is just to take [the,woman,eats] as a sentence, leaving [the,apple] as the remainder.

The XML Schema processor we sketch below will normally call the parser with an empty list as the final parameter, to require that the entire input sequence be consumed. If we wish to see how much of the content of an element belongs to the base type, and how much to an extension of the base type, we can call the parser for the base type, with a variable as the final parameter, and the base type will consume as much of the input as it can. The rest belongs to the extension.

2.1.3. Grammar E2: A simple attribute grammar

We can turn this simple context-free grammar E1 into an attribute grammar by adding parameters to the productions; I will call the attribute grammar E2. In E2, one parameter is used to generate a structural description of the sentence. In this notation, the sentence “John loves Mary” is assigned the structure s(np(pn(john)), vp(v(loves), np(pn(mary)))) — or, with pretty-printing:

s(
  np(
     pn(john)
    ), 
  vp(
     v(loves), 
     np(
        pn(mary)
       )
    )
 )

On some productions, a second parameter is used to enforce agreement in number between subject and verb and between determiner and noun.

Thus extended with parameters, and a few more words in the lexicon, the grammar looks like this:

%%% E2: a trivial attribute grammar for a fragment of
%%% English, with a synthesized attribute for structural description
%%% and enforcement of number agreement.

s(s(S,P)) --> np(S,Num), vp(P,Num).
np(np(D,N),Num) --> det(D,Num), n(N,Num).
np(np(N),pl) --> n(N,pl).
np(np(N),sg) --> pn(N,sg).
vp(vp(V,O),Num) --> v(V,Num), np(O,_).
vp(vp(V),Num) --> v(V,Num).
n(n(L),Num) --> [L], { lex(L,n,Num) }.
pn(pn(L),Num) --> [L], { lex(L,pn,Num) }.
v(v(L),Num) --> [L], { lex(L,v,Num) }.
det(det(L),Num) --> [L], { lex(L,det,Num) }.

lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).

In this version of the grammar, it is worth calling the reader's attention to the Prolog clauses enclosed in braces on the right-hand side of some rules, e.g. in

n(n(L),Num) --> [L], { lex(L,n,Num)}.

These Prolog clauses express constraints on the grammatical construction which are not themselves grammatical constituents. They are sometimes called ‘guards’ (e.g. by [Brown/Blair 1990]); they correspond directly to the ‘semantic conditions’ which are a constituent part of attribute grammars as described by [Alblas 1991]. With the guard, this rule can be paraphrased as saying “A noun (an n), has the structure n(L) and the grammatical number Num, if (a) L (for ‘lexical form’) is a single item in the input list [L], and (b) the relation lex contains the triple (L,n,Num).”

Such guards will be present in many of the rules we use to represent XML Schema in logic grammars.

By calling the grammar rules, we can instantiate a variable with the structural description of the sentence.

?- s(Struc,[john,loves,mary],[]).

Struc = s(np(pn(john)), vp(v(loves), np(pn(mary)))) 

Yes
?- s(S,[the,woman,eats,the,apple],[]).

S = s(np(det(the), n(woman)), vp(v(eats), np(det(the), n(apple)))) 

Yes
?- s(S,[the,man,sings],[]).

S = s(np(det(the), n(man)), vp(v(sings))) 

Yes
?- s(S,[apple,sings,the,mary],[]).

No
?- s(S,[the,apple,sings,mary],[]).

S = s(np(det(the), n(apple)), vp(v(sings), np(pn(mary)))) 

Yes
?-

Note that in a ‘conventional’ use of DCGs, the user or application will call the grammar just once, at the top level, and the parser itself will generate recursive calls for the nested grammatical constructs (e.g. np and vp in our example). Note, however, that we can if we wish make calls to any level of the grammar from arbitrary Prolog code. Note also that we can include arbitrary Prolog code in the right-hand side of rules by wrapping it in braces. By doing so, we can make arbitrarily complex calculations part of the grammar rules; this is one reason that DCGs are not limited to context-free languages.

2.2. Representing XML in Prolog

The SWI-Prolog system includes an SGML/XML parser which conveniently makes SGML and XML documents accessible to Prolog manipulation [Wielemaker 2001].

The parser returns the XML document to the user as a list of items,[4] each item one of:

an atom (for character data)
a structure of the form element(GI,AttributeList,ContentList), where
- GI is an atom representing the element's generic identifier
- AttributeList is a list containing the element's attribute-value specifications in the form Attributename=Value, with both attribute names and attribute values represented as Prolog atoms. Like all Prolog lists, this one is comma-separated, so a list as a whole might look like [id=frobnitzer,rend=blort,type=farble]
- ContentList is a list of items (i.e. atoms representing character data, element(Gi,A,C) structures, entity(N) structures, etc.
a structure of the form entity(Integer) representing an otherwise unrepresentable character
a structure of the form entity(Name) representing an otherwise unrepresentable character for which a named entity is defined
a structure of the form pi(Text) representing a processing instruction

For example, the sample purchase order po1.xml used as an example in [W3C 2001a] looks like this in the format described (I have added white space for legibility; the single quotation marks around some atoms are a way of ensuring that they can be read again by a Prolog implementation and recognized as atoms):

[element('http://www.example.com/PO1':purchaseOrder, 
  [xmlns:apo='http://www.example.com/PO1', 
   orderDate='1999-10-20'], 
  [element(shipTo, 
    [country='US'], 
    [element(name, [], ['Alice Smith']), 
     element(street, [], ['123 Maple Street']), 
     element(city, [], ['Mill Valley']), 
     element(state, [], ['CA']), 
     element(zip, [], ['90952'])
    ]), 
   element(billTo, 
    [country='US'], 
    [element(name, [], ['Robert Smith']), 
     element(street, [], ['8 Oak Avenue']), 
     element(city, [], ['Old Town']), 
     element(state, [], ['PA']), 
     element(zip, [], ['95819'])
    ]), 
   element('http://www.example.com/PO1':comment, 
    [], 
    ['Hurry, my lawn is going wild!']
   ), 
   element(items, [], 
    [element(item, 
      [partNum='872-AA'], 
      [element(productName, [], ['Lawnmower']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['148.95']), 
       element('http://www.example.com/PO1':comment, 
        [], 
        ['Confirm this is electric']
       )
      ]), 
     element(item, 
      [partNum='926-AA'], 
      [element(productName, [], ['Baby Monitor']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['39.98']), 
       element(shipDate, [], ['1999-05-21'])
      ])
    ])
  ])
]

When elements and attributes are namespace-qualified, as the purchaseOrder and comment elements are in this example, the SWI-Prolog parser represents the name as a pair of two atoms, separated by a colon. To the right of the colon is the local name; depending on the option specified, the left-hand atom is either the namespace name (as shown here) or the namespace prefix used.

Other Prolog representations of XML are possible, of course, but we can use this one and it already exists.

2.3. A simple schema

Consider the simple schema defined for purchase orders in Part 0 of the XML Schema specification ([W3C 2001a]). Its complex types are relatively simple, involving only sequences of elements, some of which are optional, and some attributes. Some attributes and some elements are defined with named and some with anonymous simple types. This schema will serve admirably to illustrate the translation of schemas to logic grammars. For simplicity of exposition, the translation will be divided into several layers:

translation into DCG form
checking for duplicate and missing attributes, validating attribute types
translation into DCTG form and supplying PSVI properties
supplying properties related to validity, handling invalid input

Some features of XML Schema (handling xsi:type, supporting wildcards, supporting substitution groups) won't be shown here; the purchase-order schema doesn't have any wildcards or substitution groups, and it doesn't have much scope for the use of xsi:type in document instances. But it will be possible, after developing the example, to explain how these features can be handled in logic grammars.

2.4. Layer 1: Translation into definite-clause grammar

Each element type in the language is translated into three sets of DCG productions:

a top-level rule to match the tag and get the contents
a set of rules for checking the element's attributes
a set of rules defining the element's content; this is the part that looks most like a conventional DCG

2.4.1. Top-level rules for element types

For each element type in the schema, we create one rule in the grammar, which will serve to match the start-tag, get the attributes and contents of the element, and call routines to check the attributes and content against the complex type. Here there is exactly one such rule for each element type, but in later examples we will have several, in order to handle substitution groups. A simple version of these top-level rules would look like this:[5]

< 1 Rules for elements with complex types [File podcg3.pl] > ≡

/* Definite-clause grammar for the XML Schema represented in po.xsd.
 * 
 * This DCG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
purchaseOrder --> [element('http://www.example.com/PO1':purchaseOrder,
    Attributes,Content)],
  {
    purchaseOrderType_atts(Attributes,[]),
    purchaseOrderType_cont(Content,[])
  }.
shipTo --> [element(shipTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
billTo --> [element(billTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
items --> [element(items,Attributes,Content)],
  {
    t_items_atts(Attributes,[]),
    t_items_cont(Content,[])
  }.
item --> [element(item,Attributes,Content)],
  {
    t_item_atts(Attributes,[]),
    t_item_cont(Content,[])
  }.

Because we get the XML document already in tree form, the top-level rules don't have much work to do. In particular, they do not have to identify the substructure of the element or find its end; that has already been done. Instead, we get the attributes and content already separated out. We launch a separate call on each to parse them according to the appropriate rule for their datatype.[6]

The comment element and others associated with simple types pose an interesting question. We could define them following the pattern above, like this:

comment --> [element('http://www.example.com/PO1':comment,
    Attributes,Content)],
  {
    xsd_string_atts(Attributes,[]),
    xsd_string_cont(Content,[])
  }.

But we know that the xsd:string type does not have any attributes declared, so Attributes must be the empty list. We can therefore define these elements this way instead:

< 2 Rules for elements with simple types [File podcg3.pl] > ≡

comment --> [element('http://www.example.com/PO1':comment,
    [],Content)],
  {
    xsd_string_cont(Content,[])
  }.

poname --> [element(name,[],Content)], 
           { xsd_string_cont(Content,[]) }.
street --> [element(street,[],Content)], 
           { xsd_string_cont(Content,[]) }.
city   --> [element(city,[],Content)], 
           { xsd_string_cont(Content,[]) }.
state  --> [element(state,[],Content)], 
           { xsd_string_cont(Content,[]) }.
zip    --> [element(zip,[],Content)], 
           { xsd_decimal_cont(Content,[]) }.

productName --> [element(productName,[],Content)], 
                { xsd_string_cont(Content,[]) }.
quantity    --> [element(quantity,[],Content)], 
                { xsd_integer_cont(Content,[]) }.
'USPrice'   --> [element('USPrice',[],Content)], 
                { xsd_decimal_cont(Content,[]) }.
shipDate    --> [element(shipDate,[],Content)], 
                { xsd_date_cont(Content,[]) }.

2.4.2. Rules for attributes of complex types

For each complex type in the schema, we generate two rules, one we will use to check the attributes, and one to check the contents. For each simple type, we generate one rule for checking the value.

The attributes for a given element can be enumerated; in this first example, we check to make sure each attribute was declared.

< 3 Rules for attributes defined for complex types (v1) > ≡

purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date].

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> no_such_attribute.
/* note that t_items_att is undefined, since there
 * are no attributes for the Items type 
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU].

This code is not used elsewhere.

For now, we do not worry about whether required attributes have been specified, or about checking the type of the attribute, or about supplying default values; we will do that later.

2.4.3. Rules for content of complex types

The most conventional-looking part of our grammar is the representation of the content models. A purchase order (being of type purchaseOrderType) has a ship-to address, a billing address, possibly a comment, and a list of items. An element with complex type USAddress has a name, street, city, state, and zip. We handle optional elements by writing a pair of recursive productions:

< 4 [File podcg3.pl] > ≡

purchaseOrderType_cont --> shipTo, billTo, opt_comment, items.
opt_comment --> [].
opt_comment --> comment, opt_comment.

usAddress_cont --> poname, street, city, state, zip.

t_items_cont --> star_item.
star_item    --> [].
star_item    --> item, star_item.

t_item_cont --> productName, quantity, 'USPrice', 
                opt_comment, opt_shipDate.

opt_shipDate --> [].
opt_shipDate --> shipDate.

It's worth noting that since the content models of XML all define regular languages, the right hand sides of the DCG rules for handling the content of elements are all rather simple. The only complexity is introduced by nested choices, occurrence indicators, and the like. In a way, this is the reflection of a fundamental fact about validating XML: the XML structure is given rather than needing to be discovered, and the rules for validation are correspondingly simpler.

A practical consequence of the fact that XML works with regular-right-part grammars while we are working with something more like standard Backus-Naur Form is that whenever we have recursive constructs like star_block, the parse tree described by the DCG does not correspond directly to the XML structure, and when we generate PSVI structures we are going to need to flatten the result to make it represent the XML structure properly.[7]

There is one other complication in practice: if an element has mixed content, we need to allow atoms representing character data between sub-elements; otherwise, we need to allow only atoms representing white space. For now, we step around this problem by instructing the upstream XML parser to strip whitespace from the beginning and ending of CDATA nodes; this does the job for this particular schema.[8] Later, we'll need to find a systematic way to hop over inter-element characters, and to restrict them to white space where appropriate.

2.4.4. Rules for checking values of simple types

We need some rules for checking simple types. The rules for built-in types of XML Schema, at least, can be written once for all and built in to a library. N.B. In order to get something working quickly, we cut some corners here.

In each case we write two rules, one to check the content of an element by checking to see whether the putative contents are a legal value, and the second to check the value itself. Within these latter rules, it is necessary to remember that the thing we are checking is a Prolog atom created from the lexical form for the putative value. In the case of strings, anything that's a legal lexical form (represented, following the conventions of the SWI-Prolog parser, as an atom) counts as a legal string.

< 5 Checking simple types [File podcg3.pl] > ≡

/* In our representation of XML, character data is represented
 * as atoms.  See note on handling of non-ASCII characters */
xsd_string_cont([H|T],T) :- xsd_string_value(H).
xsd_string_value(LF) :- atom(LF).

We check decimal numbers by converting the lexical form from an atom into a sequence of characters and then seeing whether Prolog can create a number out of that sequence of characters. This is actually a bit too broad, at least with the Prolog I am using, since it accepts numbers in scientific notation. But it will do for an illustration.

< 6 Checking numbers [File podcg3.pl] > ≡

/* We'll accept anything as a decimal number if we can convert
 * the atom to a list of character codes which can in turn be
 * converted to a number. 
 */
xsd_decimal_cont([H|T],T) :- xsd_decimal_value(H).
xsd_decimal_value(LF) :- atom_codes(LF,L), number_codes(N,L).
xsd_integer_cont([H|T],T) :- xsd_integer_value(H).
xsd_integer_value(LF) :- 
  atom_codes(LF,L), 
  number_codes(N,L),
  integer(N).

For dates, we really punt for now.

< 7 Checking dates [File podcg3.pl] > ≡

xsd_date_cont([H|T],T) :- xsd_date_value(H). 
xsd_date_value(LF) :- atom(LF). 
/* well, we really need to check this ... */

With this set of rules, we have a primitive but working program for checking whether a given XML document does or does not conform to the purchase order schema. There are a number of constraints it does not check, however; in the following sections of this document we will refine the code to make it a more persuasive representation of the schema.

2.5. Layer 2a: checking attribute types

One obvious improvement would be to check to make sure that the values of attributes are legitimate representations of the simple types with which the attributes are declared. Here is a revision of the relevant rules:

< 8 Rules for attributes defined for complex types (v2) [File podcg3.pl] > ≡

purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date],
  { xsd_date_value(Date) }.

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> [this_will_never_match].
/* Since it has no equal sign or value, the rule for
 * t_items_att will never match anything.
 * Alternatively we could leave t_items_att undefined, since there
 * are no attributes for the Items type.
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU],
  { po_sku_value(SKU) }.

Hand-coding a rule to check SKU values is not too hard, and illustrates how we can check the lexical forms of simple types by writing mini-grammars for them. The SKU type is declared this way:

 <xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
   <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
 </xsd:simpleType>

The pattern can be translated readily into the following grammar operating on the character sequence of the lexical form:

< 9 Checking SKU values [File podcg3.pl] > ≡

po_sku_value(LF) :- 
  atom_chars(LF,Charseq),
  sku_value(Charseq,[]).

sku_value --> sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part --> digit, digit, digit.
sku_alpha_part --> cap_a_z, cap_a_z.

The mini-grammar for SKU values relies on the SWI-Prolog character classification predicate char_type, though it would be possible to implement rules for digit checking and letter checking in any Prolog system.

< 10 Character classification rules [File podcg3.pl] > ≡

digit --> [Char], { char_type(Char,digit) }.
hyphen --> ['-'].
cap_a_z --> [Char], { char_type(Char,upper) }.

2.6. Layer 2b: checking for duplicate, defaulted, and required attributes

Another obvious improvement would be to check to ensure that attributes are not specified more than once (strictly speaking, the upstream XML processor should be checking this, so this is probably not essential) and that required attributes do have values specified. We have already seen how to check a fixed value: we simply specify the value as part of the grammatical rule for the attribute.

In our sample schema, the only required attribute on any type is the partNum attribute on the item element; it is also the only attribute defined for items. In a case like this one, we could simply write the relevant part of the grammar this way:

t_item_atts --> [partNum=SKU], { po_sku_value(SKU) }.

Or we could even put the constraint into the top-level rule for the item element:

item --> [element(item,[partNum=SKU],Content)],
  {
    po_sku_value(SKU),
    t_item_cont(Content,[])
  }.

In the general case, however, it is going to be simpler to put checking for required, forbidden, and defaulted attributes into a separate rule, which we will here call t_item_att_check:

< 11 Checking for required, forbidden attributes > ≡

t_item_att_check(LAVS) :-
  atts_present(LAVS,[partNum]).

This code is not used elsewhere.

In the general case, we may have to check for required, forbidden, and defaulted attributes, and to merge defaulted attributes into the list of attribute-value specifications. So we can extend the general form of the foo_att_check predicate to:

< 12 Checking for required, forbidden attributes > ≡

t_item_att_check(LAVS,AugmentedLAVS) :-
  atts_present(LAVS,[partNum]),
  atts_absent(LAVS,[]),
  atts_defaulted(LAVS,[],AugmentedLAVS).

This code is not used elsewhere.

We assume here the existence of three special-purpose items for checking the presence or absence of items in lists of attribute-value specifications. A list of attribute-value specifications contains all the attributes in a list of required attributes, if the list of required attributes is empty.

< 13 Utilities for checking required attributes > ≡

atts_present(LAVS,[]).

Continued in <Utilities for checking required attributes 14>, <Utilities for checking required attributes 15>, <Utilities for checking forbidden attributes 16>, <Utilities for checking forbidden attributes 17>, <Utilities for checking defaulted attributes 18>
This code is not used elsewhere.

A list of attribute-value specifications contains all the attributes in a list of required attributes, if it contains the first one, and also all the others.

< 14 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡

atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).

And finally, an attribute value specification is present for a given attribute name if it is either at the head of the list, or in the tail of the list.

< 15 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡

att_present([Attname=Attval|Tail],Attname).
att_present([AVS|Tail],Attname) :-
  att_present(Tail,Attname).

Some readers will be looking, right about now, for a rule of the form

att_present([],Attname) :- ...

We don't have a rule for the empty list, though, because if we have run through the list of attribute-value specifications without finding one for the attribute name we are seeking, we want the predicate to fail. If necessary to prevent later maintenance programmers, or ourselves, from supplying the ‘missing’ induction basis by mistake, we could write:

att_present([],Attname) :- !, fail.

Now for forbidden attributes. (These would be originally optional attributes whose maxOccurs has been set to 0 in a refinement step.) If there are no attribute names in the list of forbidden attributes, then we are done and all is well.

< 16 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡

atts_absent(LAVS,[]).

If the list of forbidden attribute names is not empty, we check first the head, then (recursively) the tail.

< 17 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡

atts_absent(LAVS,[HFA|ForbiddenTail]) :-
  not(att_present(LAVS,HFA)),
  atts_absent(LAVS,ForbiddenTail).

Finally, for attributes with defaults we specify a list of attribute-value specifications; for each of these, we check to see whether a value is already specified for an attribute of that name. If it is, we skip the attribute in question; if not, we add it to the list of attribute-value specifications. Either way, we call the predicate recursively to take care of the rest of the defaulted attributes.

< 18 Utilities for checking defaulted attributes [continues 13 Utilities for checking required attributes] > ≡


atts_defaulted(LAVS,[],LAVS).
atts_defaulted(LAVS,[AN=AV|Tail],AugmentedLAVS) :-
  att_present(LAVS,AN),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).
atts_defaulted(LAVS,[AN=AV|Tail],[AN=AV|AugmentedLAVS]) :-
  not(att_present(LAVS,AN)),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).

2.7. Evaluation

So far, we have shown that given a relatively simple schema, it is possible to create a definite-clause grammar which checks the salient constraints defined by the schema:

Each element is checked against the rules for its datatype.
Elements declared with a complex type are checked with regard to their attributes and their content.
- Attributes are checked to see that they were declared and that their value is of the correct type.
- Sets of attribute value specifications are checked to make sure that required attributes have been specified, and forbidden attributes have not. Default values can be supplied for attributes which have them.
- The regular language defined in a content model is translated into a set of definite-clause grammar productions; any legal sequence of children will be recognized, and any illegal sequence of children will fail to be recognized.
- Elements declared with a simple type are checked to ensure that their content is a legal value of the simple type.

Running a lightly modified version of this grammar over a set of test cases based on po1.xml, the grammar accepted a number of valid variations of the po1.xml example, with

additional attributes (namespace declarations, xsi attributes)
omission or inclusion of optional elements
omission or inclusion of optional attributes

Modification of the grammar was necessary because the grammar as shown does not handle namespace declarations or attributes in the xsi namespace; since all the test files do have namespace declarations, the following rules had to be added to the grammar (I haven't added them above for fear it would clog up the exposition):

purchaseOrderType_att --> magic_att.
usAddress_att --> magic_att.
t_items_att  --> magic_att.
t_item_att  --> magic_att.

magic_att --> [xmlns=NS].
magic_att --> [xmlns:Pre=NS].
magic_att --> ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value].

The grammar detected a number of errors:

misspelled generic identifiers and attribute names
omission of required elements
excess occurrences of elements
elements out of sequence
extraneous undeclared elements present
value other than the prescribed value for a fixed attribute

The following errors were not detected:

multiple po:comment elements (the schema allows at most one at the purchaseOrder level and at most one in each item)
```
    <apo:comment>Hurry, my lawn is going wild!</apo:comment>
    <apo:comment>I don't know how much longer I can hold out.</apo:comment>
```
This is an error in the grammar above, which has been retained to illustrate the risks of hand-translation. Instead of
```
opt_comment --> [].
opt_comment --> comment, opt_comment.
```
(which allows arbitrarily many comments), the rule for the optional comment element should read
```
opt_comment --> [].
opt_comment --> comment.
```
two orderDate attributes
```
<apo:purchaseOrder xmlns:apo="http://www.example.com/PO1"
                   orderDate="1999-10-19"
                   orderDate="1999-10-20"> 
```
(this can be blamed on the upstream processor, which should probably have rejected the document without producing an infoset, since single occurrence of attribute names is a well-formedness constraint)

quantity exceeding the maximum legal value

        <item partNum="926-AA">
            <productName>Baby Monitor</productName>
            <quantity>100</quantity>
            <USPrice>39.98</USPrice>
            <shipDate>1999-05-21</shipDate>
        </item>

omission of the required partNum attribute

        <item>
            <productName>Baby Monitor</productName>
            <quantity>1</quantity>
            <USPrice>39.98</USPrice>
            <shipDate>1999-05-21</shipDate>
        </item>

The following errors were detected, or at least the document was not accepted as valid, but validating the document caused a Prolog error because the number_chars predicate used to validate numbers is apparently not robust against non-numeric arguments.

invalid zip code
```
<zip>OX2 6NN</zip>
```
wrong datatype for quantity
```
<quantity>one</quantity>
```
bad value for USPrice
```
<USPrice>USD 39.98</USPrice>
```

See A note on the test cases below for further details on the test documents.

The example we have been developing does not illustrate all forms of content models, occurrence indicators, or simple types, but I think it's obvious how to extend the treatment given here.

There are several gaps in coverage, which we will make good in what follows:

handling wildcards
handling character data intermingled with mixed content models, handling whitespace intermingled with non-mixed content models
handling substitution groups
handling xsi:type specifications in the document instance
diagnosing errors (i.e. saying something more useful than “no” or “ERROR: number_chars/2: Syntax error: Illegal number” when an error is encountered) and recovering from them
providing relevant PSVI properties

It will be simplest to show how to address these gaps if we handle the last item first, and make our parser start returning more than a yes/no answer to the question “is this document/element/lexical form ok?” As shown by the examples used to introduce DCG notation, we could do that by adding parameters to the productions in the grammar; it will be more convenient, however, to translate our grammar from DCG form into a related form, that of definite-clause translation grammars. To that, the next part of this paper is dedicated.

3. Providing PSVI properties

To model schema-validity assessment properly, we need to provide more output: specifically, we need to provide information about the input document together with some additional properties (the schema infoset contributions). It's possible to do that in DCG notation, but it rapidly becomes cumbersome. We'll use DCTG notation instead; it was devised to handle grammatical attributes more conveniently than DCG, and to separate the semantics more effectively from the syntax [Abramson 1984].

On this foundation we will later be able to add more:

adding type name information: type name property
black-box (skip) validation
white-box (lax) validation, validation attempted property
error handling (falling back to lax), validity property

Note that this section is not intended to be a complete translation of XML Schema into DCTG, but a sample small enough to follow and large enough to make a persuasive case that all of XML Schema can be translated.

3.1. Introduction to DCTG notation

DCTG notation was invented to make it easier to calculate values for grammatical attributes and to refer to the values of grammatical attributes on other rules.

Here is a simple example. Consider the following grammar for binary strings:

bit ::= '0'.
bit ::= '1'.
bitstring ::= '' /* nothing */.
bitstring ::= bit, bitstring.
number ::= bitstring, fraction.
fraction ::= '.', bitstring.
fraction ::= ''.

We might wish to calculate the length and the (unsigned base-two) value of the bitstring as attributes. Using a yacc-like notation that might look like this. Notice that scale is a top-down attribute and value and fractionalvalue are bottom-up attributes.

bit ::= '0' { $0.bitvalue = 0; }.
bit ::= '1' { $0.bitvalue = power(2,$0.scale); }.
bitstring ::= '' { 
  $0.value = 0; 
  $0.length = 0;
  /* scale doesn't matter here */
}.
bitstring ::= bit, bitstring {
  $0.length = $2.length + 1;
  $1.scale = $0.scale;
  $2.scale = $0.scale - 1;
  $0.value = $1.value + $2.value;
}.
number ::= bitstring, fraction {
  $1.scale = $1.length - 1;
  $0.value = $1.value + $2.fractionalvalue;
}.
fraction ::= '.', bitstring {
  $2.scale = -1;
  $0.fractionalvalue = $2.value;
}.
fraction ::= '' {
  $0.fractionalvalue = 0;
}.

In DCTG notation, this grammar looks like this:[9]

bit ::= [0]
  <:> bitval(0,_).

bit ::= [1]
  <:> bitval(V,Scale) ::- V is **(2,Scale).

bitstring ::= []
  <:> length(0)
  &&  value(0,_).

bitstring ::= bit^^B, bitstring^^B1
  <:> length(Length) ::-
        B1 ^^ length(Length1),
        Length is Length1 + 1
  &&  value(Value,ScaleB) ::-
        B ^^ bitval(VB,ScaleB),
        S1 is ScaleB - 1,
        B1 ^^ value(V1,S1),
        Value is VB + V1.

number ::= bitstring ^^ B, fraction ^^ F
  <:> value(V) ::-
        B ^^ length(Length),
        S is Length-1,
        B ^^ value(VB,S),
        F ^^ fractional_value(VF),
        V is VB + VF.

fraction ::= ['.'], bitstring ^^ B
  <:> fractional_value(V) ::-
        S is -1,
        B ^^ value(V,S).

fraction ::= []
  <:> fractional_value(0).

As may be seen, each rule in DCTG consists of a left hand side, a right-hand side, and optionally a list of attributes. The right-hand side is separated from the left-hand side by the operator ::= and from the following list of attributes (if any) by the operator <:>. The expression on the right hand side is a sequence of non-terminal symbols, each optionally suffixed with the operator ^^ and a variable name (e.g. bitstring^^B). Each attribute is identified by a Prolog structure, e.g. value(V), whose functor is the name of the attribute and whose arguments are its values. The attribute structure may be followed by the operator ::- (designed to look a lot like the standard Prolog operator :- or ‘neck’) and a series of goals which will be satisfied when the attribute value is to be instantiated. In practice, these goals help to calculate the attribute values. Values of attributes attached to the items on the right-hand side may be referred to by the variable name associated with the item and the name of the attribute, joined by the operator ^^, as for example in B ^^ bitval(VB,ScaleB).

A partial EBNF grammar for DCTG notation[10] is:

grammar ::= rule*
rule    ::= lhs '::=' rhs ('<:>' att-spec ('&&' att-spec)*)?
lhs     ::= term
rhs     ::= term (',' term)*
attspec ::= compound-term ('::-' goal (',' goal)*)?
compound-term ::= ATOM '(' term (',' term)* ')'

3.2. Another example: the English grammar fragment E1

Here is another simple example: the trivial fragment of English grammar given above, translated directly into DCTG notation, looks like the following. The only difference from the DCG form is that the separator between the left- and right-hand side of each production is “::=” instead of “-->”.

%%% E1 (trivial context-free grammar for a fragment of English)
%%% in DCTG notation.
s ::= np, vp.
np ::= det, n.
np ::= n.
vp ::= v, np.
vp ::= v.
n ::= [mary].
n ::= [john].
n ::= [woman].
n ::= [man].
n ::= [apple].
det ::= [the].
v ::= [loves].
v ::= [eats].
v ::= [sings].

The translation into standard Prolog is similar to that used for DCGs, but instead of adding two arguments to each non-terminal, the DCTG translation adds three. The first additional argument is a Prolog structure with the functor node, described further below. The second and third are (like the two arguments added in DCG translations) difference lists.

The node structure has three arguments:

the non-terminal of the grammar rule (here s, np, vp, etc.)
a list of the node structures associated with the items on the right-hand side of the grammar rule
a list of grammatical attributes (in this grammar, this list will be empty)

The translation of our trivial grammar into standard Prolog is thus:

?- dctg_reconsult('ks81dctg.pl').

Yes
?- listing([s,np,vp,n,det,v]).

:- dynamic s/3.

s(node(s, [A, B], []), C, D) :-
        np(A, C, E),
        vp(B, E, D).

:- dynamic np/3.

np(node(np, [A, B], []), C, D) :-
        det(A, C, E),
        n(B, E, D).
np(node(np, [A], []), B, C) :-
        n(A, B, C).

:- dynamic vp/3.

vp(node(vp, [A, B], []), C, D) :-
        v(A, C, E),
        np(B, E, D).
vp(node(vp, [A], []), B, C) :-
        v(A, B, C).

:- dynamic n/3.

n(node(n, [[mary]], []), A, B) :-
        c(A, mary, B).
n(node(n, [[john]], []), A, B) :-
        c(A, john, B).
n(node(n, [[woman]], []), A, B) :-
        c(A, woman, B).
n(node(n, [[man]], []), A, B) :-
        c(A, man, B).
n(node(n, [[apple]], []), A, B) :-
        c(A, apple, B).

:- dynamic det/3.

det(node(det, [[the]], []), A, B) :-
        c(A, the, B).

:- dynamic v/3.

v(node(v, [[loves]], []), A, B) :-
        c(A, loves, B).
v(node(v, [[eats]], []), A, B) :-
        c(A, eats, B).
v(node(v, [[sings]], []), A, B) :-
        c(A, sings, B).

Yes
?-

The predicate dctg_reconsult(File) is used to translate a DCTG grammar into Prolog clauses and load them; it is provided by [Abramson/Dahl/Paine 1990] and is available from a variety of sources on the net.[11]

A short terminal session should make the nature of the results a bit clearer:[12]

?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(n, [[john]], [])], 
           []), 
      node(vp, 
           [node(v, [[loves]], []), 
            node(np, 
                 [node(n, [[mary]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(n, [[john]], [])], []), 
node(vp, [node(v, [[loves]], []), node(np, [node(n, 
[...], [])], [])], [])], []) 

Yes
?- s(S,[the,woman,eats,the,apple],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[woman]], [])], 
           []), 
      node(vp, 
           [node(v, [[eats]], []), 
            node(np, 
                 [node(det, [[the]], []), 
                  node(n, [[apple]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), node(n, 
[[woman]], [])], []), node(vp, [node(v, [[eats]], []), 
node(np, [node(det, [...], []), node(..., ..., ...)], 
[])], [])], []) 

Yes
?- s(S,[the,man,sings],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[man]], [])], 
           []), 
      node(vp, 
           [node(v, [[sings]], [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), 
node(n, [[man]], [])], []), node(vp, [node(v, 
[[sings]], [])], [])], []) 

Yes
?-

The node structure constructed in the first added argument resembles and serves much the same purpose as the structure attribute used in E2, the attribute-grammar version of the English grammar fragment.

3.3. English grammar with attributes (E2) in DCTG notation

The fragment of English grammar E2, which was presented earlier to illustrate the use of DCGs for attribute grammars, may also be used to illustrate DCTGs.

Let's walk through the grammar. A sentence is an NP followed by a VP; we will call the NP S (‘subject’), and the VP we will call P (‘predicate’).

< 19 English grammar fragment with attributes [File ks92.dctg] > ≡

s ::= np^^S, vp^^P,
  { S^^number(Num), P^^number(Num) }
  <:> {Attributes for non-terminal s 20}

Continued in <NP rules 21>, <VP rules 26>, <Pre-terminal rules 29>, <Lexicon 30>

The goals enclosed in braces (S^^number(Num), P^^number(Num)) together express the constraint that the NP and VP must agree in number: the number attribute of the NP S and the number attribute of the VP P must unify with each other.

The non-terminal s has only one grammatical attribute; let us call it structure. When s is made up (as here) of an NP and a VP, we represent its structure by a Prolog term with s as its functor and the structure of the NP and VP as its two arguments:

< 20 Attributes for non-terminal s > ≡

  structure(s(Sstr,Pstr)) ::- 
    S^^structure(Sstr), 
    P^^structure(Pstr).

This code is used in < English grammar fragment with attributes 19 >

A noun phrase (NP) is made up of a determiner and a noun; they must agree in number. This covers phrases like “the apple”, “the apples”, “one apple”, “some apples”. The agreement rule excludes the phrases “one apples” and “some apple”.[13]

< 21 NP rules [continues 19 English grammar fragment with attributes] > ≡

np ::= det^^D, n^^N,
  { D^^number(Num), N^^number(Num) }
  <:> {NP structure attribute (for DET+N) 22}

  &&  {NP number attribute (for DET+N) 23}

Continued in <NP rules, cont'd 24>, <NP rules, cont'd 25>

The non-terminal np has two attributes, named structure and number.

The structure of the NP is a Prolog term with the name of the non-terminal (np) as its functor and the constituents as the arguments. This is the pattern of the structure attribute on all non-terminals, and I won't comment on it again.

< 22 NP structure attribute (for DET+N) > ≡

  structure(np(Dstr,Nstr)) ::-
    D^^structure(Dstr),
    N^^structure(Nstr)

This code is used in < NP rules 21 >

The number attribute of the NP illustrates an important idiom: the guard in the syntactic part of the rule has already checked the number attributes of the determiner and the noun, to make sure they unify with each other; the variable Num has the value sg or pl already, and we don't need to do any more computation. We just say that whatever Num is, that's the grammatical number of the NP.

< 23 NP number attribute (for DET+N) > ≡

  number(Num).

This code is used in < NP rules 21 >

Noun phrases can also take the form of a plural noun by itself, as in “Men love apples”.

< 24 NP rules, cont'd [continues 21 NP rules] > ≡

np ::= n^^N, { N^^number(pl) }
  <:> structure(np(Nstr)) ::-
        N^^structure(Nstr)
  &&  number(pl).

The final form of NP recognized by this grammar is a singular proper noun by itself, as in “John loves Mary”.

< 25 NP rules, cont'd [continues 21 NP rules] > ≡

np ::= pn^^N, { N^^number(sg) }
  <:> structure(np(Nstr)) ::-
        N^^structure(Nstr)
  &&  number(sg).

A verb phrase (VP) can include a direct object in the form of a noun phrase:

< 26 VP rules [continues 19 English grammar fragment with attributes] > ≡

vp ::= v^^V, np^^O
  <:> structure(vp(Vs,Os)) ::-
        V^^structure(Vs),
        O^^structure(Os)
  &&  {Number attribute for VP 28}

Continued in <VP rules, cont'd 27>

or they can be just a verb with no direct object.[14]

< 27 VP rules, cont'd [continues 26 VP rules] > ≡

vp ::= v^^V
  <:> structure(vp(Vs)) ::-
        V^^structure(Vs)
  &&  {Number attribute for VP 28}

Although both the verb and the direct object have a number attribute, only that of the verb counts in determining the value of the number attribute for the VP as a whole.

< 28 Number attribute for VP > ≡

  number(Num) ::- V^^number(Num).

This code is used in < VP rules 26 > < VP rules, cont'd 27 >

Nouns, proper nouns, verbs, and determiners (the ‘pre-terminal’ categories of the grammar) all have rules of the same structure: a token in the string counts as one of these if the lexicon says it's one, and the number attribute has whatever value the lexicon gives.

< 29 Pre-terminal rules [continues 19 English grammar fragment with attributes] > ≡

n ::= [L], { lex(L,n,Num) }
  <:> structure(n(L))
  &&  number(Num).

pn ::= [L], { lex(L,pn,Num) }
  <:> structure(pn(L))
  &&  number(Num).

v ::= [L], { lex(L,v,Num) }
  <:> structure(v(L))
  &&  number(Num).

det ::= [L], { lex(L,det,Num) }
  <:> structure(det(L))
  &&  number(Num).

Finally, the lexicon. To keep things simple, the lexicon here is just a set of facts, with literal values.[15] The entry for the word the is the exception: it does not have a literal value, but the anonymous variable _ to indicate that the can be either sg or pl.

< 30 Lexicon [continues 19 English grammar fragment with attributes] > ≡

lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).

A session log illustrates the structure built by the DCTG rules:

?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(pn, 
                 [[john]], 
                 [structure(pn(john)), 
                 (number(sg)::-lex(john, pn, sg))])], 
           [ (structure(np(_G292))::-
                node(pn, 
                     [[john]], 
                     [structure(pn(john)), 
                      (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
            number(sg)]), 
      node(vp, 
           [node(v, 
                 [[loves]], 
                 [structure(v(loves)), 
                 (number(sg)::-lex(loves, v, sg))]), 
            node(np, 
                 [node(pn, 
                       [[mary]], 
                       [structure(pn(mary)), 
                       (number(sg)::-lex(mary, pn, sg))])], 
                 [(structure(np(_G424))::-
                     node(pn, [[mary]], 
                          [structure(pn(mary)), 
                          (number(sg)::-lex(mary, pn, sg))])
                     ^^structure(_G424)), 
                  number(sg)])], 
           [ (structure(vp(_G351, _G352))::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^structure(_G351), 
                node(np, 
                     [node(pn, [[mary]], 
                           [structure(pn(mary)), 
                           (number(sg)::-lex(mary, pn, sg))])], 
                     [(structure(np(_G424))::-
                         node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                      number(sg)])^^structure(_G352)), 
             (number(_G373)::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^number(_G373))])], 
     [(structure(s(_G261, _G262))::-
         node(np, 
              [node(pn, [[john]], 
                    [structure(pn(john)), 
                    (number(sg)::-lex(john, pn, sg))])], 
              [(structure(np(_G292))::-
                  node(pn, [[john]], 
                       [structure(pn(john)), 
                       (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
               number(sg)])^^structure(_G261), 
         node(vp, [node(v, [[loves]], 
                        [structure(v(loves)), 
                        (number(sg)::-lex(loves, v, sg))]), 
                   node(np, 
                        [node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])], 
                        [(structure(np(_G424))::-
                            node(pn, [[mary]], 
                                 [structure(pn(mary)), 
                                 (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                         number(sg)])], 
                  [(structure(vp(_G351, _G352))::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^structure(_G351), 
                      node(np, [node(pn, [[mary]], 
                                     [structure(pn(mary)), 
                                     (number(sg)::-lex(mary, pn, sg))])], 
                               [(structure(np(_G424))::-
                                   node(pn, [[mary]], 
                                        [structure(pn(mary)), 
                                        (number(sg)::-lex(mary, pn, sg))])
                                   ^^structure(_G424)), 
                                number(sg)])^^structure(_G352)), 
                   (number(_G373)::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^number(_G373))])
         ^^structure(_G262))])

In examining the structure just shown, the reader will note a great deal of apparent repetition; this results from the high incidence of structure sharing, which is not shown explicitly: in the grammar itself, the variables which have child nodes as values are used quite freely and appear both in the list of children and in the rules for calculating synthetic attributes.

Note also that the values of the attributes are not always pre-calculated: instead, the structure has the rules necessary to perform the evaluation on demand. In the example, the structure attribute for the pre-terminal categories is already completely grounded: it has no variables, but only literal values, while the structure attributes for the higher-level parts of the linguistic structure still have unbound variables.

3.4. Layer 3: Translation into DCTG notation

As a first step toward providing grammatical attributes with PSVI information, we will translate the existing purchase-order schema into DCTG notation, adding grammatical attributes corresponding to some basic information-set properties which are required to be in the input infoset:

for Attribute Information Items:
- [local name]
- [namespace name]
- [normalized value]
for Element Information Items:
- [local name]
- [namespace name]
- [children]
- [attributes]
- [in-scope namespaces] or [namespace attributes]
for Namespace Information Items:
- [prefix]
- [namespace name]

In principle, we ought perhaps also to provide properties for character information items:

for Character Information Items:
- [character code]

but it seems extraordinarily inconvenient to use grammatical attributes for this purpose. The information involved is obviously present, and can be isolated by using the standard Prolog predicate atom_codes(Atom,String) (or atom_chars(Atom,CharList) and char_code(Atom,ASCII)).

Additionally, we will add some more interesting properties of the PSVI:

type definition name, namespace, anonymous, and type
schema specified (schema or infoset)
validation attempted (always full)
validity (always valid, because when the document is not valid, we fail)

Like the DCG version above, the DCTG version of the schema has several distinct kinds of rules:

element rules
attribute-list rules (for checking the attributes of a complex type)
content-model rules (for checking the content of a complex type)
simple-type checking rules

The following sections give these in DCTG notation.

3.4.1. Top-level rules for element types

An element rule will serve to match the start-tag and get the attributes and contents of each element; from it, we will call routines to check the attributes and content against the complex type. These differ from the DCG rules in two ways: when we call them, we must specify three arguments, not two, and we provide explicit grammatical attributes for infoset properties. The basic pattern is simple: for any element in namespace n with local name gi and complex type ct, we will construct an appropriate non-terminal symbol nt, and the element rule will look like this:

nt ::= [element(n:gi,Attributes,Content)],
  {
    ct_atts(A,NA,Attributes),
    ct_cont(C,Content,[])
  }
  <:> attributes(A) 
  && namespaceAttributes(NA)
  && children(C) 
  && localName(gi) 
  && namespacename(n)
  && type_definition_anonymous(Boolean)
  && type_definition_namespace(URI)
  && type_definition_name(NCName)
  && type_definition_type(complex)
  && validation_attempted(full)
  && validity(valid)
.

Later, we will add further grammatical attributes, and use values other than full and valid for invalid elements.

Note that the rule for XML attributes is not a simple call to the parser, but a call to a wrapper predicate. Since the SWI parser returns namespace attributes in the same list as other attributes, while the infoset spec requires that they be listed in different properties, the ct_attributes predicate will need to filter the attribute information items into two different lists, one to become the value of the attributes infoset property, and one to become the value of namespaceAttributes.

We also should become a little more systematic about naming conventions. If we continue to use generic identifiers (element type names) directly as names of Prolog predicates, we risk name collisions between elements and predicates defined as part of the parser, or built in to Prolog. To eliminate this risk, we will prefix names taken over from the schema with e_ (for elements), t_ (for types), etc., and we will avoid those prefixes otherwise. If a schema has any names beginning with e_ or t_, this rule may become slightly confusing. But there won't be collisions between schema-based names and other names in the parser.

The purchase order schema po.xsd defines the following fifteen element types: the list gives the simple names which will be used to refer to them in the grammar below, as well as their schema-component designator as defined in Holstege/Vedamuthu 2002. Since their local names are all unique, the grammar below simply uses e_ plus their local names to refer to them. In other schemas, it will be necessary to mangle the names, or generate arbitrary identifiers, in order to distinguish different element types which have the same local names.

e_purchaseOrder = /element(purchaseOrder)
e_comment = /element(comment)
e_shipTo = /complexType(po:PurchaseOrderType)/sequence()/element(shipTo)
e_billTo = /complexType(po:PurchaseOrderType)/sequence()/element(billTo)
e_items = /complexType(po:PurchaseOrderType)/sequence()/element(items)
e_name = /complexType(po:USAddress)/sequence()/element(name)
e_street = /complexType(po:USAddress)/sequence()/element(street)
e_city = /complexType(po:USAddress)/sequence()/element(city)
e_state = /complexType(po:USAddress)/sequence()/element(state)
e_zip = /complexType(po:USAddress)/sequence()/element(zip)
e_item = /complexType(po:Items)/sequence()/element(item)
e_productName = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(productName)
e_quantity = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)
e_USPrice = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(USPrice)
e_shipDate = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(shipDate)

The simple purchase-order schema defines four complex types; one is anonymous; we'll use the local name of its host element after the t_ prefix:

t_PurchaseOrderType = /complexType(po:PurchaseOrderType)
t_USAddress = /complexType(po:USAddress)
t_Items = /complexType(po:Items)
t_item = /complexType(po:Items)/sequence()/element(item)/complexType()

The elements with complex types get these rules:

< 31 Rules for elements with complex types > ≡

/* e_purchaseOrder: grammatical rule for purchaseOrder element.
   e_purchaseOrder(ParsedNode,L1,L2): holds if the difference
      between L1 and L2 (difference lists) is a purchase order
      element in SWI Prolog notation. 
   And so on for the other element types.
*/
e_purchaseOrder ::= [element('http://www.example.com/PO1':purchaseOrder,
  Attributes,Content)],
  {
    t_PurchaseOrderType_atts(A,NA,Attributes),
    t_PurchaseOrderType_cont(C,Content,[])
  } 
  <:> localname(purchaseOrder)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('PurchaseOrderType')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}

  .
e_shipTo ::= [element(shipTo,Attributes,Content)],
  {
    t_USAddress_atts(A,NA,Attributes),
    t_USAddress_cont(C,Content,[])
  } 
  <:> localname(shipTo)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('USAddress')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .
e_billTo ::= [element(billTo,Attributes,Content)],
  {
    t_USAddress_atts(A,NA,Attributes),
    t_USAddress_cont(C,Content,[])
  } 
  <:> localname(billTo)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('USAddress')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .
e_items ::= [element(items,Attributes,Content)],
  {
    t_Items_atts(A,NA,Attributes),
    t_Items_cont(C,Content,[])
  } 
  <:> localname(items)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('Items')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .
e_item ::= [element(item,Attributes,Content)],
  {
    t_item_atts(A,NA,Attributes),
    t_item_cont(C,Content,[])
  } 
  <:> localname(item)
  && type_definition_anonymous('true')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('t_item')
  && type_definition_type(complex)
  {Common infoset properties for elements in po namespace 32}


  .

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

Note that the type_definition_name property for the item element provides the generated name we use for the type. That this name is not assigned by the schema is clarified by type_definition_anonymous('true').

Since the attributes, children, and namespacename properties have identical definitions for all element types in the purchase-order namespace, we can factor them out into a single code fragment:

< 32 Common infoset properties for elements in po namespace > ≡

  &&  attributes(A)
  &&  namespaceattributes(NA)
  &&  children(C)
  &&  namespacename('http://www.example.com/PO1')
  && validationattempted(full)
  && validity(valid)

This code is used in < Rules for elements with complex types 31 > < Rules for elements with simple types 33 >

The rules for elements with simple types are slightly simpler than those for elements with complex types, but follow the same basic pattern.

In the DCG version of this schema given above, we wrote the comment element and others associated with simple types using a hard-coded requirement for an empty list of attributes. In fact, that is too simple, since such elements may in fact have xsi:type, xsi:nil, xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes. So we write these element rules with the same basic structure as was used for complex types, except that we use a standard predicate for checking that no attributes outside the xsi namespace were used.

The schema po.xsd defines two simple types: SKU and the anonymous simple type used for quantities:

t_quantity = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()
t_SKU = /simpleType(SKU)

In addition, several built-in simple types are used:

t_string = xsd:string
t_integer = xsd:integer
t_decimal = xsd:decimal
t_date = xsd:date

In a full implementation, we'll need to do some more serious name mangling to ensure uniqueness of relatively short, handy names for all types. For now, we just choose the names manually.

The rules for simple types are:

< 33 Rules for elements with simple types > ≡

e_comment ::= [element('http://www.example.com/PO1':comment,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(comment) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_name ::= [element(name,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(name) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_street ::= [element(street,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(street) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_city ::= [element(city,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(city) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_state ::= [element(state,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(state) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_zip ::= [element(zip,Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_decimal_cont(C,Content,[])
  }
  <:> localname(zip) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for decimals 36}

  .

e_productName ::= [element(productName,
    Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(productName) 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for strings 35}

  .

e_quantity ::= [element(quantity,
    Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    t_quantity_cont(C,Content,[])
  }
  <:> localname(quantity) 
  {Common infoset properties for elements in po namespace 32}

  && type_definition_anonymous('true')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('t_quantity')
  && type_definition_type(simple)
  .

e_USPrice ::= [element('USPrice',Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_decimal_cont(C,Content,[])
  }
  <:> localname('USPrice') 
  {Common infoset properties for elements in po namespace 32}

  {PSVI properties for decimals 36}

  .

e_shipDate ::= [element(shipDate,Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_date_cont(C,Content,[])
  }
  <:> localname(shipDate) 
  {Common infoset properties for elements in po namespace 32}

  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('date')
  && type_definition_type(simple)
  .

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

Just as we factor out the common infoset properties, we can also factor out the checking against frequently used built-in simple types, notably string:

< 34 Guard to check attributes and content of strings > ≡

  {
    sT_atts(A,Attributes,[]),
    xsd_string_cont(C,Content,[])
  }

This code is used in < Rules for elements with simple types 33 >

Similarly, the type identifications for string and decimal are used more than once:

< 35 PSVI properties for strings > ≡

  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('string')
  && type_definition_type(simple)

This code is used in < Rules for elements with simple types 33 >

< 36 PSVI properties for decimals > ≡

  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('decimal')
  && type_definition_type(simple)

This code is used in < Rules for elements with simple types 33 >

3.4.2. Rules for attributes

For each complex type, we need to do several things in order to validate all the attributes on occurrence of that type and provide appropriate nodes and infoset properties:

The input structure has namespace attributes and other attributes in the same list, while we need them in separate lists so we can assign them to two different infoset properties. So we need to partition the list of attributes. We can perform the partition either before all other processing, or after; doing it afterwards leads to more compact code, so we choose that.
For each non-namespace attribute found, we need to validate it: if it is declared, we need to check it against its declared type. If the attribute is declared with a fixed type, we should check that the value given matches the prescribed value. If it is not declared, we should raise an error, but we'll save that for a later layer.
We need to ensure that attributes required by the complex type are present and that attributes forbidden by the complex type are not present. For any attributes declared with default values, we need to supply an attribute information item with the default value, if the document didn't supply a value. Rather than trying to interleave this with other tasks, we will perform a separate check on attribute occurrences.

And we want to provide basic infoset properties for the XML attributes, in the form of grammatical attributes in the attribute-grammar sense.

3.4.2.1. Basic pattern

For each complex or simple type dt, the basic pattern of the attribute-checking rule will be:

dt_atts(Lpa,Lpna,Lavs) :-
  lavs_dt(LpaAll,Lavs,[]),    /* parse against grammar of attributes */
  partition(LpaAll,LpaPresent,Lpna),         /* partition the result */
  attocc_dt(LpaPresent,Lpa).      /* check min, max occurrence rules */

The logical variables have the following meanings:

Lpa: List of parsed attributes (i.e. of node() structures of the kind returned by any DCTG rule) for this complex type, including defaulted attributes
Lpna: List of parsed namespace attributes
Lavs: The list of attribute-value specifications provided by the input structure returned by the SWI Prolog parser.
LpaAll: Combined list of parsed-attribute node() structures for all attributes, both namespace attributes and others
LpaPresent: List of parsed-attribute nodes for attributes explicitly assigned values in the document instance (without defaulted attributes)

For each type, a grammar defining the legal attributes will be constructed; if type dt has attributes an1 and an2, of types st1 and st2 respectively, then the core context-free grammar will have a form like this:

lavs_dt ::= [].
lavs_dt ::= avs_dt, lavs_dt.       /* declared attributes */
lavs_dt ::= avs_nsd, lavs_dt.   /* namespace declarations */
lavs_dt ::= avs_xsi, lavs_dt.           /* XSI attributes */

avs_dt ::= [an1=Av], { st1_value(Av) }.
avs_dt ::= [an2=Av], { st2_value(Av) }.

Simple types will, of course, have no declared attributes, and the rules for declared attributes and occurrence-checking (together with the rules for individual attributes) will be omitted. Wildcard support can also be added here when needed.

3.4.2.2. Namespace attributes and XSI attributes

One set of rules for namespace attributes and XSI attributes will suffice:

< 37 Grammar rules for namespace and XSI attributes > ≡

/* avs_nsd: grammatical rule for namespace-attribute specifications */
avs_nsd ::= [xmlns=DefaultNS]
  <:> localname(xmlns)
  && namespacename('http://www.w3.org/2000/xmlns/')
  && normalizedvalue(DefaultNS).
avs_nsd ::= [xmlns:Prefix=NSName]
  <:> localname(Prefix)
  && namespacename('http://www.w3.org/2000/xmlns/')
  && normalizedvalue(NSName).

/* avs_nsd: grammatical rule for XSI attribute specifications */
avs_xsi ::= ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value]
  <:> localname(Localname)
  && namespacename('http://www.w3.org/2001/XMLSchema-instance')
  && normalizedvalue(Value).

This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

Note that default namespace declarations do have a namespace property, despite not having a prefixed name; this is in accord with Section 2.2 of the Infoset spec, which says “By definition, all namespace attributes (including those named xmlns, whose [prefix] property has no value) have a namespace URI of http://www.w3.org/2000/xmlns/.”

3.4.2.3. Occurrence checking

Each complex type will also have a rule for occurrence-checking, which will take something like the following form (assuming that Lreq, Ldft, and Lnot are lists of required, defaulted, and forbidden attributes:

attocc_dt(LpaPres,LpaAll) :-
  atts_present(LpaPres,Lreq),
  atts_absent(LpaPres,Lnot),
  atts_defaulted(LpaPres,Ldft,LpaAll).

Since the form of the attribute lists has changed (we are now dealing with lists of node structures), we need new forms of atts_present, etc. for this:

< 38 Utilities for checking attribute occurrences > ≡

/* atts_present(Lpa,Lreq):  true if a parsed attribute node
   is present in Lpa for each attribute name in Lreq */
atts_present(LAVS,[]).
atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).

/* An attribute name matches if namespace and local part match */
/* att_present(Lpa,Attname):  true if a parsed attribute node
   is present in Lpa which has name Attname */
att_present([Pa|Lpa],NS:Attname) :- 
  Pa^^localname(Attname), 
  Pa^^namespacename(NS).
att_present([Pa|Lpa],Attname) :-
  att_present(Lpa,Attname).
/* no base step: if we reach att_present([],Attname) we want to fail. */

This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

The rule for checking forbidden attributes is very similar:

< 39 Utility for checking absent attributes > ≡

/* atts_absent(Lpa,Ltabu): true if no attribute named in Ltabu
   is present in Lpa */
atts_absent(LAVS,[]).
atts_absent(LAVS,[H|T]) :-
  not(att_present(LAVS,H)),
  atts_absent(LAVS,T).

This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

The rule for providing defaults must go through all of the attributes with defaults; this happens in the atts_defaulted predicate in the usual way of recursion on the list.

< 40 Utility for providing defaulted attributes > ≡

/* atts_defaulted(L1,L2,L3): true if L3 has all the attributes in L1,
   plus all of the attributes in L2 which are not also in L1 */
atts_defaulted(Lpa,[],Lpa).
atts_defaulted(Lpa,[Padft|Ldft],LpaAll) :-
  atts_defaulted(Lpa,Ldft,Lpa2),
  att_merge(Lpa2,Padft,LpaAll).

Continued in <Utility for providing defaulted attributes 41>
This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

For each of these attributes individually, the default value must be added to the list if a value is not already there; this involves recursion on the list of attributes already present. We expect only ever to call this predicate when the first and third arguments (the defaulted attribute and the list into which it is to be merged) are instantiated, but experience shows that we run into problems when Prolog backtracks into this predicate (e.g. after it finds an error further along in the XML document and is retrying everything it has done before). When backtracking, Prolog does call this predicate with uninstantiated arguments and then falls into an infinite loop trying to find the namespacename attribute of an uninstantiated variable. To prevent this loop, we check to ensure that the first two arguments are instantiated, using the standard Prolog predicate nonvar. Strictly speaking, this test has nothing whatever to do with the declarative meaning of the predicate, and it would be preferable to do without it, but it is essential for practical purposes.

< 41 Utility for providing defaulted attributes [continues 40 Utility for providing defaulted attributes] > ≡

/* att_merge(L1,Pa,L2): if Pa is present in L1, then L3 = L1,
   otherwise L3 = L1 + Pa. */
att_merge([],Padft,[Padft]).
att_merge([Pa|Lpa],Padft,[Pa|Lpa]) :-
  nonvar(Pa), nonvar(Lpa), nonvar(Padft),
  Pa^^namespacename(NS),
  Padft^^namespacename(NS),
  Pa^^localname(Lnm),
  Padft^^localname(Lnm).
att_merge([Pa|Lpa],Padft,Lpa2) :-
  nonvar(Pa), nonvar(Lpa), nonvar(Padft),
  not( {Pa^^namespacename(NS),
    Padft^^namespacename(NS),
    Pa^^localname(Lnm),
    Padft^^localname(Lnm) } ),
  att_merge(Lpa,Padft,Lpa2).

The explicit not() in the third rule is similarly intended to prevent the third rule from firing inappropriately during backtracking.[16]

3.4.2.4. Rules for the Purchase-order type

The PurchaseOrderType defines only one attribute, orderDate, of type xsd:date. In addition, we need to accept xsi attributes. No attributes here are required, forbidden, or defaulted, so we don't need any calls to atts_present, atts_absent, or atts_defaulted. Following the patterns described above, this gives us the following definitions for the relevant predicates:

< 42 Attribute handling for purchaseOrderType > ≡

/* t_TYPENAME_atts(Lpa,Lpna,Lavs): true if Lavs contains an 
   input-form list of attribute specifications which 
   is legal for complex type TYPENAME, and which 
   corresponds to the list of parsed attributes Lpa plus
   the list of parsed namespace attributes Lpna. */

t_PurchaseOrderType_atts(Lpa,Lpna,Lavs) :-
  lavs_t_PurchaseOrderType(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna),
  attocc_t_PurchaseOrderType(LpaPres,Lpa).

lavs_t_PurchaseOrderType ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_PurchaseOrderType ::= avs_t_PurchaseOrderType^^Pa, 
                             lavs_t_PurchaseOrderType^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_PurchaseOrderType ::= avs_nsd^^Pa, lavs_t_PurchaseOrderType^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_PurchaseOrderType ::= avs_xsi^^Pa, lavs_t_PurchaseOrderType^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

avs_t_PurchaseOrderType ::= [orderDate=Value],
  { ws_normalize(collapse,Value,SNValue),
    xsd_date_value(SNValue) }
  {Properties for orderDate attribute 50}
.

/* Literally copying the pattern would give us this:

attocc_t_PurchaseOrderType(LpaPres,LpaAll) :-
  atts_present(LpaPres,[]),
  atts_absent(LpaPres,[]),
  atts_defaulted(LpaPres,[],LpaAll).

but that's pointless.  Instead, we'll do the equivalent: */
attocc_t_PurchaseOrderType(L,L).

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

3.4.2.5. White-space normalization of simple types

The rule for the orderDate attribute specifies that whitespace handling (with the keyword collapse) should be done before the attribute value is validated. We haven't done whitespace-normalization yet, so we should stop to define it. We specify a predicate ws_normalize(+kw,+Atom,-Atom), which takes three arguments: a keyword to say what kind of normalization to perform, an atom representing the character string to be normalized, and an atom representing the same string after normalization. (The arguments marked + are expected to be used as input, i.e. the arguments will be instantiated at the time the relation is called; the argument marked - will normally be uninstantiated when the predicate is called and will be bound to an appropriate value. Readers used to other programming languages may think of it, without too much distortion, as a VAR parameter called by reference and used to return the result of a computation.)

There are three values for the keyword, described in the XML Schema 1.0 specification as follows:

preserve No normalization is done, the value is not changed (this is the behavior required by [XML 1.0 (Second Edition)] for element content)

This one is easy to implement: just make the third argument (the output argument) identical to the second.

< 43 Utility for whitespace normalization > ≡

/* ws_normalize(Keyword,Input,Output): true if Output is
   an atom identical to the whitespace-normalized form of Input,
   with the whitespace mode indicated by Keyword. */
ws_normalize(preserve,Atom,Atom).

Continued in <Utility for whitespace normalization 44>, <Utility for whitespace normalization 46>
This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

The second method of normalization is used in XML 1.0 for CDATA attributes:

replace All occurrences of #x9 (tab), #xA (line feed) and #xD (carriage return) are replaced with #x20 (space)

< 44 Utility for whitespace normalization [continues 43 Utility for whitespace normalization] > ≡

ws_normalize(replace,In,Out) :-
  atom_codes(In,Lcin),
  ws_blanks(Lcin,Lcout),
  atom_codes(Out,Lcout).

This one requires an auxiliary predicate to replace all whitespace characters in an atom with blanks; ws_blanks walks through a list, changing each tab, linefeed, or carriage return (characters 9, 10, or 13) to blanks (character 32), and leaving all other characters alone.[17]

< 45 Utility to change whitespace characters to blanks [continues 46 Utility for whitespace normalization] > ≡

/* ws_blanks(A,B): where A has any whitespace, B has a blank */
ws_blanks([],[]).
ws_blanks([9|T1],[32|T2]) :- ws_blanks(T1,T2).
ws_blanks([10|T1],[32|T2]) :- ws_blanks(T1,T2).
ws_blanks([13|T1],[32|T2]) :- ws_blanks(T1,T2).
ws_blanks([H|T1],[H|T2]) :- not(member(H,[9,10,13])), ws_blanks(T1,T2).

The third method of normalization is used in XML 1.0 for non-CDATA attributes:

collapse After the processing implied by replace, contiguous sequences of #x20's are collapsed to a single #x20, and leading and trailing #x20's are removed.

< 46 Utility for whitespace normalization [continues 43 Utility for whitespace normalization] > ≡

ws_normalize(collapse,In,Out) :-
  ws_normalize(replace,In,Temp),
  atom_codes(Temp,Lctemp),
  ws_collapse(Lctemp,Lcout),
  atom_codes(Out,Lcout).

Continued in <Utility to change whitespace characters to blanks 45>, <Utility for collapsing whitespace 47>

This method, too, requires an auxiliary predicate, ws_collapse:

< 47 Utility for collapsing whitespace [continues 46 Utility for whitespace normalization] > ≡

/* ws_collapse(A,B): B is like A, with all strings of blanks collapsed
 * to single blanks, and leading and trailing blanks stripped. 
 * ws_collapse/2 strips leading blanks, then calls ws_collapse/3 */
ws_collapse([],[]).
ws_collapse([32|T1],T2) :- ws_collapse(T1,T2).
ws_collapse([H|T1],[H|T2]) :- not(H=32), ws_collapse(internal,T1,T2).

/* ws_collapse/3 walks past non-blanks, and when it hits a string 
 * of blanks, it drops all but the last one before a non-blank. */
ws_collapse(internal,[],[]).
ws_collapse(internal,[32],[]).
ws_collapse(internal,[H|T1],[H|T2]) :- 
  not(H=32), 
  ws_collapse(internal,T1,T2).
ws_collapse(internal,[32,32|T1],T2) :- ws_collapse(internal,[32|T1],T2).
ws_collapse(internal,[32,H|T1],[32,H|T2]) :- 
  not(H=32), 
  ws_collapse(internal,T1,T2).

3.4.2.6. Rules for purchase-order types, continued

We need to provide grammatical attributes for each of these items.

The non-terminal lavs_t_PurchaseOrderType carries one grammatical attribute, whose value is the list of parsed-attribute nodes which was matched. In the case of the empty list, the attribute is simple:

< 48 Grammatical attributes for empty attribute list > ≡

<:> attributes([])

This code is used in < Attribute handling for purchaseOrderType 42 > < Attribute handling for USAddress 51 > < Attribute handling for Items type 53 > < Attribute handling for t_item 54 > < Attribute handling for simple types 56 >

In the recursion steps, we need to flatten the list; otherwise we end up with a lopsided binary tree, rather than a simple list:

< 49 Grammatical attributes for attribute-list recursion > ≡

<:> attributes([Pa|L]) ::- Lpa^^attributes(L)

The orderDate attribute has the properties usual for attribute information items:

< 50 Properties for orderDate attribute > ≡

  <:> localname('orderDate')
  && namespacename('')
  && normalizedvalue(SNValue)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('date')
  && type_definition_type(simple)
  && schemaspecified(schema)
  && validationattempted(full)
  && validity(valid)

This code is used in < Attribute handling for purchaseOrderType 42 >

3.4.2.7. Rules for other types

The USAddress type defines one attribute (country), of type NMTOKEN; it has a fixed value (US).

< 51 Attribute handling for USAddress > ≡

t_USAddress_atts(Lpa,Lpna,Lavs) :-
  lavs_t_USAddress(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna),
  attocc_t_USAddress(LpaPres,Lpa).

lavs_t_USAddress ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_USAddress ::= avs_t_USAddress^^Pa, 
                     lavs_t_USAddress^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_USAddress ::= avs_nsd^^Pa, lavs_t_USAddress^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_USAddress ::= avs_xsi^^Pa, lavs_t_USAddress^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

avs_t_USAddress ::= [country='US']
  <:> localname('country')
  && namespacename('')
  && normalizedvalue('US')
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.w3.org/2001/XMLSchema')
  && type_definition_name('NMTOKEN')
  && type_definition_type(simple)
  && schemaspecified(schema)
  && validationattempted(full)
  && validity(valid)
.

Continued in <Attribute occurrence checking for USAddress 52>
This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

Since the country attribute has a fixed value, we need to supply a complete parsed-attribute node for use in case the document instance doesn't supply one. We do this as part of the definition of attocc_t_USAddress.

< 52 Attribute occurrence checking for USAddress [continues 51 Attribute handling for USAddress] > ≡

attocc_t_USAddress(LpaPresent,LpaAll) :-
  CountryAtt = node(
    attribute(country),
    [],
    [ (namespacename('')),
      (localname('country')),
      (normalizedvalue('US')),
      (type_definition_anonymous('false')),
      (type_definition_namespace('http://www.w3.org/2001/XMLSchema')),
      (type_definition_name('NMTOKEN')),
      (type_definition_type(simple)),
      (schemaspecified(schema)),
      (validationattempted(full)),
      (validity(valid)),
    ]),
  atts_defaulted(LpaPres,[CountryAtt],LpaAll).

The complex type t_Items defines no attributes, so its grammar for attributes only has rules for namespace declarations and attributes in the XSI namespace. Since there are no attributes, there are no required, defaulted, or forbidden attributes, so we don't need the usual call to attocc_Type.

< 53 Attribute handling for Items type > ≡

t_Items_atts(Lpa,Lpna,Lavs) :-
  lavs_t_Items(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna).

lavs_t_Items ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_Items ::= avs_nsd^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_Items ::= avs_xsi^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

A similar simplification can be used for simple types.

The complex type t_item defines the partNum attribute:

< 54 Attribute handling for t_item > ≡

t_item_atts(Lpa,Lpna,Lavs) :-
  lavs_t_item(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna),
  attocc_t_item(LpaPres,Lpa).

lavs_t_item ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_item ::= avs_t_item^^Pa, 
                lavs_t_item^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_item ::= avs_nsd^^Pa, lavs_t_item^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_item ::= avs_xsi^^Pa, lavs_t_item^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

Continued in <PartNum attribute 55>
This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

The grammatical attributes for the partNum attribute illustrate PSVI properties for user-defined types.

< 55 PartNum attribute [continues 54 Attribute handling for t_item] > ≡

avs_t_item ::= [partNum=Value],
  { t_SKU_value(Value) }
  <:> localname('partNum')
  && namespacename('')
  && normalizedvalue(Value)
  && type_definition_anonymous('false')
  && type_definition_namespace('http://www.example.com/PO1')
  && type_definition_name('SKU')
  && type_definition_type(simple)
  && schemaspecified(schema)
  && validationattempted(full)
  && validity(valid)
.

/* one required attribute: partNum */
attocc_t_item(LpaPres,LpaAll) :-
  atts_present(LpaPres,['':partNum]),
  atts_absent(LpaPres,[]),
  atts_defaulted(LpaPres,[],LpaAll).

A single set of rules will suffice for all simple types (string, decimal, integer, date), because by definition simple types have no attributes; any attributes which occur in the instance must be namespace declarations or XSI attributes.

< 56 Attribute handling for simple types > ≡

sT_atts(Lpa,Lpna,Lavs) :-
  lavs_sT(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna).

lavs_sT ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_sT ::= avs_nsd^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_sT ::= avs_xsi^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

3.4.2.8. Partitioning the list

The rule for partitioning the list of parsed attribute nodes must extract the actual list from the node passed as the first argument, and then the partition is easy:

< 57 partition predicate > ≡

partition(LpaAll,LpaPresent,Lpna) :-
  LpaAll^^attributes(L),
  partition2(L,LpaPresent,Lpna).
partition2([],[],[]).
partition2([Pa|Lpa],LpaPres,[Pa|Lpna]) :-
  Pa^^localname(xmlns), 
  partition2(Lpa,LpaPres,Lpna).
partition2([Pa|Lpa],LpaPres,[Pa|Lpna]) :-
  Pa^^namespacename('http://www.w3.org/2000/xmlns/'), 
  partition2(Lpa,LpaPres,Lpna).
partition2([Pa|Lpa],[Pa|LpaPres],Lpna) :-
  not(Pa^^localname(xmlns)),
  not(Pa^^namespacename('http://www.w3.org/2000/xmlns/')),
  partition2(Lpa,LpaPres,Lpna).

This code is used in < Generic utilities for DCTG-encoded schemas 85 > < Generic utilities for DCTG-encoded schemas 158 >

3.4.3. Rules for content of complex types

As in the DCG grammar given earlier, the most conventional-looking part of our DCTG grammar is the representation of the content models. The base context-free grammar in DCTG notation is substantially the same as the DCG version, except that we add names to the various items on the right-hand side, for use in flattening the lists of children (repeating and optional items otherwise would cause nesting of nodes).

< 58 Rules for purchase-order content models > ≡

t_PurchaseOrderType_cont ::= 
  e_shipTo^^S, e_billTo^^B, opt_e_comment^^C, 
  e_items^^I
{Children attribute of t_PurchaseOrder 62}

.
opt_e_comment ::= []
{Empty list of children for opt_e_comment nonterminal 60}

.
opt_e_comment ::= e_comment^^Comm
{Children for opt_e_comment nonterminal 61}

.

t_USAddress_cont ::= 
  e_name^^N, e_street^^S, e_city^^C, e_state^^ST, e_zip^^Z
{Children attribute of t_USAddress 59}

.

t_Items_cont ::= star_e_item^^L
{Children attribute of t_Items_cont 66}

.
star_e_item    ::= []
{Empty list of children for star_e_item nonterminal 67}

.
star_e_item    ::= e_item^^I, star_e_item^^L
{Children for star_e_item nonterminal 68}

.

t_item_cont ::= e_productName^^PN, e_quantity^^Q, e_USPrice^^USP, 
                opt_e_comment^^C, opt_e_shipDate^^S
{Children attribute of t_item 63}

.

opt_e_shipDate ::= []
{Empty list of children for opt_e_shipdate nonterminal 64}

.
opt_e_shipDate ::= e_shipDate^^S
{Children for opt_e_shipdate nonterminal 65}

.

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

The only grammatical attribute we need to calculate for these non-terminals right now is children, which will be used to supply the children property of the parent element. In order to supply a flat list, rather than an arbitrarily deep one-sided binary tree, we can't simply take the node returned by each rule.

Perhaps the simplest to calculate is the children attribute of the t_USAddress_cont non-terminal: it's just a list of the children. Since no child is optional, there is no variation.

< 59 Children attribute of t_USAddress > ≡

  <:> children([N,S,C,ST,Z])

This code is used in < Rules for purchase-order content models 58 >

More complex, because the comment element is optional, is the children attribute of the t_PurchaseOrderType non-terminal. When a comment is present, we want it listed among the children; when it is not present, however, we don't want any dummy node. The opt_e_comment non-terminal, that is, should have a children attribute which is either the empty list

< 60 Empty list of children for opt_e_comment nonterminal > ≡

  <:> children([])

This code is used in < Rules for purchase-order content models 58 >

or a list containing the comment node.

< 61 Children for opt_e_comment nonterminal > ≡

  <:> children([Comm])

This code is used in < Rules for purchase-order content models 58 >

The standard list-concatenation methods can now be used to yield either [S,B,C,I] or [S,B,I]. The simplest is probably to use flatten, which generates, for a list possibly containing lists as elements, a flat list with no nested lists, by replacing each list with its elements.

< 62 Children attribute of t_PurchaseOrder > ≡

  <:> children(Lpe) ::- 
    C^^children(CC), 
    flatten([S,B,CC,I],Lpe)

This code is used in < Rules for purchase-order content models 58 >

A similar method is used for the item element, which also has optional children.

< 63 Children attribute of t_item > ≡

  <:> children(Lpe) ::- 
    C^^children(CC), 
    S^^children(SC), 
    flatten([PN,Q,USP,CC,SC],Lpe)

This code is used in < Rules for purchase-order content models 58 >

This requires that the opt_e_shipDate non-terminal produce (like opt_e_comment) its own children property:

< 64 Empty list of children for opt_e_shipdate nonterminal > ≡

  <:> children([])

This code is used in < Rules for purchase-order content models 58 >

< 65 Children for opt_e_shipdate nonterminal > ≡

  <:> children([S])

This code is used in < Rules for purchase-order content models 58 >

The items element is just a simple list; its children property can be done using the same methods we used to generate a flat list of attributes, above.

< 66 Children attribute of t_Items_cont > ≡

  <:> children(List) ::- L^^children(List)

This code is used in < Rules for purchase-order content models 58 >

< 67 Empty list of children for star_e_item nonterminal > ≡

  <:> children([])

This code is used in < Rules for purchase-order content models 58 >

< 68 Children for star_e_item nonterminal > ≡

  <:> children([I|T]) ::- L^^children(T)

This code is used in < Rules for purchase-order content models 58 >

3.4.4. Rules for checking values of simple types

In this layer of our development, we don't need any new rules for checking simple types; we'll just reuse the ones shown above for the DCG version of this schema. We do have to rewrite them in DCTG notation.

3.4.4.1. Built-in simple types

There are two kinds of rules to provide (this may be an unnecessary distinction, but it's what the rest of the program is expecting): xsd_Typename_cont (called from element rules) and xsd_Typename_value (called from elsewhere).

< 69 xsd_Typename_cont rules > ≡

xsd_string_cont ::= [Atom], { xsd_string_value(Atom) }.
xsd_decimal_cont ::= [Atom], { xsd_decimal_value(Atom) }.
xsd_integer_cont ::= [Atom], { xsd_integer_value(Atom) }.
xsd_date_cont ::= [Atom], { xsd_date_value(Atom) }.

This code is used in < Generic utilities for DCTG-encoded schemas 85 >

In this version of our schema, we will do a little better job of checking the validity of the lexical form. Strings remain trivial:

< 70 xsd_Typename_value rules > ≡

/* In our representation of XML, character data is represented
 * as atoms.  Handling of non-ASCII characters is OK if they are
 * in UTF8, but the SWI parser currently has trouble with 
 * some named entity references to non-ASCII characters */
xsd_string_value(LF) :- atom(LF).

Continued in <Checking decimal and integer values 71>, <Checking date values 72>, <Lexical form for year 73>, <Lexical form for month 76>, <Lexical form for day of month 77>, <Checking date values 78>, <Checking date values 79>
This code is used in < Generic utilities for DCTG-encoded schemas 85 >

Decimals match the pattern [+-]? [0-9]+ ('.' [0-9]*), integers match [+-]? [0-9]+ — we'll use DCTG notation to check the lexical form against these patterns:

< 71 Checking decimal and integer values [continues 70 xsd_Typename_value rules] > ≡

xsd_decimal_value(LF) :- 
  atom_chars(LF,L),
  lexform_decimal(_,L,[]).
xsd_integer_value(LF) :- 
  atom_chars(LF,L),
  lexform_integer(_,L,[]).

lexform_decimal ::= lexform_integer, fractionalpart.
lexform_integer ::= opt_sign, digits.
fractionalpart ::= [].
fractionalpart ::= decimalpoint.
fractionalpart ::= decimalpoint, opt_digits.
opt_sign ::= [].
opt_sign ::= ['+'].
opt_sign ::= ['-'].
decimalpoint ::= ['.'].
opt_digits ::= [].
opt_digits ::= digits.
/* We supply a 'lexval' property on digits, for use in date checking */
digits ::= digit^^D
  <:> lexval([Dv]) ::- D^^lexval(Dv).
digits ::= digit^^D1, digits^^Dd
  <:> lexval([D1val|Ddval]) ::- D1^^lexval(D1val), Dd^^lexval(Ddval).
digit ::= [Ch], { char_type(Ch,digit) }
  <:> lexval(Ch).

Date values can be checked fully using an appropriate grammar; the grammatical attributes of the DCTG notation make it easy to express the leap-year constraints. A lexical form for date is OK if the date is ok; the predicate dateok takes the integer values of year, month, and day as arguments.

< 72 Checking date values [continues 70 xsd_Typename_value rules] > ≡

xsd_date_value(LF) :- 
  atom_chars(LF,LC),
  lexform_date(_,LC,[]).
lexform_date ::= year^^Y, hyphen, month^^M, hyphen, day^^D,
  { Y^^val(Yv), M^^val(Mv), D^^val(Dv),
    dateok(Yv,Mv,Dv) }.

Years may take an optional leading minus sign; their value (the val property) is composed by reading their lexical form as a number (using the standard number_chars predicate).

< 73 Lexical form for year [continues 70 xsd_Typename_value rules] > ≡

/* Years must have at least four digits */
yearnum ::= digit^^D1, digit^^D2, digit^^D3, digits^^Dd
  <:> val(Num) ::- D1^^lexval(Dv1),
    D2^^lexval(Dv2),
    D3^^lexval(Dv3),
    Dd^^lexval(Dv4),
    flatten([Dv1,Dv2,Dv3,Dv4],LF),
    number_chars(Num,LF).
year ::= yearnum^^Y
  <:> val(Num) ::- Y^^val(Num).
year ::= ['-'], yearnum^^Y
  <:> val(Num) ::- Y^^val(N), Num is 0 - N.
hyphen ::= ['-'].

We can, in principle, constrain month values in purely grammatical terms:

< 74 Purely grammatical rule for month > ≡

month ::= ['0'], ['1'].
month ::= ['0'], ['2'].
...
month ::= ['0'], ['9'].
month ::= ['1'], ['0'].
month ::= ['1'], ['1'].
month ::= ['1'], ['2'].

This code is not used elsewhere.

It's a little more compact if we use the number_chars predicate and test the number arithmetically.

< 75 Semi-grammatical rule for month > ≡

month ::= ['0'], digit^^D
  { D^^lexval(Dv), number_chars(V,Dv), V > 0 }
  <:> val(V).
month ::= ['1'], digit^^D
  { D^^lexval(Dv), number_chars(V,Dv), V < 3 }
  <:> val(Val) ::- Val is 10 + V.

This code is not used elsewhere.

And it's easiest to follow, probably, if the context-free part of the grammar allows any two-digit number and we have a guard do the range check, arithmetically:

< 76 Lexical form for month [continues 70 xsd_Typename_value rules] > ≡

month ::= digit^^D1, digit^^D2,
  { D1^^lexval(Dv1),
    D2^^lexval(Dv2),
    number_chars(Num,[Dv1,Dv2]),
    Num > 0,
    Num < 13 }
  <:> val(Num).

We'll do the same for day of the month:

< 77 Lexical form for day of month [continues 70 xsd_Typename_value rules] > ≡

day ::= digit^^D1, digit^^D2,
  { D1^^lexval(Dv1),
    D2^^lexval(Dv2),
    number_chars(Num,[Dv1,Dv2]),
    Num > 0,
    Num < 32 }
  <:> val(Num).

No one is ever happy, of course, unless a date field is also checked for correct handling of leap years. The rules below are one way to do this, and not necessarily the most elegant, but relatively easy to understand: a date is OK if the day of the month is between 1 and 28, inclusive (I am relying on the range checks already performed in the grammar rules), or if the day is 29 or 30 and the month is not 2, or if the day is 31 and the month is one of those which has 31 days. Or, finally, it's OK if the year is divisible by four and its divisibility by 100 and 400 is OK. The latter is just complex enough to be worth putting into a separate predicate.

< 78 Checking date values [continues 70 xsd_Typename_value rules] > ≡

dateok(_Y,_M,D) :- D < 29.
dateok(_Y,M,29) :- M =\= 2.
dateok(_Y,M,30) :- M =\= 2.
dateok(_Y,M,31) :- member(M,[1,3,5,7,8,10,12]).
dateok(Y,2,29) :- (Y >= 0 -> Yx = Y ; Yx is Y + 1), /* adjust for BC */
  0 is Yx mod 4,
  Lc is Yx mod 100,
  L4c is Yx mod 400,
  leapyearcheck(Lc,L4c).

A year is a leap year if (it is divisible by 4, but we already have that) it is not divisible by 100, or else if it is divisible by 400.

< 79 Checking date values [continues 70 xsd_Typename_value rules] > ≡

leapyearcheck(C,_Q) :- C =\= 0. /* it's not a century year, so leapyear */
leapyearcheck(0,0).            /* it's a quad-century year, so leapyear */

3.4.4.2. Checking simple types SKU and quantity

We also need to have rules for checking the two simple types declared in the purchase-order schema: SKUs and quantities.

t_quantity = /complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()
t_SKU = /simpleType(SKU)

The content rules are simple and follow the same pattern as those of the builtin types.

< 80 Simple-type content rules for purchase-order types > ≡

t_SKU_cont ::= [Atom], { t_SKU_value(Atom) }.
t_quantity_cont ::= [Atom], { t_quantity_value(Atom) }.

This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

The SKU value checking has been seen before; all we do here is translate it from DCG into DCTG notation.

< 81 Value-checking rules for SKU > ≡

t_SKU_value(LF) :- 
  atom_chars(LF,Charseq),
  lexform_sku(_Structure,Charseq,[]).

lexform_sku ::= sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part ::= digit, digit, digit.
sku_alpha_part ::= cap_a_z, cap_a_z.
cap_a_z ::= [Char], { char_type(Char,upper) }.

Continued in <Value-checking rules for quantities 82>
This code is used in < Predicates for purchase-order material 84 > < Predicates for purchase-order material 154 >

The quantity value checking can rely in part on the rule for integers:

< 82 Value-checking rules for quantities [continues 81 Value-checking rules for SKU] > ≡

t_quantity_value(LF) :- 
  atom_chars(LF,Lchars),
  lexform_integer(_,Lchars,[]),
  number_chars(Num,Lchars),
  Num < 100.

3.4.5. Overview and Summary

Now we need to put all of this together.

The exposition above has attempted to develop the layer-3 grammar gradually, with a mix of top-down and bottom-up development.

For reference purposes, it may be useful to summarize some of the patterns and conventions used in layer 3; that is the purpose of the following sections. We begin with the top level of program podctg.pl; the program fragments given below show which fragments given above are included in the program and in what order. We then outline the basic correspondences between the schema and the logic grammar, enumerate the basic naming conventions used, and list some utilities which are schema-independent.

3.4.5.1. Top level of program podctg.pl

The program podctg.pl comprises two kinds of material: predicates specific to the purchase-order schema and generic predicates which would be part of the DCTG representation of any schema.

< 83 DCTG version of the purchase order schema, layer 3 [File podctg.pl] > ≡

/* podctg.pl: a definite-clause translation grammar representation
 * of the sample purchase-order schema from the XML Schema tutorial.
 *
 * This DCTG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
{Predicates for purchase-order material 84}

{Generic utilities for DCTG-encoded schemas 85}

The schema-specific matter includes the element- and type-specific rules:

< 84 Predicates for purchase-order material > ≡

{Rules for elements with complex types 31}

{Rules for elements with simple types 33}

{Attribute handling for purchaseOrderType 42}

{Attribute handling for USAddress 51}

{Attribute handling for Items type 53}

{Attribute handling for t_item 54}

{Attribute handling for simple types 56}

{Rules for purchase-order content models 58}

{Simple-type content rules for purchase-order types 80}

{Value-checking rules for SKU 81}

This code is used in < DCTG version of the purchase order schema, layer 3 83 > < DCTG version of the purchase order schema, layer 4 153 >

The generic material includes rules for namespace declarations and XSI attributes, the generic routines atts_present etc. for checking attribute-occurrence constraints, and the utility predicates for whitespace normalization and partitioning the attribute list into namespace attributes and others. In addition, it contains the few type-checking predicates we have written for checking built-in simple types.

< 85 Generic utilities for DCTG-encoded schemas > ≡

{Grammar rules for namespace and XSI attributes 37}

{Utilities for checking attribute occurrences 38}

{Utility for checking absent attributes 39}

{Utility for providing defaulted attributes 40}

{Utility for whitespace normalization 43}

{partition predicate 57}

{xsd_Typename_cont rules 69}

{xsd_Typename_value rules 70}

This code is used in < DCTG version of the purchase order schema, layer 3 83 > < DCTG version of the purchase order schema, layer 4 153 >

3.4.5.2. Basic patterns

The basic rules for constructing the layer-3 logic grammar are:

The content model of each complex type T is translated into a grammar which is used to validate the contents of each element declared as having type T. The grammar corresponding to a given content model defines a regular language; the only recursion in the grammar is to handle repetition. Elements found in the content are treated, for purposes of this regular grammar, as if they were lexical items; the recursive validation of their content is accomplished by separate calls to the parser.

The grammar for the content model of a complex type TN has start-symbol t_TN_cont.
For example:
```
t_USAddress_cont ::= 
  e_name^^N, e_street^^S, e_city^^C, e_state^^ST, e_zip^^Z
  <:> children([N,S,C,ST,Z]).
```
In some cases, the grammar will have several rules, to cover repetitions and choices.
The attributes associated with each complex type T are checked by an attribute-checking grammar which also checks for namespace declarations and attributes in the XSI namespace.

The predicate used to perform attribute checking for a given complex type TN is called TN_atts; the grammar for the attributes of TN has the start-symbol lavs_TN.
The generic pattern for a type named TN is:
```
TN_atts(Lpa,Lpna,Lavs) :-
  lavs_TN(LpaAll,Lavs,[]),    /* parse against grammar of attributes */
  partition(LpaAll,LpaPresent,Lpna),         /* partition the result */
  attocc_TN(LpaPresent,Lpa).      /* check min, max occurrence rules */
```
The auxiliary predicate attocc_TN is used to check for required and forbidden attributes.
Individual elements in the document instance are matched by trivial grammar rules which validate their attributes and contents by recursively invoking the parser to check the attributes and the content of the element.

The pre-terminal symbol which matches a single element with generic identifier GI is named e_GI.
The pattern for such rules is:
```
e_NCName ::= [element(NS:GI,Attributes,Content)],
  {
    t_TN_atts(A,NA,Attributes),
    t_TN_cont(C,Content,[])
  }
  <:> attributes(A)               
  && namespaceAttributes(NA)
  /* etc. */
.
```
When the element has a simple type, the pattern is identical except that it uses a standard sT_atts predicate instead of t_TN_atts.
Simple types are checked by (a) a grammar defining their legal lexical forms, and (b) ad hoc rules checking other facets.
A lexical form which occurs as the content of an element associated with simple type TN is matched by a trivial grammar rule for the non-terminal t_TN_cont. That rule has semantic actions which fire to check the lexical form and value. The pattern is:
```
t_TN_cont ::= [Atom], { t_TN_value(Atom) }.
```
The predicate which checks whether a lexical form is legal for a given simple type TN is named t_TN_value (or xsd_TN_value, for built-in simple types); it calls the parser to check the lexical form and performs any value-related tests itself. The pattern is
```
t_TN_value(LF) :- 
  atom_chars(LF,Charseq),
  lexform_TN(_Structure,Charseq,[])
  /* additional value checks */
  .
```
The grammar for the lexical form of a simple type TN has start-symbol lexform_TN.
For example:
```
lexform_sku ::= sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part ::= digit, digit, digit.
sku_alpha_part ::= cap_a_z, cap_a_z.
cap_a_z ::= [Char], { char_type(Char,upper) }.
```

3.4.5.3. Naming conventions

Identifiers for elements, types, etc.

e_GI: refers to an element; in simple cases (as here) the element with GI as its generic identifier. In other cases GI will be chosen arbitrarily to avoid name clashes. Also used as a grammar non-terminal.
t_TN: refers to a type; in simple cases, TN is the local name of the type.
xsd_TN: denotes a built-in simple type.

Predicates (other than those corresponding to non-terminals in the grammar):

t_TN_atts(-Lpa,-Lpna,+Lavs): predicate true iff Lavs is a list of attribute-value pairs, in the form used by SWI Prolog, which is legal on elements of complex type TN, and which corresponds to the union of Lpa (a list of parsed attribute nodes) and Lpna (a list of parsed namespace-attribute nodes)
TN_value(LF): true iff lexical form LF is a legal lexical form and denotes a legal value for simple type TN.

Non-terminal symbols in the grammar:

e_GI: parses one occurrence of the element type in question, producing parsed element node.
t_TN_cont: parses the content of an element of type t_TN; the rule is a translation of the content model and produces a list of parsed nodes.
lavs_TN: a list of attributes legal for type TN.
avs_TN: a single attribute legal for type TN.
avs_nsd: a single namespace attribute.
avs_xsi: a single attribute in the XSI namespace.
lexform_TN: lexical form for type TN.

3.4.5.4. Generic tools

Several utilities are included which we should not lose track of:

partition(+LpaAll, -LpaPresent, -Lpna): true iff LpaAll is the union of LpaPresent (normal parsed attribute nodes) and Lpna (parsed namespace attribute nodes), with the latter two disjoint.
atts_present(+Lpa, +Lreq): true iff every name in Lreq is the name of a parsed attribute node in Lpa.
atts_absent(+Lpa, +Lno): true iff no name in Lno is the name of a parsed attribute node in Lpa.
atts_defaulted(+Lpa, +Ldfts, -LpaAll): true iff LpaAll contains every parsed attributed node in Lpa, plus additionally every node in Ldfts which has a name which doesn't match anything in Lpa.
ws_normalize(+Kw, +In, -Out): true iff applying to In the whitespace normalization associated with Kw yields Out; In and Out are atoms. There are three cases: Kw must be one of preserve, replace, or collapse.

3.4.6. Evaluation

Initial tests of the layer-3 grammar over the test suite uncovered a number of problems: Prolog syntax errors, typos in variable names, and so on. After correcting the errors, the layer-3 grammar accepted all the valid test cases, but continued to die with infinite loops on some. The infinite-loop problem was corrected by adding the nonvar tests to the att_merge predicate; as noted above, adding it introduces an undesirable procedural note to an account of the purchase-order schema we are trying to keep as declarative as possible, but it does not particularly distort the declarative meaning, it only clutters up the predicate.

The grammar as shown above (after the corrections just described) accepts all the valid test cases, and rejects all but one of the error test cases. That test case is E35, which supplies a duplicate attribute; in theory, the upstream XML parser should be catching that well-formedness error, and so I have chosen not to modify the grammar to catch it.

There is one more serious problem remaining in the grammar shown above. The upward propagation of grammatical attributes is sometimes a problem here: the value of the children property of the elements here should be a list of child nodes, but in fact it's a node structure named t_Typename_cont. Instead of

nt ::= [element(n:gi,Attributes,Content)],
  {
    ct_atts(A,NA,Attributes),
    ct_cont(C,Content,[])
  }
  <:> attributes(A) 
  && namespaceAttributes(NA)
  && children(C) 
  && localname(gi) 
  && namespacename(n).

the rule pattern should read

nt ::= [element(n:gi,Attributes,Content)],
  {
    ct_atts(A,NA,Attributes),
    ct_cont(C,Content,[]),
    C^^children(Ch)
  }
  <:> attributes(A) 
  && namespaceAttributes(NA)
  && children(Ch) 
  && localname(gi) 
  && namespacename(n).

or else the attribute-value assignment should read differently: not

&& children(C)

but

&& children(Ch) ::- C^^children(Ch)

The children property of elements with simple types also has an unexpected value.

The flaws in the assignment of values to grammatical attributes are shown clearly when a pretty-printing predicate is used or written.

These problems will be corrected in the grammar for layer 4, given in the next section.

4. Validity, validation-attempted, and error handling (layer 4)

N.B. this section is incomplete.

4.1. Goals

The primary goal for layer 4 is to add more useful grammatical attributes to the parsed nodes. In particular, unlike the layer 3 parser, which simply fails when confronted with input which does not conform to the grammar, layer 4 should be able to handle invalid input and return a parsed node structure with values other than valid for the [validity] property, and with appropriate error codes as the value of the [schema error code] property.

The following sections provide a high-level description of these goals and how we'll achieve them.

4.1.1. Additional PSVI properties

In Layer 4, several more PSVI properties will be provided by the grammar. Some of these are simple: properties to identify the type against which a given element or attribute was validated, or to say if it was nil. Some are more challenging: providing error codes (and returning a fully parsed result showing what parts are valid and what parts are not valid).

There are some PSVI properties still not represented; all are either intrinsically optional or else represent constructs not present in the purchase-order schema.

Specifics are given below in section Validation of elements, section Validation of attributes, and section Overview of layer-4 grammar.

4.1.2. Handling invalid and partially valid input

To handle invalid input, we need to alter either the shape of the grammar or the way we invoke it.

We could change the grammar by including fall-back productions which will accept any well-formed XML as input, no matter what the schema says. This would be a bit like writing a ‘fuzzy grammar’, which assigns continuous values between 0 and 1 to each parse of a document (in practice, we could presumably get by with the eleven values 0.0, 0.1, 0.2, ... 0.9, 1.0, which is a long way from a continuous scale but also a long long way from a conventional parser). For example, we could specify that a valid instance of a complex type gets the value 1.0, an instance which has only the correct children, all valid but in the wrong order scores 0.5, and one which includes other elements scores 0.1. If some of the children are invalid, we can depress the score further.

Such a grammar is interesting, but perhaps better reserved for another project: the necessary grammatical transformations would render the grammar less and less similar to what the schema author actually specifies, and in fact it appears that a simple fallback to lax processing is more in keeping with the rules of [W3C 2001b] than a fuzzy grammar.[18]

Alternatively we can change the way we invoke the grammar, by wrapping the call to the grammar in a new predicate, which provides a sort of fall-back, along lines like these:

sva_purchaseOrder_children(Children,Parsed,valid) :-
  t_PurchaseOrderType_cont(Parsed,Content,[]).
sva_purchaseOrder_children(Children,Parsed,
                             invalid(cvc-complex-type.2.4)) :-
  not(t_PurchaseOrderType_cont(C,Content,[])),
  sva_lax(Parsed,Children,[]).

where the predicate sva_lax(-Result,+Input,+-Remainder) performs lax validation on Input, returning a Result node.

The sva_purchaseOrder_children (‘schema-validity assessment of purchase-order children’) succeeds both when t_PurchaseOrderType_cont succeeds and also when it fails (as long as sva_lax succeeds, which it should always do). The final argument unifies with either valid or with an error code, so we can tell what is happening, and we can assign the correct value to the [validity] property of the parent element. This amounts to writing two grammars, one which accepts valid documents only and one which accepts anything and performs lax validation on it. The lax grammar does not vary from schema to schema and can therefore be part of the schema processor infrastructure.

Changing the way we use the grammar, in turn, will be made easier if we partially reify some of the schema components we care most about: lax validation requires that we look things up by QName, and that in turn is easy if we have simple predicates or data structures which make that possible.

4.1.3. Reifying the schema

Several things will be easier if we reify our schema components a little bit. Lax validation requires lookup by name, which is easier if there is something to look up. Debugging is easier when we can dump out our information about the schema conveniently. And reifying the schema components lays the foundation for moving from the extremely repetitive predicates of level 3 to less repetitive predicates which capture regularities in the grammar and in the rules for schema-validity assessment. In later work (not shown here) we will turn the patterns for predicates not into many very similar predicates and grammar rules different only in parts of their names, but instead turn them into single predicates which take names of elements or types as parameters.

4.1.4. Naming and argument conventions

NOT FINISHED, INCONSISTENT. Rather than try to clean this up completely, I have cleaned it partially and left question marks next to rules I wasn't sure of. Delete the question mark when you're sure.

To make it easier to follow the code, the same naming conventions for predicates, grammar rules, and arguments will be followed as above, with a few additions and modifications.

Identifiers for elements, types, etc.

e_GI: refers to an element
t_TN: refers to a type
xsd_TN: refers to a built-in simple type

Predicates (other than non-terminals):

sva_Args: predicate which performs schema-validity assessment; the Args part of the name distinguishes what kind of assessment is happening. The last argument is invariably either a list of errors or the keyword ok.
sva_eii(Element_info_item, Parsed_node, Lerrors) for schema-validity assessment of an element information item resulting in a parsed node and a list of errors
[?] sva_eii_edecl(Element_info_item, Element_declaration, Parsed_node, Lerrors) for schema-validity assessment of an element information item against an element declaration, resulting in a parsed node and a list of errors
[?] sva_eii_type(Element_info_item, Type_definition, Parsed_node, Lerrors) for schema-validity assessment of an element information item against a type definition, resulting in a parsed node and a list of errors
? sva_lax(TN, Content, L_Parsed_node, Lerrors) for lax assessment of a list of content information items resulting in a list of parsed nodes and a list of errors [why did I write this with a TN argument?]
sva_st_lf(Simple_type_id,Lexical_form,Result,Lerrors) for assessment of the validity of a lexical form with respect to a simple type, resulting in a keyword summary of the result and a list of errors[19]
? sva_atts_t_TN(+Lavs,-Lpa,-Lpna,-Lerr): predicate true iff Lavs is a list of attribute-value pairs, in the form used by SWI Prolog, which when validated against complex type TN yields Lpa (a list of parsed attribute nodes), Lpna (a list of parsed namespace-attribute nodes), and Lerr (a list of error diagnostics, or ok).
? OR: sva_ct_atts(+TN, +Lavs,-Lpa,-Lpna,-Lerr): predicate true iff Lavs is a list of attribute-value pairs, in the form used by SWI Prolog, which when validated against complex type TN yields Lpa (a list of parsed attribute nodes), Lpna (a list of parsed namespace-attribute nodes), and Lerr (a list of error diagnostics, or ok).
? sva_content_t_TN(+Content,-Result,-Lerr): predicate true iff Content is a list of items in the form used by SWI Prolog, which when validated against complex type TN yields Result (a list of parsed element and PCDATA nodes) and Lerr (a list of error diagnostics, or ok).
? OR: sva_content_ct(+TN, +Content,-Result,-Lerr): predicate true iff Content is a list of items in the form used by SWI Prolog, which when validated against complex type TN yields Result (a list of parsed element and PCDATA nodes) and Lerr (a list of error diagnostics, or ok).
? doubtful: TN_value(LF): true iff lexical form LF is a legal lexical form and denotes a legal value for simple type TN.
? better?: sva_st_lf(TN,LF,Lerr): true iff when lexical form LF is validated as a lexical form for simple type TN, it produces errors Lerr.

Non-terminal symbols in the grammar:

? e_GI: parses one occurrence of the element type in question, producing parsed element node.
? cont_t_TN: parses the content of an element of type t_TN; the rule is a translation of the content model and produces a list of parsed nodes.
>? lavs_TN: a list of attributes legal for type TN.
? avs_TN: a single attribute legal for type TN.
? avs_nsd: a single namespace attribute.
? avs_xsi: a single attribute in the XSI namespace.
? TN: lexical form for type TN. (alt: lexform_TN)
? gr_XXX: other non-terminal in grammar
? gr_rep: other non-terminal in grammar: a repetition of X (almost certainly this should be replaced by a repetition parameter).
? gr_XXX: other non-terminal in grammar

4.1.5. Author's note to reviewers

As noted above, layer 4 of the grammar is not yet complete in this version of this paper. The to-do list for layer 4 includes:

Add tests involving xsi:type, xsi:nil, other (legal and illegal) xsi attributes to the test collection.
Make simple-type checking always succeed, and return either a success signal or an error.
Make attribute checking always succeed, and return either a success signal or an error.
write sva_lax_seq(LInputnodes,LParsednodes) (promised in Validation of simple types and Lax validation of sequences). (Check it, fix it to handle xsi:type as well.)
complete the list of content-model checking errors
partition the list of content-model checking errors differently? Examine the grammar of layer 3, and the list of errors, to identify
- errors which are already caught explicitly (and where)
- errors which cause failure of validation (and thus need to be caught, so validation can succeed)
- errors which cannot arise in the purchase-order schema (? on second thought, I'm skeptical that there can be any; if any of these errors is not thrown, I suspect validation of a document with that error would fail)
add content-model checking errors to lax validation rule
make content-model checking fall back to lax checking, if strict checking fails, and return either a success signal or an error
provide a schemaErrorCode property on each parsed attribute and element
check xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes; it is an error if they occur after the first use of the namespace they describe
make element processing handle xsi:type specificaitons
add test checking for lax validation of element information items

4.1.6. Summary of goals

Putting these together, we get the following list of goals for the layer-4 grammar:

Add more PSVI properties to the parsed nodes.
Succeed on invalid data, and produce parsed nodes with appropriate values in the [validity] property and appropriate error diagnostics in the [schema error code] properties.
Use the error codes suggested by [W3C 2001b] to diagnose the errors caught by the grammar. (But we will not attempt to change the grammar in order to test for, and raise, every identifiable error, lest it involve a radical restructuring of the grammar. That is a task for a separate paper, specifically [Sperberg-McQueen 2003a].)
Allow the instance to contain xsi:type attributes, and handle them appropriately. In the purchase order schema as written, there is very little scope for the use of this attribute, but if further components were added there could be somewhat more scope.
Formulate the translation of the purchase-order schema in such a way as to make it easier to combine it with another schema. For the moment, this means simply adding some indirection at some points rather than assuming that we already know exactly which type will be associated with each element in an instance.
Partially reify the components of the schema, to make them easier to look up and act upon.
Reduce the redundancy of the grammar. Previous layers of the schema have had several predicates for each element declaration and each type definition, which have been virtually identical except that the unique identifier of the element or type has been inserted at various places into the basic skeleton. Rewrite these rules to make them capture the common processes of validation better, and use the names of elements and types as parameters.

4.2. Prolog representation of schema components

To simplify lookup and to make it easier to inspect the Prolog representation of a schema, we represent (some of) the components of a schema as a sequence of Prolog facts. The predicate of each fact identifies the component type, and each predicate takes the same arguments:

a unique identifier for the component,[20]
its schema-component designator as defined in [Holstege/Vedamuthu 2003] (this is not currently used anywhere, but it's handy for navigation)
its target namespace
its local name
whether it is global or local
a list of properties, represented using the same list technique as is used to represent grammatical attributes in the DCTG, so the ^^ operator can be used to extract the property values. In this example, the properties are in fact all literals, but we could use the ::- pseudo-neck operator to specify a rule, if we needed to.

In addition to the properties defined by the spec, a few additional properties (id, scd) are also specified. The ID and SCD thus occur twice in each fact: as fixed-position arguments for ease in lookup, and in the property list (so the property list can be passed around as a variable when necessary, without it being necessary to re-consult the fact from which it was derived).

The first argument is used for direct references (where a C program might use a pointer, or a Java program an object reference); the third, fourth, and fifth arguments are used for QName-based lookup, when that is required.

4.2.1. Element declarations

Element declarations look like this:

< 86 Element declaration: purchaseOrder > ≡

elemdecl(e_purchaseOrder, '/element(purchaseOrder)',
  purchaseOrder, 'http://www.example.com/PO1', 
  global, [
    (id(e_purchaseOrder)),
    (scd('/element(po:purchaseOrder)')),
    (name(purchaseOrder)),
    (target_namespace('http://www.example.com/PO1')),
    (type_definition(t_PurchaseOrderType)),
    (scope(keyword(global))),
    (value_constraint(keyword(absent))),
    (nillable(false)),
    (identity_constraint_definitions([])),
    (substitution_group_affiliation(keyword(absent))),
    (disallowed_substitutions([])),
    (substitution_group_exclusions([])),
    (abstract(false)),
    (annotation(keyword(absent)))
]).

This code is used in < Element declarations 155 >

The grammatical attributes, apart from id and scd, correspond directly in obvious ways to the properties specified in the XML Schema specification. The following should be noted concerning the Prolog representation of values:

An NCName is represented by an atom.
A namespace name is represented by an atom (which happens, when viewed as a string, to be a URI).
An absent value is represented by the structure keyword(absent).
A type definition is represented by the atom used as its unique identifier in this Prolog representation.
The special values global, default, fixed are represented as structures: keyword(global), etc.
Boolean values are represented by the atoms true and false.
Sets are represented as lists; the empty set is represented by [].
An element declaration is represented by the atom used as its unique identifier in this Prolog representation.
An identity constraint definition is represented by the atom used as its unique identifier in this Prolog representation. There are none in the purchase order schema.
The keywords extension, restriction, substitution are represented by Prolog atoms spelled the same way.
Annotations are represented by copying the SWI Prolog representation. There are none in the purchase order schema.

The translation from the purchase order schema is obvious enough that there seems to be no need for much commentary. Some properties are the same throughout this schema:

< 87 Common properties > ≡

    (target_namespace('http://www.example.com/PO1')),
    (value_constraint(keyword(absent))),
    (nillable(false)),
    (identity_constraint_definitions([])),
    (substitution_group_affiliation(keyword(absent))),
    (disallowed_substitutions([])),
    (substitution_group_exclusions([])),
    (abstract(false)),
    (annotation(keyword(absent)))

< 88 Element declaration: comment > ≡

elemdecl(e_comment, '/element(comment)',
  comment, 'http://www.example.com/PO1',
  global, [
    (id(e_comment)),
    (scd('/element(po:comment)')),
    (name(comment)),
    (type_definition(t_string)),
    (scope(keyword(global))),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 89 Element declaration: shipTo > ≡

elemdecl(e_shipTo,
  '/complexType(po:PurchaseOrderType)/sequence()/element(shipTo)',
  shipTo, 'http://www.example.com/PO1',
  local, [
    (id(e_shipTo)),
    (scd('/complexType(po:PurchaseOrderType)/sequence()/element(shipTo)')),
    (name(shipTo)),
    (type_definition(t_USAddress)),
    (scope(t_PurchaseOrderType)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 90 Element declaration: billTo > ≡

elemdecl(e_billTo,
  '/complexType(po:PurchaseOrderType)/sequence()/element(billTo)',
  billTo, 'http://www.example.com/PO1',
  local, [
    (id(e_billTo)),
    (scd('/complexType(po:PurchaseOrderType)/sequence()/element(billTo)')),
    (name(billTo)),
    (type_definition(t_USAddress)),
    (scope(t_PurchaseOrderType)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 91 Element declaration: items > ≡

elemdecl(e_items,
  '/complexType(po:PurchaseOrderType)/sequence()/element(items)',
  e_items, 'http://www.example.com/PO1',
  local, [
    (id(e_items)),
    (scd('/complexType(po:PurchaseOrderType)/sequence()/element(items)')),
    (name(items)),
    (type_definition(t_Items)),
    (scope(t_PurchaseOrderType)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 92 Element declaration: name > ≡

elemdecl(e_name,
  '/complexType(po:USAddress)/sequence()/element(name)',
  name, 'http://www.example.com/PO1',
  local, [
    (id(e_name)),
    (scd('/complexType(po:USAddress)/sequence()/element(name)')),
    (name(name)),
    (type_definition(t_string)),
    (scope(t_USAddress)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 93 Element declaration: street > ≡

elemdecl(e_street,
  '/complexType(po:USAddress)/sequence()/element(street)',
  street, 'http://www.example.com/PO1',
  local, [
    (id(e_street)),
    (scd('/complexType(po:USAddress)/sequence()/element(street)')),
    (name(street)),
    (type_definition(t_string)),
    (scope(t_USAddress)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 94 Element declaration: city > ≡

elemdecl(e_city,
  '/complexType(po:USAddress)/sequence()/element(city)',
  city, 'http://www.example.com/PO1',
  local, [
    (id(e_city)), 
    (scd('/complexType(po:USAddress)/sequence()/element(city)')),
    (name(city)),
    (type_definition(t_string)),
    (scope(t_USAddress)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 95 Element declaration: state > ≡

elemdecl(e_state,
  '/complexType(po:USAddress)/sequence()/element(state)',
  state, 'http://www.example.com/PO1',
  local, [
    (id(e_state)),
    (scd('/complexType(po:USAddress)/sequence()/element(state)')),
    (name(state)),
    (type_definition(t_string)),
    (scope(t_USAddress)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 96 Element declaration: zip > ≡

elemdecl(e_zip,
  '/complexType(po:USAddress)/sequence()/element(zip)',
  zip, 'http://www.example.com/PO1',
  local, [
    (id(e_zip)),
    (scd('/complexType(po:USAddress)/sequence()/element(zip)')),
    (name(zip)),
    (type_definition(t_decimal)),
    (scope(t_USAddress)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 97 Element declaration: item > ≡

elemdecl(e_item,
  '/complexType(po:Items)/sequence()/element(item)',
  item, 'http://www.example.com/PO1',
  local, [
    (id(e_item)),
    (scd('/complexType(po:Items)/sequence()/element(item)')),
    (name(item)),
    (type_definition(t_item)),
    (scope(t_Items)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 98 Element declaration: productName > ≡

elemdecl(e_productName,
  '/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(productName)',
  productName, 'http://www.example.com/PO1',
  local, [
    (id(e_productName)),
    (scd('/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(productName)')),
    (name(productName)),
    (type_definition(t_string)),
    (scope(t_item)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 99 Element declaration: quantity > ≡

elemdecl(e_quantity,
  '/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)',
  quantity, 'http://www.example.com/PO1',
  local, [
    (id(e_quantity)),
    (scd('/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)')),
    (name(quantity)),
    (type_definition(t_quantity)),
    (scope(t_item)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 100 Element declaration: USPrice > ≡

elemdecl(e_USPrice,
  '/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(USPrice)',
  'USPrice', 'http://www.example.com/PO1',
  local, [
    (id(e_USPrice)),
    (scd('/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(USPrice)')),
    (name('USPrice')),
    (type_definition(t_decimal)),
    (scope(t_item)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

< 101 Element declaration: shipDate > ≡

elemdecl(e_shipDate,
  '/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(shipDate)',
  shipDate, 'http://www.example.com/PO1',
  local, [
    (id(e_shipDate)),
    (scd('/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(shipDate)')),
    (name(shipDate)),
    (type_definition(t_date)),
    (scope(t_item)),
    {Common properties 87}

  ]).

This code is used in < Element declarations 155 >

4.2.2. Complex type definitions and attribute declarations

Complex type definitions have a structure like this:

< 102 Complex type: t_PurchaseOrderType > ≡

complextype(t_PurchaseOrderType,
  '/complexType(po:PurchaseOrderType)',
  'PurchaseOrderType', 'http://www.example.com/PO1',
  global, [
    (id(t_PurchaseOrderType)),
    (scd('/complexType(po:PurchaseOrderType)')),
    (anonymous(false)),
    (name('PurchaseOrderType')),
    (target_namespace('http://www.example.com/PO1')),
    (base_type_definition(t_urtype)),
    (derivation_method(restriction)),
    (final([])),
    (abstract(false)),
    (attribute_uses([
              au(required(false),
                 attdecl(a_purchaseOrder_orderDate),
                 value_constraint(keyword(absent)))
             ])),
    (attribute_wildcard(keyword(absent))),
    (content_type(c(t_PurchaseOrderType_cont,element-only))),
    (prohibited_substitutions([])),
    (annotations(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

In addition to the Prolog conventions discussed with the element declarations, the following need mention here:

The anonymous property is added, to provide a convenient way to determine whether the value of the name property was supplied in the schema or by the translation into Prolog. This is redundant: it is true iff we have scope(global).
An attribute declaration is represented by the atom used as its unique identifier in this Prolog representation. (In practice, for local attributes this is a_ + the type's identifier minus its t_ prefix + the attribute's local name.)
An attribute use component is represented by a structure of the form au(RQ,AD,VC), where
- RQ is the structure required(true) or required(false),
- AC is the structure attdecl(AD) where AD is the atom used for the attribute declaration, and
- VC is a structure with functor value_constraint and a single argument, which is either keyword(absent) or required(Value) or fixed(Value)
There seems no need to make these top-level components.
In the {content type} property, the keyword empty is represented keyword(empty); a content model is represented by a structure of the form c(T,K), where T is the name of a non-terminal in the definite clause translation grammar (typically — always? — the name of the type), and K is one of the atoms element-only and mixed.

Attribute declarations are made into separate components:

< 103 Purchase order attributes > ≡

attdecl(a_PurchaseOrderType_orderDate,
  '/complexType(po:PurchaseOrderType)/attribute(orderDate)',
  orderDate, 'http://www.example.com/PO1',
  local, [
    (name(orderDate)),
    (target_namespace('http://www.example.com/PO1')),
    (type_definition(t_date)),
    (scope(t_PurchaseOrderType)),
    (value_constraint(keyword(absent))),
    (annotation(keyword(absent)))
  ]).

This code is used in < Layer 4 attribute declarations 157 >

There are four other complex types we need to have representations for:

< 104 Complex type: t_USAddress > ≡

complextype(t_USAddress,'/complexType(po:USAddress)',
  'USAddress', 'http://www.example.com/PO1',
  global, [
    (id(t_USAddress)),
    (scd('/complexType(po:USAddress)')),
    (anonymous(false)),
    (name('USAddress')),
    (target_namespace('http://www.example.com/PO1')),
    (base_type_definition(t_urtype)),
    (derivation_method(restriction)),
    (final([])),
    (abstract(false)),
    (attribute_uses([
              au(required(false),
                 attdecl(a_USAddress_country),
                 value_constraint(keyword(absent)))
             ])),
    (attribute_wildcard(keyword(absent))),
    (content_type(c(t_USAddress_cont,element-only))),
    (prohibited_substitutions([])),
    (annotations(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

< 105 Address attributes > ≡

attdecl(a_USAddress_country,
  '/complexType(po:USAddress)/attribute(country)',
  country,  'http://www.example.com/PO1',
  local, [
    (name(country)),
    (target_namespace('http://www.example.com/PO1')),
    (type_definition(t_NMTOKEN)),
    (scope(t_USAddress)),
    (value_constraint(fixed('US'))),
    (annotation(keyword(absent)))
  ]).

This code is used in < Layer 4 attribute declarations 157 >

< 106 Complex type: t_Items > ≡

complextype(t_Items,'/complexType(po:Items)',
  'Items', 'http://www.example.com/PO1',
  global, [
    (id(t_Items)),
    (scd('/complexType(po:Items)')),
    (name('Items')),
    (target_namespace('http://www.example.com/PO1')),
    (base_type_definition(t_urtype)),
    (derivation_method(restriction)),
    (final([])),
    (abstract(false)),
    (attribute_uses([])),
    (attribute_wildcard(keyword(absent))),
    (content_type(c(t_Items_cont,element-only))),
    (prohibited_substitutions([])),
    (annotations(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

< 107 Complex type: t_item > ≡

complextype(t_item,
  '/complexType(po:Items)/sequence()/element(item)/complexType() ',
  t_item, 'http://www.example.com/PO1', [
    (id(t_item)),
    (scd('/complexType(po:PurchaseOrderType)')),
    (anonymous(true)),
    (name('PurchaseOrderType')),
    (target_namespace('http://www.example.com/PO1')),
    (base_type_definition(t_urtype)),
    (derivation_method(restriction)),
    (final([])),
    (abstract(false)),
    (attribute_uses([
              au(required(true),
                 attdecl(a_item_partNum),
                 value_constraint(keyword(absent)))
             ])),
    (attribute_wildcard(keyword(absent))),
    (content_type(c(t_item_cont,element-only))),
    (prohibited_substitutions([])),
    (annotations(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

< 108 Address attributes > ≡

attdecl(a_item_partNum,
  '/complexType(po:Items)/sequence()/element(item)/complexType()/attribute(partNum)',
  partNum, 'http://www.example.com/PO1',
  local, [
    (name(partNum)),
    (target_namespace('http://www.example.com/PO1')),
    (type_definition(t_sku)),
    (scope(t_item)),
    (value_constraint(keyword(absent))),
    (annotation(keyword(absent)))
  ]).

This code is used in < Layer 4 attribute declarations 157 >

4.2.3. Simple type definitions

Simple type definitions have the following structure:

simpletype(TypeID,
  SCD,
  TypeName, TargetNS, LocalGlobal [
    (id(TypeID)),
    (scd(SCD)),
    (anonymous(true|false)),
    (name(TypeName)),
    (target_namespace(TargetNS)),
    (variety( atomic|list|union )),
    (primitive_type_definition( TypeID_P )),
    (item_type_definition( TypeID_I )),
    (member_type_definitions([ TypeID_M, TypeID_M2, ... ])),
    (facets(ListOfFacets)),
    (fundamental_facets(ListOfFundamentals)),
    (base_type_definition(t_simpleurtype)),
    (final([restriction, list, union])),
    (annotation(keyword(absent)))
]).

Both the fundamental and the constraining facets properties can be accessed by binding a variable to the value of the facets or fundamental_facets property of the simple type and then using the ^^ notation (e.g. F^^minInclusive(Mininc)).

Concretely, the two facets properties are represented in Prolog as lists of structures. For constraining facets, the structures expected are these:

  length(value(V), fixed(Bool), annotation()),
  minLength(value(V), fixed(Bool), annotation()),
  maxLength(value(V), fixed(Bool), annotation()),
  pattern(value(V), annotation(), dctg_rule()),
  enumeration(value([Vs]), annotation()),
  whiteSpace(
    value(preserve|replace|collapse), 
    fixed(Bool), 
    annotation()),
  maxInclusive(value(V), fixed(Bool), annotation()),
  maxExclusive(value(V), fixed(Bool), annotation()),
  minInclusive(value(V), fixed(Bool), annotation()),
  minExclusive(value(V), fixed(Bool), annotation()),
  totalDigits(value(V), fixed(Bool), annotation()),
  fractionDigits(value(V), fixed(Bool), annotation())

The value is in each case represented by an appropriate Prolog atom; the fixed property takes a Boolean value, and the annotation uses the customary Prolog representation of XML. For patterns, the layer 4 implementation records the name of the DCTG rule which captures the pattern.

The fundamental facets are represented as simple structures:

  ordered(false|partial|total),
  bounded(true|false),
  cardinality(finite|countable),
  numeric(true|false)

Armed with these conventions, we can write data structures for the simple types of the purchase-order schema. In addition to those defined and used directly (t_quantity, t_SKU, t_string, t_integer, t_decimal, and t_date), we will also define their base types, recursively back to their base primitive type and the simple Urtype. The derivations are perhaps clearest if we start from the top, with the simple Urtype:

< 109 Simple type: simple urtype > ≡

simpletype(t_simpleurtype,
  'simpleType(anySimpleType)',
  anySimpleType, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_simpleurtype)),
    (scd('simpleType(xsd:anySimpleType)')),
    (anonymous(true)),
    (name(anySimpleType)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (base_type_definition(t_urtype)),
    (facets([])),
    (final([])),
    (variety(keyword(absent))),
    (primitive_type_definition(keyword(absent))),
    (item_type_definition(keyword(absent))),
    (member_type_definitions(keyword(absent))),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

which immediately requires us to define the Urtype:

< 110 The urtype > ≡

complextype(t_urtype,
  '/complexType(xsd:anyType)',
  'anyType', 'http://www.w3.org/2001/XMLSchema',
  global, [
    (id(t_urtype)),
    (scd('/complexType(xsd:anyType)')),
    (anonymous(false)),
    (name('anyType')),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (base_type_definition(keyword(absent))),
    /* perhaps the base type definition should be t_urtype */
    (derivation_method(keyword(absent))),
    (final([])),
    (abstract(false)),
    (attribute_uses([
              au(required(false),
                 attdecl(a_urtype_),
                 value_constraint(keyword(absent)))
             ])),
    (attribute_wildcard(keyword(absent))),
    (content_type(c(t_urtype_cont,mixed))),
    (prohibited_substitutions([])),
    (annotations(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

The urtype, in turn, obliges us to create a representation for attribute wildcards if we wish to be complete. We could omit this without loss since the purchase order schema doesn't have any attribute wildcards; we are including it solely for completeness and documentation.

< 111 Aattribute wildcard for Urtype > ≡

attdecl(a_urtype_,
  '/complexType(xsd:anyType)/anyAttribute()',
  'keyword(any)', 'http://www.w3.org/2001/XMLSchema',
  local, [
    (name(partNum)),
    (target_namespace('http://www.example.com/PO1')),
    (type_definition(t_sku)),
    (scope(t_item)),
    (value_constraint(keyword(absent))),
    (annotation(keyword(absent)))
  ]).

This code is used in < Layer 4 attribute declarations 157 >

The string type is used directly in the schema, as are two types derived from it (NMTOKEN and SKU).

< 112 Simple type: string > ≡

simpletype(t_string,
  '/simpleType(xsd:string)',
  string, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_string)),
    (scd('/simpleType(xsd:string)')),
    (anonymous(false)),
    (name(string)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition( t_string )),
    (item_type_definition( keyword(absent) )),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(preserve), 
                 fixed(false), 
                 annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(false),
      bounded(false),
      cardinality(countable),
      numeric(false)
    ])),
    (base_type_definition(t_simpleurtype)),
    (final([])),
    (annotation(keyword(absent)))
    /* This is not true: like all builtins, string does have
     * annotation in the schema for schemas.  But I don't have
     * any particular use for it, so I'm omitting it for now. */
]).

This code is used in < Layer 4 type definitions 156 >

To get to NMTOKEN, we need to pass through normalized strings

< 113 Simple type: normalizedString > ≡

simpletype(t_normalizedString,
  '/simpleType(xsd:normalizedString)',
  normalizedString, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_normalizedString)),
    (scd('/simpleType(xsd:normalizedString)')),
    (anonymous(false)),
    (name(normalizedString)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition( t_string )),
    (item_type_definition( keyword(absent) )),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(replace), 
                 fixed(false), 
                 annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(false),
      bounded(false),
      cardinality(countable),
      numeric(false)
    ])),
    (base_type_definition(t_string)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

and tokens:

< 114 Simple type: token > ≡

simpletype(t_token,
  '/simpleType(xsd:token)',
  token, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_token)),
    (scd('/simpleType(xsd:token)')),
    (anonymous(false)),
    (name(token)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition( t_string )),
    (item_type_definition( keyword(absent) )),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(collapse), 
                 fixed(false), 
                 annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(false),
      bounded(false),
      cardinality(countable),
      numeric(false)
    ])),
    (base_type_definition(t_normalizedString)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

And now we reach NMTOKEN, which is used for the country attribute.

< 115 Simple type: NMTOKEN > ≡

simpletype(t_NMTOKEN,
  '/simpleType(xsd:NMTOKEN)',
  'NMTOKEN', 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_NMTOKEN)),
    (scd('/simpleType(xsd:NMTOKEN)')),
    (anonymous(false)),
    (name('NMTOKEN')),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition( t_string )),
    (item_type_definition( keyword(absent) )),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(collapse), 
                 fixed(false), 
                 annotation(keyword(absent))),

      pattern(value('\c+'), annotation(keyword(absent)), 
              dctg_rule(t_string))
      /* actually, the pattern does have annotation, but I have
       * no use for it at the moment */
    ])),
    (fundamental_facets([
      ordered(false),
      bounded(false),
      cardinality(countable),
      numeric(false)
    ])),
    (base_type_definition(t_token)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

Another type derived from string is the type used for part numbers:

< 116 Simple type: SKU > ≡

simpletype(t_SKU,
  'simpleType(po:SKU)',
  'SKU', 'http://www.example.com/PO1', global, [
    (id(t_SKU)),
    (scd('simpleType(po:SKU)')),
    (anonymous(false)),
    (name('SKU')),
    (target_namespace('http://www.example.com/PO1')),
    (variety(atomic)),
    (primitive_type_definition(t_string)),
    (item_type_definition(keyword(absent))),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(preserve), 
                 fixed(false), 
                 annotation(keyword(absent)))
      pattern(value('\d{3}-[A-Z]{2}'), annotation(keyword(absent)),
              dctg_rule(t_SKU))
    ])),
    (fundamental_facets([
      ordered(false),
      bounded(false),
      cardinality(countable),
      numeric(false)
    ])),
    (base_type_definition(t_string)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

The decimal type is used for zip codes (a fairly clear case of type abuse):

< 117 Simple type: decimal > ≡

simpletype(t_decimal,
  'simpleType(xsd:decimal)',
  decimal, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_decimal)),
    (scd('simpleType(xsd:decimal)')),
    (anonymous(false)),
    (name(decimal)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition(t_decimal)),
    (item_type_definition(keyword(absent))),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(collapse), 
                 fixed(true), 
                 annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(total),
      bounded(false),
      cardinality(countable),
      numeric(true)
    ])),
    (base_type_definition(t_simpleurtype)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

The type used for quantities is derived from decimal via integer, nonNegativeInteger, and positiveInteger:

< 118 Simple type: integer > ≡

simpletype(t_integer,
  'simpleType(xsd:integer)',
  integer, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_integer)),
    (scd('simpleType(xsd:integer)')),
    (anonymous(false)),
    (name(integer)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition(t_decimal)),
    (item_type_definition(keyword(absent))),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(collapse), 
                 fixed(true), 
                 annotation(keyword(absent))),
      fractionDigits(value(0), 
                     fixed(true), 
                     annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(total),
      bounded(false),
      cardinality(countable),
      numeric(true)
    ])),
    (base_type_definition(t_decimal)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

< 119 Simple type: non-negative integer > ≡

simpletype(t_nonNegativeInteger,
  'simpleType(xsd:nonNegativeInteger)',
  nonNegativeInteger, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_nonNegativeInteger)),
    (scd('simpleType(xsd:nonNegativeInteger)')),
    (anonymous(false)),
    (name(nonNegativeInteger)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition(t_decimal)),
    (item_type_definition(keyword(absent))),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(collapse), 
                 fixed(true), 
                 annotation(keyword(absent))),
      fractionDigits(value(0), 
                     fixed(true), 
                     annotation(keyword(absent)))
      minInclusive(value(0), 
                   fixed(false), 
                   annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(total),
      bounded(false),
      cardinality(countable),
      numeric(true)
    ])),
    (base_type_definition(t_integer)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

< 120 Simple type: positive integer > ≡

simpletype(t_positiveInteger,
  'simpleType(xsd:positiveInteger)',
  positiveInteger, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_positiveInteger)),
    (scd('simpleType(xsd:positiveInteger)')),
    (anonymous(false)),
    (name(positiveInteger)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition(t_decimal)),
    (item_type_definition(keyword(absent))),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(collapse), 
                 fixed(true), 
                 annotation(keyword(absent))),
      fractionDigits(value(0), 
                     fixed(true), 
                     annotation(keyword(absent)))
      minInclusive(value(1), 
                   fixed(false), 
                   annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(total),
      bounded(false),
      cardinality(countable),
      numeric(true)
    ])),
    (base_type_definition(t_nonNegativeInteger)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

The purchase order type defines a special restriction of positiveInteger for quanties:

< 121 Simple type for quantities > ≡

simpletype(t_quantity,
  '/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()',
  t_quantity, 'http://www.example.com/PO1', local, [
    (id(t_quantity)),
    (scd('/complexType(po:Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()')),
    (anonymous(true)),
    (name(t_quantity)),
    (target_namespace('http://www.example.com/PO1')),
    (variety(atomic)),
    (primitive_type_definition(t_decimal)),
      (item_type_definition(keyword(absent))),
      (member_type_definitions(keyword(absent))),
    (facets([

     /* from decimal */
      whiteSpace( value(collapse), 
                  fixed(true), 
                  annotation(keyword(absent))),

      /* from integer */
      fractionDigits(value(0), fixed(true), annotation(keyword(absent))),

      /* from nonnegativeInteger, later overridden */
      /* 
      minInclusive(value(0), fixed(false), annotation(keyword(absent))),
      */

      /* from positiveInteger */
      minInclusive(value(1), fixed(false), annotation(keyword(absent))),

      /* from final derivation step for quantity */
      maxExclusive(value(100), fixed(false), annotation(keyword(absent))),
    ])),
    (fundamental_facets([
      ordered(total),
      bounded(true),
      cardinality(finite),
      numeric(true)
    ])),
    (base_type_definition(t_positiveInteger)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is not used elsewhere.

Finally, the order and ship dates have (as one might expect) the type date:

< 122 Simple type: date > ≡

simpletype(t_date, 'simpleType(xsd:date)',
  date, 'http://www.w3.org/2001/XMLSchema', global, [
    (id(t_date)),
    (scd('simpleType(xsd:date)')),
    (anonymous(true)),
    (name(date)),
    (target_namespace('http://www.w3.org/2001/XMLSchema')),
    (variety(atomic)),
    (primitive_type_definition(t_date)),
    (item_type_definition(keyword(absent))),
    (member_type_definitions([])),
    (facets([
      whiteSpace(value(collapse), 
                 fixed(true), 
                 annotation(keyword(absent)))
    ])),
    (fundamental_facets([
      ordered(partial),
      bounded(false),
      cardinality(countable),
      numeric(false)
    ])),
    (base_type_definition(t_simpleurtype)),
    (final([])),
    (annotation(keyword(absent)))
]).

This code is used in < Layer 4 type definitions 156 >

4.2.4. Other

A general-purpose logic-grammar representation of XML Schema will need to have representations for various other component types as well. We omit these here, since they are not needed for the purchase order schema.

4.3. Validation of simple types

Perhaps the simplest place to start with error diagnosis is the simple types. In layer 3, the predicates xsd_decimal_cont, xsd_string_cont and so on succeed only if the lexical form to be checked denotes a legal value; otherwise, they simply fail. What we want are predicates which will succeed no matter what they are given as a lexical form and will return some sort of error diagnostic if appropriate.

In the sections below, we'll describe the conventions for constructing predicates and reporting errors and then write the new predicates for validating simple types.

4.3.1. Conventions for validating elements and lexical forms

Simple types can be associated either with the values of attributes or with the content of elements. In layer 3, there were two sets of predicates, one with names of the form TN_cont used for checking the content of elements and one with names of the form TN_value which succeeded if and only if a given lexical form was legal for the type.

In layer 4, we will have a similar distinction: to check the content of elements, we'll use predicates of the form sva_content_st_TN(Gi,Lin,Lpn,Lerr), and to check lexical forms we'll use sva_st_lf_TN(Lin,LF,Lerr), where

Gi is the generic identifier of the element being checked (this is solely for use in error messages)
Lin is the input list of atoms (for PCDATA), entity() structures (for characters not in the Prolog character set), and element() structures provided by the upstream parser
Lpn is a list of parsed nodes (recall that in level 3 we ended up with a node structure for an intermediate grammatical construct being returned here; we change the call in order to fix that)
Lerr is either a list of error codes or the keyword ok; if the list is empty, the element's content is valid with respect to the type involved.
LF is the lexical form to be checked; for now we'll assume this is a single atom, but we could change the calling convention later.

At its core, the sva_content_st_TN predicate calls the sva_st_lf_TN predicate. If the latter returns without errors, so does the former; otherwise, the errors found in the lexical form are returned, together with an error message about constraint cvc-type.3.1.3 (the rule that says an element with simple type T must have content which conforms to T).

sva_content_st_TN(Gi,Lin,Lpn,[]) :-
  sva_st_lf_TN(TN,Lin,[]).
sva_content_st_TN(Gi,Lin,Lpn,[ElemError,LfError|Errors]) :-
  sva_st_lf_TN(TN,Lin,[LfError|Errors]),
  ElemError = error(cvc-type.3.1.3).

Or we can phrase the rule generically, with a single predicate for all types. If we use the pattern Test -> Thenclause ; Elseclause to avoid calling sva_st_lf twice, we get:

sva_content_st(Tn,Gi,[Atom],Lpn,Lerr) :-
  sva_st_lf(Tn,Atom,Le),
  ( Le = []
      -> Lerr = []
      ;  append([error(cvc-type.3.1.3,'Bad content')],Le,Lerr)
  ).

This is an improvement on the corresponding predicates of layer 3, but it's not finished: (1) this predicate should also handle empty elements, (2) it should check for elements among the children, and (3) the error reporting should have something more than a keyword. The full definition of sva_content_st is found below (Validation of elements).

4.3.2. Error codes for simple types

[W3C 2001b] and [W3C 2001c] suggest some specific error codes for invalid simple types:

cvc-simple-type string not locally valid w.r.t. the given simple type. [Should also have one or more error codes from cvc-datatype-valid in datatypes spec.]
cvc-datatype-valid.1 lexical form not datatype-valid w.r.t. a given simple type: does not match a literal in the lexical space.
cvc-datatype-valid.2 lexical form not datatype-valid w.r.t. a given simple type: does not denote a member of the value space.

In other words, we can distinguish failure to provide a legal lexical form from failure of that lexical form to denote a legal value.[21] In order to simplify the maintenance of diagnostic messages, we can put these codes into Prolog predicates, so that we can retrieve the description conveniently without writing it into each rule that raises the error. Using the standard predicate concat_atom(+List,-Atom), we can integrate the name of the simple type into the error descriptions:

< 123 Error codes for simple types > ≡

/* xsd_errorcode(+Sym,+String,+Typename,-Desc):
   Desc is the string describing the error whose error code
   is Sym, when it occurs with respect to the string String 
   being validated against type Typename.
*/
xsd_errorcode(cvc-simple-type,String,Typename,Desc) :-
  concat_atom(['The string "',String,
      '" is not locally valid w.r.t. the given simple type ',
      Typename,'.'],
    Desc).
xsd_errorcode(cvc-datatype-valid.1,S,T,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': it matches no literal in the lexical space.'],
    Desc).
xsd_errorcode(cvc-datatype-valid.2,S,T,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': it denotes a value not in the value space.'],
    Desc).

Continued in <Error code for pattern failures 130>, <Error code for pattern failures 137>, <Error code for pattern failures 139>, <Error code for enumeration failures 141>, <Error codes for elements with simple types 152>
This code is used in < Generic utilities for DCTG-encoded schemas 158 >

4.3.3. Validating a lexical form against a simple type

Having reified the simple types above (Prolog representation of schema components), we can now write a single predicate to validate lexical forms for any simple type, which does several tasks:

Retrieve the basic information about the simple type; let's call this the type info.
Perform the appropriate white space processing: extract the whitespace keyword from the type info and call ws_normalize.
Check the lexical form against the grammar for its primitive simple type.
Collect the set of patterns defined for this type, and check the lexical form against each of them in turn.
Work through the remaining facets and check them all.

< 124 Schema-validity assessment: lexical form > ≡

/* sva_st_lf(Tn,PreLF,Lerr): true iff validating pre-lexical form 
   PreLF against simple type Tn produces list of errors Lerr.
 */
sva_st_lf(Tid,LF,Lerr) :-
  {Get type information about simple type Tid 125}

  {Normalize white space to create lexical form 126}

  {Check lexical form against grammar for primitive type 127}

  {Check lexical form against patterns for type Tid 128}

  {Check value against all facets of type Tid 131}

This code is not used elsewhere.

First, we fetch the type information. We can assume safely that the type information is available; if it's not, the error should have been raised before calling this predicate, and if the predicate is called anyway, we should simply fail.

< 125 Get type information about simple type Tid > ≡

  simpletype(Tid,Tscd,Tname,Ttargetns,Tglobal,Tprops),

This code is used in < Schema-validity assessment: lexical form 124 >

Next, we perform the prescribed whitespace normalization. This should not fail if called properly, so we don't have any fallback behavior. To find the prescribed normalization, we first extract the facet list of the type, and then extract the whiteSpace facet from the facet list.

< 126 Normalize white space to create lexical form > ≡

  Tprops^^facets(Tfacets),
  Tfacets^^whiteSpace(value(Tws),fixed(_),annotation(_)),
  ws_normalize(Tws,PreLF,LF),

This code is used in < Schema-validity assessment: lexical form 124 >

The utility ws_normalize was defined as part of layer 3 and is unchanged.

4.3.4. Checking the lexical form of a simple type

Now we have the lexical form proper, and we can check it against the grammar for the appropriate primitive type. The name of the primitive type is in the primitive_type_definition property, and we call it as a grammar rule. First, though, we turn the atom we got from the document into a list of characters. If the parse fails, we want to return with cvc-datatype-valid.1, so the call to the grammar is wrapped in an if-then-else construct.

< 127 Check lexical form against grammar for primitive type > ≡

  Tprops^^primitive_type_definition(Prim),
  atom_chars(LF,Lc),
  ( call(Prim,ParsedNode,Lc,[]) 
    -> LParseErrors = []
    ; ( xsd_errorcode(cvc-simple-type,LF,Prim,Desc-st),
        xsd_errorcode(cvc-datatype-valid.1,LF,Prim,Desc-lf),
        LParseErrors = [error(cvc-simple-type,Desc-st),
          error(cvc-datatype-valid.1,Desc-lf)]
      )
  ),

This code is used in < Schema-validity assessment: lexical form 124 >

Now we walk through all the patterns in the list of facets, and check the lexical form against each one. N.B. patterns need not recapitulate the constraints expressed in earlier patterns, so all must be checked. If any of these fail, return with cvc-datatype-valid.1. To simplify the progression through the list of patterns, we put the pattern-matching code in a separate predicate, which does the work and collects the error messages, if any.

< 128 Check lexical form against patterns for type Tid > ≡

  match_patterns(Tid,LF,Tfacets,LPatternErrors),

This code is used in < Schema-validity assessment: lexical form 124 >

The predicate match_patterns must walk through the list of facets and handle all the pattern facets.[22]

< 129 Pattern checking predicate > ≡

/* match_patterns(Tid,LF,Tfacets,LPatternErrors): true iff
   matching lexical form LF against all the pattern facets in
   the list Tfacets yields LPatternErrors as the resulting error list.
*/

/* If there are no facets, there are no errors. */
match_patterns(Tid,LF,[],[]).

/* If the next facet is not a pattern, skip it and recur. */
match_patterns(Tid,LF,[F|Lfacets],Lerr) :-
  not(F = pattern(_Value,_Annotation,_DCTG)),
  match_patterns(Tid,LF,Lfacets,Lerr).

/* If the next facet is a pattern, test it and recur. */
match_patterns(Tid,LF,
  [pattern(value(PVal),annotation(PAnn),dctg_rule(PRule))|Lfacets],
  Lerr) :-
      atom_chars(LF,Lc),
      ( call(PRule,_ParsedNode,Lc,[])
        -> Lerr = Lerr2
        ; (
            atom_codes(PreLF,LF),
            xsd_errorcode(cvc-datatype-valid.1,PreLF,Tid,Desc-dv),
            xsd_errorcode(cvc-pattern-valid.1,PreLF,Tid,PVal,Desc-pv),
            Lerr = [error(cvc-datatype-valid.1,Desc-dv),
                    error(cvc-pattern-valid.1,Desc-pv),
                   | Lerr2]
          )
      ),
      match_patterns(Tid,LF,Lfacets,Lerr2).

This code is not used elsewhere.

It does seem like overkill to raise the cvc-datatype-valid error in addition to the cvc-pattern-valid error, but as I understand it, failure to conform to the pattern violates two validation rules at the same time, so I raise both errors.

< 130 Error code for pattern failures [continues 123 Error codes for simple types] > ≡

xsd_errorcode(cvc-pattern-valid.1,S,T,P,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': it does not match the required pattern "',
      P, '".'],
    Desc).

4.3.5. Checking other facets on a simple type

Finally, we walk through all the other facets and check each. If any fail, return with cvc-datatype-valid.2. Here, too, we simply assign the core of the work to a separate predicate.

< 131 Check value against all facets of type Tid > ≡

  check_facets(Tid,LF,Tfacets,LFacetErrors).

This code is used in < Schema-validity assessment: lexical form 124 >

Like the pattern checking code, the facet-checking code walks through the list of facets accumulating diagnostics, until it reaches the end of the facet list. I won't write all the facet checking code here, only the code I need for the purchase order schema.

We begin with the obvious recursion base (empty list of facets) and with the filtering rules that ignore whitespace and pattern facets.

< 132 Facet checking predicate > ≡

/* check_facets(Tid,LF,Tfacets,Lerr): true iff checking the value
   associated with lexical form LF against all the non-pattern facets in
   the list Tfacets yields Lerr as the resulting error list.
*/

/* If there are no facets, there are no errors. */
check_facets(Tid,LF,[],[]).

/* If the next facet is a pattern or whitespace facet, 
   then skip it and recur. */
check_facets(Tid,LF,[pattern(_,_,_)|Lfacets],Lerr) :-
  check_facets(Tid,LF,Lfacets,Lerr).
check_facets(Tid,LF,[whiteSpace(_,_,_)|Lfacets],Lerr) :-
  check_facets(Tid,LF,Lfacets,Lerr).

Continued in <Checking length facet 133>, <Checking min and max length facets 138>, <Checking enumeraton facets 140>, <Checking minima and maxima 144>, <Checking total- and fractionDigits facets 145>
This code is not used elsewhere.

4.3.6. Checking length facets

The basic pattern for checking a facet is set by the rule for the length facet. We call a predicate which determines the length of the value, compare it to the facet value, and add either nothing at all or an appropriate error code to the list of errors. Since the error code to be used depends on the specific type in question, the xsd_errorcode call for cvc-length-valid takes an additional argument, which returns the correct error code to be reported to the user (cvc-length-valid plus a clause number).

< 133 Checking length facet [continues 132 Facet checking predicate] > ≡

/* Check lengths */ 

check_facets(Tid,LF, 
  [length(value(Val),fixed(_Fixed),annotation(_Ann)) 
  | Lfacets], Lerr) :-
    type_val_length(Tid,LF,Length),
    ( Length is Val 
      -> Lerr = Lerr2
      ; (
          xsd_errorcode(cvc-datatype-valid.1,LF,Tid,Desc-dv),
          xsd_errorcode(cvc-length-valid,LF,Tid,Val,Length,EC,Desc-lv),
          Lerr = [error(cvc-datatype-valid.1,Desc-dv),
                   error(EC,Desc-lv)
                 | Lerr2]
        )
    )
    check_facets(Tid,LF,Lfacets,Lerr2).

The type_val_length predicate simply measures the length of the value. The correct way to measure length in atomic types varies with the primitive datatype, so type_val_length first looks up the variety and primitive type and then calls an auxiliary predicate if it's atomic. If it's a list, we just count blank-delimited tokens.

< 134 type_val_length predicate > ≡

type_val_length(Tid,LF,Length) :-
  simpletype(Tid,_Tscd,_Tname,_Ttargetns,_Tglobal,Tprops),
  Tprops ^^ variety(atomic),
  Tprops ^^ primitive_type_definition(PTid),
  prim_val_length(PTid,LF,Length).

type_val_length(Tid,LF,Length) :-
  simpletype(Tid,_Tscd,_Tname,_Ttargetns,_Tglobal,Tprops),
  Tprops ^^ variety(list),
  tokens_count(L2,Length).

prim_val_length(t_string,LF,Length) :-
  atom_chars(LF,Lc),
  length(Lc,Length).
prim_val_length(t_anyURI,LF,Length) :-
  atom_chars(LF,Lc),
  length(Lc,Length).
prim_val_length(t_hexBinary,LF,Length) :-
  atom_chars(LF,Lc), 
  ws_delete(Lc,Lc2),
  length(Lc2,LenLc2),
  Length is LenLc2 / 2.
prim_val_length(t_base64Binary,LF,Length) :-
  atom_codes(LF,Lc), 
  wscodes_delete(Lc,Lc2),
  length(Lc2,LenLc2),
  LenCeil is ((LenLc2 / 4) * 3),
  ( append(_L,[61,61],Lc2) /* LF ends with == */
    -> Length is LenCeil - 2
     ; (append(_L,[61],Lc2) /* LF ends with = */
    -> Length is LenCeil - 1
     ; Length is LenCeil)).
prim_val_length(t_QName,LF,Length).
prim_val_length(t_NOTATION,LF,Length).

This code is not used elsewhere.

We need a simple routine to strip white space out of an atom, so we can strip white space before calculating the length of encoded binary data.

< 135 White space deletion > ≡

/* wscodes_delete(L1,L2): true iff L2 = L1, with all whitespace
   codes (numeric codes!) deleted.
 */
wscodes_delete([],[]).
wscodes_delete([32|Lin],Lout) :- wscodes_delete(Lin,Lout).
wscodes_delete([09|Lin],Lout) :- wscodes_delete(Lin,Lout).
wscodes_delete([10|Lin],Lout) :- wscodes_delete(Lin,Lout).
wscodes_delete([13|Lin],Lout) :- wscodes_delete(Lin,Lout).
wscodes_delete([C|Lin],Lout) :- 
  not(member(C,[9,10,13,32])),
  wscodes_delete([C|Lin],Lout).

This code is not used elsewhere.

We also need a simple routine to count white-space-delimited tokens, so we can handle list types.

< 136 White space deletion > ≡

tokens_count(LF,N) :-
  ws_normalize(collapse,LF,L2),
  atom_codes(L2,Lc),
  count_codes(Lc,32,Blanks),
  N is Blanks + 1.

count_codes(L,C,N) :-
  count_codes(L,C,0,N).
count_codes([],_C,N,N).
count_codes([C|L],C,Acc,Total) :-
  N is Acc + 1,
  count_codes(L,C,N,Total).
count_codes([X|L],C,Acc,Total) :-
  not(X = C),
  count_codes(L,C,Acc,Total).

This code is not used elsewhere.

The error code for length checks must check the type to know which error code is actually desired.

< 137 Error code for pattern failures [continues 123 Error codes for simple types] > ≡

xsd_errorcode(cvc-length-valid,S,T,Lf,La,EC,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': its length is ',La,' instead of the required ',
      Lf, '.'],
    Desc),
    simpletype(T,_Tscd,_Tname,_Ttargetns,_Tglobal,Tprops),
    Tprops ^^ variety(V),
    Tprops ^^ primitive_type_definition(P),
    ( V = list
      -> EC = cvc-length-valid.2
       ; member(P,[t_string,t_anyURI])
      -> EC = cvc-length-valid.1.1
       ; member(P,[t_hex_binary,t_base64Binary])
      -> EC = cvc-length-valid.1.2
       ; member(P,[t_QName,t_NOTATION])
      -> EC = 'cvc-length-valid.1.3 (wow! this should not happen)'
       ; EC = 'cvc-length-valid (this cannot happen)'
    )

The minLength and maxLength facets work analogously.

< 138 Checking min and max length facets [continues 132 Facet checking predicate] > ≡

check_facets(Tid,LF, 
  [minLength(value(Val),fixed(_Fixed),annotation(_Ann))
  | Lfacets], Lerr) :-
    type_val_length(Tid,LF,Length),
    ( Length >= Val 
      -> Lerr = Lerr2
      ; (
          xsd_errorcode(cvc-datatype-valid.1,LF,Tid,Desc-dv),
          xsd_errorcode(cvc-minLength-valid,LF,Tid,Val,Length,EC,Desc-lv),
          Lerr = [error(cvc-datatype-valid.1,Desc-dv),
                   error(EC,Desc-lv)
                 | Lerr2]
        )
    )
    check_facets(Tid,LF,Lfacets,Lerr2).

check_facets(Tid,LF, 
  [maxLength(value(Val),fixed(_Fixed),annotation(_Ann))
  | Lfacets], Lerr) :-
    type_val_length(Tid,LF,Length),
    ( Length => Val 
      -> Lerr = Lerr2
      ; (
          xsd_errorcode(cvc-datatype-valid.1,LF,Tid,Desc-dv),
          xsd_errorcode(cvc-maxLength-valid,LF,Tid,Val,Length,EC,Desc-lv),
          Lerr = [error(cvc-datatype-valid.1,Desc-dv),
                   error(EC,Desc-lv)
                 | Lerr2]
        )
    )
    check_facets(Tid,LF,Lfacets,Lerr2).

So do their error codes.

< 139 Error code for pattern failures [continues 123 Error codes for simple types] > ≡

xsd_errorcode(cvc-minLength-valid,S,T,Lf,La,EC,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': its length is ',La,' instead of being at least ',
      Lf, '.'],
    Desc),
    simpletype(T,_Tscd,_Tname,_Ttargetns,_Tglobal,Tprops),
    Tprops ^^ variety(V),
    Tprops ^^ primitive_type_definition(P),
    ( V = list
      -> EC = cvc-minLength-valid.2
       ; member(P,[t_string,t_anyURI])
      -> EC = cvc-minLength-valid.1.1
       ; member(P,[t_hex_binary,t_base64Binary])
      -> EC = cvc-minLength-valid.1.2
       ; member(P,[t_QName,t_NOTATION])
      -> EC = 'cvc-minLength-valid.1.3 (wow! this should not happen)'
       ; EC = 'cvc-minLength-valid (this cannot happen)'
    )

xsd_errorcode(cvc-maxLength-valid,S,T,Lf,La,EC,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': its length is ',La,' instead of being at most ',
      Lf, '.'],
    Desc),
    simpletype(T,_Tscd,_Tname,_Ttargetns,_Tglobal,Tprops),
    Tprops ^^ variety(V),
    Tprops ^^ primitive_type_definition(P),
    ( V = list
      -> EC = cvc-maxLength-valid.2
       ; member(P,[t_string,t_anyURI])
      -> EC = cvc-maxLength-valid.1.1
       ; member(P,[t_hex_binary,t_base64Binary])
      -> EC = cvc-maxLength-valid.1.2
       ; member(P,[t_QName,t_NOTATION])
      -> EC = 'cvc-maxLength-valid.1.3 (wow! this should not happen)'
       ; EC = 'cvc-maxLength-valid (this cannot happen)'
    )

4.3.7. Checking enumerations

To check values properly against an enumeration, it is handy to have an internal representation of the value, distinct from a particular lexical form. Then the comparison is simple: we map from the lexical form to a value, and then compare that value directly against the values we keep around at the component level. A canonical form will do as well, though. When the Prolog representation of the schema is created, any enumeration facets will be represented in their canonical form. To check a lexical form, we first translate it into canonical form, then do an identity check against each item in the enumeration.

< 140 Checking enumeraton facets [continues 132 Facet checking predicate] > ≡

/* Check enumerations */ 
check_facets(Tid, LF, 
  [enumeration(value([Vs]),annotation(Ann))
  | Lfacets], Lerr) :-
    lf_st_canonicalform(LF,Tid,CF),
    ( member(CF,Vs)
      -> Lerr = Lerr2
      ; (
          xsd_errorcode(cvc-datatype-valid.1,LF,Tid,Desc-dv),
          xsd_errorcode(cvc-enumeration-valid,LF,Tid,Desc-ev),
          Lerr = [error(cvc-datatype-valid.1,Desc-dv),
                   error(EC,Desc-ev)
                 | Lerr2]
        )
    )
    check_facets(Tid,LF,Lfacets,Lerr2).

We need an error code for enumeration failures:

< 141 Error code for enumeration failures [continues 123 Error codes for simple types] > ≡

xsd_errorcode(cvc-enumeration-valid,S,T,Vs,Desc) :-
  concat_atom(Vs, '", "', Va),
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': it does not match any of the enumerated values ("',
      Va, '").'],
    Desc).

The canonical form for some primitive types is easy.

< 142 Canonical forms > ≡

/* Calculate canonical form. 
   lf_st_canonicalform(LF,Tid,CF): true iff lexical form LF
   of simple type Tid maps to canonical form CF.
 */
lf_st_canonicalform(LF,Tid,LF) :-
  simpletype(T,_Tscd,_Tname,_Ttargetns,_Tglobal,Tprops),
  Tprops ^^ primitive_type_definition(P),
  member(P,[t_string,t_anyURI,t_gMonthDay,t_gDay,t_gMonth,
    keyword(absent)]).
  /* if primitive type is 'absent', then it's the simple urtype,
     and there is nothing to be done */

Continued in <Canonical forms 143>
This code is not used elsewhere.

For others, it's fairly straightforward:

< 143 Canonical forms [continues 142 Canonical forms] > ≡

lf_st_canonicalform(LF,Tid,CF) :-
  simpletype(T,_Tscd,_Tname,_Ttargetns,_Tglobal,Tprops),
  Tprops ^^ primitive_type_definition(t_integer),
  atom_chars(LF,Lc),
  lf_integer_cf(Lc,Cc),
  atom_chars(CF,Cc).

/* We assume the lexical form is legal, and we don't
 * need to check for signs in the wrong place, etc.
 */
/* Retain leading minus sign */
lf_integer_cf(['-'|Lexcodes],['-'|Cancodes]) :-
  lf_integer_cf(Lexcodes,Cancodes).

/* Strip leading + or blank */
lf_integer_cf(['+'|Lexcodes],[Cancodes]) :-
  lf_integer_cf(Lexcodes,Cancodes).
lf_integer_cf(['0'|Lexcodes],[Cancodes]) :-
  lf_integer_cf(Lexcodes,Cancodes).

/* base cases: if you are down to [], you've
   stripped off all the zeroes.  If you find
   a non-zero, you are done. */
lf_integer_cf([D|Lexcodes],[D|Lexcodes]) :-
  char_type(D,digit), not(D = '0').
lf_integer_cf([],['0']).

4.3.8. Checking minima and maxima

< 144 Checking minima and maxima [continues 132 Facet checking predicate] > ≡

/* Check maxima and minima */ 
check_facets(Tid,LF, [maxInclusive(value(Val),fixed(Fixed),annotation(Ann))
  | Lfacets], Lerr) :-
check_facets(Tid,LF, [maxExclusive(value(Val),fixed(Fixed),annotation(Ann))
  | Lfacets], Lerr) :-
check_facets(Tid,LF, [minInclusive(value(Val),fixed(Fixed),annotation(Ann))
  | Lfacets], Lerr) :-
check_facets(Tid,LF, [minExclusive(value(Val),fixed(Fixed),annotation(Ann))
  | Lfacets], Lerr) :-

4.3.9. Checking total digits and fraction digits

< 145 Checking total- and fractionDigits facets [continues 132 Facet checking predicate] > ≡

/* Check totalDigits and fractionDigits */ 

check_facets(Tid,LF,
  [pattern(value(PVal),annotation(PAnn),dctg_rule(PRule))|Lfacets],
  Lerr) :-
      ( call(PRule,ParsedNode,LF,[])
        -> Lerr = Lerr2
        ; (
            atom_codes(PreLF,LF),
            xsd_errorcode(cvc-datatype-valid.1,PreLF,Tid,Desc-dv),
            xsd_errorcode(cvc-pattern-valid.1,PreLF,Tid,PVal,Desc-pv),
            Lerr = [error(cvc-datatype-valid.1,Desc-dv),
                    error(cvc-pattern-valid.1,Desc-pv),
                   | Lerr2]
          )
      ),
      check_facets(Tid,LF,Lfacets,Lerr2).

4.3.10. Error diagnosis for xsd:string

For type t_string, any lexical form is legal as long as it is made up of characters legal in an XML document, and any such lexical form corresponds to a legal value. The upstream XML processor should therefore already have performed all the checking that is necessary. If I were feeling both paranoid and energetic, I'd probably write a predicate to transform the atom to a list of character codes, and then check them against the list of legal characters, but for now I'll leave that to the imagination of the reader. So the value-checking predicate for string is fairly simple. If the test atom(LF) fails, we issue a generic cvc-simple-type error, but that should never occur.

< 146 Value checking rules for simple types > ≡

/* xsd_string_value(+TN,+LF,-Errors):
   Validating the lexical form LF against the simple type 
   xs:string (with type name TN) results in the errors listed 
   in Errors. (Each error is a structure with functor 'error' 
   and two arguments: an error code and a description. Each
   argument is an atom.)  If Errors is the empty list,
   then the lexical form is a legal string.
*/
xsd_string_value('xsd:string',LF,[]) :- atom(LF).
xsd_string_value('xsd:string',LF,[Err]) :- 
  not(atom(LF)),
  xsd_errorcode(cvc-simple-type,LF,'xsd:string',DescTemp),
  concat_atom(
    [DescTemp,' N.B. This error was thought to be impossible.'],
    Desc),
  Err = error(cvc-simple-type,Desc).

This code is used in < Generic utilities for DCTG-encoded schemas 158 >

The content-checking rule for string is a bit more complicated, even though it doesn't do anything type-specific. This predicate must:

check the content for child elements; if there are any, validate them in lax mode
check the content against the simple type t_string; if any errors

< 147 Content-checking rules for simple types > ≡

/* xsd_string_cont(+Gi,+LIn,-LPn,-LErr):
   Validating an element named Gi, with children LIn, 
   against the simple type xs:string, results in the 
   errors listed in LErr.  If Errors is empty, the element 
   is a legal instance of xs:string.  If the list of
   input children LIn contains elements, the result of
   validating them in lax mode is LPn.
*/
xsd_string_cont(Gi,LIn,LPn,LErr) :-
  memberchk(element(GiChild,_Atts,_Content),LIn),
  xsd_errorcode(cvc-type.3.1.2,Gi,'xsd:string',GiChild,Desc),
  LErr = [error(cvc-type.3.1.2,Desc)],
  sva_lax_seq(LIn,LPn).
  /* This assumes that even though having any element children 
   * makes the element we are checking illegal, we should strictly
   * speaking validate all of the element children in lax mode 
   * anyway (cvc-assess-elt clause 2). */

xsd_string_cont(_Gi,LIn,LPn,LErr) :- 
  not(memberchk(element(_GiChild,_Atts,_Content),LIn)),
  gr_xsd_string(LErr,Pn,LIn,[]),
  Pn^^children(LPn).

gr_xsd_string(LErr) ::= [Atom], 
  { xsd_string_value('xsd:string',Atom,LErr) }
  <:> children([Atom]).

This code is used in < Generic utilities for DCTG-encoded schemas 158 >

The predicate sva_lax_seq will be defined below in section Lax validation of sequences.

4.3.11. Error diagnosis for integer and decimal types

For integer and decimal, we follow a similar pattern. The code is slightly more interesting, since the lexical spaces are more interesting.

To check the lexical form ...:

< 148 Checking lexical form for integer and decimal > ≡

xsd_integer_value('xsd:integer',LF,[]) :-
  atom(LF),
  atom_chars(LF,L),
  gr_xsd_integer(LErr) 
  ...
xsd_integer_value('xsd:integer',LF,[]) :-
  ...

This code is not used elsewhere.

N.B. the rule for gr_xsd_string above appears to be screwed up. I think gr_simpletype rules should be used for the lexical form grammars, and should be called by type_value rules, instead of calling type_value rules.

4.3.12. Error diagnosis for other simple types

Similar error trapping code will be supplied, in the final version of this paper, for

t_integer
t_decimal
t_date
t_quantity
t_SKU

4.4. Lax validation of sequences

This section is incomplete and some details may need revision to be made consistent with other parts of this paper written more recently.

When checking simple types, we may encounter child elements; if so, the rules for schema-validity assessment of the parent element (listed in Validation Rule: Schema-Validity Assessment (Element)) seem to require that we validate them as best we can.

If we have a declaration for the element (i.e. if we have an element declaration for the given (namespace, localname) pair), then we validate the element against the declaration.
Otherwise, if the element has an xsi:type attribute, and we have a type definition for the QName given, then we validate the element against that type definition.
Otherwise, we mark the element unvalidated. Note that this applies even if there is a local element declaration with the appropriate namespace and local name.

< 149 Lax validation of sequences > ≡

sva_lax_seq(LIn,LPn) :-
  gr_lax_sequence(Node,LIn,[]),
  Node^^children(LPn).

gr_lax_sequence ::= []
  <:> children([]).
gr_lax_sequence ::= gr_lax_item^^Item, gr_lax_sequence^^Seq
  <:> children([CI|CS]) ::- 
    Item^^child(CI),
    Seq^^children(CS).

gr_lax_item ::= [PCDATA], { atom(PCDATA) },
  <:> child([PCDATA]).
gr_lax_item ::= [element(Gi,Atts,Cont)], 
  { schema_lookup(element,Gi,Rule),
    call(Rule(Node,[element(Gi,Atts,Cont)],[]))
  }
  <:> child(Node).

This code is not used elsewhere.

The run-time lookup of element rules requires a schema_lookup predicate, which we can define thus:

< 150 QName resolution for elements > ≡

/* Define the mapping from generic identifiers to grammar rules. */
schema_lookup(element,'...':purchaseOrder,e_purchaseOrder).
schema_lookup(element,
              'http://www.example.com/PO1':purchaseOrder,
              e_purchaseOrder).
schema_lookup(element,shipTo,e_shipTo).
schema_lookup(element,billTo,e_billTo).
schema_lookup(element,items,e_items).
schema_lookup(element,item,e_item).
schema_lookup(element,'http://www.example.com/PO1':comment,e_comment).
schema_lookup(element,name,e_name).
schema_lookup(element,street,e_street).
schema_lookup(element,city,e_city).
schema_lookup(element,state,e_state).
schema_lookup(element,zip,e_zip).
schema_lookup(element,productName,e_productName).
schema_lookup(element,quantity,e_quantity).
schema_lookup(element,'USPrice',e_USPrice).
schema_lookup(element,shipDate,e_shipDate).

This code is not used elsewhere.

For the purchase-order schema, or any fixed schema, we could instead hard-wire the definition of gr_lax_item as expanding to the known elements, and have a generic rule for unknown elements, along the following lines:

< 151 Alternate approach to lax lookup > ≡

gr_lax_item ::= e_purchaseOrder^^E
  <:> child(E).
gr_lax_item ::= e_shipTo^^E
  <:> child(E).
/* ... */
gr_lax_item ::= e_shipDate^^E
  <:> child(E).
gr_lax_item ::= unknown_element^^E
  <:> child(E).

unknown_element ::= [element(Ns:Ln,Attributes,Content)], 
  { 
    check_unknown_atts(A,NA,Attributes),
    sva_lax_seq(Content,LPn)
  } 
  <:> localname(Ln)
  && namespacename(Ns)
  && attributes(A)
  && namespaceattributes(NA)
  && children(LPn)
  && schemaerrorcode(error(cvc-assess-elt.1.1.1,Desc)) ::-
    xsd_errorcode(cvc-assess-elt.1.1.1,Desc)
  && validationattempted(ValidationCode) ::-
    find_validationcode(lax,LPn)
  /* ... */
  .
unknown_element ::= [Gi,Attributes,Content)],
  {
    check_unknown_atts(A,NA,Attributes),
    sva_lax_seq(Content,LPn)
  } 
  <:> localname(Gi)
  && namespacename(Ns)
  && attributes(A)
  && namespaceattributes(NA)
  && children(LPn)
  && schemaerrorcode(error(cvc-assess-elt.1.1.1,Desc)) ::-
    xsd_errorcode(cvc-assess-elt.1.1.1,Desc)
  && validationattempted(ValidationCode) ::-
    find_validationcode(lax,LPn)
  /* ... */
  .

This code is not used elsewhere.

This approach is simpler, perhaps, but the lookup mechanism given earlier seems likely to be more compact in the long run.

4.5. Validation of complex types

This section is incomplete and details need revision in the light of recent thought.

Some error conditions are generic. When we start from the validation rule Schema Validity Assessment (Element), we encounter a variety of validation rules, which can in principle raise any of the errors listed in [W3C 2001b].

A simple way to handle errors is to define a fall-back production which accepts pretty much anything, and forces the appropriate grammatical attribute to take the error value. If multiple fallbacks can be defined, each relaxing a different constraint, then we can easily identify which constraint was violated. This may require that we insert a cut into the normal rules, which seems unfortunate.

Another approach: each constraint turns into a must-succeed predicate, and we gather the errors together. So we get something like
rule(Arg,Arg2,Errors) :- check_constraint_1(Arg,Arg2,E1), check_constraint_2(Arg,Arg2,E2), check_constraint_3(Arg,Arg2,E3), ... check_constraint_n(Arg,Arg2,En), flatten([E1,E2,E3, ..., En],Errors). check_constraint_1(Arg,Arg2,[]) :- constraint_1(Arg,Arg2). check_constraint_1(Arg,Arg2,[error(err-c1,desc)]) :- not(constraint_1(Arg,Arg2)).

4.6. Validation of elements

...

For elements, we already have

[local name]
[namespace name]
[attributes]
[namespace attributes]
[children]
[type definition name] (Element Validated by Type)
[type definition namespace] (Element Validated by Type) — for this schema, this will always be either the purchase-order namespace or nothing (‘absent’)
[type definition anonymous] (Element Validated by Type) — true or false
[type definition type] (Element Validated by Type) — simple or complex
[validation attempted] (Assessment Outcome (Element)) — full, none, or partial
[validity] (Assessment Outcome (Element)) — valid, invalid, or notKnown

and will add

[nil] (Element Declaration) — true if the type is nillable and the element is nil in fact, false otherwise (the spec introduces this as an alternative to [element declaration]; if [element declaration] is provided, this information is redundant, although perhaps inconvenient to obtain otherwise)
[schema default] (Element Validated by Type) — the canonical lexical representation of a default or fixed value (provided whether the value is defaulted or not)
[schema error code] (Validation Failure (Element)) — either absent or a list of error codes
[schema information] (Schema Information) — attached to the element at which schema-validity assessment began (for us, always the purchaseOrder element); value is a set of namespace schema information items, which are triplets of (namespace, components, documents); lightweight processors are expected to leave the lists of components empty, but I propose to put in at least a list of schema-component designators
[schema normalized value] (Element Validated by Type) — if the element is not nilled, and did not default to a default value,[23] then this is the normalized value as validated, otherwise it's absent
[schema specified] (Element Default Value) — schema or infoset
[validation context] (Assessment Outcome (Element)) — the element at which validation began

The schema information property has some sub-properties:

namespace schema information information item properties
- [schema components] (Schema Information)
- [schema documents] (Schema Information)
- [schema namespace] (Schema Information)
schema document information item properties
- [document] (Schema Information)
- [document location] (Schema Information)

There are some PSVI properties for elements which we do not include:

[element declaration] (Element Declaration) — since this schema does not provide PSVI-style access to the schema components themselves, we have nothing to provide here
[ID/IDREF table] (ID/IDREF Table) — a set of (id, binding) pairs showing all the IDs used or referred to in a document and the elements to which they are bound (an empty binding reveals an IDREF to a non-existent ID); the purchase order schema has no IDs, and exposing this property is in any case optional for any processor
[identity-constraint table] (Identity-constraint Table) — since there are no identity constraints in the purchase-order schema used here as an example, this would in any case be empty
[type definition] (Element Validated by Type) — we are not exposing the schema components yet
[member type definition] (Element Validated by Type) — we are not exposing the schema components yet
[member type definition anonymous] (Element Validated by Type) — the purchase-order schema has no simple union types
[member type definition name] (Element Validated by Type) — the purchase-order schema has no simple union types
[member type definition namespace] (Element Validated by Type) — the purchase-order schema has no simple union types
[notation] (Validated with Notation) — we are not exposing the schema components yet
[notation public] (Validated with Notation) — the purchase-order schema does not have any declared notations
[notation system] (Validated with Notation) — the purchase-order schema does not have any declared notations

4.6.1. Error codes for elements with simple content

Rules with names of the form sva_content_TN will be responsible for checking the content of elements against a given simple type; some error codes are relevant not to the lexical form itself but to the element in which it appears:

< 152 Error codes for elements with simple types [continues 123 Error codes for simple types] > ≡

/* xsd_errorcode(+Sym,+Gi,+Typename,+Attname,-Desc):
   Desc is the string describing the error whose error code
   is Sym, when it occurs with respect to an element with
   name Gi, being validated against type Typename, on account
   of a bad attribute named Attname. 
*/
xsd_errorcode(cvc-type.3.1.1,Gi,Tn,Anbad,Desc) :-
  concat_atom(['The ',Gi,' element is not locally valid: ',
      'its type definition (', Tn,
      ') is simple, but it has attributes other than ',
      'xsi:type, xsi:nil, xsi:schemaLocation, and '
      'xsi:noNamespaceSchemaLocation (specifically '
      Anbad, ').'],
    Desc).

/* xsd_errorcode(+Sym,+Gip,+Typename,+Gic,-Desc):
   Desc is the string describing the error whose error code
   is Sym, when it occurs with respect to an element with
   name Gip, being validated against type Typename, on account
   of a bad child element named Attname. 
*/
xsd_errorcode(cvc-type.3.1.2,Gi,Tn,GiChild,Desc) :-
  concat_atom(['The ',Gi,' element is not locally valid: ',
      'its type definition (',Tn,') is simple, but it has '
      'element children (',GiChild,').'],
    Desc).

/* xsd_errorcode(+Sym,+Gi,+Typename,-Desc):
   Desc is the string describing the error whose error code
   is Sym, when it occurs with respect to an element with
   name Gi, being validated against type Typename.
*/
xsd_errorcode(cvc-type.3.1.3,Gi,Tn,Desc) :-
  concat_atom(['The ',Gi,' element is not locally valid: ',
      'its content is not a legal lexical form for its '
      'simple type ',Tn,'.'],
    Desc).

4.7. Validation of attributes

...

For attributes, we already have

[local name]
[namespace name]
[normalized value]
[schema specified] (Assessment Outcome (Attribute))
[type definition anonymous] (Attribute Validated by Type)
[type definition name] (Attribute Validated by Type) — as supplied by schema; if none, may be absent or the processor may supply one
[type definition namespace] (Attribute Validated by Type) — the target namespace of the type
[type definition type] (Attribute Validated by Type) — for attributes, invariably simple
[validation attempted] (Assessment Outcome (Attribute))
[validity] (Assessment Outcome (Attribute))

and will add

[schema default] (Attribute Validated by Type) — canonical form of the default value, if any
[schema error code] (Validation Failure (Attribute)) — a list of error diagnostics, or else absent
[schema normalized value] (Attribute Validated by Type) — the normalized value as validated
[validation context] (Assessment Outcome (Attribute))

PSVI properties not included for attributes:

[attribute declaration] (Attribute Declaration) — we are not exposing the schema components yet
[member type definition] (Attribute Validated by Type) — the purchase-order schema has no union types
[member type definition anonymous] (Attribute Validated by Type) — the purchase-order schema has no union types
[member type definition name] (Attribute Validated by Type) — the purchase-order schema has no union types
[member type definition namespace] (Attribute Validated by Type) — the purchase-order schema has no union types
[type definition] (Attribute Validated by Type) — we are not exposing the schema components yet

4.8. Overview of layer-4 grammar

The layer-4 grammar has a structure very similar to that of the layer-3 grammar on which it is based.

N.B. at the moment parts or all of this layer-4 grammar are still identical to the corresponding parts of the layer-3 grammar; they will be replaced one by one as this section is completed.

The program po4.pl comprises both schema-specific and generic material.

< 153 DCTG version of the purchase order schema, layer 4 [File po4.pl] > ≡

/* po4.pl: a definite-clause translation grammar representation
 * of the sample purchase-order schema from the XML Schema tutorial.
 *
 * This DCTG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
{Predicates for purchase-order material 84}

{Generic utilities for DCTG-encoded schemas 85}

The schema-specific matter includes the element- and type-specific rules:

< 154 Predicates for purchase-order material > ≡

{Rules for elements with complex types 31}

{Rules for elements with simple types 33}

{Attribute handling for purchaseOrderType 42}

{Attribute handling for USAddress 51}

{Attribute handling for Items type 53}

{Attribute handling for t_item 54}

{Attribute handling for simple types 56}

{Rules for purchase-order content models 58}

{Simple-type content rules for purchase-order types 80}

{Value-checking rules for SKU 81}

{Element declarations 155}

{Layer 4 attribute declarations 157}

{Layer 4 type definitions 156}

This code is not used elsewhere.

The schema-specific matter also includes the various schema components:

< 155 Element declarations > ≡

{Element declaration: purchaseOrder 86}

{Element declaration: comment 88}

{Element declaration: shipTo 89}

{Element declaration: billTo 90}

{Element declaration: items 91}

{Element declaration: name 92}

{Element declaration: street 93}

{Element declaration: city 94}

{Element declaration: state 95}

{Element declaration: zip 96}

{Element declaration: item 97}

{Element declaration: productName 98}

{Element declaration: quantity 99}

{Element declaration: USPrice 100}

{Element declaration: shipDate 101}

This code is used in < Predicates for purchase-order material 154 >

and type definitions:

< 156 Layer 4 type definitions > ≡

{Complex type: t_PurchaseOrderType 102}

{Complex type: t_USAddress 104}

{Complex type: t_Items 106}

{Complex type: t_item 107}

{The urtype 110}


{Simple type: simple urtype 109}

{Simple type: string 112}

{Simple type: normalizedString 113}

{Simple type: token 114}

{Simple type: NMTOKEN 115}

{Simple type: decimal 117}

{Simple type: integer 118}

{Simple type: non-negative integer 119}

{Simple type: positive integer 120}

{ !! BAD REFERENCE! No scrap called L4_st_quanityType!}

{Simple type: date 122}

{Simple type: SKU 116}

This code is used in < Predicates for purchase-order material 154 >

and attribute declarations:

< 157 Layer 4 attribute declarations > ≡

{Purchase order attributes 103}

{Address attributes 105}

{Address attributes 108}

{Aattribute wildcard for Urtype 111}

This code is used in < Predicates for purchase-order material 154 >

The generic material includes rules for special attributes (namespace declarations and XSI attributes) and utility routines.

< 158 Generic utilities for DCTG-encoded schemas > ≡

{Grammar rules for namespace and XSI attributes 37}

{Utilities for checking attribute occurrences 38}

{Utility for checking absent attributes 39}

{Utility for providing defaulted attributes 40}

{Utility for whitespace normalization 43}

{partition predicate 57}

{Content-checking rules for simple types 147}

{Value checking rules for simple types 146}

{Error codes for simple types 123}

This code is not used elsewhere.

PSVI properties attached to the validation root which we do not include:

ID/IDREF binding information item properties — the purchase-order schema has no IDs or IDREFs

[binding] (ID/IDREF Table)
[id] (ID/IDREF Table)

Identity-constraint Binding information item properties — the purchase-order schema has no identity constraints

[definition] (Identity-constraint Table)
[node table] (Identity-constraint Table)

5. Notes on other features

This section is incomplete and may need revision in the light of recent thought.

With the completion of layer 4, we have a fairly complete representation of the purchase-order schema po1.xsd in DCTG form. Because that schema is intentionally rather simple, however, there has been no occasion to show how to translate various constructs into the DCTG formalism. The following sections set out some ideas for how some of the more prominent features of XML Schema missing from the example can be represented.

5.1. Handling xsi:type

There are two approaches to handling xsi:type. We can have the base type parse its part, so we get layers and clear delineation between base and extension, or else we can parse the entire thing. I'll describe both, and implement plan B because the separation between base and extension is not exhibited in the PSVI as defined.

5.2. Mixed content

Support mixed content. The purchase-order schema does not have any mixed content, and the SWI parser has a mode in which it suppresses text nodes consisting only of whitespace characters, so we have been able to ignore the problem of mixed content and the presence of character information items for non-significant white space in element-only content. In the long run, however, we have to face up to this topic. Three approaches offer themselves:

Filter the contents list, removing character data to produce a list of child elements with no intervening character-data atoms, and then validate that list in the way shown above. This is conceptually simple and would be easy to implement.
Unfortunately, the children attribute of the parent element needs to have the original character-data atoms present in their original places, so what we would really have to do would be first to partition the original children list into
1. a list of child elements with no intervening character-data atoms
2. a list of the character data atoms, with place-holders for the elements (actually, the original children list would work fine for this role)
and then after validating the elements to merge the two lists.

This is not hard to understand or implement, but it feels too kludgy to be attractive.[24]

Modify the rules representing any content model, adding an opt_pcdata or opt_ws token after each element, and also adding a special rule to take care of leading PCDATA or inter-element whitespace. For example, the rules for the content model of t_PurchaseOrderType would change from

t_PurchaseOrderType_cont ::= 
  e_shipTo, e_billTo, opt_e_comment, 
  e_items.
opt_e_comment ::= [].
opt_e_comment ::= e_comment.

t_PurchaseOrderType_cont ::= opt_ws, t_PurchaseOrderType_cont2.
t_PurchaseOrderType_cont2 ::= 
  e_shipTo, opt_ws, e_billTo, opt_ws, opt_e_comment, 
  e_items, opt_ws.
opt_e_comment ::= [].
opt_e_comment ::= e_comment, opt_ws.

Re-organize the parser so that instead of expressing the grammar in native Prolog rules like the Type_cont rules shown above, we represent them using some Prolog data structure (a combination of lists and structured terms is an obvious choice), and write a parser as a simple grammar interpreter which reads those data structures and does the parsing.[25]

This approach is tempting, because it allows us to eliminate much of the repetitive structure present in the grammar rules given above: every element has substantially similar rules differing only in having the name of the element and/or type embedded in particular places in the names of the relevant predicates; passing the names as parameters to a single copy of a generic predicate would be less repetitive, although also (it must be admitted) harder to follow at first glance.

5.3. Supporting wildcards with skip and lax processing

Show how to handle wildcards with skip, lax, and strict processing. The wildcards will have grammatical productions of their own; these grammars will follow rules different from those of normal non-wildcard particles.

5.4. Supporting substitution groups

Substitution groups can be handed by defining multiple element rules with the same left-hand side. If a is in the substitution group of b, then we have both

b --> [element(b,Attributes,Content)],
  {
    b_atts(Attributes,[]),
    b_cont(Content,[])
  }.
a --> [element(a,Attributes,Content)],
  {
    a_atts(Attributes,[]),
    a_cont(Content,[])
  }.

and also

b --> [element(a,Attributes,Content)],
  {
    a_atts(Attributes,[]),
    a_cont(Content,[])
  }.

5.5. Exploiting common code patterns (going to meta-level)

The grammars above are repetitious: for every element with a complex type, the grammars have several predicates differing only in the names of datatypes and element types built into the names of the predicates.

We can make the grammar more compact by replacing the repetitive predicates with single meta-level predicates to which the names of the relevant datatypes and element types are passed as parameters. If we do so, we can no longer take direct advantage of the DCG or DCTG notation and must instead formulate a sort of grammar interpreter to take their place; the code may also become slightly harder to follow. On the other hand, the code can also follow the rules of [W3C 2001b] and [W3C 2001c] more directly and obviously.

6. Further work

The processor outlined above seems promising enough to make further work seem useful. Three items of further work seem worth calling out specifically. First, we need to show that a DCTG-based representation of schemas like that described here conforms to the schema component constraints of the spec, and that a parser running the DCTG grammar performs validation according to the validation rules of the spec. Second, it would be useful to represent the schema for schemas as shown, so that we can validate XML-encoded schema documents. Third, it is a natural further step to combine the two levels of processing to make a schema processor which reads schema documents and from them builds DCTGs (or equivalent data structures) with which it validates document instances.

According to section 2.4 of [W3C 2001b], to be minimally conforming a translation of an XML Schema into DCTG notation must

implement the schema component constraints (see Appendix C.4 and subsection 6 of each component) I take this to mean that the schemas used by the processor must obey them; the processor may or may not independently enforce them. Proof obligation: prove that the style of DCTG outlined above does satisfy these constraints.
implement the validation rules (see Appendix C.1 and subsection 4 of each component). I take this to mean the processor must work correctly from its component representation to the values of the relevant PSVI instance properties. Proof obligation: prove that the style of DCTG outlined above does correctly implement the validation rules.
implement (i.e. make) the schema information set contributions (see Appendix C.2 and subsection 5 of each component). Proof obligation: make the parser decorate the input with appropriate grammatical attributes representing the relevant properties.

This work will be done in separate papers.

A. Works cited and further reading

Works cited

Abramson, Harvey. 1984. “Definite Clause Translation Grammars”. Proceedings of the 1984 International Symposium on Logic Programming, Atlantic City, New Jersey, February 6-9, 1984, pp. 233-240. (IEEE-CS 1984, ISBN 0-8186-0522-7)

Abramson, Harvey, and Veronica Dahl. 1989. Logic Grammars. Symbolic Computation AI Series. Springer-Verlag, 1989.

Abramson, Harvey, and Veronica Dahl, rev. Jocelyn Paine. 1990. DCTG: Prolog definite clause translation grammar translator. (Prolog code for translating from DCTG notation to standard Prolog. Note says syntax extended slightly by Jocelyn Paine to accept && between specifications of grammatical attributes, to minimize need for parentheses. Available from numerous AI/NLP software repositotries, including http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/prolog/code/syntax/dctg/0.html, http://www.ims.uni-stuttgart.de/ftp/pub/languages/prolog/libraries/imperial_college/dctg.tar.gz, and http://www.ifs.org.uk/~popx/prolog/dctg/.)

Alblas, Henk. 1991. “Introduction to attribute grammars”. Attribute grammars, applications and systems: International Summer School SAGA, Prague, Czechoslovakia, June 4-13, 1991, Proceedings, pp. 1-15. Berlin: Springer, 1991. Lecture Notes in Computer Science, 545.

Bratko, Ivan. 1990. Prolog programming for artificial intelligence. Second edition. Wokingham: Addison-Wesley. xxi, 597 pp.

Brown, Allen L., Jr., and Howard A. Blair. 1990. “A logic grammar foundation for document representation and layout”. In EP90: Proceedings of the International Conference on Electronic Publishing, Document Manipulation and Typography, ed. Richard Furuta. Cambridge: Cambridge University Press, 1990, pp. 47-64.

Brown, Allen, Matthew Fuchs, Jonathan Robie, and Philip Wadler. 1991. “XML Schema: Formal Description”. W3C Working Draft, 25 September 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/xmlschema-formal/

Brown, Allen L., Jr., Toshiro Wakayama, and Howard A. Blair. 1992. “A reconstruction of context-dependent document processing in SGML”. In EP92: Proceedings of Electronic Publishing, 1992, ed. C. Vanoirbeek and G. Coray. Cambridge: Cambridge University Press, 1992. Pages 1-25.

Brüggemann-Klein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasis-open.org/cover/bruggDissert-ps.gz).

[Brüggemann-Klein provides a formal definition of 1-unambiguity, which corresponds to the notion of unambiguity in ISO 8879 and determinism in XML 1.0. Her definition of 1-unambiguity can be used to check XML Schema's Unique Particle Attribution constraint by changing every minOccurs and maxOccurs value greater than 1 to 1, if the two are equal, and otherwise changing minOccurs to 1 maxOccurs greater than 1 to unbounded.]

Clocksin, W. F., and C. S. Mellish. Programming in Prolog. Second edition. Berlin: Springer, 1984.

Gal, Annie, Guy Lapalme, Patrick Saint-Dizier, and Harold Somers. 1991. Prolog for natural language processing. Chichester: Wiley, 1991. xiii, 306 pp.

Gazdar, Gerald, and Chris Mellish. 1989. Natural language processing in PROLOG: An introduction to computational linguistics. Wokingham: Addison-Wesley, 1989. xv, 504 pp.

Grune, Dick, and Ceriel J. H. Jacobs. 1990. Parsing techniques: a practical guide. New York, London: Ellis Horwood, 1990. Postscript of the book is available from the first author's Web site at http://www.cs.vu.nl/~dick/PTAPG.html

Holstege, Mary, and Asir S. Vedamuthu, ed. 2003. XML Schema: Component Designators. W3C Working Draft 09 January 2003. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/xmlschema-ref/

Knuth, D. E. 1968. “Semantics of context-free languages”. Mathematical Systems Theory 2: 127-145.

König, Esther, and Roland Seiffert. Grundkurs PROLOG für Linguisten. Tübingen: Francke, 1989. [= Uni-Taschenbücher 1525]

Pereira, Fernando C. N., and Stuart M. Shieber, Prolog and natural-language analysis. CSLI lecture notes 10. Stanford: Center for the study of language and information, 1987.

Sperberg-McQueen, C. M. 2003a. “A logic grammar representation for XML Schema”. Working paper prepared for the W3C XML Schema Working Group. [Incomplete; incomplete outline is at lgrxs.html. Attempt at a systematic demonstration that DCTGs (can be made to) correctly implement the validation rules, provide the schema-infoset contributions, and obey the constraints on schemas defined in the specification.]

Sperberg-McQueen, C. M. 2003b. “An XML Schema validator in logic-grammar form”. Working paper prepared for the W3C XML Schema Working Group. [Incomplete; incomplete outline is at xsvlgf.html. Will provide a DCTG representation of the schema for schemas, with actions to perform the XML-to-component transformation and code to check the constraints on schemas.]

Stepney, Susan. High-integrity compilation. Prentice-Hall. Available from http://www-users.cs.york.ac.uk/~susan/bib/ss/hic/index.htm. Chapter 3 (Using Prolog) provides a terse introduction to DCTG notation and use.

Sterling, Leon, and Ehud Shapiro. 1994. The Art of Prolog: Advanced Programming Techniques. Cambridge, Mass.: MIT Press.

W3C (World Wide Web Consortium). 2001a. “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.

W3C (World Wide Web Consortium). 2001b. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

W3C (World Wide Web Consortium). 2001c. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/

Wadler, Philip. “A formal semantics of patterns in XSLT and XPath.” Markup Languages: Theory & Practice 2.2 (2000): 183-202.

Wielemaker, Jan. “SWI-Prolog SGML/XML parser: Version 1.0.14, March 2001”. http://www.swi-prolog.org/packages/sgml2pl.html

B. A note on the test cases

The test documents mentioned in the text were created manually to test the implementation of a schema corresponding to the schema document po1.xsd of the XML Schema tutorial [W3C 2001a].

A simple XSLT stylesheet read the schema and produced about 120 suggestions for tests, both valid and invalid documents. Because the stylesheet is primitive and simple-minded, some of the suggestions are not useful; in the end, eleven valid variations of the po1.xml example were produced:

po1v10a.xsd, add additional attributes (namespace declarations, xsi attributes) to purchaseOrder
po1v25.xml, omit optional comment element
po1v33.xml, omit optional orderDate attribute from purchaseOrder
po1v38.xml, add harmless attributes to USAddress elements
po1v62d.xml, valid but unreasonable zip code (3.141592); the declaration is too weak to detect this problem
po1v65.xml, omit country attribute on USAddress
po1v79.xml, empty items element
po1v80.xml, longer list of items (10)
po1v100a.xml, various alternative values for quantity element (1, 43, 55, 99)
po1v100b.xml, various alternative values for quantity element (1, 43, 55, 99)
po1v121.xml, alternative values for partNum

Also, 58 test cases with errors were produced:

po1e4.xml, misspell purchaseOrder
po1e13.xml, omit main seq of PurchaseOrderType
po1e14.xml, duplicate main sequence of PurchaseOrderType
po1e15a.xml, reorder children of purchaseOrder
po1e15b.xml, reorder children of purchaseOrder
po1e16.xml, extraneous child of purchaseOrder
po1e18.xml, omit required shipTo element
po1e19.xml, duplicate shipTo
po1e20.xml, misspell shipTo
po1e27.xml, supply two comment elements (invalid)
po1e28.xml, misspell 'comment'
po1e30.xml, omit items element
po1e31.xml, supply two items elements (invalid)
po1e32.xml, misspell 'items'
po1e35.xml, supply two orderDate attributes
po1e36.xml, misspell 'orderDate' on purchaseOrder
po1e41.xml, omit USAddress contents
po1e42.xml, supply double sequence of address children
po1e43.xml, reorder children of USAddress
po1e44.xml, add extraneous child to USAddress
po1e46.xml, omit name from USAddress
po1e47.xml, provide name twice in USAddress
po1e48.xml, misspell 'name' in USAddress
po1e50.xml, omit street in USAddress
po1e51.xml, double occurrence of street in USAddress
po1e52.xml, misspell 'street'
po1e55.xml, double occurrence of 'city'
po1e56.xml, misspell 'city'
po1e62.xml, omit zip
po1e62b.xml, omit contents of zip element
po1e62c.xml, invalid zip code
po1e63.xml, double occurrence of zip code
po1e64.xml, misspell 'zip' in USAddress
po1e68.xml, misspell attribute name 'country' on USAddress
po1e70.xml, provide value other than 'US' for fixed att 'country'
po1e70b.xml, provide value other than 'US' for fixed att 'country'
po1e78.xml, add extraneous child to sequence in type t_Items
po1e81.xml, misspell 'item'
po1e86.xml, omit contents of item element
po1e87.xml, repeat sequence of children in item
po1e88.xml, reorder children of item
po1e89.xml, add extraneous child to item
po1e91.xml, omit productName
po1e92.xml, reduplicated productName
po1e101a.xml, exceed the maximum legal value for quantity
po1e101b.xml, violate min, lexical form rules for quantity
po1e101c.xml, wrong datatype for quantity
po1e101d.xml, bad value for quantity
po1e105bisa.xml, bad value for USPrice
po1e105bisb.xml, bad value for USPrice
po1e106.xml, misspell USPrice
po1e109.xml, two comments in item
po1e113.xml, two shipDates (invalid)
po1e114.xml, misspell 'shipDate'
po1e116.xml, omit required partNum attribute
po1e122a.xml, bad value for SKU (lowercase letters, char out of range)
po1e122b.xml, bad value for SKU (excess chars)
po1e122c.xml, bad value for SKU, missing hyphen

To run the tests conveniently, the following auxiliary Prolog file can be used. At the moment, it is tied to the manually-modified version of podcg3.pl, but that will change as work on this paper progresses.

< 159 Utility routines for testing Prolog implementations of po1.xsd [File potests.pl] > ≡

/* potest: simple one-item test routine for SWI Prolog 
 * 
 * This DCG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
grammar_version(current,dctg,'d:/home/cmsmcq/2003/schema/dctg/po4.pl').
grammar_version(old,dctg,'d:/home/cmsmcq/2003/schema/dctg/podctg.pl').
grammar_version(dcg,dcg,'d:/home/cmsmcq/2003/schema/dctg/podcg3.pl').

potest(Grammar,dcg,XMLDocument) :-
  write('Loading grammar '), write(Grammar), write('...'), nl,
  consult(Grammar),
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (purchaseOrder(Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.
potest(Grammar,dctg,XMLDocument) :-
  write('Loading grammar '), write(Grammar), write('...'), nl,
  consult('d:/usr/lib/prolog/msmdctg.pl'),
  dctg_reconsult(Grammar),
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (e_purchaseOrder(_Structure,Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.

potest2(XMLDocument,dcg) :-
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (purchaseOrder(Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.
potest2(XMLDocument,dctg) :-
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (e_purchaseOrder(_Structure,Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.

one(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml').

two(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v10a.xml',Syn).

/* Each of the following should be valid */
good(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v10a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v25.xml',Syn),    
  potest2('d:/usr/lib/xmlschema/po/tests/po1v33.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v38.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v62d.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v65.xml',Syn),   
  potest2('d:/usr/lib/xmlschema/po/tests/po1v79.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v80.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v100a.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v100b.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v121.xml',Syn).

/* Each of the following should raise an error */
bad(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e04.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e13.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e14.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15c.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e16.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e16b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e18.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e19.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e20.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e27.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e28.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e28b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e30.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e31.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e32.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e35.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e36.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e41.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e42.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e43.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e44.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e46.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e47.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e48.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e50.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e51.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e52.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e55.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e56.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62b.xml',Syn),
  /*
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62c.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e63.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e64.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e68.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e70.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e70b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e78.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e81.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e86.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e87.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e88.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e89.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e91.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e92.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101b.xml',Syn),
  /*
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101c.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101d.xml',Syn),
  /* 
  potest2('d:/usr/lib/xmlschema/po/tests/po1e105bisa.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e105bisb.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e106.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e109.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e113.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e114.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e116.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122c.xml',Syn).
/* 62c, 101c, 105bisa are commented out because they blow up 
 * number_chars/2.  Need a more robust approach to checking
 * numbers, apparently. */

ugly(V) :- good(V), bad(V).

The tests can be run from the SWI Prolog command line in bash thus:[26]

cd ~/2003/schema/dctg
$ /cygdrive/d/usr/src/pl/bin/plcon.exe -f potests.pl -g 'one(current)' -t halt

$ /cygdrive/d/usr/src/pl/bin/plcon.exe \
>  -f d:/home/cmsmcq/2002/Prolog/potest.pl \
>  -g 'good(current)' -t halt

The command line might be simpler if Prolog has a good search path, but I haven't gotten around to that yet.

The -g option specifies which set of tests to run: one, two (these check whether the tests work at all), good (the valid documents; these should each produce “yes”), bad (the invalid documents), or ugly (good followed by bad).

[For what it's worth, on 6 August 2003, the 71 or so tests of ugly for podctg.pl executed in 0.36 seconds of clock time on my IBM Thinkpad with a 550 MHz Celeron processor.)

C. SWI Prolog handling of characters

In the interests of simplicity, the code presented above in layers 1-4 blissfully ignores non-ASCII characters. A more reliable technique will need to deal properly with them as they are represented in the output of the SWI parser.

The version of SWI Prolog used here (5.0.10) uses atoms to represent characters which are representable in Prolog, and uses entity(Name) and entity(Num) structures otherwise. There may be some rough spots in the Unicode support. Here is a quick summary of what I have tested and the results.

named general entity reference in document for single character (defined by using numeric character reference):
- ntilde (defined as ñ): represented directly (code 241)
- lsquo and rsquo (defined as ‘ and ’): represented as entity(lsquo) and entity(rsquo)
numeric character reference in document:
- 241 (ntilde): represented directly (code 241)
- U+2018, U+2019 (lsquo and rsquo): represented as entity(8216) and entity(8217)
larger general entities with non-ASCII characters in the replacement text:
- using named entities: top-level entity is expanded; within the expansion, ntilde and the sinqle quotes treated as described above (named entities)
- using numeric character references: top-level entity is expanded; within the expansion, ntilde and the sinqle quotes treated as described above (numeric references)
UTF-8 (entire document translated to UTF-8, no entity or character references involved):
- ntilde shows up as character 241
- lsquo and rsquo show up as characters 24 and 25 (n.b. 24 = 8216 mod 256)

Upshot: use numeric character references for now.

D. Error codes for elements and attributes

Schema-validity assessment

Failures of schema-validity assessment (cvc-assess-elt). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-assess-elt.

cvc-assess-elt.1.1.1 schema-validity not assessed: no element declaration known for this element
cvc-assess-elt.1.1.2 schema-validity not assessed: element was not validated w.r.t. its element declaration (Element Locally Valid was not run).
cvc-assess-elt.1.1.3 schema-validity not assessed: the necessary type definition was absent [Should normally be accompanied by error code cvc.type.1.]
cvc-assess-elt.1.2.1 schema-validity not assessed: the type definition needed to validate this element was either unknown or absent
cvc-assess-elt.1.2.1.2.2 schema-validity not assessed: an xsi:type attribute was present, but its value was not a valid QName
cvc-assess-elt.1.2.1.2.3 schema-validity not assessed: an xsi:type attribute was present, but its value did not resolve to any known type definition
cvc-assess-elt.1.2.1.2.4 schema-validity not assessed: an xsi:type attribute was present, but the type it denotes is not validly derived from the type stipulated by the processor at run time
cvc-assess-elt.1.2.2 schema-validity not assessed: element was not validated w.r.t. either the processor-stipulated type or the local (xsi:type-specified) type definition
cvc-assess-elt.2 schema-validity not assessed: either there were children which were not assessed, or attributes which were not assessed. [Should be accompanied by one or more errors from cvc-assess-elt for the children, or cvc-assess-attr for the attributes.]

Elements: local validity

Local validity w.r.t. an element declaration (cvc-elt). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-elt.

cvc-elt.1 element is not locally valid: no element declaration found
cvc-elt.2 element is not locally valid: element is declared abstract
cvc-elt.3.1 element is not locally valid, element is declared non-nillable, but there is an xsi:nil attribute present (it doesn't matter what value that attribute had, it's legal only on nillable elements)
cvc-elt.3.2.1 element is not locally valid, it's nillable and has xsi:nil="true", but the element is not empty.
cvc-elt.3.2.2 element is not locally valid, it's nillable and has xsi:nil="true", but it has a fixed {value constraint}.
cvc-elt.4.1 element is not locally valid, it has xsi:type but the value is not a valid QName. [Should also have a cvc-simple-type error.]
cvc-elt.4.2 element is not locally valid, it has xsi:type but the value does not resolve to a known type. [Should also have a cvc-resolve-instance error.]
cvc-elt.4.3 element is not locally valid, it has xsi:type but the type named is not validly derived from the declared type. [Should also have a cos-ct-derived-ok error or a cos-st-derived-ok error.]
cvc-elt.5.1.1 element is not locally valid; it is empty and non-nil and has a {value constraint}, but the default value is not a legal value for the local type. [Should also have a cos-valid-default error.]
cvc-elt.5.1.2 element is not locally valid; it is empty and non-nil and has a {value constraint}, but the element (after supplying the default value) is not locally valid according to the actual type definition. [Should also have a cvc-type error.]
cvc-elt.5.2.1 element is not locally valid; failed Element Locally Valid (Type) (cvc-type). [Should also have a cvc-type error.]
cvc-elt.5.2.2.1 element is not locally valid; it has both a fixed value and element children.
cvc-elt.5.2.2.2.1 element is not locally valid; the initial value of its mixed content does not match the canonical lexical representation of the fixed value in the actual declaration.
cvc-elt.5.2.2.2.2 element is not locally valid; the actual value it contains does not match the canonical lexical representation of the fixed value in the actual declaration.
cvc-elt.6 element is not locally valid; failed an identity-constraint. [Should also have a cvc-identity-constraint error.]
cvc-elt.7 element is not locally valid; it's the validation root, but failed the ID/IDREF check. [Should also have a cvc-id error.]

Local validity w.r.t. a type definition (cvc-type). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-type.

cvc-type.1 element is not locally valid: no type definition found.
cvc-type.2 element is not locally valid: its type definition has abstract="true".
cvc-type.3.1.1 element is not locally valid: its type definition is simple, but it has attributes other than xsi:type, xsi:nil, xsi:schemaLocation, and xsi:noNamespaceSchemaLocation.
cvc-type.3.1.2 element is not locally valid: its type definition is simple, but it has element children.
cvc-type.3.1.2 element is not locally valid: its content is not a legal lexical form for its simple type. [Should also have a cvc-simple-type error code.]
cvc-type.3.2 element is not locally valid: it is not valid with respect to its complex type. [Should also have a cvc-complex-type error code.]

Validity with respect to a complex type

Local validity w.r.t. a complex type. For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-complex-type.

cvc-complex-type.1 element not locally valid: complex type does not have abstract="false".
cvc-complex-type.2.1 element not locally valid: complex type is declared empty (has {content type} of empty) but the element is not empty. [Applies only if non-nil.]
cvc-complex-type.2.2 element not locally valid: complex type is derived from a simple type (has a simple type as its {content type}) but the element has element children. [Applies only if non-nil.]
cvc-complex-type.2.2 element not locally valid: complex type is derived from a simple type (has a simple type as its {content type}) but the element is not valid w.r.t. that simple type. [Should also have cvc-simple-type error code.] [Applies only if non-nil.]
cvc-complex-type.2.3 element not locally valid: complex type has {content type} element-only, but the element has non-whitespace characters between child elements. [Applies only if non-nil.]
cvc-complex-type.2.4 element not locally valid: the sequence of element children does not match the content model. [Should also have one or more cvc-particle errors.] [Applies only if non-nil.]
cvc-complex-type.3.1 element not locally valid: attribute not valid w.r.t. declared attribute use. [Should also have one or more cvc-au errors.]
cvc-complex-type.3.2.1 element not locally valid: attribute not declared, matches no wildcard.
cvc-complex-type.3.2.2 element not locally valid: attribute not valid w.r.t. wildcard. [Should also have one or more cvc-wildcard errors.]
cvc-complex-type.4 element not locally valid: required attribute missing.
cvc-complex-type.5.1 element not locally valid: more than one wild ID.
cvc-complex-type.5.2 element not locally valid: there is both a wild ID and a tame ID.

Content model validation (local validity of element sequence with respect to a particle). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-particle

cvc-particle.1.1 element-sequence not locally valid w.r.t. particle: wildcard match count less than {min occurs}.
cvc-particle.1.2 element-sequence not locally valid w.r.t. particle: wildcard match count greater than numeric {max occurs}.
cvc-particle.1.3 element-sequence not locally valid w.r.t. particle: element not valid w.r.t. wildcard. [Should also have some cvc-wildcard error codes.]
cvc-particle.2.1 element-sequence not locally valid w.r.t. particle: element match count less than {min occurs}.
cvc-particle.2.2 element-sequence not locally valid w.r.t. particle: element match count greater than {max occurs}.
cvc-particle.2.3 element-sequence not locally valid w.r.t. particle: element not matched.
cvc-particle.3.1 element-sequence not locally valid w.r.t. particle: model group must occur at least {min occurs} times.
cvc-particle.3.2 element-sequence not locally valid w.r.t. particle: model group must occur at most {max occurs} times.
cvc-particle.3.3 element-sequence not locally valid w.r.t. particle: subsequence not valid against model group. [Should also have some cvc-model-group errors.]

Element sequence validity (cvc-model-group). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-model-group.

cvc-model-group element sequence is not valid with respect to model group
cvc-model-group.1 element sequence is not valid with respect to a sequence group
cvc-model-group.2 element sequence is not valid with respect to a choice group: none of the particles in the choice validates this sequence
cvc-model-group.3 element sequence is not valid with respect to an all group

Substitutability (cos-equiv-derived-ok-rec). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cos-equiv-derived-ok-rec.

cos-equiv-derived-ok-rec.2.1 element E1 not substitutable for element E2: substitution is blocked (the blocking constraint contains substitution) [which blocking constraint?]
cos-equiv-derived-ok-rec.2.2 element E1 not substitutable for element E2: E2 is not in the {substitution group affiliation} chain of E1
cos-equiv-derived-ok-rec.2.3 element E1 not substitutable for element E2: the type of E1 is derived from the type of E2 by a set of derivation methods which overlaps with the blocking constraint, or with the {prohibited substitutions} of E2's type, or with the {prohibited substitutions} of any other type in the derivation chain. [27]

Attributes

Validity of attribute (cvc-assess-attribute). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-assess-attr.

cvc-assess-attribute the attribute's schema-validity was not asessed.
cvc-assess-attribute.1 the attribute's schema-validity was not asessed: no attribute declaration was known and present
cvc-assess-attribute.1.1 the attribute's schema-validity was not asessed: no attribute declaration was known and present (context-determined declaration not found)
cvc-assess-attribute.1.2.a the attribute's schema-validity was not asessed: no attribute declaration was known and present (context-determined declaration was not skip, but the attribute's QName did not resolve)
cvc-assess-attribute.1.2.b the attribute's schema-validity was not asessed: no attribute declaration was known and present (context-determined declaration was skip)
cvc-assess-attribute.2 the attribute's schema-validity was not asessed: validation rule cvc-attribute not performed [why might this happen?]
cvc-assess-attribute.3 the attribute's schema-validity was not asessed: error cvc-attribute.1 or cvc-attribute.2 occurred

Local validity of attribute (cvc-attribute). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-attribute.

cvc-attribute attribute not locally valid w.r.t. attribute declaration
cvc-attribute.1 attribute not locally valid w.r.t. attribute declaration: attribute declaration is absent
cvc-attribute.2 attribute not locally valid w.r.t. attribute declaration: type definition is absent
cvc-attribute.3 attribute not locally valid w.r.t. attribute declaration: normalized value not locally valid with respect to (simple) type definition [should have error from cvc-simple-type, too]
cvc-attribute.4 attribute not locally valid w.r.t. attribute declaration: actual value does not match fixed value

Attribute locally valid (use) (cvc-au). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-au.

cvc-au attribute information item not valid w.r.t. attribute use: normalized value does not match canonical form of fixed value in attribute use

Simple types

Validity of a string with respect to a simple type (cvc-simple-type), as imposed by Structures. For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-simple-type.

cvc-simple-type.1 string not locally valid w.r.t. the given simple type. [Should also have one or more error codes from cvc-datatype-valid in datatypes spec.]
cvc-simple-type.2.1 string not locally valid w.r.t. simple type derived from ENTITY: not a declared entity name.
cvc-simple-type.2.2 string not locally valid w.r.t. simple type derived from ENTITIES: substring not a declared entity name.

Validity of a string with respect to a simple type (cvc-datatype-valid), as imposed by Datatypes. See http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-datatype-valid.

cvc-datatype-valid.1 lexical form not datatype-valid w.r.t. a given simple type: does not match a literal in the lexical space.
cvc-datatype-valid.1.1 lexical form not datatype-valid w.r.t. a given simple type: not pattern-valid.
cvc-datatype-valid.1.2 lexical form not datatype-valid w.r.t. a given simple type: not in base type's lexical space.
cvc-datatype-valid.1.2.1 lexical form not datatype-valid w.r.t. a given simple type: not in atomic base type's lexical space.
cvc-datatype-valid.1.2.2 lexical form not datatype-valid w.r.t. a given simple type: there is a whitespace-delimited token which is not in the base type's lexical space.
cvc-datatype-valid.1.2.3 lexical form not datatype-valid w.r.t. a given simple type: not a member of the lexical space of any member of the union.
cvc-datatype-valid.2 lexical form not datatype-valid w.r.t. a given simple type: does not denote a member of the value space. [Should usually be accompanied by an error indicating which facet was violated.]

Lexical forms

Pattern facet (cvc-pattern-valid). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-pattern-valid.

cvc-pattern-valid.1 Literal is not among the character sequences denoted by the regular expression of the pattern.

Values

Facet validity (cvc-facet-valid). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-facet-valid.

cvc-facet-valid.1 value not valid with respect to a particular constraining facet. [This one seems pointless; it will always be accompanied by other errors pointing out the specific facet with a problem.]

Length (cvc-length-valid and so on). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-length-valid, http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-minLength-valid, http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-maxLength-valid.

cvc-length-valid value is too short or too long
cvc-length-valid.1.1 (for strings and URIs): value has too many or too few characters
cvc-length-valid.1.2 (for hexBinary and base64Binary): value has too many or too few octets
cvc-length-valid.1.3 (for QName and NOTATION): value has wrong length (this error can never be raised: all values have legal length by definition)
cvc-length-valid.2 (for lists): value has too many or too few items
cvc-minLength-valid value is too short. (Sub-rules follow same pattern as for cvc-length-valid.)
cvc-maxLength-valid value is too long. (Sub-rules follow same pattern as for cvc-length-valid.)

Enumeration facet (cvc-enumeration-valid). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-enumeration-valid.

cvc-enumeration-valid the value is not one of those enumerated

Minima and maxima: (cvc-maxExclusive-valid etc.) For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-maxExclusive-valid, http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-maxInclusive-valid, http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-minExclusive-valid, http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-minInclusive-valid.

cvc-maxExclusive-valid value is not (numerically or chronologically) less than the specified maximum
cvc-maxInclusive-valid value is not (numerically or chronologically) less than or equal to the specified maximum
cvc-minExclusive-valid value is not (numerically or chronologically) greater than the specified maximum
cvc-minInclusive-valid value is not (numerically or chronologically) greater than or equal to the specified maximum

Fraction digits and total digits (cvc-fractionDigits-valid and cvc-totalDigits-valid). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-fractionDigits-valid and http://www.w3.org/XML/Group/2002/09/xmlschema-2/datatypes-with-errata.html#cvc-totalDigits-valid.

cvc-fractionDigits-valid value requires too many fractional digits (i.e. it is not expressible as i × 10^-n where i and n are integers such that 0 ≤ n ≤ fractionDigits).
cvc-totalDigits-valid value requires too many digits (i.e. it is not expressible as i × 10^-n where i and n are integers such that |i| < 10^totalDigits and 0 ≤ n ≤ totalDigits).

Miscellaneous, common constructs

Success of QName resolution (cvc-resolve-instance). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-resolve-instance.

cvc-resolve-instance a QName pair (namespace name or absent, local name) failed to resolve to a schema component
cvc-resolve-instance.1 a QName pair failed to resolve to a simple or complex type definition using {type definitions}
cvc-resolve-instance.2 a QName pair failed to resolve to an attribute declaration using {attribute declarations}
cvc-resolve-instance.3 a QName pair failed to resolve to an element declaration using {element declarations}
cvc-resolve-instance.4 a QName pair failed to resolve to a named attribute group using {attribute group definitions}
cvc-resolve-instance.5 a QName pair failed to resolve to a named model group using {model group definitions}
cvc-resolve-instance.6 a QName pair failed to resolve to a notation using {notation declarations}

Identity constraints (cvc-identity-constraint). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-identity-constraint.

cvc-identity-constraint element not locally valid w.r.t. an identity constraint
cvc-identity-constraint.1 element not locally valid w.r.t. an identity constraint: selector on constraint does not evaluate to a node set
cvc-identity-constraint.2 element not locally valid w.r.t. an identity constraint: selector on constraint evaluates to a node set including a node outside the element's subtree
cvc-identity-constraint.3.a element not locally valid w.r.t. an identity constraint: for some target node, some field evaluates to more than one node
cvc-identity-constraint.3.b element not locally valid w.r.t. an identity constraint: for some target node, some field evaluates to a node without a simple type
cvc-identity-constraint.4.1 element not locally valid w.r.t. an identity constraint: constraint is unique but two members of qualified node set have pairwise equal key sequences
cvc-identity-constraint.4.2.1 element not locally valid w.r.t. an identity constraint: constraint is key, but the target node set and the qualified node set are not equal
cvc-identity-constraint.4.2.2 element not locally valid w.r.t. an identity constraint: constraint is key, but two qualified nodes have equal key sequences
cvc-identity-constraint.4.2.3 element not locally valid w.r.t. an identity constraint: constraint is key, but some key field on a qualified node was validated with {nillable} = true
cvc-identity-constraint.4.3 element not locally valid w.r.t. an identity constraint: constraint is keyref, but there is some keyref member M for which no key in the table matches the key sequence of M.

n (cvc-id). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-id.

cvc-id.1 validation root is not valid: there is an ID/IDREF binding which is the empty set (i.e. there is a reference to an undefined ID)
cvc-id.1 validation root is not valid: there is an ID/IDREF binding which has multiple members (i.e. some ID occurs more than once)

Local validity with respect to a wildcard particle (cvc-wildcard). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-wildcard.

cvc-wildcard information item is not allowed by namespace constraint of the wildcard it was validated against [should be accompanied by cvc-wildcard-namespace errors].

Wildcard and namespace (cvc-wildcard-namespace). For details, see http://www.w3.org/XML/Group/2002/09/xmlschema-1/structures-with-errata.html#cvc-wildcard-namespace.

cvc-wildcard-namespace.2.2 namespace name not valid with respect to a wildcard constraint: constraint is an exclusion, and the value is identical to the excluded namespace.
cvc-wildcard-namespace.2.3 namespace name not valid with respect to a wildcard constraint: constraint is an exclusion, but the value is absent.
cvc-wildcard-namespace.3 namespace name not valid with respect to a wildcard constraint: The constraint is a set, but the namespace name is not among the members.

E. Further reading

Definite-clause grammars (DCGs) are introduced in almost any good Prolog textbook: e.g. [Clocksin/Mellish 1984], [Sterling/Shapiro 1994], or [Bratko 1990]. They are discussed at somewhat greater length in treatments of Prolog for natural-language processing, including [König/Seiffert 1989], [Gazdar/Mellish 1989], and [Gal et al. 1991]. Most extended discussions show how to use additional arguments to record syntactic structure or handle the semantics of the material.

Definite-clause translation grammars were introduced as a way of making it easier to handle semantics; they provide explicit names for attributes (in the sense of attribute grammars [Knuth 1968]).

F. Graveyard (todo list and dead prose)

To do

Check to see whether whitespace normalization is required on the value of namespace attributes and xsi:*.
Distribute this todo list properly among dctgnotes.xml, lgrxs.xml, and xsvlgf.xml, de-dup.
layer 4
make an XSLT stylesheet to do the DCTG generation from a single schema document (?)

Think about schema composition. N.B. the SWI http_open library can be used to open streams for Web-accessible material, like this:

['d:/usr/lib/prolog/http_open']. % load the library
http_open('http://www.w3.org/2001/XMLSchema.xsd',XSDStream), 
 load_xml_file(stream(XSDStream),XSD), 
 close(XSDStream).

Write a basic DCTG for schema documents (translation of the schema for schemas). (Write XSLT stylesheet and/or Prolog code to do this job.)
Elaborate the schema-document DCTG to check all required constraints on XML representation of schemas.
Elaborate the schema-document DCTG to provide simple representations of schema components; prove that the mapping follows the rules. Checking unique particle attribution rule can be done by transcribing Anne Brüggemann-Klein's algorithm in [Brüggemann-Klein 1993]. For extra credit, write a diagnostic routine which fires when a non-deterministic model is provided, and proposes a deterministic alternative. Warn the user if this loses any appinfo information.
Add routines to go from the representations of schema components into DCTG notation described in section 3.
Show how (add routines to?) perform schema composition: import, include, redefine.
Go back and fulfil the proof obligations relating to the DCTG representation of schemas.
[after completion] experiment with SWI Prolog facilities to try to make the program itself more useful as a program, not just an animated spec:
- compilation
- quick-load format (may make Unicode table more concise)
- profiling tools
- XPCE front end
- as CGI script

Done:

get a layer 1 grammar working (done 2002-09-13, Denver International Airport)
add layer 2 (done 2002-09-13)
translate to DCTG as layer 3 (completed 2003-01-05, though not yet tested)
generate test cases to test each type: correct instances, boundary cases, incorrect instances (done 2002-12-26)
¿ make explicit grammatical attributes for the input infoset properties, as well as those added in the PSVI? (Need to learn a bit more about how SWI Prolog presents this information.)
- Attribute Information Items:
  - [local name]
  - [namespace name]
  - [normalized value]
- Character Information Items:
  - [character code]
- Element Information Items:
  - [local name]
  - [namespace name]
  - [children]
  - [attributes]
  - [in-scope namespaces] or [namespace attributes]
- Namespace Information Items:
  - [prefix]
  - [namespace name]
(done as part of level 3)

Toward a useful layering

The layering proposed above is a start; I'd like to confirm it by walking through a validation episode or two. I'll start with the purchase order from the tutorial. We visit the following validation rules; under each, I note some features I think should probably be omitted from layer 1, and introduced as elaborations later, in some sequence to be determined.

Sec. 5.2, rule 3
- ability of user or application to stipulate a type definition at startup
- ability of user or application to stipulate an element declaration at startup
- ability of user or application to stipulate a starting point other than the root of the document
sva_element_334
- use of xsi:type (and with it clause 1.2)
qname_resolution_instance_3f4
elementlocallyvalid_element_334
- absent declarations
- abstract element declarations
- xsi:nil
- default values, fixed values (?)
element_locally_valid_type_334
- absent type definitions
- abstract types
string valid (and from there to simple-type checking)
element_locally_valid_complextype
- attribute wildcards
- wild IDs
element sequence locally valid (particle)
- wildcards
- ...
element sequence valid
attribute locally valid (use) 354
schema-validity assessement (attribute) 324
assessment outcome (attribute) 325

Notes on the complex-type structure

There are a couple aspects of this structure for complex types which I'm not sure about.

First, this sketch writes the name of a DCTG grammar rule where [W3C 2001b] says the property value has a particle. In the long run, we may want to represent content models as a structure here.[28]

Second, I'm not sure about the value of the attribute_uses property. Perhaps attributes should be globalized in some form like attdecl(ID,SCD,NS,Local,ElemID,Properties), so we'd have

attdecl(a_orderDate,
  '/complexType(po:PurchaseOrderType)/attribute(orderDate)',
  orderDate,keyword(absent), local, [
    (id(a_orderDate)),
    (scd()),
    (name(orderDate)),
    (target_namespace('http://www.example.com/PO1')),
    (type_definition(t_date)),
    (scope(t_PurchaseOrderType)),
    (value_constraint(keyword(absent))),
    (annotation(keyword(absent)))
]).

in which case the {attribute uses} property might read

  (attribute_uses([
              au(required(false),
                 attdecl(a_orderDate),
                 value_constraint(keyword(absent)))
             ])),

or (if we want to ensure that we can use the ^^ operator with attribute uses):

  (attribute_uses([
              au([(required(false)),
                 (attdecl(a_orderDate)),
                 (value_constraint(keyword(absent)))])
              ])),

An alternative design would use a single predicate component with an additional component-type argument: component(complexType,t_PurchaseOrderType,...).