Notes on logic grammars and XML Schema

A working paper prepared for the W3C XML Schema Working Group

C. M. Sperberg-McQueen

12 January 2003

N.B. not complete: work in progress



This document describes some possible applications of logic grammars to schema processing as described in the XML Schema specification. The term logic grammar is used to denote grammars written in logic-programming systems; this paper works with definite-clause grammars (DCGs), which are a built-in part of most Prolog systems, and definite-clause translation grammars (DCTGs), which employ a similar formalism which more closely resembles attribute grammars as described by [Knuth 1968] and later writers and which make it a bit easier to handle complex specifications. Both DCGs and DCTGs can be regarded as syntactic sugar for straight Prolog; before execution, both notations are translated into Prolog clauses in the usual notation.
In its current state, this paper is unfinished: it is more or less complete through the description of the layer-3 grammar, but layer 4 and the topics which follow it are still fragmentary. Some highlighted notes about what needs to be done to complete the paper are included.

1. Introduction

1.1. Note to the reader

This paper is an elaboration of an idea suggested to the author by Allen Brown of Microsoft; it is intended to help the author (and possibly others) understand how the idea might work. The idea itself is simply stated: one way to achieve direct interpretation of the formal inference rules of XML Schema: Formal Description would be to translate them mechanically into a definite-clause translation grammar.
The idea of direct interpretation of formal specifications is not new; many textbooks on formal methods or on specific formal methods (Z, the Vienna Development Method VDM, etc.) discuss it at least as an idea. [Stepney 1993] provides a simple Web-accessible example of the idea: she illustrates the creation of a high-integrity compiler by
  • specifying the semantics (both dynamic and static) of a procedural programming language formally, in Z
  • specifying the semantics of a target machine language (or: assembly language) formally, also in Z
  • specifying the translation from source language to machine language formally
  • specifying the syntax of the programming language in a definite-clause translation grammar
  • translating the static semantics, dynamic semantics, and the source-to-machine language translation as grammatical attributes[1] in the DCTG. A Prolog system asked to supply the dynamic semantics of a program fragment serves as an interpreter for the source language; a Prolog system asked to supply the translation semantics acts as a compiler for the source language.
Animation of a specification is probably easiest if the spec is in a formal language. As Stepney's book illustrates, it is still possible otherwise, but it involves both the animation of a formal specification and an interpretation of the spec (in the form of a translation into a more formal language); the latter requires careful checking.
This paper attempts to make plausible the claim that a similar approach can be used with the XML Schema specification, in order to provide a runnable XML Schema processor with a very close tie to the wording of the XML Schema specification. Separate papers will report on an attempt to make good on the claim by building an XML Schema processor using this approach; this paper will focus on the rationale and basic ideas, omitting many details.

1.2. Using logic grammars in schema processing

Any schema defines a set of trees, and can thus be modeled more or less plausibly by a grammar. Schemas defined using XML Schema 1.0 impose some constraints which are not conveniently represented by pure context-free grammars, and the process of schema-validity-assessment defined by the XML Schema 1.0 specification requires implementations to produce information that goes well beyond a yes/no answer to the question “is this tree a member of the set?” For both of these reasons, it is convenient to use a form of attribute grammar to model a schema; logic grammars are a convenient choice.
In the remainder of this paper, I introduce some basic ideas for using logic grammars as a way of animating the XML Schema specification / modeling XML Schema.
The first question that arises is:
  • How do we use logic grammars to model XML Schema processing?
To this, there are three answers:
  • Model the schema as a logic grammar; use it to parse document instances and generate the PSVI.
  • Move up one level of abstraction: model the schema for schemas as a logic grammar; use it to parse schema documents and generate the appropriate schema components.
  • Do both. This could be done by writing a utility to take a set of components (in the form generated by the logic grammar for schema documents) and generate a corresponding logic grammar (in the form needed for parsing other document instances).
Some problems arise if we use logic grammars to model schemas and perform schema-validity assessment by parsing a document using the logic grammar. These problems are attacked in a series of small examples in which a DCTG is developed which represents a simple schema:
  • In practice, the XML input document is likely to be presented to us in the form of a tree. How do we apply logic grammars to things that are already trees instead of sequences of tokens?
  • How do we check XML attributes?
  • How do we model type assignment, black-box validation, white-box validation?
  • How do we model the post-schema-validation information set?
Some problems arise if we use a logic grammar to model the rules for constructing schema components, and parse schema documents with that logic grammar as a way of modeling the process of constructing schema components.
  • How do we use parse-tree attributes to model information-item properties?
  • How do we use the output of this process to drive schema-validity assessement of document instances? Can we formulate the logic grammars used for schema-validity assessment as properties / attributes of the schema components?
In the next part of this paper (section 2) I introduce the definite-clause grammar notation, with some simple examples. A simplified representation of the purchase-order schema po1.xsd from the XML Schema tutorial [W3C 2001a] shows how a DCG can be used to support DTD-style validation in a straightforward way In section 3, I describe definite-clause translation grammars (DCTGs), which provide a slightly more convenient notation for complex attribute grammars than do DCGs. The patterns illustrated by the purchase-order schema are elaborated to handle various non-DTD constructs: post-schema-validation properties, error handling, substitution groups, xsi:type, and wildcards. This simple example shows how logic grammars can be applied as representations of schemas, but does not illustrate all aspects of schema validation.
In a separate paper ([Sperberg-McQueen 2002a], I [plan to] work systematically through the validation rules of [W3C 2001b] [and [W3C 2001c]?], providing a more systematic representation of schema-validity assessment in logic-grammar terms.
In a third paper ([Sperberg-McQueen 2002b]), I [will] develop a grammar which reads XML Schema documents and enforces constraints on their XML representation and on the components derived from them, together with routines for compiling schema components into the logic-grammar representation sketched here and developed more systematically in [Sperberg-McQueen 2002a].

2. Simple DTD-style document checking

We start with a simple case: considering only (a subset of) the kind of schema information captured in a document-type definition (DTD) in XML 1.0 DTD syntax, we write a logic grammar to represent a schema or DTD and use it to validate a document. The same technique, of course, can be used to capture and use part of the information in a schema expressed in XML Schema 1.0.

2.1. Definite-clause grammars

Two or three versions of a simple example may be useful in making some characteristics of DCG notation clear.
Readers already familiar with DCGs can skip this section without loss.

2.1.1. Grammar E1: A trivial grammar for a fragment of English

Here is a simple DCG for a trivially small fragment of English, which I'll call E1. It is taken from example 8.1 in [König/Seiffert 1989], an introduction to Prolog for linguists:
%%% Grammar E1, a trivial fragment of English
s --> np, vp.  % A sentence (s) is a noun phrase (np) plus a verb phrase
np --> det, n. % A noun phrase is a determiner plus a noun
np --> n.      % ... or just a noun.
vp --> v, np.  % A verb phrase is a verb and its direct object, an np
vp --> v.      % ... or just the verb (for intransitives).
n --> [mary].  % 'mary', 'john', 'woman, 'man', 'apple' are nouns.
n --> [john].
n --> [woman].
n --> [man].
n --> [apple].
det --> [the]. % 'the' is a determiner
v --> [loves]. % 'loves', 'eats', and 'sings' are verbs.
v --> [eats].
v --> [sings].
This grammar generates sentences like “John loves Mary” and “The man sings” and “The woman eats the apple”, as well as the somewhat less plausible “The apple sings the Mary.”
When a Prolog system consults a file containing this grammar, the rules of the grammar are automatically translated into canonical Prolog clauses using difference lists:
?- listing([s,np,vp,n,det,v]).

s(A, B) :-
        np(A, C),
        vp(C, B).

np(A, B) :-
        det(A, C),
        n(C, B).
np(A, B) :-
        n(A, B).

vp(A, B) :-
        v(A, C),
        np(C, B).
vp(A, B) :-
        v(A, B).

n([mary|A], A).
n([john|A], A).
n([woman|A], A).
n([man|A], A).
n([apple|A], A).

det([the|A], A).

v([loves|A], A).
v([eats|A], A).
v([sings|A], A).

Yes
?- 
Difference lists are pairs of lists such that the second list is a suffix of the first list (i.e. the same from some point or other to the end), for example, the pair [a, b, c, d] and [c, d]. Prolog-based parsers commonly use difference lists to represent the input; they allow the Prolog clauses which recognize structures in the input list to consume as much of the input list (the first list in the pair) as they wish, and pass the rest of the input to the next clause. If a DCG predicate succeeds with its final two arguments unified to the difference-list pair [a, b, c, d] and [c, d], what it means is that the grammar rule recognized the list [a, b].
In the clause for s, we can see how one part of the parser picks up where another leaves off. The input list is A, and the np predicate recognizes a noun phrase at the beginning of A. The portion of the list which follows the noun phrase is assigned to C. The predicate vp then tries to recognize a verb phrase at the beginning of list C. What is left after the verb phrase is assigned, in turn, to list B, which is both ‘the part of the input list after the verb phrase’ (in the predicate vp) and (because the verb phrase can be the last item in a sentence) also ‘the part of the input after the s’ (in the predicate s).
For example, given the sentence [the, woman, eats, the, apple], the first rule for np will be called with its first argument bound to the entire input string; after it recognizes the initial noun phrase, the second argument will be bound to the part of the sentence which follows the noun phrase, namely [eats,the,apple]. The predicate vp will then recognize the verb phrase [eats,the,apple], leaving the empty list [] as the part of the input after the sentence.[2]

2.1.2. Calling the parser

The following fragment of a Prolog session log shows how the grammar can be used to test the grammaticality of sentences; it also shows that the grammar could use some refinement. To test whether a given sequence of tokens is an s, we call the Prolog predicate s with two difference-list arguments. To require that the entire input sequence be a legal sentence, we specify the empty list as the second argument.
?-  s([john,loves,mary],[]).

Yes
?- s([the,woman,eats,the,apple],[]).

Yes
?- s([the,man,sings],[]).

Yes
?- s([apple,eats,woman,the],[]).

No
?- s([apple,sings,the,mary],[]).

Yes
?- 
We can use a variable as a second argument, instead of the empty list; in this case, Prolog will tell us each prefix of the input list which can be parsed as a sentence according to our grammar:
?- s([the,woman,eats,the,apple],Remainder).

Remainder = [] ;

Remainder = [the, apple] ;

No
?- 
In one parse, the entire input is parsed as a sentence, and the remainder is the empty list. An alternative parse, however, is just to take [the,woman,eats] as a sentence, leaving [the,apple] as the remainder.
The XML Schema processor we sketch below will normally call the parser with an empty list as the final parameter, to require that the entire input sequence be consumed. If we wish to see how much of the content of an element belongs to the base type, and how much to an extension of the base type, we can call the parser for the base type, with a variable as the final parameter, and the base type will consume as much of the input as it can. The rest belongs to the extension.

2.1.3. Grammar E2: A simple attribute grammar

We can turn this simple context-free grammar E1 into an attribute grammar by adding parameters to the productions; I will call the attribute grammar E2. In E2, one parameter is used to generate a structural description of the sentence. In this notation, the sentence “John loves Mary” is assigned the structure s(np(pn(john)), vp(v(loves), np(pn(mary)))) — or, with pretty-printing:
s(
  np(
     pn(john)
    ), 
  vp(
     v(loves), 
     np(
        pn(mary)
       )
    )
 )
On some productions, a second parameter is used to enforce agreement in number between subject and verb and between determiner and noun.
Thus extended with parameters, and a few more words in the lexicon, the grammar looks like this:
%%% E2: a trivial attribute grammar for a fragment of
%%% English, with a synthesized attribute for structural description
%%% and enforcement of number agreement.

s(s(S,P)) --> np(S,Num), vp(P,Num).
np(np(D,N),Num) --> det(D,Num), n(N,Num).
np(np(N),pl) --> n(N,pl).
np(np(N),sg) --> pn(N,sg).
vp(vp(V,O),Num) --> v(V,Num), np(O,_).
vp(vp(V),Num) --> v(V,Num).
n(n(L),Num) --> [L], { lex(L,n,Num) }.
pn(pn(L),Num) --> [L], { lex(L,pn,Num) }.
v(v(L),Num) --> [L], { lex(L,v,Num) }.
det(det(L),Num) --> [L], { lex(L,det,Num) }.

lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).
In this version of the example, it is worth calling the reader's attention to the Prolog clauses enclosed in braces on the right-hand side of some rules, e.g. in
n(n(L),Num) --> [L], { lex(L,n,Num)}.
These Prolog clauses express constraints on the grammatical construction which are not themselves grammatical constituents. They are sometimes called ‘guards’ (e.g. by [Brown/Blair 1990]); they correspond directly to the ‘semantic conditions’ which are a constituent part of attribute grammars as described by [Alblas 1991]. With the guard, this rule can be paraphrased as saying “A noun (an n), has the structure n(L) and the grammatical number Num, if (a) L (for ‘lexical form’) is a single item in the input list [L], and (b) the relation lex contains the triple (L,n,Num).”
Such guards will be present in many of the rules we use to represent XML Schema in logic grammars.
By calling the grammar rules, we can instantiate a variable with the structural description of the sentence.
?- s(Struc,[john,loves,mary],[]).

Struc = s(np(pn(john)), vp(v(loves), np(pn(mary)))) 

Yes
?- s(S,[the,woman,eats,the,apple],[]).

S = s(np(det(the), n(woman)), vp(v(eats), np(det(the), n(apple)))) 

Yes
?- s(S,[the,man,sings],[]).

S = s(np(det(the), n(man)), vp(v(sings))) 

Yes
?- s(S,[apple,sings,the,mary],[]).

No
?- s(S,[the,apple,sings,mary],[]).

S = s(np(det(the), n(apple)), vp(v(sings), np(pn(mary)))) 

Yes
?- 
Note that in a ‘conventional’ use of DCGs, the user or application will call the grammar just once, at the top level, and the parser itself will generate recursive calls for the nested grammatical constructs (e.g. np and vp in our example). Note, however, that we can if we wish make calls to any level of the grammar from arbitrary Prolog code. Note also that we can include arbitrary Prolog code in the right-hand side of rules by wrapping it in braces. By doing so, we can make arbitrarily complex calculations part of the grammar rules; this is one reason that DCGs are not limited to context-free languages.

2.2. Representing XML in Prolog

The SWI-Prolog system includes an SGML/XML parser which conveniently makes SGML and XML documents accessible to Prolog manipulation [Wielemaker 2001].
The parser returns the XML document to the user as a list of items,[3] each item one of:
  • an atom (for character data)
  • a structure of the form element(GI,AttributeList,ContentList), where
    • GI is an atom representing the element's generic identifier
    • AttributeList is a list containing the element's attribute-value specifications in the form Attributename=Value, with both attribute names and attribute values represented as Prolog atoms. Like all Prolog lists, this one is comma-separated, so a list as a whole might look like [id=frobnitzer,rend=blort,type=farble]
    • ContentList is a list of items (i.e. atoms representing character data, element(Gi,A,C) structures, entity(N) structures, etc.
  • a structure of the form entity(Integer) representing an otherwise unrepresentable character
  • a structure of the form entity(Name) representing an otherwise unrepresentable character for which a named entity is defined
  • a structure of the form pi(Text) representing a processing instruction
For example, the sample purchase order po1.xml used as an example in XML Schema Part 0 looks like this in the format described (I have added white space for legibility; the single quotation marks around some atoms are a way of ensuring that they can be read again by a Prolog implementation and recognized as atoms):
[element('http://www.example.com/PO1':purchaseOrder, 
  [xmlns:apo='http://www.example.com/PO1', 
   orderDate='1999-10-20'], 
  [element(shipTo, 
    [country='US'], 
    [element(name, [], ['Alice Smith']), 
     element(street, [], ['123 Maple Street']), 
     element(city, [], ['Mill Valley']), 
     element(state, [], ['CA']), 
     element(zip, [], ['90952'])
    ]), 
   element(billTo, 
    [country='US'], 
    [element(name, [], ['Robert Smith']), 
     element(street, [], ['8 Oak Avenue']), 
     element(city, [], ['Old Town']), 
     element(state, [], ['PA']), 
     element(zip, [], ['95819'])
    ]), 
   element('http://www.example.com/PO1':comment, 
    [], 
    ['Hurry, my lawn is going wild!']
   ), 
   element(items, [], 
    [element(item, 
      [partNum='872-AA'], 
      [element(productName, [], ['Lawnmower']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['148.95']), 
       element('http://www.example.com/PO1':comment, 
        [], 
        ['Confirm this is electric']
       )
      ]), 
     element(item, 
      [partNum='926-AA'], 
      [element(productName, [], ['Baby Monitor']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['39.98']), 
       element(shipDate, [], ['1999-05-21'])
      ])
    ])
  ])
]
When elements and attributes are namespace-qualified, as the purchaseOrder and comment elements are in this example, the SWI-Prolog parser represents the name as a pair of two atoms, separated by a colon. To the right of the colon is the local name; depending on the option specified, the left-hand atom is either the namespace name (as shown here) or the namespace prefix used.
Other Prolog representations of XML are possible, of course, but we can use this one and it already exists.

2.3. A simple schema

Consider the simple schema defined for purchase orders in Part 0 of the XML Schema specification ([W3C 2001a]). Its complex types are relatively simple, involving only sequences of elements, some of which are optional, and some attributes. Some attributes and some elements are defined with named and some with anonymous simple types. This schema will serve admirably to illustrate the translation of schemas to logic grammars. For simplicity of exposition, the translation will be divided into several layers:
  1. translation into DCG form
  2. checking for duplicate and missing attributes, validating attribute types
  3. translation into DCTG form
  4. supplying PSVI properties
  5. supplying properties related to validity, handling invalid input
  6. handling xsi:type
  7. supporting wildcards
  8. supporting substitution groups

2.4. Layer 1: Translation into definite-clause grammar

Each element type in the language is translated into three sets of DCG productions:
  • a top-level rule to match the tag and get the contents
  • a set of rules for checking the element's attributes
  • a set of rules defining the element's content; this is the part that looks most like a conventional DCG

2.4.1. Top-level rules for element types

For each element type in the schema, we create one rule in the grammar, which will serve to match the start-tag, get the attributes and contents of the element, and call routines to check the attributes and content against the complex type. Here there is exactly one such rule for each element type, but in later examples we will have several, in order to handle substitution groups. A simple version of these top-level rules would look like this:[4]
< 1 Rules for elements with complex types [File podcg3.pl] > ≡
/* Definite-clause grammar for the XML Schema represented in po.xsd.
 * 
 * This DCG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
purchaseOrder --> [element('http://www.example.com/PO1':purchaseOrder,
    Attributes,Content)],
  {
    purchaseOrderType_atts(Attributes,[]),
    purchaseOrderType_cont(Content,[])
  }.
shipTo --> [element(shipTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
billTo --> [element(billTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
items --> [element(items,Attributes,Content)],
  {
    t_items_atts(Attributes,[]),
    t_items_cont(Content,[])
  }.
item --> [element(item,Attributes,Content)],
  {
    t_item_atts(Attributes,[]),
    t_item_cont(Content,[])
  }.



Because we get the XML document already in tree form, the top-level rules don't have much work to do. In particular, they do not have to identify the substructure of the element or find its end; that has already been done. Instead, we get the attributes and content already separated out. We launch a separate call on each to parse them according to the appropriate rule for their datatype.[5]
The comment element and others associated with simple types pose an interesting question. We could define them following the pattern above, like this:
comment --> [element('http://www.example.com/PO1':comment,
    Attributes,Content)],
  {
    xsd_string_atts(Attributes,[]),
    xsd_string_cont(Content,[])
  }.
But we know that the xsd:string type does not have any attributes declared, so Attributes must be the empty list. We can therefore define these elements this way instead:
< 2 Rules for elements with simple types [File podcg3.pl] > ≡
comment --> [element('http://www.example.com/PO1':comment,
    [],Content)],
  {
    xsd_string_cont(Content,[])
  }.
poname --> [element(name,[],Content)], { xsd_string_cont(Content,[]) }.
street --> [element(street,[],Content)], { xsd_string_cont(Content,[]) }.
city   --> [element(city,[],Content)], { xsd_string_cont(Content,[]) }.
state  --> [element(state,[],Content)], { xsd_string_cont(Content,[]) }.
zip    --> [element(zip,[],Content)], { xsd_decimal_cont(Content,[]) }.

productName --> [element(productName,[],Content)], 
                { xsd_string_cont(Content,[]) }.
quantity    --> [element(quantity,[],Content)], 
                { xsd_integer_cont(Content,[]) }.
'USPrice'   --> [element('USPrice',[],Content)], 
                { xsd_decimal_cont(Content,[]) }.
shipDate    --> [element(shipDate,[],Content)], 
                { xsd_date_cont(Content,[]) }.




2.4.2. Rules for attributes of complex types

For each complex type in the schema, we generate two rules, one we will use to check the attributes, and one to check the contents. For each simple type, we generate one rule for checking the value.
The attributes for a given element can be enumerated; in this first example, we check to make sure each attribute was declared.
< 3 Rules for attributes defined for complex types (v1) > ≡
purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date].

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> no_such_attribute.
/* note that t_items_att is undefined, since there
 * are no attributes for the Items type 
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU]. 

This code is not used elsewhere.

For now, we do not worry about whether required attributes have been specified, or about checking the type of the attribute, or about supplying default values; we will do that later.

2.4.3. Rules for content of complex types

The most conventional-looking part of our grammar is the representation of the content models. A purchase order (being of type purchaseOrderType) has a ship-to address, a billing address, possibly a comment, and a list of items. An element with complex type USAddress has a name, street, city, state, and zip. We handle optional elements by writing a pair of recursive productions:
< 4 [File podcg3.pl] > ≡
purchaseOrderType_cont --> shipTo, billTo, opt_comment, items.
opt_comment --> [].
opt_comment --> comment, opt_comment.

usAddress_cont --> poname, street, city, state, zip.

t_items_cont --> star_item.
star_item    --> [].
star_item    --> item, star_item.

t_item_cont --> productName, quantity, 'USPrice', 
                opt_comment, opt_shipDate.

opt_shipDate --> [].
opt_shipDate --> shipDate.



It's worth noting that since the content models of XML all define regular languages, the right hand sides of the DCG rules for handling the content of elements are all rather simple. The only complexity is introduced by nested choices, occurrence indicators, and the like. In a way, this is the reflection of a fundamental fact about validating XML: the XML structure is given rather than needing to be discovered, and the rules for validation are correspondingly simpler.
A practical consequence of the fact that XML works with regular-right-part grammars while we are working with something more like standard Backus-Naur Form is that whenever we have recursive constructs like star_block, the parse tree described by the DCG does not correspond directly to the XML structure, and when we generate PSVI structures we are going to need to flatten the result to make it fit the XML structure.[6]
There is one other complication in practice: if an element has mixed content, we need to allow atoms representing character data between sub-elements; otherwise, we need to allow only atoms representing white space. For now, we step around this problem by instructing the upstream XML parser to strip whitespace from the beginning and ending of CDATA nodes; this does the job for this particular schema.[7] Later, we'll need to find a systematic way to hop over inter-element characters, and to restrict them to white space where appropriate.

2.4.4. Rules for checking values of simple types

We need some rules for checking simple types. The rules for built-in types of XML Schema, at least, can be written once for all and built in to a library. N.B. In order to get something working quickly, we cut some corners here.
In each case we write two rules, one to check the content of an element by checking to see whether the putative contents are a legal value, and the second to check the value itself. Within these latter rules, it is necessary to remember that the thing we are checking is a Prolog atom created from the lexical form for the putative value. In the case of strings, anything that's a legal lexical form (represented, following the conventions of the SWI-Prolog parser, as an atom) counts as a legal string.
< 5 Checking simple types [File podcg3.pl] > ≡
/* In our representation of XML, character data is represented
 * as atoms.  Need to check on handling of non-ASCII characters */
xsd_string_cont([H|T],T) :- xsd_string_value(H).
xsd_string_value(LF) :- atom(LF).



We check decimal numbers by converting the lexical form from an atom into a sequence of characters and then seeing whether Prolog can create a number out of that sequence of characters. This is actually a bit too broad, at least with the Prolog I am using, since it accepts numbers in scientific notation. But it will do for an illustration.
< 6 Checking numbers [File podcg3.pl] > ≡
/* We'll accept anything as a decimal number if we can convert
 * the atom to a list of character codes which can in turn be
 * converted to a number. 
 */
xsd_decimal_cont([H|T],T) :- xsd_decimal_value(H).
xsd_decimal_value(LF) :- atom_codes(LF,L), number_codes(N,L).
xsd_integer_cont([H|T],T) :- xsd_integer_value(H).
xsd_integer_value(LF) :- 
  atom_codes(LF,L), 
  number_codes(N,L),
  integer(N).




For dates, we really punt for now.
< 7 Checking dates [File podcg3.pl] > ≡
xsd_date_cont([H|T],T) :- xsd_date_value(H). 
xsd_date_value(LF) :- atom(LF). 
/* well, we really need to check this ... */



With this set of rules, we have a primitive but working program for checking whether a given XML document does or does not conform to the purchase order schema. There are a number of constraints it does not check, however; in the following sections of this document we will refine the code to make it a more persuasive representation of the schema.

2.5. Layer 2a: checking attribute types

One obvious improvement would be to check to make sure that the values of attributes are legitimate representations of the simple types with which the attributes are declared. Here is a revision of the relevant rules:
< 8 Rules for attributes defined for complex types (v2) [File podcg3.pl] > ≡
purchaseOrderType_atts --> [].
purchaseOrderType_atts --> purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date],
  { xsd_date_value(Date) }.

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> [this_will_never_match].
/* Since it has no equal sign or value, the rule for
 * t_items_att will never match anything.
 * Alternatively we could leave t_items_att undefined, since there
 * are no attributes for the Items type.
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU],
  { po_sku_value(SKU) }.



Hand-coding a rule to check SKU values is not too hard, and illustrates how we can check the lexical forms of simple types by writing mini-grammars for them. The SKU type is declared this way:
 <xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
   <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
 </xsd:simpleType>
The pattern can be translated readily into the following grammar operating on the character sequence of the lexical form:
< 9 Checking SKU values [File podcg3.pl] > ≡
po_sku_value(LF) :- 
  atom_chars(LF,Charseq),
  sku_value(Charseq,[]).

sku_value --> sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part --> digit, digit, digit.
sku_alpha_part --> cap_a_z, cap_a_z.




The mini-grammar for SKU values relies on the SWI-Prolog character classification predicate char_type, though it would be possible to implement rules for digit checking and letter checking in any Prolog system.
< 10 Character classification rules [File podcg3.pl] > ≡
digit --> [Char], { char_type(Char,digit) }.
hyphen --> ['-'].
cap_a_z --> [Char], { char_type(Char,upper) }.



2.6. Layer 2b: checking for duplicate, defaulted, and required attributes

Another obvious improvement would be to check to ensure that attributes are not specified more than once (strictly speaking, the upstream XML processor should be checking this, so this is probably not essential) and that required attributes do have values specified. We have already seen how to check a fixed value: we simply specify the value as part of the grammatical rule for the attribute.
In our sample schema, the only required attribute on any type is the partNum attribute on the item element; it is also the only attribute defined for items. In a case like this one, we could simply write the relevant part of the grammar this way:
t_item_atts --> [partNum=SKU], { po_sku_value(SKU) }.
Or we could even put the constraint into the top-level rule for the item element:
item --> [element(item,[partNum=SKU],Content)],
  {
    po_sku_value(SKU),
    t_item_cont(Content,[])
  }.
In the general case, however, it is going to be simpler to put checking for required, forbidden, and defaulted attributes into a separate rule, which we will here call t_item_att_check:
< 11 Checking for required, forbidden attributes > ≡
t_item_att_check(LAVS) :-
  atts_present(LAVS,[partNum]).

This code is not used elsewhere.

In the general case, we may have to check for required, forbidden, and defaulted attributes, and to merge defaulted attributes into the list of attribute-value specifications. So we can extend the general form of the foo_att_check predicate to:
< 12 Checking for required, forbidden attributes > ≡
t_item_att_check(LAVS,AugmentedLAVS) :-
  atts_present(LAVS,[partNum]),
  atts_absent(LAVS,[]),
  atts_defaulted(LAVS,[],AugmentedLAVS).

This code is not used elsewhere.

We assume here the existence of three special-purpose items for checking the presence or absence of items in lists of attribute-value specifications. A list of attribute-value specifications contains all the attributes in a list of required attributes, if the list of required attributes is empty.
< 13 Utilities for checking required attributes > ≡
atts_present(LAVS,[]).
Continued in < Utilities for checking required attributes 14 > , < Utilities for checking required attributes 15 > , < Utilities for checking forbidden attributes 16 > , < Utilities for checking forbidden attributes 17 > , < Utilities for checking defaulted attributes 18 >
This code is not used elsewhere.

A list of attribute-value specifications contains all the attributes in a list of required attributes, if it contains the first one, and also all the others.
< 14 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡
atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).



And finally, an attribute value specification is present for a given attribute name if it is either at the head of the list, or in the tail of the list.
< 15 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡
att_present([Attname=Attval|Tail],Attname).
att_present([AVS|Tail],Attname) :-
  att_present(Tail,Attname).



Some readers will be looking, right about now, for a rule of the form
att_present([],Attname) :- ...
We don't have a rule for the empty list, though, because if we have run through the list of attribute-value specifications without finding one for the attribute name we are seeking, we want the predicate to fail. If necessary to prevent later maintenance programmers, or ourselves, from supplying the ‘missing’ induction basis by mistake, we could write:
att_present([],Attname) :- !, fail.
Now for forbidden attributes. (These would be originally optional attributes whose maxOccurs has been set to 0 in a refinement step.) If there are no attribute names in the list of forbidden attributes, then we are done and all is well.
< 16 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡
atts_absent(LAVS,[]).



If the list of forbidden attribute names is not empty, we check first the head, then (recursively) the tail.
< 17 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡
atts_absent(LAVS,[HFA|ForbiddenTail]) :-
  not(att_present(LAVS,HFA)),
  atts_absent(LAVS,ForbiddenTail).



Finally, for attributes with defaults we specify a list of attribute-value specifications; for each of these, we check to see whether a value is already specified for an attribute of that name. If it is, we skip the attribute in question; if not, we add it to the list of attribute-value specifications. Either way, we call the predicate recursively to take care of the rest of the defaulted attributes.
< 18 Utilities for checking defaulted attributes [continues 13 Utilities for checking required attributes] > ≡

atts_defaulted(LAVS,[],LAVS).
atts_defaulted(LAVS,[AN=AV|Tail],AugmentedLAVS) :-
  att_present(LAVS,AN),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).
atts_defaulted(LAVS,[AN=AV|Tail],[AN=AV|AugmentedLAVS]) :-
  not(att_present(LAVS,AN)),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).



2.7. Evaluation

So far, we have shown that given a relatively simple schema, it is possible to create a definite-clause grammar which checks the salient constraints defined by the schema:
  • Each element is checked against the rules for its datatype.
  • Elements declared with a complex type are checked with regard to their attributes and their content.
    • Attributes are checked to see that they were declared and that their value is of the correct type.
    • Sets of attribute value specifications are checked to make sure that required attributes have been specified, and forbidden attributes have not. Default values can be supplied for attributes which have them.
    • The regular language defined in a content model is translated into a set of definite-clause grammar productions; any legal sequence of children will be recognized, and any illegal sequence of children will fail to be recognized.
    • Elements declared with a simple type are checked to ensure that their content is a legal value of the simple type.
Running a lightly modified version of this grammar over a set of test cases based on po1.xml, the grammar accepted a number of valid variations of the po1.xml example, with
  • additional attributes (namespace declarations, xsi attributes)
  • omission or inclusion of optional elements
  • omission or inclusion of optional attributes
Modification of the grammar was necessary because the grammar as shown does not handle namespace declarations or attributes in the xsi namespace; since all the test files do have namespace declarations, the following rules had to be added to the grammar (I haven't added them above for fear it would clog up the exposition):
purchaseOrderType_att --> magic_att.
usAddress_att --> magic_att.
t_items_att  --> magic_att.
t_item_att  --> magic_att.

magic_att --> [xmlns=NS].
magic_att --> [xmlns:Pre=NS].
magic_att --> ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value].
The grammar detected a number of errors:
  • misspelled generic identifiers and attribute names
  • omission of required elements
  • excess occurrences of elements
  • elements out of sequence
  • extraneous undeclared elements present
  • value other than the prescribed value for a fixed attribute
The following errors were not detected:
  • multiple po:comment elements (the schema allows at most one at the purchaseOrder level and at most one in each item)
        <apo:comment>Hurry, my lawn is going wild!</apo:comment>
        <apo:comment>I don't know how much longer I can hold out.</apo:comment>
    
    This is an error in the grammar above, which has been retained to illustrate the risks of hand-translation. Instead of
    opt_comment --> [].
    opt_comment --> comment, opt_comment.
    
    (which allows arbitrarily many comments), the rule for the optional comment element should read
    opt_comment --> [].
    opt_comment --> comment.
    
  • two orderDate attributes
    <apo:purchaseOrder xmlns:apo="http://www.example.com/PO1"
                       orderDate="1999-10-19"
                       orderDate="1999-10-20"> 
    
  • quantity exceeding the maximum legal value
            <item partNum="926-AA">
                <productName>Baby Monitor</productName>
                <quantity>100</quantity>
                <USPrice>39.98</USPrice>
                <shipDate>1999-05-21</shipDate>
            </item>
    
  • omission of the required partNum attribute
            <item>
                <productName>Baby Monitor</productName>
                <quantity>1</quantity>
                <USPrice>39.98</USPrice>
                <shipDate>1999-05-21</shipDate>
            </item>
    
The following errors were detected, or at least the document was not accepted as valid, but validating the document caused a Prolog error because the number_chars predicate used to validate numbers is apparently not robust against non-numeric arguments.
  • invalid zip code
    <zip>OX2 6NN</zip>
  • wrong datatype for quantity
    <quantity>one</quantity>
  • bad value for USPrice
    <USPrice>USD 39.98</USPrice>
See A note on the test cases below for further details on the test documents.
The example we have been developing does not illustrate all forms of content models, occurrence indicators, or simple types, but I think it's obvious how to extend the treatment given here.
There are several gaps in coverage, which we will make good in what follows:
  • handling wildcards
  • handling character data intermingled with mixed content models, handling whitespace intermingled with non-mixed content models
  • handling substitution groups
  • handling xsi:type specifications in the document instance
  • diagnosing errors (i.e. saying something more useful than “no” or “ERROR: number_chars/2: Syntax error: Illegal number” when an error is encountered) and recovering from them
  • providing relevant PSVI properties
It will be simplest to show how to address these gaps if we handle the last item first, and make our parser start returning more than a yes/no answer to the question “is this document/element/lexical form ok?” As shown by the examples used to introduce DCG notation, we could do that by adding parameters to the productions in the grammar; it will be more convenient, however, to translate our grammar from DCG form into a related form, that of definite-clause translation grammars. To that, the next part of this paper is dedicated.

3. Providing PSVI properties

To model schema-validity assessment properly, we need to provide more output: specifically, we need to provide information about the input document together with some additional properties (the schema infoset contributions). It's possible to do that in DCG notation, but it rapidly becomes cumbersome. We'll use DCTG notation instead; it was devised to handle grammatical attributes more conveniently than DCG, and to separate the semantics more effectively from the syntax [Abramson 1984].
On this foundation we will later be able to add more:
  • adding type name information: type name property
  • black-box (skip) validation
  • white-box (lax) validation, validation attempted property
  • error handling (falling back to lax), validity property
Note that this section is not intended to be a complete translation of XML Schema into DCTG, but a sample small enough to follow and large enough to make a persuasive case that all of XML Schema can be translated.

3.1. Introduction to DCTG notation

DCTG notation was invented to make it easier to calculate values for grammatical attributes and to refer to the values of grammatical attributes on other rules.
Here is a simple example. Consider the following grammar for binary strings:
bit ::= '0'.
bit ::= '1'.
bitstring ::= '' /* nothing */.
bitstring ::= bit, bitstring.
number ::= bitstring, fraction.
fraction ::= '.', bitstring.
fraction ::= ''.
We might wish to calculate the length and the (unsigned base-two) value of the bitstring as attributes. Using a yacc-like notation that might look like this. Notice that scale is a top-down attribute and value and fractionalvalue are bottom-up attributes.
bit ::= '0' { $0.bitvalue = 0; }.
bit ::= '1' { $0.bitvalue = power(2,$0.scale); }.
bitstring ::= '' { 
  $0.value = 0; 
  $0.length = 0;
  /* scale doesn't matter here */
}.
bitstring ::= bit, bitstring {
  $0.length = $2.length + 1;
  $1.scale = $0.scale;
  $2.scale = $0.scale - 1;
  $0.value = $1.value + $2.value;
}.
number ::= bitstring, fraction {
  $1.scale = $1.length - 1;
  $0.value = $1.value + $2.fractionalvalue;
}.
fraction ::= '.', bitstring {
  $2.scale = -1;
  $0.fractionalvalue = $2.value;
}.
fraction ::= '' {
  $0.fractionalvalue = 0;
}.
In DCTG notation, this grammar looks like this:[8]
bit ::= [0]
  <:> bitval(0,_).

bit ::= [1]
  <:> bitval(V,Scale) ::- V is **(2,Scale).

bitstring ::= []
  <:> length(0)
  &&  value(0,_).

bitstring ::= bit^^B, bitstring^^B1
  <:> length(Length) ::-
        B1 ^^ length(Length1),
        Length is Length1 + 1
  &&  value(Value,ScaleB) ::-
        B ^^ bitval(VB,ScaleB),
        S1 is ScaleB - 1,
        B1 ^^ value(V1,S1),
        Value is VB + V1.

number ::= bitstring ^^ B, fraction ^^ F
  <:> value(V) ::-
        B ^^ length(Length),
        S is Length-1,
        B ^^ value(VB,S),
        F ^^ fractional_value(VF),
        V is VB + VF.

fraction ::= ['.'], bitstring ^^ B
  <:> fractional_value(V) ::-
        S is -1,
        B ^^ value(V,S).

fraction ::= []
  <:> fractional_value(0).
As may be seen, each rule in DCTG consists of a left hand side, a right-hand side, and optionally a list of attributes. The right-hand side is separated from the left-hand side by the operator ::= and from the following list of attributes (if any) by the operator <:>. The expression on the right hand side is a sequence of non-terminal symbols, each optionally suffixed with the operator ^^ and a variable name (e.g. bitstring^^B). Each attribute is identified by a Prolog structure, e.g. value(V), whose functor is the name of the attribute and whose arguments are its values. The attribute structure may be followed by the operator ::- (designed to look a lot like the standard Prolog ‘neck’ operator :-) and a series of goals which will be satisfied when the attribute value is to be instantiated. In practice, these goals help to calculate the attribute values. Values of attributes attached to the items on the right-hand side may be referred to by the variable name associated with the item and the name of the attribute, joined by the operator ^^, as for example in B ^^ bitval(VB,ScaleB).
An EBNF grammar for DCTG notation[9] is:
grammar ::= rule*
rule    ::= lhs '::=' rhs ('<:>' att-spec ('&&' att-spec)*)?
lhs     ::= term
rhs     ::= term (',' term)*
attspec ::= compound-term ('::-' goal (',' goal)*)?
compound-term ::= ATOM '(' term (',' term)* ')'

3.2. Another example: the English grammar fragment E1

Here is another simple example: the trivial fragment of English grammar given above, translated directly into DCTG notation, looks like the following. The only difference from the DCG form is that the separator between the left- and right-hand side of each production is “::=” instead of “-->”.
%%% E1 (trivial context-free grammar for a fragment of English)
%%% in DCTG notation.
s ::= np, vp.
np ::= det, n.
np ::= n.
vp ::= v, np.
vp ::= v.
n ::= [mary].
n ::= [john].
n ::= [woman].
n ::= [man].
n ::= [apple].
det ::= [the].
v ::= [loves].
v ::= [eats].
v ::= [sings].
The translation into standard Prolog is similar to that used for DCGs, but instead of adding two arguments to each non-terminal, the DCTG translation adds three. The first additional argument is a Prolog structure with the functor node, described further below. The second and third are (like the two arguments added in DCG translations) difference lists.
The node structure has three arguments:
  1. the non-terminal of the grammar rule (here s, np, vp, etc.)
  2. a list of the node structures associated with the items on the right-hand side of the grammar rule
  3. a list of grammatical attributes (in this grammar, this list will be empty)
The translation of our trivial grammar into standard Prolog is thus:
?- dctg_reconsult('ks81dctg.pl').

Yes
?- listing([s,np,vp,n,det,v]).

:- dynamic s/3.

s(node(s, [A, B], []), C, D) :-
        np(A, C, E),
        vp(B, E, D).

:- dynamic np/3.

np(node(np, [A, B], []), C, D) :-
        det(A, C, E),
        n(B, E, D).
np(node(np, [A], []), B, C) :-
        n(A, B, C).

:- dynamic vp/3.

vp(node(vp, [A, B], []), C, D) :-
        v(A, C, E),
        np(B, E, D).
vp(node(vp, [A], []), B, C) :-
        v(A, B, C).

:- dynamic n/3.

n(node(n, [[mary]], []), A, B) :-
        c(A, mary, B).
n(node(n, [[john]], []), A, B) :-
        c(A, john, B).
n(node(n, [[woman]], []), A, B) :-
        c(A, woman, B).
n(node(n, [[man]], []), A, B) :-
        c(A, man, B).
n(node(n, [[apple]], []), A, B) :-
        c(A, apple, B).

:- dynamic det/3.

det(node(det, [[the]], []), A, B) :-
        c(A, the, B).

:- dynamic v/3.

v(node(v, [[loves]], []), A, B) :-
        c(A, loves, B).
v(node(v, [[eats]], []), A, B) :-
        c(A, eats, B).
v(node(v, [[sings]], []), A, B) :-
        c(A, sings, B).

Yes
?- 
The predicate dctg_reconsult(File) is used to translate a DCTG grammar into Prolog clauses and load them; it is provided by [Abramson/Dahl/Paine 1990] and is available from a variety of sources on the net.[10]
A short terminal session should make the nature of the results a bit clearer:[11]
?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(n, [[john]], [])], 
           []), 
      node(vp, 
           [node(v, [[loves]], []), 
            node(np, 
                 [node(n, [[mary]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(n, [[john]], [])], []), 
node(vp, [node(v, [[loves]], []), node(np, [node(n, 
[...], [])], [])], [])], []) 

Yes
?- s(S,[the,woman,eats,the,apple],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[woman]], [])], 
           []), 
      node(vp, 
           [node(v, [[eats]], []), 
            node(np, 
                 [node(det, [[the]], []), 
                  node(n, [[apple]], [])], 
                 [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), node(n, 
[[woman]], [])], []), node(vp, [node(v, [[eats]], []), 
node(np, [node(det, [...], []), node(..., ..., ...)], 
[])], [])], []) 

Yes
?- s(S,[the,man,sings],[]), write(S).
node(s, 
     [node(np, 
           [node(det, [[the]], []), 
            node(n, [[man]], [])], 
           []), 
      node(vp, 
           [node(v, [[sings]], [])], 
           [])], 
     [])

S = node(s, [node(np, [node(det, [[the]], []), 
node(n, [[man]], [])], []), node(vp, [node(v, 
[[sings]], [])], [])], []) 

Yes
?- 
The node structure constructed in the first added argument resembles and serves much the same purpose as the structure attribute used in E2, the attribute-grammar version of the English grammar fragment.

3.3. English grammar with attributes (E2) in DCTG notation

The fragment of English grammar E2, which was presented earlier to illustrate the use of DCGs for attribute grammars, may also be used to illustrate DCTGs.
Let's walk through the grammar. A sentence is an NP followed by a VP; we will call the NP S (‘subject’), and the VP we will call P (‘predicate’).
< 19 English grammar fragment with attributes [File ks92.dctg] > ≡
s ::= np^^S, vp^^P,
  { S^^number(Num), P^^number(Num) }
  <:> {Attributes for non-terminal s 20}

Continued in < NP rules 21 > , < VP rules 26 > , < Pre-terminal rules 29 > , < Lexicon 30 >


The goals enclosed in braces (S^^number(Num), P^^number(Num)) together express the constraint that the NP and VP must agree in number: the number attribute of the NP S and the number attribute of the VP P must unify with each other.
The non-terminal s has only one grammatical attribute; let us call it structure. When s is made up (as here) of an NP and a VP, we represent its structure by a Prolog term with s as its functor and the structure of the NP and VP as its two arguments:
< 20 Attributes for non-terminal s > ≡
  structure(s(Sstr,Pstr)) ::- 
    S^^structure(Sstr), 
    P^^structure(Pstr).

This code is used in < English grammar fragment with attributes 19 >

A noun phrase (NP) is made up of a determiner and a noun; they must agree in number. This covers phrases like “the apple”, “the apples”, “one apple”, “some apples”. The agreement rule excludes the phrases “one apples” and “some apple”.[12]
< 21 NP rules [continues 19 English grammar fragment with attributes] > ≡
np ::= det^^D, n^^N,
  { D^^number(Num), N^^number(Num) }
  <:> {NP structure attribute (for DET+N) 22}

  &&  {NP number attribute (for DET+N) 23}

Continued in < NP rules, cont'd 24 > , < NP rules, cont'd 25 >


The non-terminal np has two attributes, named structure and number.
The structure of the NP is a Prolog term with the name of the non-terminal (np) as its functor and the constituents as the arguments. (This is the pattern of the structure attribute on all non-terminals, and I won't comment on it again.)
< 22 NP structure attribute (for DET+N) > ≡
  structure(np(Dstr,Nstr)) ::-
    D^^structure(Dstr),
    N^^structure(Nstr)

This code is used in < NP rules 21 >

The number attribute of the NP illustrates an important idiom: the guard in the syntactic part of the rule has already checked the number attributes of the determiner and the noun, to make sure they unify with each other; the variable Num has the value sg or pl already, and we don't need to do any more computation. We just say that whatever Num is, that's the grammatical number of the NP.
< 23 NP number attribute (for DET+N) > ≡
  number(Num).

This code is used in < NP rules 21 >

Noun phrases can also take the form of a plural noun by itself, as in “Men love apples”.
< 24 NP rules, cont'd [continues 21 NP rules] > ≡
np ::= n^^N, { N^^number(pl) }
  <:> structure(np(Nstr)) ::-
        N^^structure(Nstr)
  &&  number(pl).



The final form of NP recognized by this grammar is a singular proper noun by itself, as in “John loves Mary”.
< 25 NP rules, cont'd [continues 21 NP rules] > ≡
np ::= pn^^N, { N^^number(sg) }
  <:> structure(np(Nstr)) ::-
        N^^structure(Nstr)
  &&  number(sg).



A verb phrase (VP) can include a direct object in the form of a noun phrase:
< 26 VP rules [continues 19 English grammar fragment with attributes] > ≡
vp ::= v^^V, np^^O
  <:> structure(vp(Vs,Os)) ::-
        V^^structure(Vs),
        O^^structure(Os)
  &&  {Number attribute for VP 28}

Continued in < VP rules, cont'd 27 >


or they can be just a verb with no direct object.[13]
< 27 VP rules, cont'd [continues 26 VP rules] > ≡
vp ::= v^^V
  <:> structure(vp(Vs)) ::-
        V^^structure(Vs)
  &&  {Number attribute for VP 28}




Although both the verb and the direct object have a number attribute, only that of the verb counts in determining the value of the number attribute for the VP as a whole.
< 28 Number attribute for VP > ≡
  number(Num) ::- V^^number(Num).

This code is used in < VP rules 26 > < VP rules, cont'd 27 >

Nouns, proper nouns, verbs, and determiners (the ‘pre-terminal’ categories of the grammar) all have rules of the same structure: a token in the string counts as one of these if the lexicon says it's one, and the number attribute has whatever value the lexicon gives.
< 29 Pre-terminal rules [continues 19 English grammar fragment with attributes] > ≡
n ::= [L], { lex(L,n,Num) }
  <:> structure(n(L))
  &&  number(Num).

pn ::= [L], { lex(L,pn,Num) }
  <:> structure(pn(L))
  &&  number(Num).

v ::= [L], { lex(L,v,Num) }
  <:> structure(v(L))
  &&  number(Num).

det ::= [L], { lex(L,det,Num) }
  <:> structure(det(L))
  &&  number(Num).



Finally, the lexicon. To keep things simple, the lexicon here is just a set of facts, with literal values.[14] The entry for the word the is the exception: it does not have a literal value, but the anonymous variable _ to indicate that the can be either sg or pl.
< 30 Lexicon [continues 19 English grammar fragment with attributes] > ≡
lex(mary,pn,sg).
lex(john,pn,sg).
lex(woman,n,sg).
lex(women,n,pl).
lex(man,n,sg).
lex(men,n,pl).
lex(apple,n,sg).
lex(apples,n,pl).
lex(the,det,_).
lex(some,det,pl).
lex(one,det,sg).
lex(loves,v,sg).
lex(love,v,pl).
lex(eats,v,sg).
lex(eat,v,pl).
lex(sings,v,sg).
lex(sing,v,pl).



A session log illustrates the structure built by the DCTG rules:
?- s(S,[john,loves,mary],[]), write(S).
node(s, 
     [node(np, 
           [node(pn, 
                 [[john]], 
                 [structure(pn(john)), 
                 (number(sg)::-lex(john, pn, sg))])], 
           [ (structure(np(_G292))::-
                node(pn, 
                     [[john]], 
                     [structure(pn(john)), 
                      (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
            number(sg)]), 
      node(vp, 
           [node(v, 
                 [[loves]], 
                 [structure(v(loves)), 
                 (number(sg)::-lex(loves, v, sg))]), 
            node(np, 
                 [node(pn, 
                       [[mary]], 
                       [structure(pn(mary)), 
                       (number(sg)::-lex(mary, pn, sg))])], 
                 [(structure(np(_G424))::-
                     node(pn, [[mary]], 
                          [structure(pn(mary)), 
                          (number(sg)::-lex(mary, pn, sg))])
                     ^^structure(_G424)), 
                  number(sg)])], 
           [ (structure(vp(_G351, _G352))::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^structure(_G351), 
                node(np, 
                     [node(pn, [[mary]], 
                           [structure(pn(mary)), 
                           (number(sg)::-lex(mary, pn, sg))])], 
                     [(structure(np(_G424))::-
                         node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                      number(sg)])^^structure(_G352)), 
             (number(_G373)::-
                node(v, [[loves]], 
                     [structure(v(loves)), 
                     (number(sg)::-lex(loves, v, sg))])^^number(_G373))])], 
     [(structure(s(_G261, _G262))::-
         node(np, 
              [node(pn, [[john]], 
                    [structure(pn(john)), 
                    (number(sg)::-lex(john, pn, sg))])], 
              [(structure(np(_G292))::-
                  node(pn, [[john]], 
                       [structure(pn(john)), 
                       (number(sg)::-lex(john, pn, sg))])^^structure(_G292)), 
               number(sg)])^^structure(_G261), 
         node(vp, [node(v, [[loves]], 
                        [structure(v(loves)), 
                        (number(sg)::-lex(loves, v, sg))]), 
                   node(np, 
                        [node(pn, [[mary]], 
                              [structure(pn(mary)), 
                              (number(sg)::-lex(mary, pn, sg))])], 
                        [(structure(np(_G424))::-
                            node(pn, [[mary]], 
                                 [structure(pn(mary)), 
                                 (number(sg)::-lex(mary, pn, sg))])
                            ^^structure(_G424)), 
                         number(sg)])], 
                  [(structure(vp(_G351, _G352))::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^structure(_G351), 
                      node(np, [node(pn, [[mary]], 
                                     [structure(pn(mary)), 
                                     (number(sg)::-lex(mary, pn, sg))])], 
                               [(structure(np(_G424))::-
                                   node(pn, [[mary]], 
                                        [structure(pn(mary)), 
                                        (number(sg)::-lex(mary, pn, sg))])
                                   ^^structure(_G424)), 
                                number(sg)])^^structure(_G352)), 
                   (number(_G373)::-
                      node(v, [[loves]], 
                           [structure(v(loves)), 
                           (number(sg)::-lex(loves, v, sg))])
                      ^^number(_G373))])
         ^^structure(_G262))])
In examining the structure just shown, the reader will note a great deal of apparent repetition; this results from the high incidence of structure sharing, which is not shown explicitly: in the grammar itself, the variables which have child nodes as values are used quite freely and appear both in the list of children and in the rules for calculating synthetic attributes.
Note also that the values of the attributes are not always pre-calculated: instead, the structure has the rules necessary to perform the evaluation on demand. In the example, the structure attribute for the pre-terminal categories is already completely grounded: it has no variables, but only literal values, while the structure attributes for the higher-level parts of the linguistic structure still have unbound variables.

3.4. Layer 3: Translation into DCTG notation

As a first step toward providing grammatical attributes with PSVI information, we will translate the existing purchase-order schema into DCTG notation, adding grammatical attributes corresponding to some basic information-set properties which are required to be in the input infoset:
  • for Attribute Information Items:
    • [local name]
    • [namespace name]
    • [normalized value]
  • for Element Information Items:
    • [local name]
    • [namespace name]
    • [children]
    • [attributes]
    • [in-scope namespaces] or [namespace attributes]
  • for Namespace Information Items:
    • [prefix]
    • [namespace name]
In principle, we ought perhaps also to provide properties for character information items:
  • for Character Information Items:
    • [character code]
but it seems extraordinarily inconvenient to use grammatical attributes for this purpose. The information involved is obviously present, and can be isolated by using the standard Prolog predicate atom_codes(Atom,String) (or atom_chars(Atom,CharList) and char_code(Atom,ASCII)).
Like the DCG version above, the DCTG version of the schema has several distinct kinds of rules:
  • element rules
  • attribute-list rules (for checking the attributes of a complex type)
  • content-model rules (for checking the content of a complex type)
  • simple-type checking rules
The following sections give these in DCTG notation.

3.4.1. Top-level rules for element types

An element rule will serve to match the start-tag and get the attributes and contents of each element; from it, we will call routines to check the attributes and content against the complex type. These differ from the DCG rules in two ways: when we call them, we must specify three arguments, not two, and we provide explicit grammatical attributes for infoset properties. The basic pattern is simple: for any element in namespace n with local name gi and complex type ct, we will construct an appropriate non-terminal symbol nt, and the element rule will look like this:
nt ::= [element(n:gi,Attributes,Content)],
  {
    ct_atts(A,NA,Attributes),
    ct_cont(C,Content,[])
  }
  <:> attributes(A) 
  && namespaceAttributes(NA)
  && children(C) 
  && localName(gi) 
  && namespacename(n).
Later, we will add further grammatical attributes.
Note that the rule for XML attributes is not a simple call to the parser, but a call to a wrapper predicate. Since the SWI parser returns namespace attributes in the same list as other attributes, while the infoset spec requires that they be listed in different properties, the ct_attributes predicate will need to filter the attribute information items into two different lists, one to become the value of the attributes infoset property, and one to become the value of namespaceAttributes.
We also should become a little more systematic about naming conventions. If we continue to use generic identifiers (element type names) directly as names of Prolog predicates, we risk name collisions between elements and predicates defined as part of the parser, or built in to Prolog. To eliminate this risk, we will prefix names taken over from the schema with e_ (for elements), t_ (for types), etc., and we will avoid those prefixes otherwise. If a schema has any names beginning with e_ or t_, this rule may become slightly confusing. But there won't be collisions between schema-based names and other names in the parser.
The purchase order schema po.xsd defines the following fifteen element types: the list gives the simple names which will be used to refer to them in the grammar below, as well as their schema-component designator as defined in [Holstege/Vedamuthu 2002]. Since their local names are all unique, the grammar below simply uses e_ plus their local names to refer to them. In other schemas, it will be necessary to mangle the names, or generate arbitrary identifiers, in order to distinguish different element types which have the same local names.
  • e_purchaseOrder = /element(purchaseOrder)
  • e_comment = /element(comment)
  • e_shipTo = /complexType(PurchaseOrderType)/sequence()/element(shipTo)
  • e_billTo = /complexType(PurchaseOrderType)/sequence()/element(billTo)
  • e_items = /complexType(PurchaseOrderType)/sequence()/element(items)
  • e_name = /complexType(USAddress)/sequence()/element(name)
  • e_street = /complexType(USAddress)/sequence()/element(street)
  • e_city = /complexType(USAddress)/sequence()/element(city)
  • e_state = /complexType(USAddress)/sequence()/element(state)
  • e_zip = /complexType(USAddress)/sequence()/element(zip)
  • e_item = /complexType(Items)/sequence()/element(item)
  • e_productName = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(productName)
  • e_quantity = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)
  • e_USPrice = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(USPrice)
  • e_shipDate = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(shipDate)
The simple purchase-order schema defines four complex types; one is anonymous; we'll use the local name of its host element after the t_ prefix:
  • t_PurchaseOrderType = /complexType(PurchaseOrderType)
  • t_USAddress = /complexType(USAddress)
  • t_Items = /complexType(Items)
  • t_item = /complexType(Items)/sequence()/element(item)/complexType()
The elements with complex types get these rules:
< 31 Rules for elements with complex types > ≡
e_purchaseOrder ::= [element('http://www.example.com/PO1':purchaseOrder,
  Attributes,Content)],
  {
    t_PurchaseOrderType_atts(A,NA,Attributes),
    t_PurchaseOrderType_cont(C,Content,[])
  } 
  <:> localname(purchaseOrder)
  {Common infoset properties for elements in po namespace 32}

  .
e_shipTo ::= [element(shipTo,Attributes,Content)],
  {
    t_USAddress_atts(A,NA,Attributes),
    t_USAddress_cont(C,Content,[])
  } 
  <:> localname(shipTo)
  {Common infoset properties for elements in po namespace 32}


  .
e_billTo ::= [element(billTo,Attributes,Content)],
  {
    t_USAddress_atts(A,NA,Attributes),
    t_USAddress_cont(C,Content,[])
  } 
  <:> localname(billTo)
  {Common infoset properties for elements in po namespace 32}


  .
e_items ::= [element(items,Attributes,Content)],
  {
    t_Items_atts(A,NA,Attributes),
    t_Items_cont(C,Content,[])
  } 
  <:> localname(items)
  {Common infoset properties for elements in po namespace 32}


  .
e_item ::= [element(item,Attributes,Content)],
  {
    t_item_atts(A,NA,Attributes),
    t_item_cont(C,Content,[])
  } 
  <:> localname(item)
  {Common infoset properties for elements in po namespace 32}


  .

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

Since the attributes, children, and namespacename properties have identical definitions for all element types in the purchase-order namespace, we can factor them out into a single code fragment:
< 32 Common infoset properties for elements in po namespace > ≡
  &&  attributes(A)
  &&  namespaceattributes(NA)
  &&  children(C)
  &&  namespacename('http://www.example.com/PO1')

This code is used in < Rules for elements with complex types 31 > < Rules for elements with simple types 33 >

The rules for elements with simple types are slightly simpler than those for elements with complex types, but follow the same basic pattern.
In the DCG version of this schema given above, we wrote the comment element and others associated with simple types using a hard-coded requirement for an empty list of attributes. In fact, that is too simple, since such elements may in fact have xsi:type, xsi:nil, xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes. So we write these element rules with the same basic structure as was used for complex types, except that we use a standard predicate for checking that no attributes outside the xsi namespace were used.
The schema po.xsd defines two simple types: SKU and the anonymous simple type used for quantities:
  • t_quantityType = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()
  • t_SKU = /simpleType(SKU)
In addition, several built-in simple types are used:
  • t_string = xsd:string
  • t_integer = xsd:integer
  • t_decimal = xsd:decimal
  • t_date = xsd:date
In a full implementation, we'll need to do some more serious name mangling to ensure uniqueness of relatively short, handy names for all types. For now, we just choose the names manually.
The rules for simple types are:
< 33 Rules for elements with simple types > ≡
e_comment ::= [element('http://www.example.com/PO1':comment,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(comment) 
  {Common infoset properties for elements in po namespace 32}

  .

e_name ::= [element(name,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(name) 
  {Common infoset properties for elements in po namespace 32}

  .

e_street ::= [element(street,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(street) 
  {Common infoset properties for elements in po namespace 32}

  .

e_city ::= [element(city,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(city) 
  {Common infoset properties for elements in po namespace 32}

  .

e_state ::= [element(state,Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(state) 
  {Common infoset properties for elements in po namespace 32}

  .

e_zip ::= [element(zip,Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_decimal_cont(C,Content,[])
  }
  <:> localname(zip) 
  {Common infoset properties for elements in po namespace 32}

  .

e_productName ::= [element(productName,
    Attributes,Content)],
  {Guard to check attributes and content of strings 34}

  <:> localname(productName) 
  {Common infoset properties for elements in po namespace 32}

  .

e_quantity ::= [element(quantity,
    Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    t_quantity_cont(C,Content,[])
  }
  <:> localname(quantity) 
  {Common infoset properties for elements in po namespace 32}

  .

e_USPrice ::= [element('USPrice',Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_decimal_cont(C,Content,[])
  }
  <:> localname('USPrice') 
  {Common infoset properties for elements in po namespace 32}

  .

e_shipDate ::= [element(shipDate,Attributes,Content)],
  {
    sT_atts(A,Attributes,[]),
    xsd_date_cont(C,Content,[])
  }
  <:> localname(shipDate) 
  {Common infoset properties for elements in po namespace 32}

  .

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

Just as we factor out the common infoset properties, we can also factor out the checking against frequently used built-in simple types, notably string:
< 34 Guard to check attributes and content of strings > ≡
  {
    sT_atts(A,Attributes,[]),
    xsd_string_cont(C,Content,[])
  }

This code is used in < Rules for elements with simple types 33 >

3.4.2. Rules for attributes

For each complex type, we need to do several things in order to validate all the attributes on occurrence of that type and provide appropriate nodes and infoset properties:
  • The input structure has namespace attributes and other attributes in the same list, while we need them in separate lists so we can assign them to two different infoset properties. So we need to partition the list of attributes. We can perform the partition either before all other processing, or after; doing it afterwards leads to more compact code, so we choose that.
  • For each non-namespace attribute found, we need to validate it: if it is declared, we need to check it against its declared type. If the attribute is declared with a fixed type, we should check that the value given matches the prescribed value. If it is not declared, we should raise an error, but we'll save that for a later layer.
  • We need to ensure that attributes required by the complex type are present and that attributes forbidden by the complex type are not present. For any attributes declared with default values, we need to supply an attribute information item with the default value, if the document didn't supply a value. Rather than trying to interleave this with other tasks, we will perform a separate check on attribute occurrences.
And we want to provide basic infoset properties for the XML attributes, in the form of grammatical attributes in the attribute-grammar sense.
3.4.2.1. Basic pattern
For each complex or simple type dt, the basic pattern of the attribute-checking rule will be:
dt_atts(Lpa,Lpna,Lavs) :-
  lavs_dt(LpaAll,Lavs,[]),    /* parse against grammar of attributes */
  partition(LpaAll,LpaPresent,Lpna),         /* partition the result */
  attocc_dt(LpaPresent,Lpa).      /* check min, max occurrence rules */
The logical variables have the following meanings:
Lpa
List of parsed attributes (i.e. of node() structures of the kind returned by any DCTG rule) for this complex type, including defaulted attributes
Lpna
List of parsed namespace attributes
Lavs
The list of attribute-value specifications provided by the input structure returned by the SWI Prolog parser.
Lpaall
Combined list of parsed-attribute node() structures for all attributes, both namespace attributes and others
LpaPresent
List of parsed-attribute nodes for attributes explicitly assigned values in the document instance (without defaulted attributes)
For each type, a grammar defining the legal attributes will be constructed; if type dt has attributes an1 and an2, of types st1 and st2 respectively, then the core context-free grammar will have a form like this:
lavs_dt ::= [].
lavs_dt ::= avs_dt, lavs_dt.       /* declared attributes */
lavs_dt ::= avs_nsd, lavs_dt.   /* namespace declarations */
lavs_dt ::= avs_xsi, lavs_dt.           /* XSI attributes */

avs_dt ::= [an1=Av], { st1_value(Av) }.
avs_dt ::= [an2=Av], { st2_value(Av) }.
Simple types will, of course, have no declared attributes, and the rules for declared attributes and occurrence-checking (together with the rules for individual attributes) will be omitted. Wildcard support can also be added here when needed.
3.4.2.2. Namespace attributes and XSI attributes
One set of rules for namespace attributes and XSI attributes will suffice:
< 35 Grammar rules for namespace and XSI attributes > ≡
avs_nsd ::= [xmlns=DefaultNS]
  <:> localname(xmlns)
  && namespacename('http://www.w3.org/2000/xmlns/')
  && normalizedvalue(DefaultNS).
avs_nsd ::= [xmlns:Prefix=NSName]
  <:> localname(Prefix)
  && namespacename('http://www.w3.org/2000/xmlns/')
  && normalizedvalue(NSName).
avs_xsi ::= ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value]
  <:> localname(Localname)
  && namespacename('http://www.w3.org/2001/XMLSchema-instance')
  && normalizedvalue(Value).

This code is used in < Generic utilities for DCTG-encoded schemas 84 > < Generic utilities for DCTG-encoded schemas 94 >

Note that default namespace declarations do have a namespace property, despite not having a prefixed name; this is in accord with Section 2.2 of the Infoset spec, which says “By definition, all namespace attributes (including those named xmlns, whose [prefix] property has no value) have a namespace URI of http://www.w3.org/2000/xmlns/.”
3.4.2.3. Occurrence checking
Each complex type will also have a rule for occurrence-checking, which will take something like the following form (assuming that Lreq, Ldft, and Lnot are lists of required, defaulted, and forbidden elements:
attocc_dt(LpaPres,LpaAll) :-
  atts_present(LpaPres,Lreq),
  atts_absent(LpaPres,Lnot),
  atts_defaulted(LpaPres,Ldft,LpaAll).
Since the form of the attribute lists has changed (we are now dealing with lists of node structures), we need new forms of atts_present, etc. for this:
< 36 Utilities for checking attribute occurrences > ≡
atts_present(LAVS,[]).
atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).

/* An attribute name matches if namespace and local part match */
att_present([Pa|Lpa],NS:Attname) :- 
  Pa^^localname(Attname), 
  Pa^^namespacename(NS).
att_present([Pa|Lpa],Attname) :-
  att_present(Lpa,Attname).
/* no base step: if we reach att_present([],Attname) we want to fail. */

This code is used in < Generic utilities for DCTG-encoded schemas 84 > < Generic utilities for DCTG-encoded schemas 94 >

The rule for checking forbidden attributes is very similar:
< 37 Utility for checking absent attributes > ≡
atts_absent(LAVS,[]).
atts_absent(LAVS,[H|T]) :-
  not(att_present(LAVS,H)),
  atts_absent(LAVS,T).

This code is used in < Generic utilities for DCTG-encoded schemas 84 > < Generic utilities for DCTG-encoded schemas 94 >

The rule for providing defaults must go through all of the attributes with defaults; this happens in the atts_defaulted predicate in the usual way of recursion on the list.
< 38 Utility for providing defaulted attributes > ≡
atts_defaulted(Lpa,[],Lpa).
atts_defaulted(Lpa,[Padft|Ldft],LpaAll) :-
  atts_defaulted(Lpa,Ldft,Lpa2),
  att_merge(Lpa2,Padft,LpaAll).
Continued in < Utility for providing defaulted attributes 39 >
This code is used in < Generic utilities for DCTG-encoded schemas 84 > < Generic utilities for DCTG-encoded schemas 94 >

For each of these attributes individually, the default value must be added to the list if a value is not already there; this involves recursion on the list of attributes already present. We expect only ever to call this predicate when the first and arguments (the defaulted attribute and the list into which it is to be merged) are instantiated, but experience shows that we run into problems when Prolog backtracks into this predicate (e.g. after it finds an error further along in the XML document and is retrying everything it has done before). When backtracking, Prolog does call this with uninstantiated arguments and then falls into an infinite loop trying to find the namespacename attribute of an uninstantiated variable. To prevent this loop, we check to ensure that the first two arguments are instantiated, using the standard Prolog predicate nonvar. Strictly speaking, this test has nothing whatever to do with the declarative meaning of the predicate, and it would be preferable to do without it, but it is essential for practical purposes.
< 39 Utility for providing defaulted attributes [continues 38 Utility for providing defaulted attributes] > ≡
att_merge([],Padft,[Padft]).
att_merge([Pa|Lpa],Padft,[Pa|Lpa]) :-
  nonvar(Pa), nonvar(Lpa), nonvar(Padft),
  Pa^^namespacename(NS),
  Padft^^namespacename(NS),
  Pa^^localname(Lnm),
  Padft^^localname(Lnm).
att_merge([Pa|Lpa],Padft,Lpa2) :-
  nonvar(Pa), nonvar(Lpa), nonvar(Padft),
  not( {Pa^^namespacename(NS),
    Padft^^namespacename(NS),
    Pa^^localname(Lnm),
    Padft^^localname(Lnm) } ),
  att_merge(Lpa,Padft,Lpa2).



The explicit not() in the third rule is similarly intended to prevent the third rule from firing inappropriately during backtracking.[15]
3.4.2.4. Rules for the Purchase-order type
The PurchaseOrderType defines only one attribute, orderDate, of type xsd:date. In addition, we need to accept xsi attributes. No attributes here are required, forbidden, or defaulted, so we don't need any calls to atts_present, atts_absent, or atts_defaulted. Following the patterns described above, this gives us the following definitions for the relevant predicates:
< 42 Attribute handling for purchaseOrderType > ≡
t_PurchaseOrderType_atts(Lpa,Lpna,Lavs) :-
  lavs_t_PurchaseOrderType(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna),
  attocc_t_PurchaseOrderType(LpaPres,Lpa).

lavs_t_PurchaseOrderType ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_PurchaseOrderType ::= avs_t_PurchaseOrderType^^Pa, 
                             lavs_t_PurchaseOrderType^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_PurchaseOrderType ::= avs_nsd^^Pa, lavs_t_PurchaseOrderType^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_PurchaseOrderType ::= avs_xsi^^Pa, lavs_t_PurchaseOrderType^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

avs_t_PurchaseOrderType ::= [orderDate=Value],
  { ws_normalize(collapse,Value,SNValue),
    xsd_date_value(SNValue) }
  {Properties for orderDate attribute 50}
.

/* Literally copying the pattern would give us this:

attocc_t_PurchaseOrderType(LpaPres,LpaAll) :-
  atts_present(LpaPres,[]),
  atts_absent(LpaPres,[]),
  atts_defaulted(LpaPres,[],LpaAll).

but that's pointless.  Instead, we'll do the equivalent: */
attocc_t_PurchaseOrderType(L,L).

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

3.4.2.5. White-space normalization of simple types
The rule for the orderDate attribute specifies that whitespace handling (with the keyword collapse) should be done before the attribute value is validated. We haven't done whitespace-normalization yet, so we should stop to define it. The predicate ws_normalize(+kw,+Atom,-Atom) takes three arguments: a keyword to say what kind of normalization to perform, an atom representing the character string to be normalized, and an atom representing the same string after normalization. (The arguments marked + are expected to be used as input, i.e. the arguments will be instantiated at the time the relation is called; the argument marked - will normally be uninstantiated when the predicate is called and will be bound to an appropriate value. Readers used to other programming languages may think of it, without too much distortion, as a VAR parameter called by reference and used to return the result of a computation.)
There are three values for the keyword, described in the XML Schema 1.0 specification as follows:
  • preserve No normalization is done, the value is not changed (this is the behavior required by [XML 1.0 (Second Edition)] for element content)
This one is easy to implement: just make the third argument (the output argument) identical to the second. The second method of normalization is used in XML 1.0 for CDATA attributes:
  • replace All occurrences of #x9 (tab), #xA (line feed) and #xD (carriage return) are replaced with #x20 (space)
< 44 Utility for whitespace normalization [continues 43 Utility for whitespace normalization] > ≡
ws_normalize(replace,In,Out) :-
  atom_codes(In,Lcin),
  ws_blanks(Lcin,Lcout),
  atom_codes(Out,Lcout).



This one requires an auxiliary predicate to replace all whitespace characters in an atom with blanks; ws_blanks walks through a list, changing each tab, linefeed, or carriage return (characters 9, 10, or 13) to blanks (character 32), and leaving all other characters alone.[16]
< 45 Utility to change whitespace characters to blanks [continues 46 Utility for whitespace normalization] > ≡
/* ws_blanks(A,B): where A has any whitespace, B has a blank */
ws_blanks([],[]).
ws_blanks([9|T1],[32|T2]) :- ws_blanks(T1,T2).
ws_blanks([10|T1],[32|T2]) :- ws_blanks(T1,T2).
ws_blanks([13|T1],[32|T2]) :- ws_blanks(T1,T2).
ws_blanks([H|T1],[H|T2]) :- not(member(H,[9,10,13])), ws_blanks(T1,T2).



The third method of normalization is used in XML 1.0 for non-CDATA attributes:
  • collapse After the processing implied by replace, contiguous sequences of #x20's are collapsed to a single #x20, and leading and trailing #x20's are removed.
< 46 Utility for whitespace normalization [continues 43 Utility for whitespace normalization] > ≡
ws_normalize(collapse,In,Out) :-
  ws_normalize(replace,In,Temp),
  atom_codes(Temp,Lctemp),
  ws_collapse(Lctemp,Lcout),
  atom_codes(Out,Lcout).
Continued in < Utility to change whitespace characters to blanks 45 > , < Utility for collapsing whitespace 47 >


This method, too, requires an auxiliary predicate, ws_collapse:
< 47 Utility for collapsing whitespace [continues 46 Utility for whitespace normalization] > ≡
/* ws_collapse(A,B): B is like A, with all strings of blanks collapsed
 * to single blanks, and leading and trailing blanks stripped. 
 * ws_collapse/2 strips leading blanks, then calls ws_collapse/3 */
ws_collapse([],[]).
ws_collapse([32|T1],T2) :- ws_collapse(T1,T2).
ws_collapse([H|T1],[H|T2]) :- not(H=32), ws_collapse(internal,T1,T2).

/* ws_collapse/3 walks past non-blanks, and when it hits a string 
 * of blanks, it drops all but the last one before a non-blank. */
ws_collapse(internal,[],[]).
ws_collapse(internal,[32],[]).
ws_collapse(internal,[H|T1],[H|T2]) :- 
  not(H=32), 
  ws_collapse(internal,T1,T2).
ws_collapse(internal,[32,32|T1],T2) :- ws_collapse(internal,[32|T1],T2).
ws_collapse(internal,[32,H|T1],[32,H|T2]) :- 
  not(H=32), 
  ws_collapse(internal,T1,T2).



3.4.2.6. Rules for purchase-order types, continued
We need to provide grammatical attributes for each of these items.
The non-terminal lavs_t_PurchaseOrderType carries one grammatical attribute, whose value is the list of parsed-attribute nodes which was matched. In the case of the empty list, the attribute is simple: In the recursion steps, we need to flatten the list; otherwise we end up with a lopsided binary tree, rather than a simple list:
The orderDate attribute has the properties usual for attribute information items:
< 50 Properties for orderDate attribute > ≡
  <:> localname('orderDate')
  && namespacename('')
  && normalizedvalue(SNValue)

This code is used in < Attribute handling for purchaseOrderType 42 >

3.4.2.7. Rules for other types
The USAddress type defines one attribute (country), of type NMTOKEN; it has a fixed value (US).
< 51 Attribute handling for USAddress > ≡
t_USAddress_atts(Lpa,Lpna,Lavs) :-
  lavs_t_USAddress(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna),
  attocc_t_USAddress(LpaPres,Lpa).

lavs_t_USAddress ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_USAddress ::= avs_t_USAddress^^Pa, 
                     lavs_t_USAddress^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_USAddress ::= avs_nsd^^Pa, lavs_t_USAddress^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_USAddress ::= avs_xsi^^Pa, lavs_t_USAddress^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

avs_t_USAddress ::= [country='US']
  <:> localname('country')
  && namespacename('')
  && normalizedvalue('US')
.
Continued in < Attribute occurrence checking for USAddress 52 >
This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

Since the country attribute has a fixed value, we need to supply a complete parsed-attribute node for use in case the document instance doesn't supply one. We do this as part of the definition of attocc_t_USAddress.
< 52 Attribute occurrence checking for USAddress [continues 51 Attribute handling for USAddress] > ≡
attocc_t_USAddress(LpaPresent,LpaAll) :-
  CountryAtt = node(
    attribute(country),
    [],
    [ (namespacename('')),
      (localname('country')),
      (normalizedvalue('US'))
    ]),
  atts_defaulted(LpaPres,[CountryAtt],LpaAll).



The complex type t_Items defines no attributes, so its grammar for attributes only has rules for namespace declarations and attributes in the XSI namespace. Since there are no attributes, there are no required, defaulted, or forbidden attributes, so we don't need the usual call to attocc_Type.
< 53 Attribute handling for Items type > ≡
t_Items_atts(Lpa,Lpna,Lavs) :-
  lavs_t_Items(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna).

lavs_t_Items ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_Items ::= avs_nsd^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_Items ::= avs_xsi^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

A similar simplification can be used for simple types.
The complex type t_item defines the partNum attribute:
< 54 Attribute handling for t_item > ≡
t_item_atts(Lpa,Lpna,Lavs) :-
  lavs_t_item(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna),
  attocc_t_item(LpaPres,Lpa).

lavs_t_item ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_t_item ::= avs_t_item^^Pa, 
                lavs_t_item^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_item ::= avs_nsd^^Pa, lavs_t_item^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_t_item ::= avs_xsi^^Pa, lavs_t_item^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

avs_t_item ::= [partNum=Value],
  { t_SKU_value(Value) }
  <:> localname('partNum')
  && namespacename('')
  && normalizedvalue(Value)
.

/* one required attribute: partNum */
attocc_t_item(LpaPres,LpaAll) :-
  atts_present(LpaPres,['':partNum]),
  atts_absent(LpaPres,[]),
  atts_defaulted(LpaPres,[],LpaAll).

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

A single set of rules will suffice for all simple types (string, decimal, integer, date), because by definition simple types have no attributes; any attributes which occur in the instance must be namespace declarations or XSI attributes.
< 55 Attribute handling for simple types > ≡
sT_atts(Lpa,Lpna,Lavs) :-
  lavs_sT(LpaAll,Lavs,[]),
  partition(LpaAll,LpaPres,Lpna).

lavs_sT ::= []
  {Grammatical attributes for empty attribute list 48}
.
lavs_sT ::= avs_nsd^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.
lavs_sT ::= avs_xsi^^Pa, lavs_t_Items^^Lpa
  {Grammatical attributes for attribute-list recursion 49}
.

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

3.4.2.8. Partitioning the list
The rule for partitioning the list of parsed attribute nodes must extract the actual list from the node passed as the first argument, and then the partition is easy:
< 56 partition predicate > ≡
partition(LpaAll,LpaPresent,Lpna) :-
  LpaAll^^attributes(L),
  partition2(L,LpaPresent,Lpna).
partition2([],[],[]).
partition2([Pa|Lpa],LpaPres,[Pa|Lpna]) :-
  Pa^^localname(xmlns), 
  partition2(Lpa,LpaPres,Lpna).
partition2([Pa|Lpa],LpaPres,[Pa|Lpna]) :-
  Pa^^namespacename('http://www.w3.org/2000/xmlns/'), 
  partition2(Lpa,LpaPres,Lpna).
partition2([Pa|Lpa],[Pa|LpaPres],Lpna) :-
  not(Pa^^localname(xmlns)),
  not(Pa^^namespacename('http://www.w3.org/2000/xmlns/')),
  partition2(Lpa,LpaPres,Lpna).

This code is used in < Generic utilities for DCTG-encoded schemas 84 > < Generic utilities for DCTG-encoded schemas 94 >

3.4.3. Rules for content of complex types

As in the DCG grammar given earlier, the most conventional-looking part of our DCTG grammar is the representation of the content models. The base context-free grammar in DCTG notation is substantially the same as the DCG version, except that we add names to the various items on the right-hand side, for use in flattening the lists of children (repeating and optional items otherwise would cause nesting of nodes).
< 57 Rules for purchase-order content models > ≡
t_PurchaseOrderType_cont ::= 
  e_shipTo^^S, e_billTo^^B, opt_e_comment^^C, 
  e_items^^I
{Children attribute of t_PurchaseOrder 61}

.
opt_e_comment ::= []
{Empty list of children for opt_e_comment nonterminal 59}

.
opt_e_comment ::= e_comment^^Comm
{Children for opt_e_comment nonterminal 60}

.

t_USAddress_cont ::= 
  e_name^^N, e_street^^S, e_city^^C, e_state^^ST, e_zip^^Z
{Children attribute of t_USAddress 58}

.

t_Items_cont ::= star_e_item^^L
{Children attribute of t_Items_cont 65}

.
star_e_item    ::= []
{Empty list of children for star_e_item nonterminal 66}

.
star_e_item    ::= e_item^^I, star_e_item^^L
{Children for star_e_item nonterminal 67}

.

t_item_cont ::= e_productName^^PN, e_quantity^^Q, e_USPrice^^USP, 
                opt_e_comment^^C, opt_e_shipDate^^S
{Children attribute of t_item 62}

.

opt_e_shipDate ::= []
{Empty list of children for opt_e_shipdate nonterminal 63}

.
opt_e_shipDate ::= e_shipDate^^S
{Children for opt_e_shipdate nonterminal 64}

.

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

The only grammatical attribute we need to calculate for these non-terminals right now is children, which will be used to supply the children property of the parent element. In order to supply a flat list, rather than an arbitrarily deep one-sided binary tree, we can't simply take the node returned by each rule.
Perhaps the simplest to calculate is the children attribute of the t_USAddress_cont non-terminal: it's just a list of the children. Since no child is optional, there is no variation.
< 58 Children attribute of t_USAddress > ≡
  <:> children([N,S,C,ST,Z])

This code is used in < Rules for purchase-order content models 57 >

More complex, because the comment element is optional, is the children attribute of the t_PurchaseOrderType non-terminal. When a comment is present, we want it listed among the children; when it is not present, however, we don't want any dummy node. The opt_e_comment non-terminal, that is, should have a children attribute which is either the empty list or a list containing the comment node. The standard list-concatenation methods can now be used to yield either [S,B,C,I] or [S,B,I]. The simplest is probably to use flatten, which generates, for a list possibly containing lists as elements, a flat list with no nested lists, by replacing each list with its elements.
< 61 Children attribute of t_PurchaseOrder > ≡
  <:> children(Lpe) ::- 
    C^^children(CC), 
    flatten([S,B,CC,I],Lpe)

This code is used in < Rules for purchase-order content models 57 >

A similar method is used for the item element, which also has optional children.
< 62 Children attribute of t_item > ≡
  <:> children(Lpe) ::- 
    C^^children(CC), 
    S^^children(SC), 
    flatten([PN,Q,USP,CC,SC],Lpe)

This code is used in < Rules for purchase-order content models 57 >

This requires that the opt_e_shipDate non-terminal produce (like opt_e_comment) its own children property:
The items element is just a simple list; its children property can be done using the same methods we used to generate a flat list of attributes, above.
< 65 Children attribute of t_Items_cont > ≡
  <:> children(List) ::- L^^children(List)

This code is used in < Rules for purchase-order content models 57 >

< 67 Children for star_e_item nonterminal > ≡
  <:> children([I|T]) ::- L^^children(T)

This code is used in < Rules for purchase-order content models 57 >

3.4.4. Rules for checking values of simple types

In this layer of our development, we don't need any new rules for checking simple types; we'll just reuse the ones shown above for the DCG version of this schema. We do have to rewrite them in DCTG notation.
3.4.4.1. Built-in simple types
There are two kinds of rules to provide (this may be an unnecessary distinction, but it's what the rest of the program is expecting): xsd_Typename_cont (called from element rules) and xsd_Typename_value (called from elsewhere).
< 68 xsd_Typename_cont rules > ≡
xsd_string_cont ::= [Atom], { xsd_string_value(Atom) }.
xsd_decimal_cont ::= [Atom], { xsd_decimal_value(Atom) }.
xsd_integer_cont ::= [Atom], { xsd_integer_value(Atom) }.
xsd_date_cont ::= [Atom], { xsd_date_value(Atom) }.

This code is used in < Generic utilities for DCTG-encoded schemas 84 >

In this version of our schema, we will do a little better job of checking the validity of the lexical form. Strings remain trivial:
< 69 xsd_Typename_value rules > ≡
/* In our representation of XML, character data is represented
 * as atoms.  Handling of non-ASCII characters is OK if they are
 * in UTF8, but the SWI parser currently has trouble with 
 * numeric character references */
xsd_string_value(LF) :- atom(LF).
Continued in < Checking decimal and integer values 70 > , < Checking date values 71 > , < Lexical form for year 72 > , < Lexical form for month 75 > , < Lexical form for day of month 76 > , < Checking date values 77 > , < Checking date values 78 >
This code is used in < Generic utilities for DCTG-encoded schemas 84 >

Decimals match the pattern [+-]? [0-9]+ ('.' [0-9]*), integers match [+-]? [0-9]+ — we'll use DCTG notation to check the lexical form against these patterns:
< 70 Checking decimal and integer values [continues 69 xsd_Typename_value rules] > ≡
xsd_decimal_value(LF) :- 
  atom_chars(LF,L),
  lexform_decimal(_,L,[]).
xsd_integer_value(LF) :- 
  atom_chars(LF,L),
  integer(_,L,[]).

lexform_decimal ::= integer, fractionalpart.
integer ::= opt_sign, digits.
fractionalpart ::= [].
fractionalpart ::= decimalpoint.
fractionalpart ::= decimalpoint, opt_digits.
opt_sign ::= [].
opt_sign ::= ['+'].
opt_sign ::= ['-'].
decimalpoint ::= ['.'].
opt_digits ::= [].
opt_digits ::= digits.
/* We supply a 'lexval' property on digits, for use in date checking */
digits ::= digit^^D
  <:> lexval([Dv]) ::- D^^lexval(Dv).
digits ::= digit^^D1, digits^^Dd
  <:> lexval([D1val|Ddval]) ::- D1^^lexval(D1val), Dd^^lexval(Ddval).
digit ::= [Ch], { char_type(Ch,digit) }
  <:> lexval(Ch).



Date values can be checked fully using an appropriate grammar; the grammatical attributes of the DCTG notation make it easy to express the leap-year constraints. A lexical form for date is OK if the date is ok; the predicate dateok takes the integer values of year, month, and day as arguments.
< 71 Checking date values [continues 69 xsd_Typename_value rules] > ≡
xsd_date_value(LF) :- 
  atom_chars(LF,LC),
  lexform_date(_,LC,[]).
lexform_date ::= year^^Y, hyphen, month^^M, hyphen, day^^D,
  { Y^^val(Yv), M^^val(Mv), D^^val(Dv),
    dateok(Yv,Mv,Dv) }.



Years may take an optional leading minus sign; their value (the val property) is composed by reading their lexical form as a number (using the standard number_chars predicate).
< 72 Lexical form for year [continues 69 xsd_Typename_value rules] > ≡
/* Years must have at least four digits */
yearnum ::= digit^^D1, digit^^D2, digit^^D3, digits^^Dd
  <:> val(Num) ::- D1^^lexval(Dv1),
    D2^^lexval(Dv2),
    D3^^lexval(Dv3),
    Dd^^lexval(Dv4),
    flatten([Dv1,Dv2,Dv3,Dv4],LF),
    number_chars(Num,LF).
year ::= yearnum^^Y
  <:> val(Num) ::- Y^^val(Num).
year ::= ['-'], yearnum^^Y
  <:> val(Num) ::- Y^^val(N), Num is 0 - N.
hyphen ::= ['-'].



We can, in principle, constrain month values in purely grammatical terms:
< 73 Purely grammatical rule for month > ≡
month ::= ['0'], ['1'].
month ::= ['0'], ['2'].
...
month ::= ['0'], ['9'].
month ::= ['1'], ['0'].
month ::= ['1'], ['1'].
month ::= ['1'], ['2'].

This code is not used elsewhere.

It's a little more compact if we use the number_chars predicate and test the number arithmetically.
< 74 Semi-grammatical rule for month > ≡
month ::= ['0'], digit^^D
  { D^^lexval(Dv), number_chars(V,Dv), V > 0 }
  <:> val(V).
month ::= ['1'], digit^^D
  { D^^lexval(Dv), number_chars(V,Dv), V < 3 }
  <:> val(Val) ::- Val is 10 + V.

This code is not used elsewhere.

And it's easiest to follow, probably, if the context-free part of the grammar allows any two-digit number and we have a guard do the range check, arithmetically:
< 75 Lexical form for month [continues 69 xsd_Typename_value rules] > ≡
month ::= digit^^D1, digit^^D2,
  { D1^^lexval(Dv1),
    D2^^lexval(Dv2),
    number_chars(Num,[Dv1,Dv2]),
    Num > 0,
    Num < 13 }
  <:> val(Num).



We'll do the same for day of the month:
< 76 Lexical form for day of month [continues 69 xsd_Typename_value rules] > ≡
day ::= digit^^D1, digit^^D2,
  { D1^^lexval(Dv1),
    D2^^lexval(Dv2),
    number_chars(Num,[Dv1,Dv2]),
    Num > 0,
    Num < 32 }
  <:> val(Num).



No one is ever happy, of course, unless a date field is also checked for correct handling of leap years. The rules below are one way to do this, and not necessarily the most elegant, but relatively easy to understand: a date is OK if the day of the month is between 1 and 28, inclusive (I am relying on the range checks already performed in the grammar rules), or if the day is 29 or 30 and the month is not 2, or if the day is 31 and the month is one of those which has 31 days. Or, finally, it's OK if the year is divisible by four and its divisibility by 100 and 400 is OK. The latter is just complex enough to be worth putting into a separate predicate.
< 77 Checking date values [continues 69 xsd_Typename_value rules] > ≡
dateok(_Y,_M,D) :- D < 29.
dateok(_Y,M,29) :- M =\= 2.
dateok(_Y,M,30) :- M =\= 2.
dateok(_Y,M,31) :- member(M,[1,3,5,7,8,10,12]).
dateok(Y,2,29) :- (Y >= 0 -> Yx = Y ; Yx is Y + 1), /* adjust for BC */
  0 is Yx mod 4,
  Lc is Yx mod 100,
  L4c is Yx mod 400,
  leapyearcheck(Lc,L4c).



A year is a leap year if (it is divisible by 4, but we already have that) it is not divisible by 100, or else if it is divisible by 400.
< 78 Checking date values [continues 69 xsd_Typename_value rules] > ≡
leapyearcheck(C,_Q) :- C =\= 0. /* it's not a century year, so leapyear */
leapyearcheck(0,0).            /* it's a quad-century year, so leapyear */



We also need to have rules for checking the two simple types declared in the purchase-order schema: SKUs and quantities.
  • t_quantity = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()
  • t_SKU = /simpleType(SKU)
The content rules are simple and follow the same pattern as those of the builtin types.
< 79 Simple-type content rules for purchase-order types > ≡
t_SKU_cont ::= [Atom], { t_SKU_value(Atom) }.
t_quantity_cont ::= [Atom], { t_quantity_value(Atom) }.

This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

The SKU value checking has been seen before; all we do here is translate it from DCG into DCTG notation.
< 80 Value-checking rules for SKU > ≡
t_SKU_value(LF) :- 
  atom_chars(LF,Charseq),
  sku_value(_Structure,Charseq,[]).

sku_value ::= sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part ::= digit, digit, digit.
sku_alpha_part ::= cap_a_z, cap_a_z.
cap_a_z ::= [Char], { char_type(Char,upper) }.
Continued in < Value-checking rules for quantities 81 >
This code is used in < Predicates for purchase-order material 83 > < Predicates for purchase-order material 93 >

The quantity value checking can rely in part on the rule for integers:
< 81 Value-checking rules for quantities [continues 80 Value-checking rules for SKU] > ≡
t_quantity_value(LF) :- 
  atom_chars(LF,Lchars),
  integer(_,Lchars,[]),
  number_chars(Num,Lchars),
  Num < 100.



3.4.5. Overview

Now we need to put all of this together.
The program podctg.pl comprises two kinds of material: predicates specific to the purchase-order schema and generic predicates which would be part of the DCTG representation of any schema.
< 82 DCTG version of the purchase order schema, layer 3 [File podctg.pl] > ≡
/* podctg.pl: a definite-clause translation grammar representation
 * of the sample purchase-order schema from the XML Schema tutorial.
 *
 * This DCTG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
{Predicates for purchase-order material 83}

{Generic utilities for DCTG-encoded schemas 84}




The generic material includes rules for namespace declarations and XSI attributes, the generic routines atts_present etc. for checking attribute-occurrence constraints, and the utility predicates for whitespace normalization and partitioning the attribute list into namespace attributes and others. In addition, it contains the few type-checking predicates we have written for checking built-in simple types.

3.4.6. Evaluation

Summary of test results to go here, with some illustrations of the node structures resulting from parsing instances with this grammar.
Initial tests of the level-3 grammar over the test suite uncovered a number of problems: Prolog syntax errors, typos in variable names, and so on. After correcting the errors, the level-3 grammar accepted all the valid test cases, but continued to die with infinite loops on some. The infinite-loop problem was corrected by adding the nonvar tests to the att_merge predicate; as noted above, adding it introduces an undesirable procedural note to an account of the purchase-order schema we are trying to keep as declarative as possible, but it does not particularly distort the declarative meaning, it only clutters up the predicate.
The grammar as shown above (after the corrections just described) accepts all the valid test cases, and rejects all but one of the error test cases. That test case is E35, which supplies a duplicate attribute; in theory, the upstream XML parser should be catching that well-formedness error, and so I have chosen not to modify the grammar to catch it.
There is one more serious problem remaining in the grammar shown above. The propagation of attributes upward is sometimes a problem at present: the value of the children property of the elements here should be a list of child nodes, but in fact it's a node structure named t_Typename_cont. Instead of
nt ::= [element(n:gi,Attributes,Content)],
  {
    ct_atts(A,NA,Attributes),
    ct_cont(C,Content,[])
  }
  <:> attributes(A) 
  && namespaceAttributes(NA)
  && children(C) 
  && localname(gi) 
  && namespacename(n).
the rule pattern should read
nt ::= [element(n:gi,Attributes,Content)],
  {
    ct_atts(A,NA,Attributes),
    ct_cont(C,Content,[]),
    C^^children(Ch)
  }
  <:> attributes(A) 
  && namespaceAttributes(NA)
  && children(Ch) 
  && localname(gi) 
  && namespacename(n).
or else the attribute-value assignment should read differently: not
&& children(C)
but
&& children(Ch) ::- C^^children(Ch)
The children property of elements with simple types also has an unexpected value.
The flaws in the assignment of values to grammatical attributes are shown clearly when a pretty-printing predicate is used or written.
These problems will be corrected in the grammar for Layer 4, given in the next section.

3.5. Layer 4: validity, validation-attempted, and error handling

N.B. this section is incomplete.

3.5.1. Goals

3.5.1.1. Additional PSVI properties
In this section, I will add several more PSVI properties to the grammar. Some of these are simple: properties to identify the type against which a given element or attribute was validated, or to say if it was nil. Some are more challenging: providing error codes (and returning a fully parsed result showing what parts are valid and what parts are not valid).
Specifically, in this section I will rewrite the grammar adding the following grammatical attributes:
  • For elements, we already have
    • [local name]
    • [namespace name]
    • [attributes]
    • [namespace attributes]
    • [children]
    and will add
    • [type definition name] (Element Validated by Type)
    • [type definition namespace] (Element Validated by Type) — for this schema, this will always be either the purchase-order namespace or nothing (‘absent’)
    • [type definition anonymous] (Element Validated by Type) — true or false
    • [type definition type] (Element Validated by Type) — simple or complex
    • [nil] (Element Declaration) — true if the type is nillable and the element is nil in fact, false otherwise (the spec introduces this as an alternative to [element declaration]; if [element declaration] is provided, this information is redundant, although perhaps inconvenient to obtain otherwise)
    • [schema default] (Element Validated by Type) — the canonical lexical representation of a default or fixed value (provided whether the value is defaulted or not)
    • [schema error code] (Validation Failure (Element)) — either absent or a list of error codes
    • [schema information] (Schema Information) — attached to the element at which schema-validity assessment began (for us, always the purchaseOrder element); value is a set of namespace schema information items, which are triplets of (namespace, components, documents); lightweight processors are expected to leave the lists of components empty, but I propose to put in at least a list of schema-component designators
    • [schema normalized value] (Element Validated by Type) — if the element is not nilled, and did not default to a default value,[17] then this is the normalized value as validated, otherwise it's absent
    • [schema specified] (Element Default Value) — schema or infoset
    • [validation attempted] (Assessment Outcome (Element)) — full, none, or partial
    • [validation context] (Assessment Outcome (Element)) — the element at which validation began
    • [validity] (Assessment Outcome (Element)) — valid, invalid, or notKnown
    The schema information property has some sub-properties:
    • namespace schema information information item properties
      • [schema components] (Schema Information)
      • [schema documents] (Schema Information)
      • [schema namespace] (Schema Information)
    • schema document information item properties
      • [document] (Schema Information)
      • [document location] (Schema Information)
  • For attributes, we already have
    • [local name]
    • [namespace name]
    • [normalized value]
    and will add
    • [schema default] (Attribute Validated by Type) — canonical form of the default value, if any
    • [schema error code] (Validation Failure (Attribute)) — a list of error diagnostics, or else absent
    • [schema normalized value] (Attribute Validated by Type) — the normalized value as validated
    • [schema specified] (Assessment Outcome (Attribute))
    • [type definition anonymous] (Attribute Validated by Type)
    • [type definition name] (Attribute Validated by Type) — as supplied by schema; if none, may be absent or the processor may supply one
    • [type definition namespace] (Attribute Validated by Type) — the target namespace of the type
    • [type definition type] (Attribute Validated by Type) — for attributes, invariably simple
    • [validation attempted] (Assessment Outcome (Attribute))
    • [validation context] (Assessment Outcome (Attribute))
    • [validity] (Assessment Outcome (Attribute))
3.5.1.2. PSVI properties not supplied
There are some PSVI properties still not represented:
  • for elements:
    • [element declaration] (Element Declaration) — since this schema does not provide PSVI-style access to the schema components themselves, we have nothing to provide here
    • [ID/IDREF table] (ID/IDREF Table) — a set of (id, binding) pairs showing all the IDs used or referred to in a document and the elements to which they are bound (an empty binding reveals an IDREF to a non-existent ID); the purchase order schema has no IDs, and exposing this property is in any case optional for any processor
    • [identity-constraint table] (Identity-constraint Table) — since there are no identity constraints in the purchase-order schema used here as an example, this would in any case be empty
    • [type definition] (Element Validated by Type) — we are not exposing the schema components yet
    • [member type definition] (Element Validated by Type) — we are not exposing the schema components yet
    • [member type definition anonymous] (Element Validated by Type) — the purchase-order schema has no simple union types
    • [member type definition name] (Element Validated by Type) — the purchase-order schema has no simple union types
    • [member type definition namespace] (Element Validated by Type) — the purchase-order schema has no simple union types
    • [notation] (Validated with Notation) — we are not exposing the schema components yet
    • [notation public] (Validated with Notation) — the purchase-order schema does not have any declared notations
    • [notation system] (Validated with Notation) — the purchase-order schema does not have any declared notations
  • for attributes:
    • [attribute declaration] (Attribute Declaration)
    • [member type definition] (Attribute Validated by Type)
    • [member type definition anonymous] (Attribute Validated by Type)
    • [member type definition name] (Attribute Validated by Type)
    • [member type definition namespace] (Attribute Validated by Type)
    • [type definition] (Attribute Validated by Type)
  • ID/IDREF binding information item properties
    • [binding] (ID/IDREF Table)
    • [id] (ID/IDREF Table)
  • Identity-constraint Binding information item properties
    • [definition] (Identity-constraint Table)
    • [node table] (Identity-constraint Table)
3.5.1.3. To do
To do:
  • provide a schemaErrorCode property on each parsed attribute and element
  • check xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes; it is an error if they occur after the first use of the namespace they describe

3.5.2. Error diagnosis in simple types

Perhaps the simplest place to start with error diagnosis is the simple types. Currently, the predicates xsd_decimal_cont, xsd_string_cont and so on succeed only if the lexical form to be checked denotes a legal value; otherwise, they simply fail. What we want to do is to allow them to succeed no matter what they are given as a lexical form, and to return some sort of error diagnostic if appropriate.
[W3C 2001b] and [W3C 2001c] suggest some specific error codes for invalid simple types:
  • cvc-simple-type string not locally valid w.r.t. the given simple type. [Should also have one or more error codes from cvc-datatype-valid in datatypes spec.]
  • cvc-datatype-valid.1 lexical form not datatype-valid w.r.t. a given simple type: does not match a literal in the lexical space.
  • cvc-datatype-valid.2 lexical form not datatype-valid w.r.t. a given simple type: does not denote a member of the value space.
In other words, we can distinguish failure to provide a legal lexical form from failure of that lexical form to denote a legal value. In order to simplify the maintenance of diagnostic messages, we can put these codes into Prolog predicates, so that we can retrieve the description conveniently without writing it into each rule that raises the error. Using the standard predicate concat_atom(+List,-Atom), we can integrate the name of the simple type into the error descriptions:
< 85 Error codes for simple types > ≡
xsd_errorcode(cvc-simple-type,String,Typename,Desc) :-
  concat_atom(['The string "',String,
      '" is not locally valid w.r.t. the given simple type ',
      Typename,'.'],
    Desc).
xsd_errorcode(cvc-datatype-valid.1,S,T,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': it matches no literal in the lexical space.'],
    Desc).
xsd_errorcode(cvc-datatype-valid.2,S,T,Desc) :-
  concat_atom(['The lexical form "',S,
      '" is not datatype-valid w.r.t. the given simple type ',
      T,': it denotes a value not in the value space.'],
    Desc).
Continued in < Error codes for elements with simple types 86 >
This code is used in < Generic utilities for DCTG-encoded schemas 94 >

The Type_cont rules are responsible for checking the content of elements against a given simple type; some error codes are relevant not to the lexical form itself but to the element in which it appears:
< 86 Error codes for elements with simple types [continues 85 Error codes for simple types] > ≡
xsd_errorcode(cvc-type.3.1.1,Gi,Tn,Anbad,Desc) :-
  concat_atom(['The ',Gi,' element is not locally valid: ',
      'its type definition (', Tn,
      ') is simple, but it has attributes other than ',
      'xsi:type, xsi:nil, xsi:schemaLocation, and '
      'xsi:noNamespaceSchemaLocation (specifically '
      Anbad, ').'],
    Desc).
xsd_errorcode(cvc-type.3.1.2,Gi,Tn,GiChild,Desc) :-
  concat_atom(['The ',Gi,' element is not locally valid: ',
      'its type definition (',Tn,') is simple, but it has '
      'element children (',GiChild,').'],
    Desc).
xsd_errorcode(cvc-type.3.1.2,Gi,Tn,Desc) :-
  concat_atom(['The ',Gi,' element is not locally valid: ',
      'its content is not a legal lexical form for its '
      'simple type ',Tn,'.'],
    Desc).



As before, each simple type will have both a Type_cont and a Type_value predicate. Instead of the call pattern Type_cont(ResultNode,InputChildrenList,[]), the content-checking predicate will have the pattern Type_cont(Gi,LIn,LPn,LErrors), where
  • Gi is the generic identifier of the element being checked (this is solely for use in error messages)
  • LIn is the input list of atoms (for PCDATA) and element() structures provided by the upstream parser
  • LPn is a list of parsed nodes (recall that in level 3 we ended up with a node structure for an intermediate grammatical construct being returned here; we change the call in order to fix that)
  • LErrors is a list of error codes; if the list is empty, the element's content is valid with respect to the type involved.
The call Type_value(Atom) will analogously be replaced by Type_value(Tn,Atom,LErrors), which will return an empty list of errors for valid data, and a non-empty list for invalid data.
For type t_string, any lexical form is legal as long as it is made up of characters legal in an XML document, and any such lexical form corresponds to a legal value. The upstream XML processor should therefore already have performed all the checking that is necessary. If I were feeling both paranoid and energetic, I'd probably write a predicate to transform the atom to a list of character codes, and then check them against the list of legal characters, but for now I'll leave that to the imagination of the reader. So the value-checking predicate for string is fairly simple. If the test atom(LF) fails, we issue a generic cvc-simple-type error, but that should never occur.
< 87 Value checking rules for simple types > ≡
xsd_string_value('xsd:string',LF,[]) :- atom(LF).
xsd_string_value('xsd:string',LF,[Err]) :- 
  not(atom(LF)),
  xsd_errorcode(cvc-simple-type,LF,'xsd:string',DescTemp),
  concat_atom(
    [DescTemp,' N.B. This error was thought to be impossible.'],
    Desc),
  Err = error(cvc-simple-type,Desc).

This code is used in < Generic utilities for DCTG-encoded schemas 94 >

The content-checking rule for string is a bit more complicated, even though it doesn't do anything type-specific. This predicate must:
  • check the content for child elements; if there are any, validate them in lax mode
  • check the content against the simple type t_string; if any errors
< 88 Content-checking rules for simple types > ≡
xsd_string_cont(Gi,LIn,LPn,LErr) :-
  member(element(GiChild,_Atts,_Content),LIn), !,
  xsd_errorcode(cvc-type.3.1.2,Gi,'xsd:string',GiChild,Desc),
  LErr = [error(cvc-type.3.1.2,Desc)],
  lax_validate_seq(LIn,LPn).
  /* This assumes that even though having any element children 
   * makes the element we are checking illegal, we should strictly
   * speaking validate all of the element children in lax mode 
   * anyway (cvc-assess-elt clause 2). */

xsd_string_cont(_Gi,LIn,LPn,LErr) :- 
  gr_xsd_string(LErr,Pn,LIn,[]),
  Pn^^children(LPn).

gr_xsd_string(LErr) ::= [Atom], 
  { xsd_string_value('xsd:string',Atom,LErr) }
  <:> children([Atom]).

This code is used in < Generic utilities for DCTG-encoded schemas 94 >

The predicate lax_validate_seq will be defined below.
t_integer ...
t_decimal ...
t_date ...
t_quantity ...
t_SKU ...

3.5.3. Lax validation of sequences

When checking simple types, we may encounter child elements; if so, the rules for schema-validity assessment of the parent element (listed in Validation Rule: Schema-Validity Assessment (Element)) seem to require that we validate them as best we can.
If we have a declaration for the element (i.e. if we have an element declaration for the given (namespace, localname) pair), then we validate the element against the declaration.
Otherwise, if the element has an xsi:type attribute, and we have a type definition for the QName given, then we validate the element against that type definition.
Otherwise, we mark the element unvalidated.
< 89 Lax validation of sequences > ≡
lax_validate_seq(LIn,LPn) :-
  gr_lax_sequence(Node,LIn,[]),
  Node^^children(LPn).

gr_lax_sequence ::= []
  <:> children([]).
gr_lax_sequence ::= gr_lax_item^^Item, gr_lax_sequence^^Seq
  <:> children([CI|CS]) ::- 
    Item^^child(CI),
    Seq^^children(CS).

gr_lax_item ::= [PCDATA], { atom(PCDATA) },
  <:> child([PCDATA]).
gr_lax_item ::= [element(Gi,Atts,Cont)], 
  { schema_lookup(element,Gi,Rule),
    call(Rule(Node,[element(Gi,Atts,Cont)],[]))
  }
  <:> child(Node).

This code is not used elsewhere.

The run-time lookup of element rules requires a schema_lookup predicate, which we can define thus:
< 90 QName resolution for elements > ≡
/* Define the mapping from generic identifiers to grammar rules. */
schema_lookup(element,'...':purchaseOrder,e_purchaseOrder).
e_purchaseOrder ::= [element('http://www.example.com/PO1':purchaseOrder,
e_shipTo ::= [element(shipTo,Attributes,Content)],
e_billTo ::= [element(billTo,Attributes,Content)],
e_items ::= [element(items,Attributes,Content)],
e_item ::= [element(item,Attributes,Content)],
e_comment ::= [element('http://www.example.com/PO1':comment,Attributes,Content)],
e_name ::= [element(name,Attributes,Content)],
e_street ::= [element(street,Attributes,Content)],
e_city ::= [element(city,Attributes,Content)],
e_state ::= [element(state,Attributes,Content)],
e_zip ::= [element(zip,Attributes,Content)],
e_productName ::= [element(productName,
e_quantity ::= [element(quantity,
e_USPrice ::= [element('USPrice',Attributes,Content)],
e_shipDate ::= [element(shipDate,Attributes,Content)],


This code is not used elsewhere.

For the purchase-order schema, or any fixed schema, we could instead hard-wire the definition of gr_lax_item as expanding to the known attributes, and have a generic rule for unknown elements, along the following lines:
< 91 Alternate approach to lax lookup > ≡
gr_lax_item ::= e_purchaseOrder^^E
  <:> child(E).
gr_lax_item ::= e_shipTo^^E
  <:> child(E).
/* ... */
gr_lax_item ::= e_shipDate^^E
  <:> child(E).
gr_lax_item ::= unknown_element^^E
  <:> child(E).

unknown_element ::= [element(Ns:Ln,Attributes,Content)], 
  { 
    check_unknown_atts(A,NA,Attributes),
    lax_validate_seq(Content,LPn)
  } 
  <:> localname(Ln)
  && namespacename(Ns)
  && attributes(A)
  && namespaceattributes(NA)
  && children(LPn)
  && schemaerrorcode(error(cvc-assess-elt.1.1.1,Desc)) ::-
    xsd_errorcode(cvc-assess-elt.1.1.1,Desc)
  && validationattempted(ValidationCode) ::-
    find_validationcode(lax,LPn)
  /* ... */
  .
unknown_element ::= [Gi,Attributes,Content)],
  {
    check_unknown_atts(A,NA,Attributes),
    lax_validate_seq(Content,LPn)
  } 
  <:> localname(Gi)
  && namespacename(Ns)
  && attributes(A)
  && namespaceattributes(NA)
  && children(LPn)
  && schemaerrorcode(error(cvc-assess-elt.1.1.1,Desc)) ::-
    xsd_errorcode(cvc-assess-elt.1.1.1,Desc)
  && validationattempted(ValidationCode) ::-
    find_validationcode(lax,LPn)
  /* ... */
  .

This code is not used elsewhere.

This approach is simpler, perhaps, but the lookup mechanism given earlier seems likely to be more compact in the long run.

3.5.4. Error diagnosis in complex types

Some error conditions are generic.
Failures of schema-validity assessment:
  • cvc-assess-elt.1.1.1 schema-validity not assessed: no element declaration known for this element
  • cvc-assess-elt.1.1.2 schema-validity not assessed: element was not validated w.r.t. its element declaration (Element Locally Valid was not run).
  • cvc-assess-elt.1.1.3 schema-validity not assessed: the necessary type definition was absent [Should normally be accompanied by error code cvc.type.1.]
  • cvc-assess-elt.1.2.1 schema-validity not assessed: the type definition needed to validate this element was either unknown or absent
  • cvc-assess-elt.1.2.1.2.2 schema-validity not assessed: an xsi:type attribute was present, but its value was not a valid QName
  • cvc-assess-elt.1.2.1.2.3 schema-validity not assessed: an xsi:type attribute was present, but its value did not resolve to any known type definition
  • cvc-assess-elt.1.2.1.2.4 schema-validity not assessed: an xsi:type attribute was present, but the type it denotes is not validly derived from the type stipulated by the processor at run time
  • cvc-assess-elt.1.2.2 schema-validity not assessed: element was not validated w.r.t. either the processor-stipulated type or the local (xsi:type-specified) type definition
  • cvc-assess-elt.2 schema-validity not assessed: either there were children which were not assessed, or attributes which were not assessed. [Should be accompanied by one or more errors from cvc-assess-elt for the children, or cvc-assess-attr for the attributes.]
Local validity w.r.t. an element declaration:
cvc-elt.1 element is not locally valid: no element declaration found
cvc-elt.2 element is not locally valid: element is declared abstract
cvc-elt.3.1 element is not locally valid, element is declared non-nillable, but there is an xsi:nil attribute present (it doesn't matter what value that attribute had, it's legal only on nillable elements)
cvc-elt.3.2.1 element is not locally valid, it's nillable and has xsi:nil="true", but the element is not empty.
cvc-elt.3.2.2 element is not locally valid, it's nillable and has xsi:nil="true", but it has a fixed {value constraint}.
cvc-elt.4.1 element is not locally valid, it has xsi:type but the value is not a valid QName. [Should also have a cvc-simple-type error.]
cvc-elt.4.2 element is not locally valid, it has xsi:type but the value does not resolve to a known type. [Should also have a cvc-resolve-instance error.]
cvc-elt.4.3 element is not locally valid, it has xsi:type but the type named is not validly derived from the declared type. [Should also have a cos-ct-derived-ok error or a cos-st-derived-ok error.]
cvc-elt.5.1.1 element is not locally valid; it is empty and non-nil and has a {value constraint}, but the default value is not a legal value for the local type. [Should also have a cos-valid-default error.]
cvc-elt.5.1.2 element is not locally valid; it is empty and non-nil and has a {value constraint}, but the element (after supplying the default value) is not locally valid according to the actual type definition. [Should also have a cvc-type error.]
cvc-elt.5.2.1 element is not locally valid; failed Element Locally Valid (Type) (cvc-type). [Should also have a cvc-type error.]
cvc-elt.5.2.2.1 element is not locally valid; it has both a fixed value and element children.
cvc-elt.5.2.2.2.1 element is not locally valid; the initial value of its mixed content does not match the canonical lexical representation of the fixed value in the actual declaration.
cvc-elt.5.2.2.2.2 element is not locally valid; the actual value it contains does not match the canonical lexical representation of the fixed value in the actual declaration.
cvc-elt.6 element is not locally valid; failed an identity-constraint. [Should also have a cvc-identity-constraint error.]
cvc-elt.7 element is not locally valid; it's the validation root, but failed the ID/IDREF check. [Should also have a cvc-id error.]
Local validity w.r.t. a type definition:
cvc-type.1 element is not locally valid: no type definition found.
cvc-type.2 element is not locally valid: its type definition has abstract="true".
cvc-type.3.1.1 element is not locally valid: its type definition is simple, but it has attributes other than xsi:type, xsi:nil, xsi:schemaLocation, and xsi:noNamespaceSchemaLocation.
cvc-type.3.1.2 element is not locally valid: its type definition is simple, but it has element children.
cvc-type.3.1.2 element is not locally valid: its content is not a legal lexical form for its simple type. [Should also have a cvc-simple-type error code.]
cvc-type.3.2 element is not locally valid: it is not valid with respect to its complex type. [Should also have a cvc-complex-type error code.]
Local validity w.r.t. a complex type:
cvc-complex-type.1 element not locally valid: complex type does not have abstract="false".
cvc-complex-type.2.1 element not locally valid: complex type is declared empty (has {content type} of empty) but the element is not empty. [Applies only if non-nil.]
cvc-complex-type.2.2 element not locally valid: complex type is derived from a simple type (has a simple type as its {content type}) but the element has element children. [Applies only if non-nil.]
cvc-complex-type.2.2 element not locally valid: complex type is derived from a simple type (has a simple type as its {content type}) but the element is not valid w.r.t. that simple type. [Should also have cvc-simple-type error code.] [Applies only if non-nil.]
cvc-complex-type.2.3 element not locally valid: complex type has {content type} element-only, but the element has non-whitespace characters between child elements. [Applies only if non-nil.]
cvc-complex-type.2.4 element not locally valid: the sequence of element children does not match the content model. [Should also have one or more cvc-particle errors.] [Applies only if non-nil.]
cvc-complex-type.3.1 element not locally valid: attribute not valid w.r.t. declared attribute use. [Should also have one or more cvc-au errors.]
cvc-complex-type.3.2.1 element not locally valid: attribute not declared, matches no wildcard.
cvc-complex-type.3.2.2 element not locally valid: attribute not valid w.r.t. wildcard. [Should also have one or more cvc-wildcard errors.]
cvc-complex-type.4 element not locally valid: required attribute missing.
cvc-complex-type.5.1 element not locally valid: more than one wild ID.
cvc-complex-type.5.2 element not locally valid: there is both a wild ID and a tame ID.
Content model validation:
cvc-particle.1.1 element-sequence not locally valid w.r.t. particle: wildcard match count less than {min occurs}.
cvc-particle.1.2 element-sequence not locally valid w.r.t. particle: wildcard match count greater than numeric {max occurs}.
cvc-particle.1.3 element-sequence not locally valid w.r.t. particle: element not valid w.r.t. wildcard. [Should also have some cvc-wildcard error codes.]
cvc-particle.2.1 element-sequence not locally valid w.r.t. particle: element match count less than {min occurs}.
cvc-particle.2.2 element-sequence not locally valid w.r.t. particle: element match count greater than {max occurs}.
cvc-particle.2.3 element-sequence not locally valid w.r.t. particle: element not matched.
cvc-particle.3.1 element-sequence not locally valid w.r.t. particle: model group must occur at least {min occurs} times.
cvc-particle.3.2 element-sequence not locally valid w.r.t. particle: model group must occur at most {max occurs} times.
cvc-particle.3.2 element-sequence not locally valid w.r.t. particle: subsequence not valid against model group. [Should also have some cvc-model-group errors.]
... to be continued with whatever is cross-referenced from cvc-particle
A simple way to handle errors is to define a fall-back production which accepts pretty much anything, and forces the appropriate grammatical attribute to take the error value. If multiple fallbacks can be defined, each relaxing a different constraint, then we can easily identify which constraint was violated. This may require that we insert a cut into the normal rules, which seems unfortunate.

3.5.5. Overview of level-4 grammar

The level-4 grammar has a structure very similar to that of the level-3 grammar on which it is based.
N.B. at the moment parts or all of this level-4 grammar are still identical to the corresponding parts of the level-3 grammar; they will be replaced one by one as this section is completed.
The program po4.pl comprises both schema-specific and generic material.
< 92 DCTG version of the purchase order schema, layer 4 [File po4.pl] > ≡
/* po4.pl: a definite-clause translation grammar representation
 * of the sample purchase-order schema from the XML Schema tutorial.
 *
 * This DCTG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
{Predicates for purchase-order material 83}

{Generic utilities for DCTG-encoded schemas 84}




3.6. Notes on other features

With the completion of layer 4, we have a fairly complete representation of the purchase-order schema po1.xsd in DCTG form. Because that schema is intentionally rather simple, however, there has been no occasion to show how to translate various constructs into the DCTG formalism. The following sections set out some ideas for how some of the more prominent features of XML Schema missing from the example can be represented.

3.6.1. Handling xsi:type

There are two approaches to handling xsi:type. We can have the base type parse its part, so we get layers and clear delineation between base and extension, or else we can parse the entire thing. I'll describe both, and implement plan B because the separation between base and extension is not exhibited in the PSVI as defined.

3.6.2. Mixed content

Support mixed content. The purchase-order schema does not have any mixed content, and the SWI parser has a mode in which it suppresses text nodes consisting only of whitespace characters, so we have been able to ignore the problem of mixed content and the presence of character information items for non-significant white space in element-only content. In the long run, however, we have to face up to this topic. Three approaches offer themselves:
  • Filter the contents list, removing character data to produce a list of child elements with no intervening character-data atoms, and then validate that list in the way shown above. This is conceptually simple and would be easy to implement.
    Unfortunately, the children attribute of the parent element needs to have the original character-data atoms present in their original places, so what we would really have to do would be first to partition the original children list into
    1. a list of child elements with no intervening character-data atoms
    2. a list of the character data atoms, with place-holders for the elements (actually, the original children list would work fine for this role)
    and then after validating the elements to merge the two lists.
    This is not hard to understand or implement, but it feels too kludgy to be attractive.[18]
  • Modify the rules representing any content model, adding an opt_pcdata or opt_ws token after each element, and also adding a special rule to take care of leading PCDATA or inter-element whitespace. For example, the rules for the content model of t_PurchaseOrderType would change from
    t_PurchaseOrderType_cont ::= 
      e_shipTo, e_billTo, opt_e_comment, 
      e_items.
    opt_e_comment ::= [].
    opt_e_comment ::= e_comment.
    
    to
    t_PurchaseOrderType_cont ::= opt_ws, t_PurchaseOrderType_cont2.
    t_PurchaseOrderType_cont2 ::= 
      e_shipTo, opt_ws, e_billTo, opt_ws, opt_e_comment, 
      e_items, opt_ws.
    opt_e_comment ::= [].
    opt_e_comment ::= e_comment, opt_ws.
    
  • Re-organize the parser so that instead of expressing the grammar in native Prolog rules like the Type_cont rules shown above, we represent them using some Prolog data structure (a combination of lists and structured terms is an obvious choice), and write a parser as a simple grammar interpreter which reads those data structures and does the parsing.[19]
    This approach is tempting, because it allows us to eliminate much of the repetitive structure present in the grammar rules given above: every element has substantially similar rules differing only in having the name of the element and/or type embedded in particular places in the names of the relevant predicates; passing the names as parameters to a single copy of a generic predicate would be less repetitive, although also (it must be admitted) harder to follow at first glance.

3.6.3. Supporting wildcards with skip and lax processing

Show how to handle wildcards with skip, lax, and strict processing. The wildcards will have grammatical productions of their own; these grammars will follow rules different from those of normal non-wildcard particles.

3.6.4. Supporting substitution groups

Substitution groups can be handed by defining multiple element rules with the same left-hand side. If a is in the substitution group of b, then we have both
b --> [element(b,Attributes,Content)],
  {
    b_atts(Attributes,[]),
    b_cont(Content,[])
  }.
a --> [element(a,Attributes,Content)],
  {
    a_atts(Attributes,[]),
    a_cont(Content,[])
  }.
and also
b --> [element(a,Attributes,Content)],
  {
    a_atts(Attributes,[]),
    a_cont(Content,[])
  }.

3.6.5. Exploiting common code patterns (going to meta-level)

The grammars above are repetitious: for every element with a complex type, the grammars have several predicates differing only in the names of datatypes and element types built into the names of the predicates.
We can make the grammar more compact by replacing the repetitive predicates with single meta-level predicates to which the names of the relevant datatypes and element types are passed as parameters. If we do so, we can no longer take direct advantage of the DCG or DCTG notation and must instead formulate a sort of grammar interpreter to take their place; the code may also become slightly harder to follow. On the other hand, the code can also follow the rules of and more directly and obviously.
Not clear whether it is worthwhile moving to the meta-level here as a level-5 grammar, or it would be better to wait for the next paper. My current leaning is to think that the meta-level interpreter will be easier to build and to understand if we do it in two stages, beginning with one that will handle the purchase-order schema and omits support for some features not exercised in it.

4. Further work

The processor outlined above seems promising enough to make further work seem useful. There are three things worth doing. First, we need to show that a DCTG-based representation of schemas like that described here conforms to the schema component constraints of the spec, and that a parser running the DCTG grammar performs validation according to the validation rules of the spec. Second, it would be useful to represent the schema for schemas as shown, so that we can validate XML-encoded schema documents. Third, it is a natural further step to combine the two levels of processing to make a schema processor which reads schema documents and from them builds DCTGs (or equivalent data structures) with which it validates document instances.
According to section 2.4 of [W3C 2001b], to be minimally conforming a translation of an XML Schema into DCTG notation must
  • implement the schema component constraints (see Appendix C.4 and subsection 6 of each component) I take this to mean that the schemas used by the processor must obey them; the processor may or may not independently enforce them. Proof obligation: prove that the style of DCTG outlined above does satisfy these constraints.
  • implement the validation rules (see Appendix C.1 and subsection 4 of each component). I take this to mean the processor must work correctly from its component representation to the values of the relevant PSVI instance properties. Proof obligation: prove that the style of DCTG outlined above does correctly implement the validation rules.
  • implement (i.e. make) the schema information set contributions (see Appendix C.2 and subsection 5 of each component). Proof obligation: make the parser decorate the input with appropriate grammatical attributes representing the relevant properties.

A. Works cited and further reading

Works cited

Abramson, Harvey. 1984. “Definite Clause Translation Grammars”. Proceedings of the 1984 International Symposium on Logic Programming, Atlantic City, New Jersey, February 6-9, 1984, pp. 233-240. (IEEE-CS 1984, ISBN 0-8186-0522-7)

Abramson, Harvey, and Veronica Dahl. 1989. Logic Grammars. Symbolic Computation AI Series. Springer-Verlag, 1989.

Abramson, Harvey, and Veronica Dahl, rev. Jocelyn Paine. 1990. DCTG: Prolog definite clause translation grammar translator. (Prolog code for translating from DCTG notation to standard Prolog. Note says syntax extended slightly by Jocelyn Paine to accept && between specifications of grammatical attributes, to minimize need for parentheses. Available from numerous AI/NLP software repositotries, including http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/prolog/code/syntax/dctg/0.html, http://www.ims.uni-stuttgart.de/ftp/pub/languages/prolog/libraries/imperial_college/dctg.tar.gz, and http://www.ifs.org.uk/~popx/prolog/dctg/.)

Alblas, Henk. 1991. “Introduction to attribute grammars”. Attribute grammars, applications and systems: International Summer School SAGA, Prague, Czechoslovakia, June 4-13, 1991, Proceedings, pp. 1-15. Berlin: Springer, 1991. Lecture Notes in Computer Science, 545.

Bratko, Ivan. 1990. Prolog programming for artificial intelligence. Second edition. Wokingham: Addison-Wesley. xxi, 597 pp.

Brown, Allen L., Jr., and Howard A. Blair. 1990. “A logic grammar foundation for document representation and layout”. In EP90: Proceedings of the International Conference on Electronic Publishing, Document Manipulation and Typography, ed. Richard Furuta. Cambridge: Cambridge University Press, 1990, pp. 47-64.

Brown, Allen L., Jr., Toshiro Wakayama, and Howard A. Blair. 1992. “A reconstruction of context-dependent document processing in SGML”. In EP92: Proceedings of Electronic Publishing, 1992, ed. C. Vanoirbeek and G. Coray. Cambridge: Cambridge University Press, 1992. Pages 1-25.

Brüggemann-Klein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasis-open.org/cover/bruggDissert-ps.gz).

[Brüggemann-Klein provides a formal definition of 1-unambiguity, which corresponds to the notion of unambiguity in ISO 8879 and determinism in XML 1.0. Her definition of 1-unambiguity can be used to check XML Schema's Unique Particle Attribution constraint by changing every minOccurs and maxOccurs value greater than 1 to 1, if the two are equal, and otherwise changing minOccurs to 1 maxOccurs greater than 1 to unbounded.]

Clocksin, W. F., and C. S. Mellish. Programming in Prolog. Second edition. Berlin: Springer, 1984.

Gal, Annie, Guy Lapalme, Patrick Saint-Dizier, and Harold Somers. 1991. Prolog for natural language processing. Chichester: Wiley, 1991. xiii, 306 pp.

Gazdar, Gerald, and Chris Mellish. 1989. Natural language processing in PROLOG: An introduction to computational linguistics. Wokingham: Addison-Wesley, 1989. xv, 504 pp.

Grune, Dick, and Ceriel J. H. Jacobs. 1990. Parsing techniques: a practical guide. New York, London: Ellis Horwood, 1990. Postscript of the book is available from the first author's Web site at http://www.cs.vu.nl/~dick/PTAPG.html

Holstege, Mary, and Asir S. Vedamuthu, ed. 2002. XML Schema: Component Designators. W3C Working Draft 19 December 2002. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/2002/12/xml-schema-component-designators.html

Knuth, D. E. 1968. “Semantics of context-free languages”. Mathematical Systems Theory 2: 127-145.

König, Esther, and Roland Seiffert. Grundkurs PROLOG für Linguisten. Tübingen: Francke, 1989. [= Uni-Taschenbücher 1525]

Pereira, Fernando C. N., and Stuart M. Shieber, Prolog and natural-language analysis. CSLI lecture notes 10. Stanford: Center for the study of language and information, 1987.

Sperberg-McQueen, C. M. 2002a. “A logic grammar representation for XML Schema”. Working paper prepared for the W3C XML Schema Working Group. [Incomplete]

Sperberg-McQueen, C. M. 2002b. “An XML Schema validator in logic-grammar form”. Working paper prepared for the W3C XML Schema Working Group. [Incomplete]

Stepney, Susan. High-integrity compilation. Prentice-Hall. Available from http://www-users.cs.york.ac.uk/~susan/bib/ss/hic/index.htm. Chapter 3 (Using Prolog) provides a terse introduction to DCTG notation and use.

W3C (World Wide Web Consortium). 2001a. “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.

W3C (World Wide Web Consortium). 2001b. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

W3C (World Wide Web Consortium). 2001c. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/

Wielemaker, Jan. “SWI-Prolog SGML/XML parser: Version 1.0.14, March 2001”. http://www.swi-prolog.org/packages/sgml2pl.html

Further reading

For more information on the representation of SGML and XML documents as Prolog structures, see the SWI add-ons documentation [Wielemaker 2001]. Other representations are possible; I have used this one because it's convenient and because representing sibling sets in a list makes it easier to use DCGs and DCTGs.
Definite-clause grammars (DCGs) are introduced in almost any good Prolog textbook: e.g. [Clocksin/Mellish 1984], [Bratko 1990]. They are discussed at somewhat greater length in treatments of Prolog for natural-language processing, including [König/Seiffert 1989], [Gazdar/Mellish 1989], and [Gal et al. 1991]. Most extended discussions show how to use additional arguments to record syntactic structure or handle the semantics of the material.
Definite-clause translation grammars were introduced as a way of making it easier to handle semantics; they provide explicit names for attributes (in the sense of attribute grammars [Knuth 1968]).

B. To do

Done:
  • get a layer 1 grammar working (done 2002-09-13, Denver International Airport)
  • add layer 2 (done 2002-09-13)
  • translate to DCTG as layer 3 (completed 2003-01-05, though not yet tested)
  • generate test cases to test each type: correct instances, boundary cases, incorrect instances (done 2002-12-26)
  • ¿ make explicit grammatical attributes for the input infoset properties, as well as those added in the PSVI? (Need to learn a bit more about how SWI Prolog presents this information.)
    • Attribute Information Items:
      • [local name]
      • [namespace name]
      • [normalized value]
    • Character Information Items:
      • [character code]
    • Element Information Items:
      • [local name]
      • [namespace name]
      • [children]
      • [attributes]
      • [in-scope namespaces] or [namespace attributes]
    • Namespace Information Items:
      • [prefix]
      • [namespace name]
    (done as part of level 3)

C. Toward a useful layering

The layering proposed above is a start; I'd like to confirm it by walking through a validation episode or two. I'll start with the purchase order from the tutorial. We visit the following validation rules; under each, I note some features I think should probably be omitted from layer 1, and introduced as elaborations later, in some sequence to be determined.
  • Sec. 5.2, rule 3
    • ability of user or application to stipulate a type definition at startup
    • ability of user or application to stipulate an element declaration at startup
    • ability of user or application to stipulate a starting point other than the root of the document
  • sva_element_334
    • use of xsi:type (and with it clause 1.2)
  • qname_resolution_instance_3f4
  • elementlocallyvalid_element_334
    • absent declarations
    • abstract element declarations
    • xsi:nil
    • default values, fixed values (?)
  • element_locally_valid_type_334
    • absent type definitions
    • abstract types
  • string valid (and from there to simple-type checking)
  • element_locally_valid_complextype
    • attribute wildcards
    • wild IDs
  • element sequence locally valid (particle)
    • wildcards
    • ...
  • element sequence valid
  • attribute locally valid (use) 354
  • schema-validity assessement (attribute) 324
  • assessment outcome (attribute) 325

D. A note on the test cases

The test documents mentioned in the text were created manually to test the implementation of a schema corresponding to the schema document po1.xsd of the XML Schema tutorial [W3C 2001a].
A simple XSLT stylesheet read the schema and produced about 120 suggestions for tests, both valid and invalid documents. Because the stylesheet is primitive and simple-minded, some of the suggestions are not useful; in the end, eleven valid variations of the po1.xml example were produced:
  1. po1v10a.xsd, add additional attributes (namespace declarations, xsi attributes) to purchaseOrder
  2. po1v25.xml, omit optional comment element
  3. po1v33.xml, omit optional orderDate attribute from purchaseOrder
  4. po1v38.xml, add harmless attributes to USAddress elements
  5. po1v62d.xml, valid but unreasonable zip code (3.141592); the declaration is too weak to detect this problem
  6. po1v65.xml, omit country attribute on USAddress
  7. po1v79.xml, empty items element
  8. po1v80.xml, longer list of items (10)
  9. po1v100a.xml, various alternative values for quantity element (1, 43, 55, 99)
  10. po1v100b.xml, various alternative values for quantity element (1, 43, 55, 99)
  11. po1v121.xml, alternative values for partNum
Also, 58 test cases with errors were produced:
  1. po1e4.xml, misspell purchaseOrder
  2. po1e13.xml, omit main seq of PurchaseOrderType
  3. po1e14.xml, duplicate main sequence of PurchaseOrderType
  4. po1e15a.xml, reorder children of purchaseOrder
  5. po1e15b.xml, reorder children of purchaseOrder
  6. po1e16.xml, extraneous child of purchaseOrder
  7. po1e18.xml, omit required shipTo element
  8. po1e19.xml, duplicate shipTo
  9. po1e20.xml, misspell shipTo
  10. po1e27.xml, supply two comment elements (invalid)
  11. po1e28.xml, misspell 'comment'
  12. po1e30.xml, omit items element
  13. po1e31.xml, supply two items elements (invalid)
  14. po1e32.xml, misspell 'items'
  15. po1e35.xml, supply two orderDate attributes
  16. po1e36.xml, misspell 'orderDate' on purchaseOrder
  17. po1e41.xml, omit USAddress contents
  18. po1e42.xml, supply double sequence of address children
  19. po1e43.xml, reorder children of USAddress
  20. po1e44.xml, add extraneous child to USAddress
  21. po1e46.xml, omit name from USAddress
  22. po1e47.xml, provide name twice in USAddress
  23. po1e48.xml, misspell 'name' in USAddress
  24. po1e50.xml, omit street in USAddress
  25. po1e51.xml, double occurrence of street in USAddress
  26. po1e52.xml, misspell 'street'
  27. po1e55.xml, double occurrence of 'city'
  28. po1e56.xml, misspell 'city'
  29. po1e62.xml, omit zip
  30. po1e62b.xml, omit contents of zip element
  31. po1e62c.xml, invalid zip code
  32. po1e63.xml, double occurrence of zip code
  33. po1e64.xml, misspell 'zip' in USAddress
  34. po1e68.xml, misspell attribute name 'country' on USAddress
  35. po1e70.xml, provide value other than 'US' for fixed att 'country'
  36. po1e70b.xml, provide value other than 'US' for fixed att 'country'
  37. po1e78.xml, add extraneous child to sequence in type t_Items
  38. po1e81.xml, misspell 'item'
  39. po1e86.xml, omit contents of item element
  40. po1e87.xml, repeat sequence of children in item
  41. po1e88.xml, reorder children of item
  42. po1e89.xml, add extraneous child to item
  43. po1e91.xml, omit productName
  44. po1e92.xml, reduplicated productName
  45. po1e101a.xml, exceed the maximum legal value for quantity
  46. po1e101b.xml, violate min, lexical form rules for quantity
  47. po1e101c.xml, wrong datatype for quantity
  48. po1e101d.xml, bad value for quantity
  49. po1e105bisa.xml, bad value for USPrice
  50. po1e105bisb.xml, bad value for USPrice
  51. po1e106.xml, misspell USPrice
  52. po1e109.xml, two comments in item
  53. po1e113.xml, two shipDates (invalid)
  54. po1e114.xml, misspell 'shipDate'
  55. po1e116.xml, omit required partNum attribute
  56. po1e122a.xml, bad value for SKU (lowercase letters, char out of range)
  57. po1e122b.xml, bad value for SKU (excess chars)
  58. po1e122c.xml, bad value for SKU, missing hyphen
To run the tests conveniently, the following auxiliary Prolog file can be used. At the moment, it is tied to the manually-modified version of podcg3.pl, but that will change as work on this paper progresses.
< 95 Utility routines for testing Prolog implementations of po1.xsd [File potests.pl] > ≡
/* potest: simple one-item test routine for SWI Prolog 
 * 
 * This DCG was generated by a literate programming system; if
 * maintenance is necessary, make changes to the source (dctgnotes.xml),
 * not to this output file.
 */
grammar_version(current,dctg,'d:/home/cmsmcq/2003/schema/dctg/po4.pl').
grammar_version(old,dctg,'d:/home/cmsmcq/2003/schema/dctg/podctg.pl').
grammar_version(dcg,dcg,'d:/home/cmsmcq/2003/schema/dctg/podcg3.pl').

potest(Grammar,dcg,XMLDocument) :-
  write('Loading grammar '), write(Grammar), write('...'), nl,
  consult(Grammar),
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (purchaseOrder(Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.
potest(Grammar,dctg,XMLDocument) :-
  write('Loading grammar '), write(Grammar), write('...'), nl,
  consult('d:/usr/lib/prolog/msmdctg.pl'),
  dctg_reconsult(Grammar),
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (e_purchaseOrder(_Structure,Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.

potest2(XMLDocument,dcg) :-
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (purchaseOrder(Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.
potest2(XMLDocument,dctg) :-
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (e_purchaseOrder(_Structure,Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.

one(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml').

two(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v10a.xml',Syn).

/* Each of the following should be valid */
good(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v10a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v25.xml',Syn),    
  potest2('d:/usr/lib/xmlschema/po/tests/po1v33.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v38.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v62d.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v65.xml',Syn),   
  potest2('d:/usr/lib/xmlschema/po/tests/po1v79.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v80.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v100a.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v100b.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v121.xml',Syn).

/* Each of the following should raise an error */
bad(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e04.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e13.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e14.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15c.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e16.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e16b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e18.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e19.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e20.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e27.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e28.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e28b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e30.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e31.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e32.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e35.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e36.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e41.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e42.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e43.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e44.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e46.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e47.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e48.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e50.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e51.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e52.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e55.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e56.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62b.xml',Syn),
  /*
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62c.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e63.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e64.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e68.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e70.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e70b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e78.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e81.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e86.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e87.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e88.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e89.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e91.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e92.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101b.xml',Syn),
  /*
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101c.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101d.xml',Syn),
  /* 
  potest2('d:/usr/lib/xmlschema/po/tests/po1e105bisa.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e105bisb.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e106.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e109.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e113.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e114.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e116.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122c.xml',Syn).
/* 62c, 101c, 105bisa are commented out because they blow up 
 * number_chars/2.  Need a more robust approach to checking
 * numbers, apparently. */

ugly(V) :- good(V), bad(v).



The tests can be run from the SWI Prolog command line in bash thus:[20]
cd ~/2003/schema/dctg
$ /cygdrive/d/usr/src/pl/bin/plcon.exe -f potests.pl -g 'one(current)' -t halt
or
$ /cygdrive/d/usr/src/pl/bin/plcon.exe \
>  -f d:/home/cmsmcq/2002/Prolog/potest.pl \
>  -g good -t halt
The command line might be simpler if Prolog has a good search path, but I haven't gotten around to that yet.
The -g option specifies which set of tests to run: one, two (these check whether the tests work at all), good (the valid documents; these should each produce “yes”), bad (the invalid documents), or ugly (good followed by bad).

Notes

[1] Here I am obliged to use the term attribute in the sense defined by Knuth 1968. In order to distinguish the attributes of elements in an XML document from attributes in this sense, I will where necessary refer to the former as XML attributes and to the latter either as grammatical attributes or (adopting the terminology of XML information sets) as properties.
[2] If for some reason the parser backtracks, the vp predicate will also recognize [eats] as a verb phrase, leaving [the,apple] as the unparsed residue; see below.
[3] For small documents, the parser returns the entire document in the list structure described; for large documents, alternative calls are provided to avoid having to have the entire document in memory at a time. I will here use only the simpler all-at-once form of the parser.
[4] Various complications will lead to something slightly different in a production version of this idea, but I start with the simplest idea that I have been able to make work.
[5] The DCG representation of a schema thus looks rather like a set of separate grammars for regular languages, linked together solely by the nested calls. That makes the DCG we are writing unusual as a DCG, but it will work very nicely for us later when we begin to model skip and lax processing.
[6] See [Grune/Jacobs 1990] sec. 2.3.2.3 (pp. 35-37) for discussion of this point.
[7] To be specific, when loading the document instance, we specify load_structure('d:/usr/lib/xmlschema/po/po.xml', P01, [dialect(xmlns),space(remove)]) instead of just load_structure('d:/usr/lib/xmlschema/po/po.xml', P01, [dialect(xmlns)]).
[8] This grammar is supplied as a test case with the standard code for DCTG translation found in many AI repositories [Abramson/Dahl/Paine 1990] transcribed (with slight modifications) by Jocelyn Paine from code supplied by Abramson and Dahl [Abramson/Dahl 1989]. It may possibly mirror an example in [Knuth 1968], but I haven't actually ever seen that article and am not sure. (I have reformatted the code slightly to match the style used below.) Tracing the execution of this code while parsing a short bit string is instructive.
[9] This EBNF ignores some Prolog constructs like curly braces; it is therefore not exact or complete.
[10] The version I found on the net assumes that operator precedence values are in the range 0 to 255, as described in [Clocksin/Mellish 1984]; to make this program work with the many Prolog systems which use the range 0 to 1200, one must adjust the operator precedence values appropriately. A modified copy is available from the author.
[11] The attentive reader will note that SWI Prolog abbreviates complex structures when printing them at the shell; I have added a call to write in order to have the entire structure printed out. I have also manually added white space to the output of write, to make the structures easier to examine.
[12] In a more adequate grammar of English, of course, some would be recognized as compatible with singular nouns, too: “Every man loves some woman.”
[13] A more complete grammar would attribute a property to the verb to say whether it is transitive or intransitive; verbs marked intransitive cannot take a direct object, though transitive verbs can normally be used without one (“John eats the apple” but also “John eats”). This refinement is omitted here; the grammar is already long enough.
[14] In more serious grammars, strenuous efforts are made to avoid having to have separate entries for the singular and the plural of every noun. One reason for introducing a lex predicate is to insulate the definition of the pre-terminals from the lexicon.
[15] It is probably more usual in production Prolog systems to use the cut for this purpose; I am trying to avoid its use so as to avoid having to explain its meaning to readers who don't already know Prolog. Using cut, the rule would read thus:
< 40 Alternate rule for defaulted attributes, with cut > ≡
att_merge([],Padft,[Padft]).
att_merge([Pa|Lpa],Padft,[Pa|Lpa]) :-
  Pa^^namespacename(NS),
  Padft^^namespacename(NS),
  Pa^^localname(Lnm),
  Padft^^localname(Lnm), !.
att_merge([Pa|Lpa],Padft,Lpa2) :-
  att_merge(Lpa,Padft,Lpa2).

This code is not used elsewhere.

The if-then-else construct may also be used with the same effect:
< 41 Alternate rule for defaulted attributes, with if > ≡
att_merge([],Padft,[Padft]).
att_merge([Pa|Lpa],Padft,Lpa2) :-
  ( { Pa^^namespacename(NS),
      Padft^^namespacename(NS),
      Pa^^localname(Lnm),
      Padft^^localname(Lnm) } -> 
    Lpa2 = [Pa|Lpa]
    ; 
    att_merge(Lpa,Padft,Lpa2)
  ).

This code is not used elsewhere.

[16] Readers who are not Prolog programmers should know that atom_codes(X,Y) is a relation which holds between an atom X and a list of integers Y which correspond, in order, to the characters used to spell the atom X.
[17] The spec appears inconsistent on this point; the definition of [schema specified] says that if the element defaulted, then this property is filled in, rather than being absent. I may have my positives and negatives mixed up.
[18] The problem is that it's impossible to imagine a memory-constrained processor being able to work in any analogous way; the concern may be misplaced for purposes of this Prolog representation of schema, since we need to keep the elements around in memory indefinitely no matter what.
[19] [Pereira/Shieber 1987] provides some examples of the kind of embedded interpreter described here, including a left-corner parser which does not suffer from infinite loops on left-recursive grammar rules.
[20] The $ and > characters are provided by bash, the \ immediately followed by a newline is just a way to continue a single command onto a new line.