Searching for patterns in XML siblings

[9 April 2013]

I’ve been experimenting with searches for elements in XML documents whose sequences of children match certain patterns. I’ve got some interesting results, but before I can describe them in a way that will make sense for the reader, I’ll have to provide some background information.

For example, consider a TEI-encoded language corpus where each word is tagged as a w element and carries a pos attribute. At the bottom levels of the XML tree, the documents in the corpus might look like this (this extract is from COLT, the Bergen Corpus of London Teenage English, as distributed by ICAME, the International Computer Archive of Modern and Medieval English; an earlier version of COLT was also included in the British National Corpus):

<u id="345" who="14-7">
<s n="407">
<w pos="PPIS1">I</w>
<w pos="VBDZ">was</w>
<w pos="XX">n't</w>
<w pos="JJ">sure</w>
<w pos="DDQ">what</w>
<w pos="TO">to</w>
<w pos="VVI">revise</w>
<w pos="RR">though</w>
</s>
</u>
<u id="346" who="14-1">
<s n="408">
<w pos="PPIS1">I</w>
<w pos="VV0">know</w>
<w pos="YCOM">,</w>
<w pos="VBZ">is</w>
<w pos="PPH1">it</w>
<w pos="AT">the</w>
<w pos="JJ">whole</w>
<w pos="JJ">bloody</w>
<w pos="NN1">book</w>
<w pos="CC">or</w>
<w pos="RR">just</w>
<w pos="AT">the</w>
<w pos="NN2">bits</w>
<w pos="PPHS1">she</w>
<w pos="VVD">tested</w>
<w pos="PPIO2">us</w>
<w pos="RP">on</w>
<w pos="YSTP">.</w>
</s>
</u>

These two u (utterance) elements record part of a conversation between speaker 14-7, a 13-year-old female named Kate of unknown socio-economic background, and speaker 14-1, a 13-year-old female named Sarah in socio-economic group 2 (that’s the middle group; 1 is high, 3 is low).

Suppose our interest is piqued by the phrase “the whole bloody book” and we decide we to look at other passages where we find a definite article, followed by two (or more) adjectives, followed by a noun.

Using the part-of-speech tags used here, supplied by CLAWS, the part-of-speech tagger developed at Lancaster’s UCREL (University Centre for Computer Corpus Research on Language), this amounts at a first approximation to searching for a w element with pos = AT, followed by two or more w elements with pos = JJ, followed by a w element with pos = NN1. If we want other possible determiners (“a”, “an”, “every”, “some”, etc.) and not just “the” and “no”, and other kinds of adjective, and other forms of noun, the query eventually looks like this:

let $determiner := ('AT', 'AT1', 'DD',
'DD1', 'DD2',
'DDQ', 'DDQGE', 'DDQV'),
$adjective := ('JJ', 'JJR', 'JJT', 'JK',
'MC', 'MCMC', 'MC1', 'MD'),
$noun := ('MC1', 'MC2', 'ND1',
'NN', 'NN1', 'NN2',
'NNJ', 'NNJ2',
'NNL1', 'NNL2',
'NNT1', 'NNT2',
'NNU', 'NNU1', 'NNU2',
'NP', 'NP1', 'NP2',
'NPD1', 'NPD2',
'NPM1', 'NPM2' )

let $hits :=
collection('COLT')
//w[@pos=$determiner]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $noun]
]]]
for $h in $hits return
<hit doc="{base-uri($h)}">{
$h,
<orth>{
normalize-space(string($h/..))
}</orth>,
$h/..
}</hit>

Such searches pose several problems, for which I’ve been mulling over solutions for a while now.

  • One problem is finding a good way to express the concept of “two or more adjectives”. (The attentive reader will have noticed that the XQuery given searches for determiners followed by exactly two adjectives and a noun, not two or more adjectives.)

    To this, the obvious solution is regular expressions over w elements. The obvious problem standing in the way of this obvious solution is that XPath, XQuery, and XSLT don’t actually have support in their syntax or in their function library for regular expressions over sequences of elements, only regular expressions over sequences of characters.

  • A second problem is finding a syntax for expressing the query which ordinary working linguists will find less daunting or more convenient than XQuery.

    Why ordinary working linguists should find XQuery daunting, I don’t know, but I’m told they will. But even if one doesn’t find XQuery daunting, one may find the syntax required for sibling searches a bit cumbersome. The absence of a named axis meaning “immediate following sibling” is particularly inconvenient, because it means one must perpetually remember to add “[1]” to steps; experience shows that forgetting that predicate in even one place can lead to bewildering results. Fortunately (or not), the world in general (and even just the world of corpus linguistics) contains a large number of query languages that can be adopted or used for inspiration.

    Once such a syntax is invented or identified, of course, one will have the problem of building an evaluator for expressions in the new language, for example by transforming expressions in the new syntax into XQuery expressions which the XQuery engine or an XSLT processor evaluates, or by writing an interpreter for the new language in XQuery or XSLT.

  • A third problem is finding a good way to make the queries faster.

    I’ve been experimenting with building user-level indices to help with this. By user-level indices I mean user-constructed XML documents which serve the same purpose as dbms-managed indices: they contain a subset of the information in the primary (or ‘real’) documents, typically in a different structure, and they can make certain queries faster. They are not to be confused with the indices that most database management systems can build on their own, with or without user action. Preliminary results are encouraging.

More on these individual problems in other posts.