Mechanization of logic and mechanization of numerical calculation

[9 January 2015]

In their book Mechanization of reasoning in a historical perspective (Amsterdam: Rodopi, visit this 1995), story Witold Marciszewski and Roman Murawski write (actually, this chapter is by WM):

Albert Einstein could not do without arithmetical data-processing in his computations, but had no need to resort to rules of formalized deduction in his reasonings …

Point taken. (It should be noted that WM and RM define data-processing broadly, to include things like the mechanical rules for multiplication and long division taught in primary schools.) We use mechanical rules for arithmetic calculation all the time, but very seldom for inference. It would be interesting to have a theory as to why this might be so. Perhaps humans are better at inference than at arithmetic? (Certain kinds of inference will, I guess, have been reinforced and selected for by evolution.) Or perhaps humans are so bad at inference that people seldom or never engage in chains of inferences that would require mechanical aid? As if we never did arithmetic with numbers greater than ten?

Of course, sometimes people make mistaken inferences, which mechanized reasoning aids could in theory prevent.

And compare Leibniz’s suggestion that people would do better in some endeavors if they did resort more often to rules of formalized deduction (in “De logica nova condenda”, in Die Grundlagen des logischen Kalk├╝ls, ed. and tr. Franz Schupp with Stephanie Weber [Hamburg: Meiner,2000], p. 2 [my translation, following Schupp and Weber]):

Est vero in notra potestate, ut in colligendo non erremus, si scilicet quoad argumentandi formam rigide observemus regulas Logicas, quoad materiam vero nullas assumamus propositiones, quarum vel veritas, vel major ex datis probabilitas, non sit jam antea rigorose demonstrata. Quam methodum secuti sunt Mathematici, admirando cum successu.

Est etiam in potestate nostra, ut controversias finiamus, si scilicet argumenta quae afferuntur in formam accurate redigamus, non syllogismos tantum formando atque examinando, set et prosyllogismos, et prosyllogismorum prosyllogismos, donec vel absolvatur probatio, vel constet quid adhuc investigandum probandumve argumentanti supersit, ne scilicet inani circulo priora repetat, et Diogenis dolium volvat.

But it is in our power not to err in logical inference, namely if we rigidly observe the rules of logic, with respect to the form of argument, and if with respect to the subject matter weal assume nothing whose truth, or at least its likelihood, has not been shown on the basis of available data. Which is the method that the mathematicians have followed with admirable success.

It is also in our power to put an end to controversies, namely if we put the arguments brought forward accurately into a form, in which we not only formulate and examine the syllogisms of the argument, but the prosyllogisms, and the prosyllogisms of the prosyllogisms, until finally the proof is completed, or else it is established what parts of the argument must be further investigated in order to avoid falling into an empty circle repeating what has already been said, and rolling the tub of Diogenes.

Is there a contradiction here, or can these two views be reconciled?

More digital than thou

[16 December 2013]

An odd thing has started happening in reviews for the Digital Humanities conference: reviewers are objecting to papers if the reviewer thinks it has relevance beyond the field of DH, more about apparently on the grounds that the topic is then insufficiently digital. It doesn’t matter how relevant the topic is to work in DH, look or how deeply embedded the topic is in a core DH topic like text encoding — if some reviewers don’t see a computer in the proposal, this site they want to exclude it from the conference.

[“You’re making this up,” said my evil twin Enrique. “That’s absurd; I don’t believe it.” Absurd, yes, but made up? no. The reviewing period for DH 2014 just ended. One paper proposal I reviewed addressed the conceptual analysis of … I’ll call the topic X, to preserve some fig leaf of anonymity here; X in turn is a core practice of many DH projects and an essential part of any account of the meaning of tag sets like that of the Text Encoding Initiative. I thought the paper constituted a useful contribution to an ongoing discussion within the framework of DH. Another reviewer found the paper interesting and thoughtful but found nothing specifically digital in the proposal [after all, X can arise in a pen and paper world, too, computers are not essential], graded it 0 for relevance to DH theory and practice, and voted to reject it. It should be submitted, this reviewer said, to a conference in [another field where X is a concern] and not in a DH conference. I showed this to Enrique. For once, he was speechless. Thank you, Reviewer 2!]

At some level, the question is what counts as work in the field of digital humanities. Is it only work in which computers figure in an essential role? Or is digital humanities concerned with the application of computers to the humanities and all the problems that arise in that effort? I think the latter. Some reviewers appear to think the former. If solving an essential problem turns out to involve considerations that are not peculiar to work with computing machines, however, what do we do? I believe that the problem remains relevant to DH because it’s rooted in DH and because those interested in doing good work in DH need the answer. The other reviewer seems to take the view that once the question becomes more general, and applicable to work that doesn’t use computers, the question is no longer peculiarly digital, and thus not sufficiently digital; they would like to exclude it from the DH conference on the grounds that it addresses a question of interest not only to DH but also to other fields.

[“Wait a second,” said Enrique. “You’re saying there’s a pattern here. Isn’t it just this one reviewer?” “No,” I said. “This line of thought also came up in at least one other review (possibly in a more benign form), and it has also been seen in past years’ reviews.” “Is that why you’re so touchy about this?” laughed Enrique. “Did they reject one of your papers on this account?” “Oh, hush.”]

The notion that there must be something about a topic that is peculiar to the digital humanities, as opposed to the broader humanities disciplines, makes sense perhaps if one believes that the digital humanities are intrinsically different from the pre-electronic humanities disciplines. On that view, any DH topic is necessarily distinct from any non-DH humanities topic, and once a topic is relevant to the broader humanistic fields (e.g. “what is the nature of X?”), it is ipso facto no longer a DH topic.

This is like arguing that papers about the concept of literary genre don’t belong at a conference about English literary history, because there is nothing peculiarly English about the notion of genre and any general discussion will (if it’s worth anything at all) also be relevant to non-English literatures. Or like trying to exclude work on the theory of computation from computer science conferences because it applies to all computation, not only to computation carried out by electronic stored-program binary digital computers.

[“I notice that leaders in the field of computer science occasionally feel obliged to remark that the name computer science is a misnomer,” said Enrique, “because computers are in no sense an essential element in the field.“ ”Perhaps they have the same problem, in their way,” I said. “My sympathy,” said Enrique.]

Another problem I see with the view my co-reviewer seems to hold is that some people believe that DH is not intrinsically different from the pre-electronic humanities disciplines. I didn’t get involved with computers in order to stop doing traditional philology; I got involved to do it better. Computers allow a much broader basis for our literary and linguistic argumentation, and they demand a higher degree of precision than philologists typically achieved in past decades or centuries, but I believe that digital philology is recognizably the same discipline as pre-digital philology. If the DH conference were to start refusing papers on the grounds that they are describing work that might be relevant to scholars working with pen and paper, then it would be presupposing an answer to the important question of how computers change and don’t change the world, instead of encouraging discussion of that question. (And it would be excluding people like me from the conference, or trying to.)

[“Trying in vain, I bet,” said Enrique. “Well, yeah. I’m here, and I’m not leaving.”]

These exclusionary impulses are ironic, in their way, because one of the motive forces behind the formation of scholarly organizations like the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (now the European Association for Digital Humanities) and behind their journals and conferences was that early practitioners in the field were often made to feel unwelcome in the conferences and journals of their home disciplines. The late Paul Fortier was eloquent on the subject of the difficulties he had trying to publish his computer-assisted work on Céline in conventional journals for Romance languages and literatures; he often spoke of Computers and the Humanities as having made it possible for him and others like him to have an academic career.

It will be a sad thing if the recent growth in degree programs devoted to digital humanities turns out to result in the field of DH setting up signs at all boundaries reading “Outsiders not wanted.”

Flattening structure without losing all rules

[25 November 2013, healing typos corrected 26 November]

The other day during the question-and-answer session after my talk at Markupforum 2013, cialis 40mg Marko Hedler speculated that some authors (and perhaps some others) might prefer to think of text as having a rather flat structure, not a hierarchical nesting structure: a sequence of paragraphs, strewn about with the occasional heading, rather than a collection of hierarchically nesting sections, each with a heading of the appropriate level.

One possible argument for this would be how well the flat model agrees with the text model implicit in word processors. Whether word processors use a flat model because their developers have correctly intuited what authors left to their own devices think, or whether author who want a flat structure think as they do because their minds have been poisoned by exposure to word processors, we need not now decide.

In a hallway conversation, Prof. Hedler wondered if it might not be possible to invent some sort of validation technique to define such a flat sequence. To this, the immediate answer is yes (we can validate such a sequence) and no (no new invention is needed): as the various schemas for HTML demonstrate, it’s easy enough to define a text body as a sequence of paragraphs and headngs:

<!ELEMENT body (p | ul | ol | blockquote
| h1 | h2 | h3 ...)*>

What, we wondered then, if want to enforce the usual rules about heading levels? We need to check the sequence of headings and ensure that any heading is at most one level deeper than the preceding heading, so a level-one heading is always followed by another level-one heading or by a level-two heading, but not by a heading of level three or greater, while (for example) a fourth-level heading can be immediately followed by a fifth-level heading, but not a sixth-level heading. And so forth.

The discussion having begun with the conjecture that conventional vocabularies (like TEI or JATS) use nesting sections because nesting sections are the kinds of things we can specify using context-free grammars, we were inclined to believe that it might be hard to perform this level check with a conventional grammar-based schema language. In another talk at the conference, Gerrit Imsieke sketched a solution in Schematron (from memory, something like this rule for an h1: following-sibling::* [matches(local-name(), '^hd')] [1] /self::*[self::h1 or self::h2] — that can’t be right as it stands, but you get the idea).

It turns out we were wrong. It’s not only not impossible to check heading levels in a flat structure using a grammar-based schema language, it’s quite straightforward. Here is a solution in DTD syntax:

<!ENTITY p-level 'p | ul | ol | blockquote' >
<!ENTITY pseq '(%p-level;)+' >
<!ENTITY h6seq 'h6, %pseq;' >
<!ENTITY h5seq 'h5, (%pseq;)?, (%h6seq;)*'>
<!ENTITY h4seq 'h4, (%pseq;)?, (%h5seq;)*'>
<!ENTITY h3seq 'h3, (%pseq;)?, (%h4seq;)*'>
<!ENTITY h2seq 'h2, (%pseq;)?, (%h3seq;)*'>
<!ENTITY h1seq 'h1, (%pseq;)?, (%h2seq;)*'>

<!ELEMENT body ((%pseq;)?, (%h1seq;)*) '>

Note: as defined, these fail (for simplicity’s sake) to guarantee that a section at any level must contain either paragraph-level elements or subsections (lower-level headings) or both; a purely mechanical reformulation can make that guarantee:

<!ENTITY h1seq 'h1,
((%pseq;, (%h2seq;)*)
| (%h2seq;)+)' >

Searching for patterns in XML siblings

[9 April 2013]

I’ve been experimenting with searches for elements in XML documents whose sequences of children match certain patterns. I’ve got some interesting results, pilule but before I can describe them in a way that will make sense for the reader, sale I’ll have to provide some background information.

For example, info consider a TEI-encoded language corpus where each word is tagged as a w element and carries a pos attribute. At the bottom levels of the XML tree, the documents in the corpus might look like this (this extract is from COLT, the Bergen Corpus of London Teenage English, as distributed by ICAME, the International Computer Archive of Modern and Medieval English; an earlier version of COLT was also included in the British National Corpus):

<u id="345" who="14-7">
<s n="407">
<w pos="PPIS1">I</w>
<w pos="VBDZ">was</w>
<w pos="XX">n't</w>
<w pos="JJ">sure</w>
<w pos="DDQ">what</w>
<w pos="TO">to</w>
<w pos="VVI">revise</w>
<w pos="RR">though</w>
<u id="346" who="14-1">
<s n="408">
<w pos="PPIS1">I</w>
<w pos="VV0">know</w>
<w pos="YCOM">,</w>
<w pos="VBZ">is</w>
<w pos="PPH1">it</w>
<w pos="AT">the</w>
<w pos="JJ">whole</w>
<w pos="JJ">bloody</w>
<w pos="NN1">book</w>
<w pos="CC">or</w>
<w pos="RR">just</w>
<w pos="AT">the</w>
<w pos="NN2">bits</w>
<w pos="PPHS1">she</w>
<w pos="VVD">tested</w>
<w pos="PPIO2">us</w>
<w pos="RP">on</w>
<w pos="YSTP">.</w>

These two u (utterance) elements record part of a conversation between speaker 14-7, a 13-year-old female named Kate of unknown socio-economic background, and speaker 14-1, a 13-year-old female named Sarah in socio-economic group 2 (that’s the middle group; 1 is high, 3 is low).

Suppose our interest is piqued by the phrase “the whole bloody book” and we decide we to look at other passages where we find a definite article, followed by two (or more) adjectives, followed by a noun.

Using the part-of-speech tags used here, supplied by CLAWS, the part-of-speech tagger developed at Lancaster’s UCREL (University Centre for Computer Corpus Research on Language), this amounts at a first approximation to searching for a w element with pos = AT, followed by two or more w elements with pos = JJ, followed by a w element with pos = NN1. If we want other possible determiners (“a”, “an”, “every”, “some”, etc.) and not just “the” and “no”, and other kinds of adjective, and other forms of noun, the query eventually looks like this:

let $determiner := ('AT', 'AT1', 'DD',
'DD1', 'DD2',
'DDQ', 'DDQGE', 'DDQV'),
$adjective := ('JJ', 'JJR', 'JJT', 'JK',
'MC', 'MCMC', 'MC1', 'MD'),
$noun := ('MC1', 'MC2', 'ND1',
'NN', 'NN1', 'NN2',
'NNJ', 'NNJ2',
'NNL1', 'NNL2',
'NNT1', 'NNT2',
'NNU', 'NNU1', 'NNU2',
'NP', 'NP1', 'NP2',
'NPD1', 'NPD2',
'NPM1', 'NPM2' )

let $hits :=
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $noun]
for $h in $hits return
<hit doc="{base-uri($h)}">{

Such searches pose several problems, for which I’ve been mulling over solutions for a while now.

  • One problem is finding a good way to express the concept of “two or more adjectives”. (The attentive reader will have noticed that the XQuery given searches for determiners followed by exactly two adjectives and a noun, not two or more adjectives.)

    To this, the obvious solution is regular expressions over w elements. The obvious problem standing in the way of this obvious solution is that XPath, XQuery, and XSLT don’t actually have support in their syntax or in their function library for regular expressions over sequences of elements, only regular expressions over sequences of characters.

  • A second problem is finding a syntax for expressing the query which ordinary working linguists will find less daunting or more convenient than XQuery.

    Why ordinary working linguists should find XQuery daunting, I don’t know, but I’m told they will. But even if one doesn’t find XQuery daunting, one may find the syntax required for sibling searches a bit cumbersome. The absence of a named axis meaning “immediate following sibling” is particularly inconvenient, because it means one must perpetually remember to add “[1]” to steps; experience shows that forgetting that predicate in even one place can lead to bewildering results. Fortunately (or not), the world in general (and even just the world of corpus linguistics) contains a large number of query languages that can be adopted or used for inspiration.

    Once such a syntax is invented or identified, of course, one will have the problem of building an evaluator for expressions in the new language, for example by transforming expressions in the new syntax into XQuery expressions which the XQuery engine or an XSLT processor evaluates, or by writing an interpreter for the new language in XQuery or XSLT.

  • A third problem is finding a good way to make the queries faster.

    I’ve been experimenting with building user-level indices to help with this. By user-level indices I mean user-constructed XML documents which serve the same purpose as dbms-managed indices: they contain a subset of the information in the primary (or ‘real’) documents, typically in a different structure, and they can make certain queries faster. They are not to be confused with the indices that most database management systems can build on their own, with or without user action. Preliminary results are encouraging.

More on these individual problems in other posts.

URI casuistry

[4 March 2013]

I’ve been working lately on improving a DTD parser I wrote some time ago.

It’s been instructive to work with libcurl, hygiene the well known URL library by Daniel Stenberg and others, and with uriparser, a library for parsing and manipulating URIs written by Weijia Song and Sebastian Pipping (to all of whom thanks); I use the latter to absolutize relative references in entity declarations and the former to dereference the resulting absolute URIs.

A couple of interesting problems arise.

Relative references in parameter-entity declarations

When a parameter entity declaration uses a relative reference in its system identifier, you need to resolve that relative reference against the base URI. Section 5 of RFC 3986 is clear that the base URI should come from the URI of the DTD file in which the relative reference is used. So if the DTD contains a parameter entity declaration of the form

<!ENTITY % chapters SYSTEM "chapters.dtd">

then the relative reference chapters.dtd is resolved against the base URI to produce the absolutized form This is true even if the reference to the parameter entity chapters occurs not in doc.dtd but in some other file: the logic is, as I understand it, that the relative reference is actually used when the parameter entity is declared, not when it is referenced, and the base URI comes from the place where the relative reference is used. Of course, in many or most cases the declaration and the reference to an external parameter entity will occur in the same resource.

I should qualify my statement; this is what I believe to be correct and natural practice, and what I believe to be implied by RFC 3986. I have occasionally encountered software which behaved differently; I have put it down to bugs, but it may mean that some developers of XML software interpret the base-URI rules of RFC 3986 differently. And one could also argue that use is not the issue; the base URI to be used is the URI of the resource within which the relative reference physically occurs; in this case it amounts to the same thing.

I’m not sure, however, what ought to happen if we add a level of indirection. Suppose …

  • DTD file contains the declaration <!ENTITY % chapdecl '&#x003C;!ENTITY % chapters SYSTEM "chapters.dtd">'> (not a declaration of the parameter entity chapters, but the declaration of a parameter entity containing the declaration of that parameter entity).
  • DTD file contains a parameter entity reference to %chapdecl; (and thus, logically, it is this DTD file that contains the actual declaration of chapters and the actual use of the relative reference).
  • DTD contains a reference to %chapters;.

Which DTD file should provide the base URI for resolving the relative reference? I think the declaration/reference logic rules out the third candidate. If we say that we should take the base URI from the entity in which the relative reference was used, and that the relative reference is used when the parameter entity chapters is declared, then the second choice (b.dtd) seems to have the strongest claim. If we say that we should take the base URI from the entity in which the relative reference appears, and that the relative reference here clearly appears in the declaration of the parameter entity chapdecl, then the first choice (a.dtd) seems to have the strongest claim.

I am sometimes tempted by the one and sometimes by the other. The logic that argues for a.dtd has, however, a serious problem: the declaration of chapdecl might just as easily look like this: <!ENTITY % chapdecl '&#x003C;!ENTITY % chapters SYSTEM "%fn_pt1;%fn_pt2;.%fn_ext;">'>, with the relative reference composed of references to three parameter entities each perhaps declared in a different resource with a different URI. Where does the relative reference “appear” in that case? So on the whole the second choice (b.dtd) seems the right one. But my code actually chooses a.dtd for the base URI in this case: each parameter entity carries a property that indicates what base URI to use for relative references that occur within the text of that parameter entity, and in the case of internal entities like chapdecl the base URI is inherited from the entity on top of the entity stack when the declaration is read. Here, the top of the stack is chapdecl, which as an internal entity inherits its base-URI-for-children property from a.dtd. Changing to use the base URI of the resource within which the entity declaration appears (to get b.dtd as the answer) would require adding new code to delay the calculation of the base URI: possible, but fiddly, easy to get wrong, and not my top priority. (I am grateful that in fact these cases never appear in the DTDs I care about, though if they did I might have intuitions about what behavior to regard as correct.)


A similar complication arises when we wish to follow the advice of some commentators on the W3C system team’s blog post on excessive DTD traffic and provide an HTTP_REFERER value that indicates the source of the URI from which we are retrieving a DTD. In the case given above, which URI file should be listed as the source of the reference? Is it a.dtd, b.dtd, or c.dtd?

It may be that answers to this are laid out in a spec somewhere. But … where?

Posted in XML