Dumbing grammars down, cont’d

[30 May 2009; diagrams added 31 May 2009]

Last April I posted an entry on the possibility of systematically producing regular languages which are subsets or supersets of a context-free language. The process I outlined involves creating a finite state automaton, augmented with counters and guards and whatnot (or equivalently, creating an attributed regular grammar with attributes and semantic conditions). My description of that process involved a certain amount of hand-waving, because I knew intuitively how to build such an automaton by hand (at least for simple cases), but had not yet figured out a good way to describe it more precisely.

I’m still not there yet, but I’ve made some progress today. Recently I spent some relaxing hours reading about Earley parsers and implementing an Earley parser. And today it occurred to me: a simpler way to make the attributed regular grammar / augmented FSA I need is to make one state for each position in each rule of the grammar, so that each state in the automaton (each non-terminal in the attributed regular grammar) corresponds to the rule-and-position part of an Earley item. (The Earley item also keeps track of the starting position in the input of the non-terminal whose rule is being recognized; I don’t need that for the FSA.)

So I can now give a slightly simpler account of the FSA construction.

  1. Let the states of the automaton be the positions in the grammar. (Each rule of length n has n+1 positions.)
  2. For each position where the rule has a terminal symbol, include an arc from that position to the next position in that rule, labeled with the terminal symbol.
  3. For each position p where the rule has a non-terminal symbol N, (a) add arcs labeled with the empty string from p to the first position of each rule in the grammar where N appears on the left-hand side, and (b) add arcs labeled with the empty string from the final positions in each rule for N to the next position after p.
  4. Add a new starting state; add arcs labeled with the empty string from the starting state to the first positions of the rules for the start symbol of the grammar.
  5. The accepting states of the automaton are the final positions of the rules for the grammar’s start symbol.

I have not yet attempted anything like a proof, but I believe that the automaton thus constructed recognizes what my earlier post called the smallest regular superlanguage of the language you started with. That is, every sentence of the original language will be recognized by the automaton, and some non-sentences will also be recognized.

The set of sentences recognized by the automaton can be characterized simply, by reference to the pumping lemma for context-free languages. In simple terms, this lemma says that for long enough sentences in a language, the sentence s can be partitioned into subsequences u, v, w, x, and y, such that s is the concatenation uvwxy, and such that for any positive integer n, the string uvnwxny is also in the language. The automaton recognizes the slightly larger set of sentences uvnwxmy for positive integers n and m. That is to say, it does not ensure (for example) that closing parens match opening parens in number or kind. But otherwise it captures the language.

To take a very simple example, consider the grammar:

  • s ::= ‘a’
  • s ::= ‘(‘ s ‘)’
  • s ::= ‘[‘ s ‘]’

The automaton has ten states, which can be represented in the style of Earley items:

  1. s ::= · ‘a’
  2. s ::= ‘a’ ·
  3. s ::= · ‘(‘ s ‘)’
  4. s ::= ‘(‘ · s ‘)’
  5. s ::= ‘(‘ s · ‘)’
  6. s ::= ‘(‘ s ‘)’ ·
  7. s ::= · ‘[‘ s ‘]’
  8. s ::= ‘[‘ · s ‘]’
  9. s ::= ‘[‘ s · ‘]’
  10. s ::= ‘[‘ s ‘]’ ·

Plus a new starting state I’ll call 0.

Arcs:

  • 0: λ → 1
  • 0: λ → 3
  • 0: λ → 7
  • 1: “a” → 2
  • 2: λ → 5
  • 2: λ → 9
  • 3: “(” → 4
  • 4: λ → 1
  • 4: λ → 3
  • 4: λ → 7
  • 5: “)” → 6
  • 6: λ → 5
  • 6: λ → 9
  • 7: “[” → 8
  • 8: λ → 1
  • 8: λ → 3
  • 8: λ → 7
  • 9: “]” → 10
  • 10: λ → 5
  • 10: λ → 9

Accepting states are 2, 6, and 10.

Or in pictures:

Drawing of the finite-state automaton described in the text, with circles for the states and arrows for the transitions.

Eliminating epsilon transitions and minimizing the automaton gives us a smaller one with two states, 1 (the start state) and 2 (the accepting state), with transitions:

  • 1: “a” → 2
  • 1: “(” → 1
  • 1: “[” → 1
  • 2: “)” → 2
  • 2: “]” → 2

Drawing of the two-state finite-state automaton described in the text, with circles for the states and arrows for the transitions.

The regular expression “[\[(]*a[\])]*” provides a compact equivalent. As you can see, it accepts a superset of the example language, but it is the smallest regular superset and does impose at least some of the constraints of the original grammar.

Unfinished business (anyone who needs a puzzle to solve can feel free to take these off my hands):

  • Prove that the FSA whose construction is described above accepts a superset of the language accepted by the grammar from which it’s constructed.
  • Describe (with proof of correctness) what augmentations to the mechanism would suffice to use this automaton to parse the original grammar. (Hint: my earlier conjecture that counters would suffice is not true; the example grammar illustrates why.)
  • The augmented FSA that correctly recognizes the original language is, presumably, a re-invention of some well known parsing technique for context-free languages; which?
  • Prove that the language recognized by the FSA is essentially the language uvnwxmy for every uvwxy decomposition of sentences in the original language.
  • Can the largest practicable regular sublanguage be constructed using this FSA?
  • Earley parsing can, I think, be described in terms of this automaton, in the sense that the actions of an Earley parser can be translated into state changes in the automaton (with an external control that prevents making an illegal move). Can other parsing techniques also be?

Sustainability, succession plans, and PURLs — Burial societies for libraries?

[24 May 2009]

At the Summer Institute on Data Curation in the Humanities (SIDCH) this past week in Urbana (see previous post), Dorothea Salo surveyed a variety of threats to the longevity of humanities data, including lack or loss of institutional commitment, and/or death (failure) of the institution housing the data. People serious about maintaining data accessible for long periods need to make succession plans: what happens to the extensive collection of digital data held by the XYZ State University’s Institute for the History of Pataphysical Research when the state legislature finally notices its existence and writes into next year’s budget a rule forbidding university administrators to fund it in any year which in the Gregorian calendar is either (a) a leap year or (b) not a leap year, and (c) requiring the adminstrators to wash their mouths out with soap for having ever funded the Intitute in the first place?

Enough centers for computing in the humanities have been born in the last fifty years, flourished some years, and later died, that I can assure the reader that the prospect of going out of existence should concern not only institutes for the history of pataphysics, but all of us.

It’s good if valuable data held by an organization can survive its end; from the point of view of URI persistence it would be even better if the URL used to refer to the data didn’t have to change either.

I have found myself thinking, the last couple of days, about a possible method of mitigating this threat, that runs something like this.

  • A collection of reasonably like-minded organizations (or individuals) forms a mutual assistance pact for the preservation of data and URIs.
  • The group sets up and runs a PURL server, to provide persistent URLs for the data held by members of the group. [Alternate approach: they all agree to use OCLC’s PURL server.]
  • Using whatever mechanism they choose, the members of the group arrange to mirror each other’s data in some convenient way. Some people will use rsync or similar tools; Dorothea Salo observed that LOCKSS software can also do this kind of job with very low cost in time and effort.
  • If any of the partners in the mutual assistance pact lose their funding or go out of existence for other reasons, the survivors agree on who will serve the decedent’s data. The PURL resolution tables are updated to point to the new location.
  • Some time before the count of partners is down to one, remaining partners recruit new members. (Once the count hits zero, the system has failed.)

    [Wendell Piez observed, when we got to this point of our discussion of this idea, “There’s a Borges story in that, just waiting for someone to write it.” I won’t be surprised if Enrique is working on one even as I write.]

In some cases, people will not want to use PURLs, because when they make available the kind of resources whose longevity is most obviously desirable, the domain name in the URLs performs a sort of marketing or public awareness function for their organization. I suppose one could allow the use of non-PURL domains, if the members of the group can arrange to ensure that upon the demise of an individual member the ownership of their domains passes seamlessly to some other member of the group, or to to the group as a whole. But this works only for domain owners, and only if you can find a way to ensure the orderly transfer of domain ownership. Steve Newcomb, my colleague in the organizing committee for the Balisage conference on the theory and practice of markup, points out a difficulty here: in cases of bankruptcy, the domain name may be regarded as an asset and it may therefore be impossible to transfer it to the other members of the mutual assistance society.

It’s a little bit like the burial societies formed by immigrants in a strange land for mutual assurance, to ensure that when one dies, there will be someone to give them a decent burial according to the customs of their common ancestral homeland, so that their souls will not wander the earth restlessly in perpetuity.

It would be nice to think that the data collections we create will have a home in the future, lest their ghosts wander the earth restlessly, bewailing their untimely demise, accusing us of carelessness and negligence in letting them die, and haunting us in perpetuity.

Data curation for the humanities

[23 May 2009]

Most of this week I was in Illinois, attending a Summer Institute on Humanities Data Curation in the Humanities (SIHDC) sponsored by the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign (UIUC).

The week began on Monday with useful and proficient introductions to the general idea of data curation from Melissa Cragin, Carole Palmer, John MacMullen, and Allen Renear; Craigin, Palmer, and MacMullen talked a lot about scientific data, for which the term data curation was first formulated. (Although social scientists have been addressing these problems systematically for twenty or thirty years and have a well developed network of social science data archives and data libraries, natural scientists and the librarians working with them don’t seem to have paid much attention to their experience.) They were also working hard to achieve generally applicable insights, which had the unfortunate side effect of raising the number of abstract noun phrases in their slides. Toward the end of the day, I began finding the room a little airless; eventually I concluded that this was partly oxygen starvation from the high density of warm bodies in a room whose air conditioning was not working, and partly concrete-example starvation.

On Tuesday, Syd Bauman and Julia Flanders of the Brown Univerisity Women Writers Project (now housed in the Brown University Library) gave a one-day introduction to XML, TEI, and the utility of TEI for data curation. The encoding exercises they gave us had the advantage of concreteness, but also (alas) the disadvantage: as they were describing yet another way that TEI could be used to annotate textual material, one librarian burst out “But we can’t possibly be expected to do all this!”

If in giving an introduction to TEI you don’t go into some detail about the things it can do, no one will understand why people might prefer to archive data in TEI form instead of HTML or straight ASCII. If you do, at least some in the audience will conclude that you are asking them to do all the work, instead of (as here) making them aware of some of the salient properties of the data they may be responsible in future for curating (and, a fortiori, understanding).

Wednesday morning, I was on the program under the title Markup semantics and the preservation of intellectual content, but I had spent Monday and Tuesday concluding that I had more questions than answers, so I threw away most of my plans for the presentation and turned as much of the morning as possible into group discussion. (Perversely, this had the effect of making me think I had things I wanted to say, after all.) I took the opportunity to introduce briefly the notion of skeleton sentences as a way of representing the meaning of markup in symbolic logic or English, and to explain why I think that skeleton sentences (or other similar mechanisms) provide a way to verify the correctness and completeness of the result, when data are migrated from one representation to another. This certainly works in theory, and almost certainly it will work in practice, although the tools still need to be built to test the idea in practce. When I showed thd screen with ten lines or so of first-order predicate calculus showing the meaning of the oai:OAI-PMH root element of an OAI metadata harvesting message, some participants (not unreasonably) looked a bit like deer caught in headlights. But others seemed to follow without effort, or without more puzzlement than might be occasioned by the eccentricities of my translation.

Wednesday afternoon, John Unsworth introduced the tools for text analysis on bodies of similarly encoded TEI documents produced by the MONK project (the name is alleged to be an acronym for Metadata offers new knowledge, but I had to think hard to see where the tools actually exploited metadata very heavily. If you regard annotations like part-of-speech tagging as metadata, then the role of metadata is more obvious.)

And at the end of Thursday, Lorcan Dempsey, the vice president of OCLC, gave a thoughtful and humorous closing keynote.

For me, no longer able to spend as much time in libraries and with librarians as in some former lives, the most informative presentation was surely Dorothea Salo’s survey of issues facing institutional repositories and other organizations that wish to preserve digital objects of interest to humanists and to make them accessible. She started from two disarmingly simple questions, which seem more blindingly apposite and obvious every time I think back to them. (The world clicked into a different configuration during her talk, so it’s now hard for me to recreate that sense of non-obviousness, but I do remember that no one else had boiled things down so effectively before this talk.)

For all the material you are trying to preserve, she suggested, you should ask

  • “Why do you want to preserve this?”
  • “What threats are you trying to preserve it against?”

The first question led to a consideration of collection development policy and its (often unrecognized) relevance to data curations; the second led to an extremely helpful analysis of the threats to data against which data curators must fight.

I won’t try to summarize her talk further; her slides and her blog post about them will do it better justice than I can.

VC-Filter

[6 May 2009]

Short version: I’ve made a new toy for playing with one aspect of XSD 1.1, namely the conditional inclusion of elements in schema documents, controlled by attributes in the vc (version control) namespace. The experience has reminded me of reasons to like the vc:* design and of reasons to like (and hate) Javascript in the browser.

Continue reading