Schmidt on networking

[26 October 2008]

I’ve recently set up a LinkedIn profile for myself, infertility and I’ve been assiduously searching in LinkedIn for old and current friends and colleagues and ‘building my network’ — so I was rather struck by the following slightly sobering remarks in Helmut Schmidt’s recent book Außer Dienst, which I picked up in Germany a couple weeks ago and have been reading at odd moments on airplanes:

Den Ausdruck Netzwerk hat es zu meiner Zeit noch nicht gegeben. Aber natürlich hat man vielfältige persönliche Kontakte geknüpft und sie langfristig aufrechterhalten. Wer sich gegen seine Zeitgenossen abschließt, hat es schwerer, zu abgewogenen Urteilen zu gelangen, als einer, der sich öffnet und Kontakt und Austausch sucht. … Mir wollen weniger jene Netzwerke wichtig erscheinen, welche einem bestimmten Interesse oder der eigenen Karriere dienen, als vielmehr solche, die der geistigen Anregung und dem gedanklichen Austausch förderlich sind.

Or (my translation):

In my day the expression ‘network’ did not yet exist. But of course people made a lot of personal contacts and kept them up over long periods. Anyone who is closed off from contact with one’s contemporaries will find it harder to reach well founded judgements than someone who is open to new contacts and to exchange of views. … Networks of contacts which serve only a particular interest or which are made only in the interest of one’s career seem to me less important than contacts which serve to provide intellectual stimulation and promote the exchange of thoughts.

Of course, Schmidt has things like the Mittwochsgesellschaft (and a Freitagsgesellschaft in Hamburg) in mind, which set an awfully high bar. But still — it makes me wonder not just how well LinkedIn and other social networking sites measure up to these high standards, but how one might use them to pursue the same kinds of intellectual cross-fertilization and mutual education Schmidt describes.

… I’ve been wandering late … (travels, part 2)

[26 October 2008]

This is the second in series of posts recording some of my impressions from recent travels.

After the XSLT meetings described in the previous post, viagra and then a week at home, hepatitis during which I was distracted by events that don’t need to be described here, I left again in early October for Europe. During the first half of last week [week before last, now], I was in Mannheim attending a workshop on organized by the electronic publications working group of the Union of German Academies of Science. Most of the projects represented were dictionaries of one stripe or another, many of them historical dictionaries (the Thesaurus Linguae Latinae, the Dictionnaire Etymologique de l’Ancien Français, the Deutsches Rechtswörterbuch, the Qumran-Wörterbuch, the Wörterbuch der deutschen Vinzersprache, both an Althochdeutsches Wörterbuch and an Althochdeutsches Etymologisches Wörterbuch, a whole set of dialect dictionaries, and others too numerous to name).

Some of the projects are making very well planned, good use of information technology (the Qumran dictionary in Göttingen sticks particularly in my mind), but many suffer from the huge weight of a paper legacy, or from short-sighted decisions made years ago. I’m sure it seemed like a good idea at the time to standardize on Word 6, and to build the project work flow around a set of Word 6 macros which are thought not to run properly in Word 7 or later versions of Word, and which were built by a clever participant in the project who is now long gone, leaving no one who can maintain or recreate them. But however good an idea it seemed at the time, it was in reality a foolish decision for which project is now paying a price (being stuck in outdated software, without the ability to upgrade, and with increasing difficulty finding support), and for which the academy sponsoring the project, and the future users of the work product, will continue paying for many years to come.

I gave a lecture Monday evening under the title “Standards in der geisteswissenschaftlichen Textdatenverarbeitung: Über die Zukunftssicherung von Sprachdaten”, in which I argued that the IT practice of projects involved with the preservation of our common cultural heritage must attend to a variety of problems that can make their work inaccessible to posterity.

The consistent use of suitably developed and carefully chosen open standards is by far the best way to ensure that the data and tools we create today can still be used by the intended beneficiaries in the future. I ended with a plea for people with suitable backgrounds in the history of language and culture to participate in standardization work, to ensure that the standards developed at W3C or elsewhere provide suitable support for the cultural heritage. The main problem, of course, is that the academy projects are already stretched thin and have no resources to spare for extra work. But I hope that the academies will realize that they have a role to play here, which is directly related to their mission.

It’s best, of course, if appropriate bodies join W3C as members and provide people to serve in Working Groups. More universities, more academies, more user organizations need to join and help guide the Web. (Boeing and Chevron and other user organizations within W3C do a lot for all users of the Web, but there is only so much they can accomplish as single members; they need reinforcements!) But even if an organization does not or cannot join W3C, an individual can have an impact by commenting on draft W3C specifications. All W3C working groups are required by the W3C development process to respond formally to comments received on Last-Call drafts, and to make a good-faith effort to satisfy the originator of the comment, either by doing as suggested or by providing a persuasive rationale for not doing so. (Maybe it’s not a good idea after all, maybe it’s a good idea but conflicts with other good ideas, etc.) It is not unknown for individuals outside the working group (and outside W3C entirely) to have a significant impact on a spec just by commenting on it systematically.

Whether anyone in the academies will take the invitation to heart remains to be seen, though at least a couple of people asked hesitantly after the lecture how much membership dues actually run. So maybe someday …

I’ve been wandering early …

[21 October 2008]

I’ve been traveling a good deal lately.

This is the first in series of posts recording some of my impressions.

In late September the XSL Working Group held a one-week meeting in Prague to work on revisions of XSLT to make it easier to support streaming transformations in XSLT. By streaming, for sale the WG means:

  • The transformation can run in memory independent of document size (sometimes constant memory, sometimes memory proportional to document depth, sometimes memory proportional to the size of discrete windows of data in the document).
  • The transformation can begin delivering results before all of the input is available (e.g. can work on so-called ‘infinite’ XML documents like streams of stock quotations).
  • The transformation can be preformed in a single pass over the document.

It turns out that for different use cases it can be necessary or useful to:

  • declare a particular input document as streamable / to be streamed if possible
  • declare a particular set of templates as streamable
  • declare that particular parts of the document need to be available in full (buffered for random access) for part of the transform, but can then be discarded (e.g. for windowing use cases)

Some members of the WG may have been apprehensive at the thought of five straight days of WG discussions. Would we have enough to do, or would we run out of ideas on Tuesday and spend Wednesday through Friday staring at the floor in embarrassment while the chair urged us to get some work done? (If no one else harbored these fears, I certainly did.) But in fact we had a lively discussion straight through to the end of the week, and made what I think was good progress toward concrete proposals for the spec.

Among the technical ideas with most legs is (I think) the idea that sometimes what you want to do with a particular node in the input tree can actually be done with a partial copy of the input tree, and that different kinds of partial copy may be appropriate in different situations.

If you perform a deep copy of an element node in an XDM data model instance, for example, you have access to the entire subtree rooted at that node, but not to any of its siblings or ancestors, nor to anything else in the tree from which it came. For cases where you wish to (or must) allow access to the subtree rooted in a node, but to nothing else, such a deep copy is ideal: it preserves the information you want to have available, and it makes all other information inaccessible. (This is essentially the way that XSD 1.1 restricts assertions to the subtree rooted in a given node: logically speaking the assertions are evaluated against a copy of the node, not against the node itself.)

Several kinds of copy can be distinguished. In the terminology of the XSL Working Group (using terms introduced mostly by Michael Kay and Mohamed Zergaoui):

  • Y-copy: contains the subtree rooted in the node being copied, and also all of its ancestor nodes and their attributes, but none of their siblings. It is thus shaped like an upside down uppercase Y.
  • Nabla-copy: contains just the subtree rooted in the node being copied. It is thus shaped like an upside-down nabla. (Yes, also like a right-side-up delta, but since the Y-copy requires the Y to be inverted, we say nabla copy not delta copy. Besides, a delta copy would sound more like something used in change management.)
  • Dot-copy: contains just the node being copied, itself, and its attributes if any.
  • Yen-copy: like a Y-copy, but includes the siblings of each ancestor together with their attributres (although not their children).
  • Spanish exclamation-point copy: contains just the node being copied, and its ancestors, together with their attributes. Shaped like an exclamation point (dot, with something above it), or like an upside-down Spanish exclamation point.

I’ve been quite taken recently by one possible application of these ideas outside of the streaming XSLT work. In the current draft, assertions in XSD 1.1 are restricted to / are evaluated against a nabla-copy of the element or attribute being validated, and the conditions used for conditional type assignment are evaluated against a dot copy of the element. These restrictions are painful, especially the latter since it makes it impossible to select the type of an element depending on its xml:lang value (which is inherited from an ancestor if not specified locally). But XSD 1.1 could readily support conditions on the nearest value of xml:lang if conditions were evaluated on a Spanish-exclamation-point copy, instead of on a dot copy, of the element in question. I don’t know whether the XML Schema WG will buy this approach, but the possibility does seem to suggest that there is value in the idea of thinking about things in terms of invariants preserved by different kinds of node copying operations.

Another way to define and validate overlapping structures

[17 October 2008]

At a meeting in Edinburgh organized by the e-Science center here, phthisiatrician I encountered for the first time a piece of work done some years ago by Siu-wai Leung and others (Siu-wai Leung, Chris Mellish, and Dave Robertson, “Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences,” Bioinformatics 17.3 [2001]: 226-236). The ‘Basic Gene Grammar’ formalism introduced by Leung distinguishes several kinds of arrows separating the left-hand and right-hand sides of grammar rules, so:

A ---> B, C, D

means, just as in conventional grammar notations, that the lexical form of the non-terminal A is the concatenation of lexical forms of B, C, and D, in order. The rules

X ===> B, C, D
Y <=== B, C, D

mean (if I have correctly understood the paper, which is not given) that the lexical form of an X and the lexical form of a Y are regions on characters in which a B, a C, and a D occur, but in which the B, the C, and the D may overlap. The arrow ===> means that the B must start before the C, and the C must start before the D; the arrow <=== means the ends of the B, C, and D regions must occur in that order. (I don’t know how to say that both the beginnings and the ends must occur in order, without using additional non-grammatical constraints, or that the order is unconstrained; I’ll have to buttonhole Mr. Leung during a break and ask.)

[Answer: yes, use the mechanism for additional constraints. To say that A, B, and C can occur in any order, use multiple rules.]

This means I now know of three different formalisms for specifying and validating overlapping structures: the Basic Gene Grammars of Leung’s MSc thesis (1993, and the 2001 paper cited above), the Creole notation described by Jeni Tennison at XTech 2007, and the rabbit/duck grammars I described at Extreme Markup Languages 2006. (There are obviously other possibilities for specification and validation of overlap, but I don’t know of worked-out proposals.)

Several questions arise:

  1. What classes of languages (viewed as string sets) do these mechanisms describe?
  2. What classes of languages (viewed as tree sets) do these mechanisms describe?
  3. How do the expressive poweres of the different notations relate to each other? Are they the same? Do they overlap? Are there subset/superset relations?
  4. For various common textual phenomena (e.g. page structure and formal structure of the text, or verse structure and dramatic structure of verse drama), what would a definition of the structure look like in each of these formalisms?
  5. Are there obvious correspondences between idioms in the different languages?
  6. Are there differences in the likely or possible performance of parsing / validation algorithms for the different formalisms? (The Basic Gene Grammar mechanism is easy to understand in terms of the underlying chart parser, but it would be nice to have a less expensive method of validation than exhaustive checking of all contiguous subsequences in the input. I guess that for markup applications, the use of explicit start- and end-tags might reduce the expected cost of a naive implementation: bracketed languages are cheaper to parse. But I haven’t worked it out clearly.)

All of the talks at this meeting have been interesting and informative, but even had they not been, learning about this formalism would have made me glad to have come to the meeting.