Notes on Steve Pepper’s Italian Opera topic map

[30 November 2009]

As mentioned recently, I have been spending a lot of time recently thinking about topic maps. And, as not infrequently happens, thinking about one thing has taught me something interesting about something else.

Concretely, as a way of learning more about the technology and the way things work, I have spent some rewarding time studying a topic map first put together some time ago by Steve Pepper (then with Ontopia), on the subject of Italian opera, and still actively maintained by him. I’m not sure whether the Italian opera topic map has a canonical or master location, but it can be inspected in several locations on the web, including the Ontopedia web site which Steve appears to maintain together with Dmitry Bogachev (where it can be inspected using the Ontopia Omnigator), and the Topic Maps Lab web site (where it is made accessible through Maiana, the new ‘social topic maps’ tool offered as a service by the Topic Maps Lab in Leipzig. It also ships as one of the examples in any distribution of the Ontopia software, so if you install Ontopia you will have a copy on your disk.

Working with the Italian opera topic map has given me occasion to think about a couple of principles applicable far beyond topic maps.

Keeping examples real

Steve Pepper’s topic map has the great virtue that it’s small enough to be perspicuous, but large enough and realistic enough to be interesting and fun to read through and to illustrate some interesting problems for modeling. (An opera has one composer, right? And one librettist — no, wait, scratch that, one or more librettists. And hang on, we may have more than one composer involved: Puccini died before finishing Turandot; it was completed by Franco Alfano. On the other hand, there is always a single composer given credit for the opera as a whole. Or at least there appears to be, in this body of material. So: one composer who is The composer, and possibly another who completes the work. Or are there any works where more than one hand is involved in the completion? [A. Not in this collection of operas, no.] Etc., etc.)

Also, while for simplicity the information is selective and incomplete, it’s real.

This turns out to matter. I have sometimes read discussions of database design where examples were given using bibliographic data so ludicrously oversimplified that they distracted me from the point being made. How can anyone take seriously a discussion in which a database schema consisting of author, title, and date is offered as a plausible representation even for very simple bibliographic data? Or in which we make the simplifying assumption that suppliers never ever have locations in more than one city?

The Italian Opera topic map is certainly simplified vis-a-vis reality, or even the full complement of information found in opera lexica. But it works with real data, and it takes the data seriously, and that makes it very satisfying to work with.

XQuery for data exploration

Both Maiana and Omnigator make it easy to click around in the topic map, passing from a page about the topic type Opera to a page about the specific opera Don Carlos, to the Paris Opera (where it had its premiere), to Verdi’s Jerusalem (also premiered there), to the role of the Emir of Ramla, which is (predictably) a bass, to a page about the topic of basses, with a list of all the bass roles in the topic map, to the role of Mefistofele, to … you get the idea. Unfortunately, neither of them make it as easy as one might wish to get the kind of design overview I have been trying to get. They make an effort, of course. Some of my questions are easily answered.

For example: What kinds of entities are represented by topics in the topic map? This one is easy, and hard. Easy, because both tools provide a list of topic types; hard (at least I found it so) because there are so many of them, and the list mixes domain-oriented types (Opera, Character, Voice Type) central to the concerns of the topic map with others of peripheral interest (Publisher, Broadcasting company, City, Country, Place), and a few that belong to the meta-level (Subordinate Role Type, Superordinate Role Type).

I found it helpful to export the topic map in an XML form (XTM 2.0 seems to be a widely supported XML syntax for topic maps) and load it into an XQuery engine with an interactive sandbox interface, so that I could get a better sense of how many topics there are of various types. That way, I could focus on learning about how the most important topic types (or at least the most common ones) are represented, and leave the oddball special cases for later (including a few types which have only a few instances and are used to describe things like what syntax the topic map itself is maintained in).

When trying to understand the structure of some collection of information, there is no substitute for looking at the data. And there is a lot to be said for having a tool to make it easy to look at the data in a lot of different ways. XQuery and XSLT have no peers here.

Some ongoing challenges

It has proven a lot harder to get a good overview of the different kinds of associations between topics, which always seem to me one of the key strengths of topic maps.

Like RDF, Topic Maps can describe relationships between things; unlike RDF and like the relational model, Topic Maps can describe n-ary relations without artificially dissolving them to a set of n – 1 binary relations. There seems to me an obvious, very natural, and important relation between the associations in a topic map (which allow the representation of propositions like “Florian Tosca kills Baron Scarpia by stabbing”), the relations in an RDBMS representation of the material, and the n-ary predicates one might use to formulate the propositional content in symbolic logic or in a logic programming language like Prolog.

So I found myself wanting a nice, concise overview of the predicates captured by the topic map: what association types are there, and what roles are involved in each? And what types of things are supposed to play those roles?

Question for the makers of Topic-Map tools: why is this so hard? (Or: what am I missing?)

With sufficient persistence and the help of (a) the schema for the topic map and (b) my trusty XQuery sandbox, I have begun to get an overview of the design of the topic map. If time permits, I may record it in subsequent posts, partly for subsequent use and partly so people who understand the topic map better than I do can correct misapprehensions.

The biggest set of open questions remains: how does modeling a collection of information with Topic Maps differ from modeling it using some other approach? Are there things we can do with Topic Maps that we can’t do, or cannot do as easily, with a SQL database? With a fact base in Prolog? With colloquial XML? It might be enlightening to see what the Italian Opera topic map might look like, if we designed a bespoke XML vocabulary for it, or if we poured it into SQL. (I have friends who tell me that SQL is really not suited for the kinds of things you can do with Topic Maps, but so far I haven’t understood what they mean; perhaps a concrete example will make it easier to compare the two.)

What is a character?

[25 November 2009]

The other day I posted about a proposal to use Wikipedia as a rough-and-ready guide to the ontology of most public entities. I’ve been thinking about it, and wondering why it felt somehow sort of familiar.

Eventually, I decided that the proposal reminds me of the way in which some people (including me) eventually disposed of the thorny question of what to count as a ‘character’ when analysing writing systems. (For example: when are e and é to be regarded as the same character, when as two? Or is the latter a sequence of two characters e and an acute accent? The answer some people eventually converged upon is simple:

For virtually all engineering purposes, treat something as a character if and only if there is a code point for it in the Universal Character Set (UCS) defined by Unicode and ISO 10646.

Some exceptions may need to be made in principle for the private use area, or for particular special cases. But unless you and your project are the kind of people who actually run into, or identify, special cases related to subtle issues in the history of the world’s writing systems (that means, for 99.999% of the world’s population, and at least 50% of the readers of this blog), you don’t need to worry about exceptions.

The reasoning is not that the Unicode Consortium and SC 2 got the answers right. On the contrary, any reasonable observer will agree that they got some of them wrong. Many members of the relevant committees will agree. (They answer, for example, that é is BOTH a single character and a sequence of two characters. Thank you very much; may I have another drink now?) It’s not likely, of course, that any two reasonable observers will agree on which questions the UCS gets right, and which it gets wrong.

But some questions are just hard to answer in a universally satisfactory way; if you decide for yourself what counts as a character and what does not count as a character, your answers may differ in details from those enshrined in the UCS, but they will not be more persuasive on the whole: there are no answers to that question that are persuasive in every regard.

The definition of ‘character’ embodied in the UCS is as good an answer to the question “What is a character” as we as a community are going to get, and for those for whom that question is incidental to other more important concerns, it’s far better to accept that answer and move on than to try to provide a better one.

If the question is not incidental, but central to your concerns (if, for example, you are a historian of writing systems, or of a particular writing system), then a standardized answer is not much use to you, except perhaps as an object of study.

Hmm. Perhaps one of the main purposes of standardization is to allow us to ignore things we are not particularly interested in, at the moment? Is the purpose of standards to make things boring and ignorable?

That could affect whether we think it’s a good idea to adopt such a de facto standard for ontology, or whether we think such standardization is just one step along a slippery slope with thought police at the bottom.

Gavagai and topic maps

[23 November 2009]

In Leipzig earlier this month, at TMRA 2009, there was (understandably, in light of the conference motto “Linked Topic Maps”) a certain amount of discussion of using public subject identifiers for topics, to increase the likelihood that information collected by different people could be merged usefully, without preconcertation by the parties involved. In principle, this raises all sorts of thorny questions about ontology (of both the philosophical and the engineering kinds). In practice, however, people would like to be able build systems and share data without waiting for the philosophers and engineers of the world to agree on an answer to the question ”What exists?” and they seem to be willing to risk making a mistake or two along the way.

At some point (I’m now a bit hazy when this happened), someone mentioned a proposal made by Vegard Sandvold, an interesting sort of rough-and-ready approach to the problem: use Wikipedia (or dbpedia) as a sort of barometer of a reasonably democratic consensus on the question.

Steve Pepper has responded that one problem with that (among others of no concern here) is that if he wants to say something about (for example) the International Socialists organization in the UK (founded in 1962), a restriction to Wikipedia as a source of public subject identifiers would make it impossible: Wikipedia redirects from International Socialists (UK) to a page about the Socialist Workers Party (Britain). This organization, founded in 1977, is really not the same thing, said Steve (even if Wikipedia says the SWP was formed by renaming the IS, which suggests a continuity of essential identity).

Steve appears to argue that there is a simple fact of the matter in this case, and that treating the International Socialists and the Socialist Workers Party as the same is simply wrong. He may be right. But (without wanting to argue the ins and outs of this particular case) the example seems to me to be the kind of question which does not always have a determinate answer. It illustrates nicely a kind of ontological indeterminacy which necessarily haunts both RDF and Topic Maps and shows why accounts of how those technologies can combine information from multiple sources can strike careful readers as naive.

Within a given discourse, for a given purpose, we may decide to analyse a particular portion of reality as consisting of several distinct entities which succeed each other in time; in another discourse, for another purpose, we may choose to analyse that same portion of reality as consisting of a single entity which undergoes some changes of (accidental) properties. Concretely: sometimes we want to treat the International Socialists as distinct from the Socialist Workers Party, and sometimes we want to treat them as two names for the same thing. Sometimes I want to treat the journal Computers and the Humanities and the journal Language resources and evaluation as two quite distinct journals; at other times, I want to treat them as the same journal, published under one title through 2004 and then under a new title. There is not a simple fact of the matter; it depends on what, exactly, I want to refer to when I refer to things.

It reminds me of W.V.O. Quine’s question: if a field linguist sees a rabbit run by and hears a native informant say “gavagai”, how is the linguist to determine whether this utterance means ‘Look, a rabbit!‘ or ‘Food!’ or ‘Undetached rabbit-parts’? Quine talked about the indeterminacy of translation, suggesting that he’s concerned essentially with the problem of understanding what someone else really means by something (oddly, he seems to think the difficulty arises primarily with foreign languages, which seems to me optimistic), but I think the difficulty reflects the fact that the field linguist is likely to confront an analysis of reality that does not match his own. This happens to the rest of us, too, not just to field linguists.

When I say the phrase “International Socialists (UK)”, can you reliably determine whether I intend to denote an organization which ceased to exist in 1977, or an organization which continues to exist today?

You can ask me, of course, and maybe I’ll answer. And if I answer, maybe I’ll tell you the truth. But if you don’t have a chance to ask me?

Topic maps on my mind

[20 November 2009, additional pointer 23 November]

Last week I spent in Leipzig, attending the Topic Maps Research and Applications 2009 conference organized by the Topic Maps Lab at the university there. And since I returned, I have found myself spending a lot of time thinking about topic maps. (Enough that I really need to stop for a bit and get back to other work.)

I’ve written a short trip report on the conference, which can be read in the archives of the Topic Maps in LIS list run by Kevin Trainor. [23 November. The Topic Maps Lab has posted a version of the trip report that may be easier to read. Many thanks to them.] It doesn’t really do the conference justice, but perhaps it’s better than nothing. Other trip reports are [also] pointed to from the TM Lab page.

The short version of my report is: “Gosh, that was fun! And wow! is Leipzig ever worth a visit!”

The joy of testing

[5 November 2009, some additions 6 November 2009]

I’m using Jeni Tennison’s xspec to develop tests for a simple stylesheet I’m writing. An xspec test takes the form of a scenario along the lines of

  • When you match a foo element, do this.
  • When you call function bar with these arguments, expect this result.
  • When you call this named template, expect this result.

It’s a relatively young project, and the documentation is perhaps best described as nascent. Working from the documentation (it does exist, which makes for a nice change from some things I work with), I first wrote nine or ten tests to describe the behavior of an existing stylesheet; when I ran the tests against that stylesheet, all of them reported failures, because my formulation of the expected results violated various silent assumptions of the xspec code. That might indicate opportunities for making the xspec documentation more informative. I’ve spent an enjoyable hour or two this evening, however, looking at the xspec code and figuring out how my test cases are confusing it, reformulating them, and watching the bars of red in the test report change, one by one, to green. It’s nice to have a visible sign of forward progress.

There are other XSLT test frameworks I haven’t tried, and I can’t compare xspec to any of them. But I can say this: if you are developing XSLT stylesheets and aren’t using any of the available test frameworks, you really ought to look into xspec.

A helpful page about XSLT testing is maintained by Tony Graham of Menteith Consulting. If xspec doesn’t work out for you, check out the other frameworks he lists there.