Notes on Steve Pepper’s Italian Opera topic map

[30 November 2009]

As mentioned recently, I have been spending a lot of time recently thinking about topic maps. And, as not infrequently happens, thinking about one thing has taught me something interesting about something else.

Concretely, as a way of learning more about the technology and the way things work, I have spent some rewarding time studying a topic map first put together some time ago by Steve Pepper (then with Ontopia), on the subject of Italian opera, and still actively maintained by him. I’m not sure whether the Italian opera topic map has a canonical or master location, but it can be inspected in several locations on the web, including the Ontopedia web site which Steve appears to maintain together with Dmitry Bogachev (where it can be inspected using the Ontopia Omnigator), and the Topic Maps Lab web site (where it is made accessible through Maiana, the new ‘social topic maps’ tool offered as a service by the Topic Maps Lab in Leipzig. It also ships as one of the examples in any distribution of the Ontopia software, so if you install Ontopia you will have a copy on your disk.

Working with the Italian opera topic map has given me occasion to think about a couple of principles applicable far beyond topic maps.

Keeping examples real

Steve Pepper’s topic map has the great virtue that it’s small enough to be perspicuous, but large enough and realistic enough to be interesting and fun to read through and to illustrate some interesting problems for modeling. (An opera has one composer, right? And one librettist — no, wait, scratch that, one or more librettists. And hang on, we may have more than one composer involved: Puccini died before finishing Turandot; it was completed by Franco Alfano. On the other hand, there is always a single composer given credit for the opera as a whole. Or at least there appears to be, in this body of material. So: one composer who is The composer, and possibly another who completes the work. Or are there any works where more than one hand is involved in the completion? [A. Not in this collection of operas, no.] Etc., etc.)

Also, while for simplicity the information is selective and incomplete, it’s real.

This turns out to matter. I have sometimes read discussions of database design where examples were given using bibliographic data so ludicrously oversimplified that they distracted me from the point being made. How can anyone take seriously a discussion in which a database schema consisting of author, title, and date is offered as a plausible representation even for very simple bibliographic data? Or in which we make the simplifying assumption that suppliers never ever have locations in more than one city?

The Italian Opera topic map is certainly simplified vis-a-vis reality, or even the full complement of information found in opera lexica. But it works with real data, and it takes the data seriously, and that makes it very satisfying to work with.

XQuery for data exploration

Both Maiana and Omnigator make it easy to click around in the topic map, passing from a page about the topic type Opera to a page about the specific opera Don Carlos, to the Paris Opera (where it had its premiere), to Verdi’s Jerusalem (also premiered there), to the role of the Emir of Ramla, which is (predictably) a bass, to a page about the topic of basses, with a list of all the bass roles in the topic map, to the role of Mefistofele, to … you get the idea. Unfortunately, neither of them make it as easy as one might wish to get the kind of design overview I have been trying to get. They make an effort, of course. Some of my questions are easily answered.

For example: What kinds of entities are represented by topics in the topic map? This one is easy, and hard. Easy, because both tools provide a list of topic types; hard (at least I found it so) because there are so many of them, and the list mixes domain-oriented types (Opera, Character, Voice Type) central to the concerns of the topic map with others of peripheral interest (Publisher, Broadcasting company, City, Country, Place), and a few that belong to the meta-level (Subordinate Role Type, Superordinate Role Type).

I found it helpful to export the topic map in an XML form (XTM 2.0 seems to be a widely supported XML syntax for topic maps) and load it into an XQuery engine with an interactive sandbox interface, so that I could get a better sense of how many topics there are of various types. That way, I could focus on learning about how the most important topic types (or at least the most common ones) are represented, and leave the oddball special cases for later (including a few types which have only a few instances and are used to describe things like what syntax the topic map itself is maintained in).

When trying to understand the structure of some collection of information, there is no substitute for looking at the data. And there is a lot to be said for having a tool to make it easy to look at the data in a lot of different ways. XQuery and XSLT have no peers here.

Some ongoing challenges

It has proven a lot harder to get a good overview of the different kinds of associations between topics, which always seem to me one of the key strengths of topic maps.

Like RDF, Topic Maps can describe relationships between things; unlike RDF and like the relational model, Topic Maps can describe n-ary relations without artificially dissolving them to a set of n – 1 binary relations. There seems to me an obvious, very natural, and important relation between the associations in a topic map (which allow the representation of propositions like “Florian Tosca kills Baron Scarpia by stabbing”), the relations in an RDBMS representation of the material, and the n-ary predicates one might use to formulate the propositional content in symbolic logic or in a logic programming language like Prolog.

So I found myself wanting a nice, concise overview of the predicates captured by the topic map: what association types are there, and what roles are involved in each? And what types of things are supposed to play those roles?

Question for the makers of Topic-Map tools: why is this so hard? (Or: what am I missing?)

With sufficient persistence and the help of (a) the schema for the topic map and (b) my trusty XQuery sandbox, I have begun to get an overview of the design of the topic map. If time permits, I may record it in subsequent posts, partly for subsequent use and partly so people who understand the topic map better than I do can correct misapprehensions.

The biggest set of open questions remains: how does modeling a collection of information with Topic Maps differ from modeling it using some other approach? Are there things we can do with Topic Maps that we can’t do, or cannot do as easily, with a SQL database? With a fact base in Prolog? With colloquial XML? It might be enlightening to see what the Italian Opera topic map might look like, if we designed a bespoke XML vocabulary for it, or if we poured it into SQL. (I have friends who tell me that SQL is really not suited for the kinds of things you can do with Topic Maps, but so far I haven’t understood what they mean; perhaps a concrete example will make it easier to compare the two.)

Gavagai and topic maps

[23 November 2009]

In Leipzig earlier this month, at TMRA 2009, there was (understandably, in light of the conference motto “Linked Topic Maps”) a certain amount of discussion of using public subject identifiers for topics, to increase the likelihood that information collected by different people could be merged usefully, without preconcertation by the parties involved. In principle, this raises all sorts of thorny questions about ontology (of both the philosophical and the engineering kinds). In practice, however, people would like to be able build systems and share data without waiting for the philosophers and engineers of the world to agree on an answer to the question ”What exists?” and they seem to be willing to risk making a mistake or two along the way.

At some point (I’m now a bit hazy when this happened), someone mentioned a proposal made by Vegard Sandvold, an interesting sort of rough-and-ready approach to the problem: use Wikipedia (or dbpedia) as a sort of barometer of a reasonably democratic consensus on the question.

Steve Pepper has responded that one problem with that (among others of no concern here) is that if he wants to say something about (for example) the International Socialists organization in the UK (founded in 1962), a restriction to Wikipedia as a source of public subject identifiers would make it impossible: Wikipedia redirects from International Socialists (UK) to a page about the Socialist Workers Party (Britain). This organization, founded in 1977, is really not the same thing, said Steve (even if Wikipedia says the SWP was formed by renaming the IS, which suggests a continuity of essential identity).

Steve appears to argue that there is a simple fact of the matter in this case, and that treating the International Socialists and the Socialist Workers Party as the same is simply wrong. He may be right. But (without wanting to argue the ins and outs of this particular case) the example seems to me to be the kind of question which does not always have a determinate answer. It illustrates nicely a kind of ontological indeterminacy which necessarily haunts both RDF and Topic Maps and shows why accounts of how those technologies can combine information from multiple sources can strike careful readers as naive.

Within a given discourse, for a given purpose, we may decide to analyse a particular portion of reality as consisting of several distinct entities which succeed each other in time; in another discourse, for another purpose, we may choose to analyse that same portion of reality as consisting of a single entity which undergoes some changes of (accidental) properties. Concretely: sometimes we want to treat the International Socialists as distinct from the Socialist Workers Party, and sometimes we want to treat them as two names for the same thing. Sometimes I want to treat the journal Computers and the Humanities and the journal Language resources and evaluation as two quite distinct journals; at other times, I want to treat them as the same journal, published under one title through 2004 and then under a new title. There is not a simple fact of the matter; it depends on what, exactly, I want to refer to when I refer to things.

It reminds me of W.V.O. Quine’s question: if a field linguist sees a rabbit run by and hears a native informant say “gavagai”, how is the linguist to determine whether this utterance means ‘Look, a rabbit!‘ or ‘Food!’ or ‘Undetached rabbit-parts’? Quine talked about the indeterminacy of translation, suggesting that he’s concerned essentially with the problem of understanding what someone else really means by something (oddly, he seems to think the difficulty arises primarily with foreign languages, which seems to me optimistic), but I think the difficulty reflects the fact that the field linguist is likely to confront an analysis of reality that does not match his own. This happens to the rest of us, too, not just to field linguists.

When I say the phrase “International Socialists (UK)”, can you reliably determine whether I intend to denote an organization which ceased to exist in 1977, or an organization which continues to exist today?

You can ask me, of course, and maybe I’ll answer. And if I answer, maybe I’ll tell you the truth. But if you don’t have a chance to ask me?

Topic maps on my mind

[20 November 2009, additional pointer 23 November]

Last week I spent in Leipzig, attending the Topic Maps Research and Applications 2009 conference organized by the Topic Maps Lab at the university there. And since I returned, I have found myself spending a lot of time thinking about topic maps. (Enough that I really need to stop for a bit and get back to other work.)

I’ve written a short trip report on the conference, which can be read in the archives of the Topic Maps in LIS list run by Kevin Trainor. [23 November. The Topic Maps Lab has posted a version of the trip report that may be easier to read. Many thanks to them.] It doesn’t really do the conference justice, but perhaps it’s better than nothing. Other trip reports are [also] pointed to from the TM Lab page.

The short version of my report is: “Gosh, that was fun! And wow! is Leipzig ever worth a visit!”

Persistence and dereferenceability

[31 March 2009]

My esteemed former colleague Thomas Roessler has posted a musing on the fragility of the electronic historical record and the difficulties of achieving persistence, when companies go out of existence and coincidentally stop maintaining their Web sites or paying their domain registration fees.

After reading Thomas’s post, my evil twin Enrique came close to throwing a temper tantrum. (Actually, that’s quite unfair. For Enrique, he was remarkably well behaved.)

“The semantic web partisans,” he shouted, “have spent the last ten years or more telling us that URLs are the perfect naming mechanism: a single, integrated space of names with distributed naming authority. Haven’t they?”

“Well,” I said, “strictly speaking, I think they have mostly been talking about URIs, for the last few years at least.” He ignored this.

“They have been telling us we should use URLs for naming absolutely everything. Including everything we care about. Including Aeschylus and Euripides! Homer! Sappho! Including Shelley, and Keats, and Pope!”

I couldn’t help starting to hum ‘Brush up your Shakespeare’ at this, but he ignored me. This in itself was unusual; he is usually a sucker for Cole Porter. I guess he really was kind of worked up.

“And when anyone expressed concern about (a) the fact that the power to mint URLs is tied up with the regular payment of fees, so it’s really not equally accessible to everyone, or (b) the possibility that URLs don’t have the kind of longevity needed for real persistence, they just told us again, louder, that we should be using URLs for everything.”

“Now, don’t bring up URNs!” I told him, in a warning tone. “We don’t want to open those old wounds again, do we?”

“And why the hell not?” he roared. “What do the SemWeb people think they are playing at?!”

“Well,” I said.

“Either they are surprised at this problem, in which case you have to ask: ‘How can they be surprised? What kind of idiots must they be not to have seen this coming?’“

“Well,” I said.

“Or else they aren’t surprised, in which case you have to ask what they are smoking! Is it their attention span so short that it has never occurred to them that names sometimes need to last for longer than Netscape, Inc., happens to be in business?”

“Well,” I said. I realized I didn’t really have a good answer.

“And you?!” he snarled, turning on me and grabbing my lapels. “You were there for years — you couldn’t take a moment to point out to them that a naming convention can be used for everything we care about only if it can be used for the monuments of human culture? You couldn’t be bothered to point out that URLs can be suitable for naming parts of our cultural heritage only if they can last for a few hundred, preferably a few tens of thousands, of years? What use are you?!”

“Well,” I said.

“What use are URLs and their much hyped dereferenceability, if they can break this fast?”

“Well,” I said.

Long pause.

I am not sure Enrique’s complaints are entirely fair, but I also didn’t know how to answer them. I fear he is still waiting for an answer.

Thoughts to come back to …

[27 August 2008]

Observation: both RDF and Topic Maps seem to aspire to make it easy to find all the ways in which a given thing (topic or resource) may appear in a given body of data.

In both, the basic goal appears to be that if you look for Essex or Washington, you should be able to specify whether you mean the human being or the geographic entity (and probably be able to distinguish the state of Washington from the various cities named Washington), and find it no matter where it appears in the system. In RDF terms, this would mean no matter which triples it appears in, and no matter whether it appears as subject or object of the triples; in topic map terms, it would mean no matter which roles it plays in which associations.

Observation: Codd insists firmly that to be acceptable in his eyes, relational database management systems must not only provide built-in system-level support for domains (which may be regarded as a form of extended data types with slightly more semantics than the basic types in a typical programming language, so you can distinguish people from places even if you use VARCHAR(60) columns to represent both), but also include ways of finding all the values actively in use for a given domain, regardless of relation or column, and of finding all the occurrences of a particular value of a particular domain, without getting it mixed up with any values from different domains which happen to use the same underlying basic datatype in their representation. (For those with The relational model for database management, version 2 (Reading: Addison-Wesley, 1990) on the shelf, I’m looking at sections 3.4.1 The FAO_AV Command and 3.4.2 The FAO_LIST Command).

Question: Are the RDF and TM communities and Codd after essentially the same thing here? Is there some sense in which they fulfil (or are trying to fulfil) this part of Codd’s ideal for data management better than SQL systems do?

What exactly is the relation between this aspect of both RDF and TM on the one hand, and Codd’s notion of domains or extended data types on the other?

I’ve wanted to think about this for years, but have not managed to find anyone to discuss it with who had (a) sufficient knowledge of both semweb or Topic-Map technology and relational theory (or rather Codd’s particular doctrines) and (b) the time and inclination to go into it, at (c) a time when I myself had the time and inclination. But someday, perhaps …