Information modeling | Messages in a Bottle

[3 December 2009]

I just encountered the following statements in technical documentation for a family of products which I’ll leave nameless.

This document does not describe the complete XML schema for either [Application 1] or [Application 2]. The complete XML schema for both applications is not available and will not be made public.

Perhaps there can be good reasons for such a situation. Perhaps the developers really don’t know how to use any existing schema language to describe the set of documents they actually accept; perhaps only a Turing machine can actually identify set of documents accepted, and the developers were unwilling to work with a simpler set whose membership could be more cheaply decided. (Well, wait, those may be reasons, but they don’t actually qualify as “good”.)

I wonder whether this is an insidious attempt to look like the products have an open format (See? it’s XML! How much more open can you get?) while ensuring that the commercial products in question remain the only arbiters of acceptable documents? Or whether the programmers in question were just too lazy to specify a clean vocabulary and ensure that their software handles all documents which meet some standard of validity that does not require Turing completeness?

Having a partially defined XML format is, at least for me, still a great deal more convenient than having the format be binary and completely undocumented. But it certainly seems to fall a long distance short of what XML could make possible.

[30 November 2009]

As mentioned recently, I have been spending a lot of time recently thinking about topic maps. And, as not infrequently happens, thinking about one thing has taught me something interesting about something else.

Concretely, as a way of learning more about the technology and the way things work, I have spent some rewarding time studying a topic map first put together some time ago by Steve Pepper (then with Ontopia), on the subject of Italian opera, and still actively maintained by him. I’m not sure whether the Italian opera topic map has a canonical or master location, but it can be inspected in several locations on the web, including the Ontopedia web site which Steve appears to maintain together with Dmitry Bogachev (where it can be inspected using the Ontopia Omnigator), and the Topic Maps Lab web site (where it is made accessible through Maiana, the new ‘social topic maps’ tool offered as a service by the Topic Maps Lab in Leipzig. It also ships as one of the examples in any distribution of the Ontopia software, so if you install Ontopia you will have a copy on your disk.

Working with the Italian opera topic map has given me occasion to think about a couple of principles applicable far beyond topic maps.

Keeping examples real

Steve Pepper’s topic map has the great virtue that it’s small enough to be perspicuous, but large enough and realistic enough to be interesting and fun to read through and to illustrate some interesting problems for modeling. (An opera has one composer, right? And one librettist — no, wait, scratch that, one or more librettists. And hang on, we may have more than one composer involved: Puccini died before finishing Turandot; it was completed by Franco Alfano. On the other hand, there is always a single composer given credit for the opera as a whole. Or at least there appears to be, in this body of material. So: one composer who is The composer, and possibly another who completes the work. Or are there any works where more than one hand is involved in the completion? [A. Not in this collection of operas, no.] Etc., etc.)

Also, while for simplicity the information is selective and incomplete, it’s real.

This turns out to matter. I have sometimes read discussions of database design where examples were given using bibliographic data so ludicrously oversimplified that they distracted me from the point being made. How can anyone take seriously a discussion in which a database schema consisting of author, title, and date is offered as a plausible representation even for very simple bibliographic data? Or in which we make the simplifying assumption that suppliers never ever have locations in more than one city?

The Italian Opera topic map is certainly simplified vis-a-vis reality, or even the full complement of information found in opera lexica. But it works with real data, and it takes the data seriously, and that makes it very satisfying to work with.

XQuery for data exploration

Both Maiana and Omnigator make it easy to click around in the topic map, passing from a page about the topic type Opera to a page about the specific opera Don Carlos, to the Paris Opera (where it had its premiere), to Verdi’s Jerusalem (also premiered there), to the role of the Emir of Ramla, which is (predictably) a bass, to a page about the topic of basses, with a list of all the bass roles in the topic map, to the role of Mefistofele, to … you get the idea. Unfortunately, neither of them make it as easy as one might wish to get the kind of design overview I have been trying to get. They make an effort, of course. Some of my questions are easily answered.

For example: What kinds of entities are represented by topics in the topic map? This one is easy, and hard. Easy, because both tools provide a list of topic types; hard (at least I found it so) because there are so many of them, and the list mixes domain-oriented types (Opera, Character, Voice Type) central to the concerns of the topic map with others of peripheral interest (Publisher, Broadcasting company, City, Country, Place), and a few that belong to the meta-level (Subordinate Role Type, Superordinate Role Type).

I found it helpful to export the topic map in an XML form (XTM 2.0 seems to be a widely supported XML syntax for topic maps) and load it into an XQuery engine with an interactive sandbox interface, so that I could get a better sense of how many topics there are of various types. That way, I could focus on learning about how the most important topic types (or at least the most common ones) are represented, and leave the oddball special cases for later (including a few types which have only a few instances and are used to describe things like what syntax the topic map itself is maintained in).

When trying to understand the structure of some collection of information, there is no substitute for looking at the data. And there is a lot to be said for having a tool to make it easy to look at the data in a lot of different ways. XQuery and XSLT have no peers here.

Some ongoing challenges

It has proven a lot harder to get a good overview of the different kinds of associations between topics, which always seem to me one of the key strengths of topic maps.

Like RDF, Topic Maps can describe relationships between things; unlike RDF and like the relational model, Topic Maps can describe n-ary relations without artificially dissolving them to a set of n – 1 binary relations. There seems to me an obvious, very natural, and important relation between the associations in a topic map (which allow the representation of propositions like “Florian Tosca kills Baron Scarpia by stabbing”), the relations in an RDBMS representation of the material, and the n-ary predicates one might use to formulate the propositional content in symbolic logic or in a logic programming language like Prolog.

So I found myself wanting a nice, concise overview of the predicates captured by the topic map: what association types are there, and what roles are involved in each? And what types of things are supposed to play those roles?

Question for the makers of Topic-Map tools: why is this so hard? (Or: what am I missing?)

With sufficient persistence and the help of (a) the schema for the topic map and (b) my trusty XQuery sandbox, I have begun to get an overview of the design of the topic map. If time permits, I may record it in subsequent posts, partly for subsequent use and partly so people who understand the topic map better than I do can correct misapprehensions.

The biggest set of open questions remains: how does modeling a collection of information with Topic Maps differ from modeling it using some other approach? Are there things we can do with Topic Maps that we can’t do, or cannot do as easily, with a SQL database? With a fact base in Prolog? With colloquial XML? It might be enlightening to see what the Italian Opera topic map might look like, if we designed a bespoke XML vocabulary for it, or if we poured it into SQL. (I have friends who tell me that SQL is really not suited for the kinds of things you can do with Topic Maps, but so far I haven’t understood what they mean; perhaps a concrete example will make it easier to compare the two.)

Messages in a Bottle

CMSMcQ's klog

Category Archives: Information modeling

Giants and the KISS principle

Postel’s Law vs the Whiteboard Marker Rule

XML as a sort-of open format

Notes on Steve Pepper’s Italian Opera topic map

Keeping examples real

XQuery for data exploration

Some ongoing challenges