graphic with four colored squares
Cover page image (keys)

What is schema mapping,
and why should portals care?

C. M. Sperberg-McQueen

<philtag n="7"/>
Trier, 13 October 2008

Overview

N.B. this talk does not provide full or correct answers.
It doesn't even provide incomplete or incorrect answers.
It may, possibly, provide a useful question.

What is schema mapping?

schema mapping
the translation of data from one schema (or: vocabulary; or: markup language) to another
data exchange
taking data structured as an instance of one schema and representing it as accurately as possible as an instance of another
data integration
bringing together data from disparate sources and providing access to it as though it were a unified collection

A problem for portals

A portal presenting a collection should
  • Provide access to the entire collection through a single unified interface
  • Provide full access to each item in the collection

A problem for portals (2)

But the collection has variable structure.
  Tag 1Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Tag 7 Tag 8
Text 1 XX       
Text 2 XXXX     
Text 3 X  X     
Text 4 X  X  X  
Text 5 X  X   X 
Text 6 XX X    X
Text 7 XX  X  XX
Text 8 XX       
Text 9 XXXX     
Text 10 X  X     
Text 11 X  X  X  
Text 12 X  X   X 
Text 13 XX X    X
Text 14 XX  X  XX

A Procrustean approach

We can analyse the variation:
  Tag 1 Tag 4 Tag 2 Tag 3 Tag 5 Tag 6 Tag 7 Tag 8
Text 2 XXXX     
Text 6 XXX     X
Text 9 XXXX     
Text 13 XXX     X
Text 3 XX       
Text 4 XX    X  
Text 5 XX     X 
Text 10 XX       
Text 11 XX    X  
Text 12 XX     X 
Text 1 X X      
Text 7 X X X  XX
Text 8 X X      
Text 14 X X X  XX

Now it's simple

And trim it away. Now it's simple.
  Tag 1 Tag 4 Tag 2
Text 2 XXX
Text 6 XXX
Text 9 XXX
Text 13 XXX
Text 3 XX 
Text 4 XX 
Text 5 XX 
Text 10 XX 
Text 11 XX 
Text 12 XX 
Text 1 X X
Text 7 X X
Text 8 X X
Text 14 X X

A better approach

Two incompatible goals?
Try two solutions:
  • One common interface (‘least common denominator’) for entire collection.
    E.g. all texts consist of book, chapter, paragraph.
    Cf. Arras (John B. Smith, ca. 1980):
    • text, chapter, paragraph, sentence, word
    • text, volume, page, line, word
  • Specialized interfaces for subcollections with consistent content / tagging.

Multiple UIs?

Some user-interface specialists* recommend multiple user interfaces.
  • Different users work differently.
  • Each user works differently at different times.
  • Multiple interfaces → clean separation of UI and back end.
* See e.g. N. Borenstein, Programming as if people mattered.
But how do we make that work?

A simple solution

The simplest solution is brute force.
  • Transform the data.
  • Keep the original.
  • Keep the copy.
  • Buy a larger disk.
“If brute force is not working, you're not applying enough brute force.” -Norm Walsh

The simple solution

A cleverer solution

Invert the mapping and apply to query.

When are mappings reversible?

Obvious answer: when they lose no information.
  • ... but how do we define that, in turn?
  • M · M' = identity

Some obvious cases:

Less obvious cases

Related work in database theory

Schema mapping for relations has occupied some very bright minds.
Summary: Gosh, that's hard!

Also a topic in RDF-based data integration.

Summary: if you keep it simple, it can be simple.

Mapping inversion ≠ query inversion

It may not be as bad as it looks.
  • Even lossy mappings can be query-reversible:
    • Map emph, foreign, mentioned, hiem.
    • Query //hi//*[self::emph or self::foreign or self::mentioned or self::hi]
  • Reverse mapping may overgenerate, and excess results may be filtered out later in user (target) schema.
  • N.B. mapping must cover the documents present, not necessarily all documents in the source schema.

Mapping inversion ≠ query inversion 2

It may be worse than it looks.
  • Results in isolation may lack context necessary to filter out excess hits (or translate correctly).

Shorter term practice

  • Materialize the UI-specific views as needed. (Disk is cheap.)
  • Assign stable identifiers to elements.
  • Keep IDs aligned in all versions.
  • Keep clear which version is the master copy.
  • Derive all other copies mechanically from the master.

Longer term challenge

Understanding document-schema mappings: when are they
  • reversible?
  • query-reversible?
  • applicable to results out of context?
For now, inversion of arbitrary XSLT infeasible.
But particular transformations are clearly reversible in general, or for querying.
Are there suitable subsets? Alternative formalisms?
There is a lot to learn here.