Some notes on metadata at W3C

C. M. Sperberg-McQueen

8 January 2002



This document outlines some tentative thoughts on things W3C could do to exploit better metadata on our site. As a side effect, such uses would help demonstrate the utility of metadata to others.

1. Principles

Some basic assumptions should probably be made explicit at the start.

1.1. Gradual changeover, not sudden

Many of the ideas here involve adding a little metadata, and thus require a little extra work, on each document or HTML page on the W3C site.
If we try to do that extra work for the entire body of existing pages, there will be too much work and we'll easily foresee that doing it will have a horrible impact on getting anything else done, and thus that we'll never finish the job. And so we'll never start.
But if we plan the transition differently, it is manageable.
Instead of doing the extra work for every existing page, we do it only
  • when we create a new page
  • when we revise an existing page
At the beginning, very little of the Web site will have the extra metadata. Over time, however, as the Web site develops, more and more of the site will have the new metadata. The areas which are being updated most actively get the new metadata first; areas which have slower rates of change get the new metadata later. Eventually, the only parts of the site which don't have the metadata will be the parts we aren't changing anymore. At that point, we can decide whether it's worth bothering about that part of the site or not. [1]

1.2. Think globally, act locally

Many documents on our site are written and edited not by members of W3C staff but by volunteers from our member organizations.
We are perhaps not really in a position to require that they provide metadata in any particular form, except in documents being published as technical reports, for which we already impose publication rules.
So I try not to assume that every document on our site will have the metadata I'm describing. I think we may benefit from having the metadata in some of our documents; certainly we need to look for benefits that will flow from having even incomplete metadata.

1.3. Carrots not sticks

The best way to get better metadata is not simply to require it (though I recommend that in some cases) but to make it to the authors' and editors' advantage to provide metadata.
I think this means we want to have systems which exploit metadata to make documents easier to find and use documents than documents lacking metadata.
The first thing that occurs to me is a site-wide search engine; this may be the lowest hanging fruit.

2. Dublin Core information

To start with, we should put Dublin Core metadata into the HTML headers of our documents, and our site software should exploit and use it.
The Dublin Core allows us to identify
  • title of the document
  • creator (author, editor, revisor, ...)
  • subject (expressed as keywords, subject phrases, or classification codes)
  • description (table of contents, abstract, ...)
  • publisher (in our case, always W3C)
  • contributor (I assume this is weaker than author)
  • date (of publication, of revision, ...)
  • type of resource[2]
  • format (Dublin Core recommends some controlled vocabulary and mentions MIME types)
  • identifier (for our purposes, presumably this should be the URI)
  • language
  • relation to other resources
  • coverage (I'm not sure how often this will apply to us)
  • rights (this can be a standard reference to our IPR and reproduction policy)

2.1. Dublin Core in HTML documents

Put it into the head ... examples ...

2.2. Dublin Core in XML documents

We don't have (that I know of) any complete list of the SGML and XML DTDs, other than HTML, currently used on our site. I know of two: TEI Lite (for which I may be the only user) and specprod.dtd, which is used (with lots of minor variations) for several specs.
If these DTDs don't have Dublin Core information in them, they should. For those we control, we should modify the DTD as needed to make this true. For those we don't control, we should encourage the owner to do so.
The methods used to translate from them into HTML should put the Dublin Core information into the HTML as shown above.
... examples ...

3. Search engines

One of the easiest ways to make metadata pay off is to make it visible in search screens and in search results.
A search engine that allowed us to do fielded search for documents on author, title, date, and subject keyword fields, as well as full-text search, would help me personally a lot.[3]
A search engine that displayed metadata in the results of a search, so the user could tell from the short screen the author, the title, the date of first publication, the date of most recent revision, etc., would make it easier for me to find the documents I'm looking for among those listed in the short list. If it also made clear when a hit had no metadata, there would soon be peer and personal pressure on all those preparing documents to provide useful metadata. I suspect that after a while of seeing lists like the following, those responsible for the documents with "[Author unknown]" will feel an urge to provide better metadata for them.
  1. Murata Makoto. Syntax for regular-but-non-local schemata for structured documents, 17 February 1999. XML Schema WG.
  2. Mike Spreitzer. Some Extensibility Issues for XML Schema. 17 February 1999. XML Schema WG.
  3. Scott Lawrence. Schemas vs DTDs as set definitions. Rev 2. 16 Feb 1999. XML Schema WG.
  4. Shriram V Revankar. Data types and basic operations. 15 February 1999. XML Schema WG.
  5. [Author unknown]. Associating Schemas with XML Instance Documents. [Date unknown]. [WG unknown].
  6. [Author unknown]. Inheritance in Documents, Inheritance in Data. [Date unknown]. [WG unknown].
  7. [Author unknown]. An outline proposal for an XML Schema version of ID/IDREF. [Date unknown]. [WG unknown].
  8. [Author unknown]. A Note on Inheritance and Specialization. [Date unknown]. [WG unknown]
It would be very nice if the search engine could accept standoff metadata, so that we can provide metadata for old documents which aren't being revised, without touching them.

Notes

[1] This idea of changing things gradually, in the course of processing you'd be doing anyway, is one I learned from the books of the library scientist S. R. Ranganathan, who labels it osmosis.
[2] The Dublin Core initiative's type vocabulary WG recommends using one of the terms collection, dataset, event, image, interactive resource, service, software, sound, or text.
[3] Substantially similar mechanisms in the search engine should make it possible to do fielded search on email logs, using the fields in the RFC 822 mail headers to allow the searcher to distinguish the name of the mailing list on which the note appeared from the name of a different mailing list mentioned in the note.