What is a character?

[25 November 2009]

The other day I posted about a proposal to use Wikipedia as a rough-and-ready guide to the ontology of most public entities. I’ve been thinking about it, and wondering why it felt somehow sort of familiar.

Eventually, I decided that the proposal reminds me of the way in which some people (including me) eventually disposed of the thorny question of what to count as a ‘character’ when analysing writing systems. (For example: when are e and é to be regarded as the same character, when as two? Or is the latter a sequence of two characters e and an acute accent? The answer some people eventually converged upon is simple:

For virtually all engineering purposes, treat something as a character if and only if there is a code point for it in the Universal Character Set (UCS) defined by Unicode and ISO 10646.

Some exceptions may need to be made in principle for the private use area, or for particular special cases. But unless you and your project are the kind of people who actually run into, or identify, special cases related to subtle issues in the history of the world’s writing systems (that means, for 99.999% of the world’s population, and at least 50% of the readers of this blog), you don’t need to worry about exceptions.

The reasoning is not that the Unicode Consortium and SC 2 got the answers right. On the contrary, any reasonable observer will agree that they got some of them wrong. Many members of the relevant committees will agree. (They answer, for example, that é is BOTH a single character and a sequence of two characters. Thank you very much; may I have another drink now?) It’s not likely, of course, that any two reasonable observers will agree on which questions the UCS gets right, and which it gets wrong.

But some questions are just hard to answer in a universally satisfactory way; if you decide for yourself what counts as a character and what does not count as a character, your answers may differ in details from those enshrined in the UCS, but they will not be more persuasive on the whole: there are no answers to that question that are persuasive in every regard.

The definition of ‘character’ embodied in the UCS is as good an answer to the question “What is a character” as we as a community are going to get, and for those for whom that question is incidental to other more important concerns, it’s far better to accept that answer and move on than to try to provide a better one.

If the question is not incidental, but central to your concerns (if, for example, you are a historian of writing systems, or of a particular writing system), then a standardized answer is not much use to you, except perhaps as an object of study.

Hmm. Perhaps one of the main purposes of standardization is to allow us to ignore things we are not particularly interested in, at the moment? Is the purpose of standards to make things boring and ignorable?

That could affect whether we think it’s a good idea to adopt such a de facto standard for ontology, or whether we think such standardization is just one step along a slippery slope with thought police at the bottom.