[25 November 2009]
The other day I posted about a proposal to use Wikipedia as a rough-and-ready guide to the ontology of most public entities. I’ve been thinking about it, and wondering why it felt somehow sort of familiar.
Eventually, I decided that the proposal reminds me of the way in which some people (including me) eventually disposed of the thorny question of what to count as a ‘character’ when analysing writing systems. (For example: when are e and é to be regarded as the same character, when as two? Or is the latter a sequence of two characters e and an acute accent? The answer some people eventually converged upon is simple:
For virtually all engineering purposes, treat something as a character if and only if there is a code point for it in the Universal Character Set (UCS) defined by Unicode and ISO 10646.
Some exceptions may need to be made in principle for the private use area, or for particular special cases. But unless you and your project are the kind of people who actually run into, or identify, special cases related to subtle issues in the history of the world’s writing systems (that means, for 99.999% of the world’s population, and at least 50% of the readers of this blog), you don’t need to worry about exceptions.
The reasoning is not that the Unicode Consortium and SC 2 got the answers right. On the contrary, any reasonable observer will agree that they got some of them wrong. Many members of the relevant committees will agree. (They answer, for example, that é is BOTH a single character and a sequence of two characters. Thank you very much; may I have another drink now?) It’s not likely, of course, that any two reasonable observers will agree on which questions the UCS gets right, and which it gets wrong.
But some questions are just hard to answer in a universally satisfactory way; if you decide for yourself what counts as a character and what does not count as a character, your answers may differ in details from those enshrined in the UCS, but they will not be more persuasive on the whole: there are no answers to that question that are persuasive in every regard.
The definition of ‘character’ embodied in the UCS is as good an answer to the question “What is a character” as we as a community are going to get, and for those for whom that question is incidental to other more important concerns, it’s far better to accept that answer and move on than to try to provide a better one.
If the question is not incidental, but central to your concerns (if, for example, you are a historian of writing systems, or of a particular writing system), then a standardized answer is not much use to you, except perhaps as an object of study.
Hmm. Perhaps one of the main purposes of standardization is to allow us to ignore things we are not particularly interested in, at the moment? Is the purpose of standards to make things boring and ignorable?
That could affect whether we think it’s a good idea to adopt such a de facto standard for ontology, or whether we think such standardization is just one step along a slippery slope with thought police at the bottom.
Just for the fun of it, I diffidently point out that this is only one of the four definitions given in the Unicode Glossary, viz. the third:
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin.
The term abstract character, unlike character, is formally defined in the Unicode Standard itself, thus:
An addition to John’s list:
Unicode Scalar Value. Any Unicode code point except high-surrogate and
low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16,
inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)
ISO/IEC 29500 has adopted this phrase when we would like to make it clear that a surrogate pair is a single thing rather than two things.
Hmm. It occurs to me that perhaps you mean that Ä (which can be encoded either with one codepoint or with two) is a single character by nature, whereas f́, which can only be encoded with two codepoints, is two characters by nature. If so, I think that’s perverse.
John, I’m not sure whether your remark in comment 3 is addressed to Murata-san or to me. If to me, then I think what I had in mind in this post is that for engineering purposes, the right solution is to appeal to the UCS.
It is not the case that the UCS’s answers are entirely satisfactory. But for pretty much all engineering purposes, the choice is between (a) aligning with the UCS and having answers that are not entirely satisfactory, or (b) deciding all the tricky cases yourself, which means adding extravagant amounts of complexity to your project, thus probably making your project late, and ending up with answers that are … not entirely satisfactory. Also with answers not aligned with most other projects’ answers, which will be a trial to users. (If you’re going to make me put up with unsatisfactory analyses of writing systems, at least have the grace to content yourselves with the same unsatisfactory analyses I’m already putting up with, and refrain from adding new ones to my portfolio!)
If any engineering project aimed at building something other than a coded character set thinks it is going to produce answers which are better than those of the UCS by any measurable margin (let alone a margin large enough to earn forgiveness for deviation from the norm), they are smoking crack. And if the project is aimed at building a new polyglot character set to replace the UCS? Then they are smoking crack, too.
So is Ä one character or two, by nature? My recommended answer is: yes, in the UCS it is one character, or two, thanksbye.