[26 April 2009]
I had an interesting exchange with a friend recently. He wrote
Presumably a binary octet is a sequence of bits. A sequence of sequences of bits is not a sequence of bits, since a bit is not a sequence of bits. (Please, let’s not violate the axiom of regularity!) Therefore, a finite-length sequence of zero or more binary octets is not a sequence of bits.
The upshot, he said, was that the description of the value space of the base64Binary and hexBinary datatypes was wrong in both XSD 1.0 and XSD 1.1. The 1.0 spec says (in section 3.2.16):
The value space of base64Binary is the set of finite-length sequences of binary octets.
XSD 1.1 says, very similarly (but with pedantic attention to details on which people have professed uncertainty):
The value space of base64Binary is the set of finite-length sequences of zero or more binary octets. The length of a value is the number of octets.
But if my friend is right, and the binary datatypes are intended to have value spaces which are sequences of bits, then those descriptions are wrong. So, my friend continues, the description of the value space really ought to be:
… the set of finite-length concatenations of sequences of zero or more binary octets.
Shouldn’t it?
This sounds plausible at first glance but the answer is no, it shouldn’t. The two binary datatypes can certainly be used, in fairly obvious ways, to encode sequences of bits, but their value space is not the set of bit sequences, not even the set of bit sequences whose length is an integer multiple of eight (and whose length for purposes of the XSD length facet would be their bit length divided by eight).
My friend’s argument suffers, I think, from two faults. Most important, he seems to assume
- That an octet is a sequence of bits.
- That purpose of base64Binary (and of the Base 64 Encoding on which it is based) is to encode sequences of bits.
I don’t think either of these is true in detail.
Are octets sequences of bits?
Is an octet a sequence of bits? Certainly it’s often thought of that way (e.g. in the Wikipedia article on ‘octet’). But strictly speaking I think the term ‘octet’ is best taken as denoting a group of bits, without assuming a geometry, in which (for purposes of most network transmission protocols) each bit is asociated with a power of two.
But if we number the bits with their powers as 0 .. 7, is the octet the sequence of b0 b1 b2 b3 b4 b5 b6 b7? or b7 b6 b5 b4 b3 b2 b1 b0? Or some other sequence? On architectures where the byte is the smallest addressable unit, there is no requirement that the bits be thought of as being in any particular order, although the design of big- and little-endian machines makes better intuitive sense if we assume least-significant-bit first order for little-endian, and most-significant-first for big-endian. I believe that some protocols for serial port protocols specify least-first, others greatest-first, order (with least-first most common). But I suspect that most networking protocols today (and for a long time since) assume parallel transmission of bits, in which asking about the sequence of bits within an octet is nothing but a category error.
But IANAEE. Am I wrong?
Does base 64 encoding encode bits?
RFC 3548, which defines Base 64 encoding, says
The Base 64 encoding is designed to represent arbitrary sequences of octets in a form that requires case sensitivity but need not be humanly readable.
It uses similar wording for the base 32 and base 16 encodings it also defines.
Note the choice of words: octets, not bits.
I wasn’t there when it was developed, but I’m guessing that base 64 notation is carefully formulated to be agnostic on the sequence of bits within an octet, and to be equally implementable on big-endian, little-endian, and other machines. That would be one reason for it to talk about encoding sequences of octets, not sequences of bits.
One consequence of such agnosticism is that if one wanted to specify a sequence of bits using base64Binary, or hexBinary, one would need to specify what sequence to assign to the eight bits of each octet. And indeed RFC 2045 specifies that “When encoding a bit stream via the base64 encoding, the bit stream must be presumed to be ordered with the most-significant-bit first.” I believe that that stipulation is necessary precisely because it doesn’t follow from the nature of octets and thus doesn’t go without saying. The RFCs also don’t say, when you are encoding a sequence of six bits, for example, whether it should be left-aligned or right-aligned in the octet.
Bottom line: I think the XSD spec is right to say the value spaces of the binary types are sequences of octets, not of bits.
[In preparing this post, I notice (a) that RFC 3548 appears to have been obsoleted by RFC 4648, and (b) that XSD 1.1 still cites RFC 2045. I wonder if anyone cares.]
I think there is no doubt that octet means a group of eight things (bits, in this case) rather than a sequence of eight things. Perhaps the use of the word to mean ‘the first eight lines of a sonnet’ is an exception.
So the question has to be asked, if the datatypes have a value space which consists of (sequences of) octets rather than bits, why does base64Binary
have “Binary” as part of its name? It may well be this bad naming on the part of XSD that led your friend to assume that it encodes sequence of “bits” (ie a Binary type with 2 values) rather than Octet which isn’t binary.
I agree with you that the datatype should be seen as encoding octets, and don’t suggest that the name be changed now……