Base 64 binary, hex binary, and bit sequences

[26 April 2009]

I had an interesting exchange with a friend recently. He wrote

Presumably a binary octet is a sequence of bits. A sequence of sequences of bits is not a sequence of bits, this web since a bit is not a sequence of bits. (Please, drugs let’s not violate the axiom of regularity!) Therefore, website a finite-length sequence of zero or more binary octets is not a sequence of bits.

The upshot, he said, was that the description of the value space of the base64Binary and hexBinary datatypes was wrong in both XSD 1.0 and XSD 1.1. The 1.0 spec says (in section 3.2.16):

The value space of base64Binary is the set of finite-length sequences of binary octets.

XSD 1.1 says, very similarly (but with pedantic attention to details on which people have professed uncertainty):

The value space of base64Binary is the set of finite-length sequences of zero or more binary octets. The length of a value is the number of octets.

But if my friend is right, and the binary datatypes are intended to have value spaces which are sequences of bits, then those descriptions are wrong. So, my friend continues, the description of the value space really ought to be:

… the set of finite-length concatenations of sequences of zero or more binary octets.

Shouldn’t it?

This sounds plausible at first glance but the answer is no, it shouldn’t. The two binary datatypes can certainly be used, in fairly obvious ways, to encode sequences of bits, but their value space is not the set of bit sequences, not even the set of bit sequences whose length is an integer multiple of eight (and whose length for purposes of the XSD length facet would be their bit length divided by eight).

My friend’s argument suffers, I think, from two faults. Most important, he seems to assume

  1. That an octet is a sequence of bits.
  2. That purpose of base64Binary (and of the Base 64 Encoding on which it is based) is to encode sequences of bits.

I don’t think either of these is true in detail.

Are octets sequences of bits?

Is an octet a sequence of bits? Certainly it’s often thought of that way (e.g. in the Wikipedia article on ‘octet’). But strictly speaking I think the term ‘octet’ is best taken as denoting a group of bits, without assuming a geometry, in which (for purposes of most network transmission protocols) each bit is asociated with a power of two.

But if we number the bits with their powers as 0 .. 7, is the octet the sequence of b0 b1 b2 b3 b4 b5 b6 b7? or b7 b6 b5 b4 b3 b2 b1 b0? Or some other sequence? On architectures where the byte is the smallest addressable unit, there is no requirement that the bits be thought of as being in any particular order, although the design of big- and little-endian machines makes better intuitive sense if we assume least-significant-bit first order for little-endian, and most-significant-first for big-endian. I believe that some protocols for serial port protocols specify least-first, others greatest-first, order (with least-first most common). But I suspect that most networking protocols today (and for a long time since) assume parallel transmission of bits, in which asking about the sequence of bits within an octet is nothing but a category error.

But IANAEE. Am I wrong?

Does base 64 encoding encode bits?

RFC 3548, which defines Base 64 encoding, says

The Base 64 encoding is designed to represent arbitrary sequences of octets in a form that requires case sensitivity but need not be humanly readable.

It uses similar wording for the base 32 and base 16 encodings it also defines.

Note the choice of words: octets, not bits.

I wasn’t there when it was developed, but I’m guessing that base 64 notation is carefully formulated to be agnostic on the sequence of bits within an octet, and to be equally implementable on big-endian, little-endian, and other machines. That would be one reason for it to talk about encoding sequences of octets, not sequences of bits.

One consequence of such agnosticism is that if one wanted to specify a sequence of bits using base64Binary, or hexBinary, one would need to specify what sequence to assign to the eight bits of each octet. And indeed RFC 2045 specifies that “When encoding a bit stream via the base64 encoding, the bit stream must be presumed to be ordered with the most-significant-bit first.” I believe that that stipulation is necessary precisely because it doesn’t follow from the nature of octets and thus doesn’t go without saying. The RFCs also don’t say, when you are encoding a sequence of six bits, for example, whether it should be left-aligned or right-aligned in the octet.

Bottom line: I think the XSD spec is right to say the value spaces of the binary types are sequences of octets, not of bits.

[In preparing this post, I notice (a) that RFC 3548 appears to have been obsoleted by RFC 4648, and (b) that XSD 1.1 still cites RFC 2045. I wonder if anyone cares.]

For collectors of headline ambiguities …

[dd Mmmm 2009]

[Names changed to protect the guilty. I don’t know for sure what either of these people were talking about, stuff and I find the exchange so surrealistically satisfying that I don’t actually want to know.]

Vladimir: Estragon – do you understand what the HTML 5 people are doing?

Estragon: Vladimir, could you be more specific?

Vladimir: Estragon – would you like me to be more specific about anything in particular?

[There is a long pause. Gestures. Facial expressions.]

Estragon: What they are doing is trying to cling to the happiness and safety that they associate with their childhood/teenage experiences of learning about the Web.

[allegorical programming?]

allegory: you tell the story on one level, look but the whole point is the interpretation on another level.

programming: you manipulate bits, clinic but the whole point is the customer and the goods and the legal transactions, not the bits

[allegorical programming?]

allegory: you tell the story on one level, apoplectic
but the whole point is the interpretation on another level.

programming: you manipulate bits, but the whole point is the customer
[22 April 2009]

Linguists (and others) like to collect cases in which the compressed, salve telegraphic style of newspaper headlines lead to unexpected syntactic ambiguities.

The Albuquerque Journal and the Santa Fe New Mexican both carried stories yesterday about the results of a new University of New Mexico study on the incidence of spinal injuries in the U.S.

I showed them to Enrique, who glanced at the headline in the Journal:

Paralysis More Widespread Than Thought

and asked “When did the Albuquerque Journal start covering W3C and ISO Working Groups?”

Grant and contract-supported software development

[7 April 2009]

Bob Sutor asks, page in a blog post this morning, some questions about government funding and open source software. Since some of them, at least, are questions I have thought about for a while as a reviewer for the National Endowment for the Humanities and other funding agencies, I think I’ll take a shot at answering them. To increase the opportunity for independent thought to occur, I’ll answer them before I read Bob Sutor’s take on them; if we turn out to agree or disagree in ways that require comment, that can be separate.

He asks:

  • When a government provides funding to a research project, should any software created in the project be released under an open source license?

It depends.

In practice, I think it almost always should, but in theory I recognize the possibility of cases in which it needn’t.

When I review a funding proposal, I ask (among other things): what is the quid pro quo? The people of the country fund this proposal; what are they buying with that money? A reliable survey of the work of Ramon Llull and its relevance to today? Sounds good (assuming I think the applicant is actually in a position to produce a reliable survey, and the cost is not exorbitant). A better tool for finding and studying emblem books? Insight into methods of performing some important task? (Digitizing cultural artefacts, archiving digital research results for posterity, creating reliable language corpora, handling standoff annotation, … there are a whole lot of tasks it would be good to know better how to do.) How interesting is what we would be learning about? How much are we likely to learn?

My emphasis on what we get for the money sometimes leads other reviewers or panelists to regard me as cold and mean-hearted, insufficiently concerned with encouraging a movement here, nurturing a career there. But I have noticed that the smartest and most attractive members of panels I’ve been on are almost always even tougher graders than I am. When funds are as tight as they typically are, you really do need to put them where they will do the most good.

If the value proposition of the funding proposal is “we’ll develop this cool software”, then as a reviewer I want the public to own that software. Otherwise, what did we buy with that money?

If the value proposition is “we’ll develop these cool algorithms and techniques, and write them up, so the community now has better knowledge of how to do XYZ — oh, and along the way we will write this software, it’s necessary to the plan but it’s not what this grant is buying”, then I don’t think I want to insist on open-sourcing the software. But it does make it harder for the applicant to make the case that the results will be worth the money.

Stipulating that software produced in a project will be open-source does usually help persuade me that its benefit will be greater and more permanent. If the primary deliverable I care about is insight, or an algorithm, open-sourcing the software may not be essential. But it helps guarantee that there will be a mercilessly complete account of the algorithm with all details. (It does have the potential danger, though, that it may allow other reviewers or the applicants to believe that the source code provides an adequate account of the algorithm and there is no need for a well written paper or series of papers on the technical problem. I am told that some programmers write source code so clear and beautiful that it might suffice as a description of the algorithm. I say, if writing documentation as well as source code is good enough for Donald Knuth, it’s good enough for the rest of us.)

On the other hand, I don’t think deciding not to open-source the software is necessarily an insuperable barrier. The question is: what value is the nation or the world going to get from this funding? Sometimes the value clearly lies with the software people are proposing to develop, sometimes it clearly lies elsewhere and the software plays a purely subordinate, if essential, role. (But although I admit this in principle, I am not sure that in practice I have ever liked a proposal that proposed to spend a lot of effort on software but not to make it generally available. So maybe my generosity toward non-open-source projects is a purely theoretical quantity, not observable in practice.)

If software is involved, you also have to ask yourself as a reviewer how well it is likely to be engineered and whether the release of the software will serve the greater good, or whether it will act like a laboratory escape, not providing good value but inhibiting the devotion of resources to creating better software.

The chances and consequences of suboptimal engineering vary, of course, with whether the research in question is focused specifically on computer science and software engineering, or on an application domain, in which case there is a long and often proud history of good science being performed with software that would make any self-respecting software engineer gag. (A long time ago, I worked at a well known university where the music department burned more CPU cycles on the university mainframe than any other department. Partly this was because Physics had its own machines, of course, and partly it was because the music people were doing some really really cool and interesting stuff. But was it also partly because they were lousy programmers who ran the worst optimized code east of the Mississippi? I never found out.)

  • Does this change if commercial companies are involved? How?

If the work is being done by a commercial company, they are historically perhaps less likely to want to make the software they develop open-source. That’s one way the process is affected.

But also, if a government agency is contracting with a commercial organization to develop some software, there may be a higher chance that the agency wants some software for particular parties to use, and the main benefit to be gained is the availability to those parties of the software involved. In some cases, the benefit may be the existence of commercially viable organizations willing and able to support software of a particular class and develop it further.

There are plenty of examples of commercial codebases developed in close consultation with an initial client or with a small group of initial clients. The developer gets money with which to do the development; the initial clients get to help shape the product and ensure that at least one commercial product on the market meets their needs. In the cases I have heard of, the clients don’t typically turn around and demand that the code base be open-source.

It’s not clear to me that government funding agencies should be barred from acting as clients in scenarios like this. This kind of arrangement isn’t precisely what I tend to think of as “research”, but whether it’s appropriate or not in a given research program really depends on the terms of reference of that program, and not on what counts as research in the institutions that trained me.

I have been told on what I think is good authority that if it had not been for contracts let by various defence agencies, the original crop of SGML software might never have been commercially viable. (And since it was that crop of software that demonstrated the utility of descriptive markup and made XML possible, I wouldn’t like to try to outlaw whatever practices led to that software being developed.)

  • Does this change if academic institutions are involved? How?

I don’t think so.

  • How should the open source license be chosen? Who gets to decide?


Two umbrellas and a prime number.

I think I mean “Huh?” Is this a trick question?”

To the extent that we think of funded research as the purchase (on spec) of certain research products we hope the funding will produce, then the funding agency can certainly say “We want the … license”. And then the Golden Rule of Arts and Sciences applies. Or the people writing the proposal can say “We want to use the … license; take it or leave it.” And the funding agency, guided by its reviewers and panelists and staff and the native wit of those responsible for the final decision, will either leave it or take it.

The only thing that would make me more suspicious and worried than this chaotic back and forth would be an attempt to make an iron-clad rule to cover all cases, for all projects, for all governmental funding agencies.

Balisage: The markup conference 2009

[3 April 2009]

Three weeks to go until the 24 April deadline for papers for the 2009 edition of Balisage: The markup conference.

We want your paper. So give it to us, visit this already!

This is a peer-reviewed conference that seeks to be of interest both to the theorist and to the practitioner of markup. That makes it a lot of fun (at least for people like me who are interested both in theory and in practice, and who like to see them informing each other). And the peer reviews are unusual (at least in my experience of conference paper submissions) in the detail and passion of their comments.

If you have markup-related work to report on, you will not get better feedback from any conference on the planet. (Disclaimer: I am one of the organizers, and have been known to have a small soft spot in my heart for this conference. But don’t take my word for it: ask anyone who has spoken at Balisage how the peer review and the questions at the conference compared with other conferences.)

Details of submission procedure, of course, in the call for participation.

I look forward to reading your papers.

Heinrich Hertz and the empty set of tomatoes

[2 April 2009]

Why does Nelson Goodman want to work so hard just to avoid talking about classes or sets?

Earlier this year I spent some time reading the section on the calculus of individuals in Nelson Goodman’s The structure of appearance (3d ed. Boston: Reidel, approved 1977) and the paper Goodman wrote on the subject with Henry S. Leonard (Henry S. Leonard and Nelson Goodman, “The calculus of individuals and its uses” The journal of symbolic logic 5.2 (1940): 45-55).

I was struck by the lengths Goodman goes to in order to avoid talking about sets, although his compound individuals which contain other individuals seem to be doing very much the same work as sets. Indeed, the 1940 paper makes a selling point of this fact. On page 46, Leonard and Goodman write “To any analytic proposition of the Boolean algebra will correspond a postulate or theorem of this calculus provided that …” (In other words, with some few provisos, if you can make a true statement about sets, you can make a corresponding true statement about individuals in the calculus of individuals. The provisos aren’t even statements you can’t make, just restrictions on the form you make them in. Instead of saying “the intersection of x and y is the empty set” you have to say they are discrete. And so on.) And the concluding sentence of the paper (p. 55) is: “The dispute between nominalist and realist as to what actual entities are individuals and what are classes is recognized as devolving upon matters of interpretative convenience rather than upon metaphysical necessity.“

In other words, Goodman seems at first glance to be simplifying the world by eliminating the notion of sets and classes, and then to be complicating it again in precisely similar ways by taking all of the fundamental ideas we have about sets or classes, and reconstructing them as funny ways of talking about individuals. Cui bono?

This afternoon I saw a review by Anthony Gottlieb, in the New Yorker, of a recent book about the Wittgenstein family (Alexander Waugh, The House of Wittgenstein: A family at war), which seems to suggest a solution. Gottlieb quotes a suggestion from the physicist Heinrich Hertz:

Hertz had suggested a novel way to deal with the puzzling concept of force in Newtonian physics: the best approach was not to try to define it but to restate Newton’s theory in a way that eliminates any reference to force. Once this was done, according to Hertz, “the question as to the nature of force will not have been answered; but our minds, no longer vexed, will cease to ask illegitimate questions.”

(Throws a new light on Wittgenstein’s remark about not wanting to solve problems but to dissolve them, doesn’t it?)

It’s true that once you rebuild the ideas of set union, intersection, difference, etc. as ideas about individuals which can overlap or contain other individuals, and eliminate the word ‘set’, it becomes a lot harder to describe a set which contains as members all sets which are members of themselves, or a set which contains as members all sets which are not members of themselves. The closest you can conveniently get are statements about individuals which overlap themselves (they all do) or which do not overlap themselves (no such individual). Good-bye, Russell’s Paradox!

And consider the surrealist joke I ran into the other day:

Q. What is red and invisible?
A. No tomatoes.

A user of the calculus of individuals can enjoy this on its own terms, without having to worry about whether it’s a veiled reference to the fact that some typed logics end up with multiple forms of empty set, one for each type in the system. One for integers, if you’re going to reason about integers. One for customer records, if you’re going to reason about customers. And … one for tomatoes?

Q. What is red and invisible?
A. The empty set of tomatoes.