(Note: this is written 2 years after the bulk of the work was done on this codebase, so my thoughts will be basic, muddled, confused, and probably fabricated from the lies one tells oneself. You have been warned.)
If you know anything about transferring binary information from one place to another, you know that getting your data from one place to another can be hazardous. That is, hazardous to your data: maybe you need to transfer your blob over a channel meant solely for ASCII text, and all the bytes that would represent
'\n' in a C-style string would suddenly have a sibling,
'\r'. Or, your transport layer is just really really really into null-terminated strings, and you happen to have 8 consecutive 0 bits in a row on the byte boundary, even though it’s in the middle of a 32-bit integer within the blob, and you
end up with only half your data.
One way to get around this is by using
base64, which takes all the numbers and upper/lowercase alphabet characters to encode the data (plus two more, but we’ll ignore them for now). Now you don’t have any problems with your transport layer trying to intepret your bytes, because everyone handles the alphanumeric characters just fine, and if they don’t you probably don’t want to be using them.
Of course, after using
base64 for a while, you might notice that while it’s fantastic that you aren’t having your data mutate at the whims of a capricious god that thinks he’s helping, you’re giving up efficiency by using
base64. For every 3 bytes in your binary blob, you have 4 bytes if alphanumeric characters that you need to transmit, so you’re paying for data protection with a 33% size levy: if the mob could skim off 33% in protection rackets, you bet they would.
base64, you might notice that there are way more ASCII characters that are displayable than just the 64 used by
base64, which could be used in encoding in order to make the transport density higher. Thus was
base85 born, using… you guessed it, 85 characters! This time around, 4 bytes of binary data get encoded into 5 ASCII characters, which results in a 20% transport size overhead. But what can you do? It’s not like there are more displayable ASCII… oh, wait.
So it appears that no one had tried to take the obvious next step beyond
base85, so I decided to try my hand at making
We take a similar approach to the
base64 design, and encode a number of bits from the binary stream into an ASCII character. However, instead of using 6 bits for each character (
base64), we use… 6.5? That’s kind of weird! What do you do with the 0.5 bit? Here we take a page from
base85‘s book, and instead of taking the raw bit stream as a basis for the resulting characters, you convert a chunk of bits to an actual number (13 in our case, 32 in
base85‘ case), and then read off the values in base whatever.
A nice property of using only 13 bits can be that
base92 is more resistant to transport corruption than
base85, since changing one character in the
base92 transport will only effect 2-3 bytes after decoding. I haven’t actually run the numbers to find out if one can cause widespread corruption in 4 consecutive bytes with a 1-bit flip of the transport material, but it seems plausible.
A decision remains to be made: which characters should we use? There are in fact 94 traditionally displayable ASCII characters, so if we use 6.5 bits, we have 91 characters to work with. Which three do we
leave out? Making a choice at random is hardly optimal: it makes no difference to a computer, but if the data has to pass through a visual or voice channel, we can reduce some ambiguity. For instance, using only one of 1 or l can make it easier to handle fonts that display those characters as nearly the same. Instead, I chose to disambiguate between the quotes, ` ” and `, since they’re usually very similar, and one of them is merely a doubling of another, a recipe for readability disaster.
Finally, as noted in the README,
base91 was just too ugly a name, so I added an optional extra character to get
In keeping with my roots, I wrote the python version first. I know, boo, hiss, why bother with saving time on the wire if you’re going to spend time at the encode/decode phase. Well, something-something-move-fast-by-moving-slow. Or something.
I believe the library was written in a TDD style, with the handy doctests tool. It’s nifty, keeping your tests with your code and making it available to all only a
help() call away, but it does bloat up your docstrings and makes
help() less… helpful than a nice, concise docstring that isn’t trying to test all the facets of functionality.
I will confess, using python’s binary string formatting makes me sad inside. If you can believe it, doing massive string munging makes me sadder than it not being in C. One of these days, I’ll hook up the C
library with a python wrapper. One of these days.
On the bright side, I believe this was my second project that I uploaded to pypi!
Currently the repo only contains a companion C library. I know, I can just make wrappers to everything else with a fine C library. Uh, maybe you can contribute! I’ll see you in a bit!
Still here? Well… the library itself was written using the autotools sort-of-kind-of framework, which comes out of the GNU tools tradition. I know people rag on makefiles, but it’s worked out. On the other hand, I haven’t used autotools since this project. On the gripping hand, I haven’t written a project in C since this project.
Speaking of writing in C, it was my first time doing TDD with C code. I vaguely remember being in pain while trying to write C that wouldn’t segfault, and the
git log seems to belay this. Go look: it’s somewhat amusing.
It turns out that I fail at scholarship, and that there has been an effort to create a
basE91 over at Sourceforge, and that the project started in 2006, which is way before
base92 was even a glimmer in my young impressionable eyes. Oh well.
Eh, you actually want to use this? Well, if you’re interested in using
base92, you can grab it from github. The README should have details on how to use it, so I won’t repeat myself here.