These past few days, I’ve explored some of the ways we could use Bencode as data format “over the wire” in both very limited embedded µC contexts and in scripting languages.
Calling it “over the wire” is a bit more down-to-earth than calling Bencode a data serialisation format. Same thing really: a way to transform a possibly complex nested (but acylic) data structure into a sequence of bytes which can be sent (or stored!) and then decoded at a later date. The world is full of these things, because where there is communication, there is a need to get stuff across, and well… getting stuff across in the world of computers tends to happen byte-by-byte (or octet-by-octet, if you prefer).
When you transfer things as bytes, you have to delimit the different pieces somehow. The receiver needs to know where one “serialised” piece of information ends and the next starts.
There are three ways to send multi-byte chunks and keep track of those boundaries:
- send a count, send the data, rinse and repeat
- send the bytes, then add a special byte marker at the end
- send the bytes and use some “out-of-band” mechanism to signal the other side
Each of them has major implications and trade-offs for how a transmission works. With counts, if there is any sort of error, we’re hosed – because we lose sync and no guaranteed way to ever recover from it.
With the second approach, we need to reserve some character code as end marker. That means it can’t appear inside the data. So then the world came up with escape sequences to work around this limitation. That’s why to enter a quote inside a string in C, you have to use a backslash:
"this is a quoted \" inside a string" – and then you lose the backslash. It’s all solvable, of course… but messy.
The third approach uses a different trick: we send whatever we like, and then we use a separate means of communication to signal the end or some other state change. We could use two separate communication lines for example, sending data over one and control information over the other. Or close the socket when done, as with TCP/IP.
If you don’t get this stuff right, you can get into a lot of trouble. Like when in the 60′s, telephone companies used “in-band” tones on a telephone line to pass along routing or even billing information. Some clever guys got pretty famous for that – simply inserting a couple of tones into the conversation!
So how about Bencode, eh?
Well, I think it hits the sweet spot in tradeoffs. It’s more or less based on the second mechanism, using a few delimiters and special characters to signal the start and end of various types of data, while switching to a byte-counted prefix for the things that matter: strings with arbitrary content (hence including any bit pattern). And it sure helps that we often tend to know the sizes of our strings up front.
With Bencode, you don’t have to first build up the entire message in memory (or generate it twice) to find out how many bytes will be sent – as required if we had to use a size prefix. Yet the receiver also can prepare for all the bigger memory requirements, because strings are still prefixed with the number of bytes to come.
Also, having an 8-bit clean data path really offers a lot of convenience. Because any set of bytes can be pushed through without any processing. Like 32-bit or 64-bit floats, binaries, ZIP’s, MP3′s, video files – anything.
Another pretty clever little design choice is that neither string lengths nor signed integers are limited in size or magnitude in this protocol. They both use the natural decimal notation we all use every day. A bigger number is simply a matter of sending more digits. And if you want to send data in multiple pieces: send them as a list.
Lastly, this format has the property that if all you send is numerical and plain ASCII data, then the encoded string will also only consist of plain text. No binary codes or delimiters in sight, not even for the string sizes. That can be a big help when trying to debug things.
Yep – an elegant set of compromises and design choices indeed, this “Bencode” thing!