Computing stuff tied to the physical world

What’s the deal with Bencode?

In Software on Oct 5, 2012 at 00:01

These past few days, I’ve explored some of the ways we could use Bencode as data format “over the wire” in both very limited embedded µC contexts and in scripting languages.

Calling it “over the wire” is a bit more down-to-earth than calling Bencode a data serialisation format. Same thing really: a way to transform a possibly complex nested (but acylic) data structure into a sequence of bytes which can be sent (or stored!) and then decoded at a later date. The world is full of these things, because where there is communication, there is a need to get stuff across, and well… getting stuff across in the world of computers tends to happen byte-by-byte (or octet-by-octet, if you prefer).

When you transfer things as bytes, you have to delimit the different pieces somehow. The receiver needs to know where one “serialised” piece of information ends and the next starts.

There are three ways to send multi-byte chunks and keep track of those boundaries:

  • send a count, send the data, rinse and repeat
  • send the bytes, then add a special byte marker at the end
  • send the bytes and use some “out-of-band” mechanism to signal the other side

Each of them has major implications and trade-offs for how a transmission works. With counts, if there is any sort of error, we’re hosed – because we lose sync and no guaranteed way to ever recover from it.

With the second approach, we need to reserve some character code as end marker. That means it can’t appear inside the data. So then the world came up with escape sequences to work around this limitation. That’s why to enter a quote inside a string in C, you have to use a backslash: "this is a quoted \" inside a string" – and then you lose the backslash. It’s all solvable, of course… but messy.

The third approach uses a different trick: we send whatever we like, and then we use a separate means of communication to signal the end or some other state change. We could use two separate communication lines for example, sending data over one and control information over the other. Or close the socket when done, as with TCP/IP.

If you don’t get this stuff right, you can get into a lot of trouble. Like when in the 60’s, telephone companies used “in-band” tones on a telephone line to pass along routing or even billing information. Some clever guys got pretty famous for that – simply inserting a couple of tones into the conversation!

Closer to home, as noted recently – even the Arduino bootloader got bitten by this.

So how about Bencode, eh?

Well, I think it hits the sweet spot in tradeoffs. It’s more or less based on the second mechanism, using a few delimiters and special characters to signal the start and end of various types of data, while switching to a byte-counted prefix for the things that matter: strings with arbitrary content (hence including any bit pattern). And it sure helps that we often tend to know the sizes of our strings up front.

With Bencode, you don’t have to first build up the entire message in memory (or generate it twice) to find out how many bytes will be sent – as required if we had to use a size prefix. Yet the receiver also can prepare for all the bigger memory requirements, because strings are still prefixed with the number of bytes to come.

Also, having an 8-bit clean data path really offers a lot of convenience. Because any set of bytes can be pushed through without any processing. Like 32-bit or 64-bit floats, binaries, ZIP’s, MP3’s, video files – anything.

Another pretty clever little design choice is that neither string lengths nor signed integers are limited in size or magnitude in this protocol. They both use the natural decimal notation we all use every day. A bigger number is simply a matter of sending more digits. And if you want to send data in multiple pieces: send them as a list.

Lastly, this format has the property that if all you send is numerical and plain ASCII data, then the encoded string will also only consist of plain text. No binary codes or delimiters in sight, not even for the string sizes. That can be a big help when trying to debug things.

Yep – an elegant set of compromises and design choices indeed, this “Bencode” thing!

  1. There is a way to counter errors in data transmission or storage: by using error detecting codes such as a checksum, or even error correcting codes if you can afford the overhead. IPv4 packets described in RFC791 http://www.ietf.org/rfc/rfc791.txt protect the packet header in this way and if you want more reliability you use the Transmission Control Protocol (TCP) described in RFC793 http://www.ietf.org/rfc/rfc793.txt on top which protects the data with a checksum too.

    • Good point. Bencode does not address this issue, it’s only a serialisation format. Proper transmission needs to be verified either at a lower level (blocks of bytes with a checksum) or at a higher level (sending out a header and/or footer with a checksum. For example: send a special-purpose second message with the checksum of the one preceding it. This could again use Bencode (or specific flag bytes to create an out-of-band escape mechanism).

      I didn’t add it here, because the data will also be sent across sockets a lot. Real “physical” hardware comms will need it, whenever there’s a chance of data loss or corruption.

      Error correction is probably more a design decision tied to the actual hardware and required level of protection.

  2. Nice articles, how does BSON (http://bsonspec.org/) compare? It seems a bit more complex due to datatype support.

    “BSON is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. BSON is designed to be lightweight, traversable, and efficient. BSON, like JSON, supports the embedding of objects and arrays within other objects and arrays.”

    • A protocol with datatype support is bound to need revision from time to time (“no MONEY type?”, “no SHA1 hash?”, etc). It’s already happening a bit, it seems, as some types are currently being deprecated, according to the BSON spec.

      There’s a problem with that: any new datatype added makes the protocol unusable by older versions.

      BSON is halfway between a non-descriptive format (such as Protocol buffers), where you need external knowledge to parse the data, and a fully self-descriptive format which tells the decoder exactly what’s coming. It embeds some decisions as fixed codes (to save space), but once you mess with those codes, you break the mechanism.

      In Bencode, there is absolutely minimal datatyping. Everything else is pushed back to a higher level, which can encode its data as either numbers or bytes, and has to find some conventions for passing type information along (either implied, or explicitly sent as part of the data stream). So Bencode decoders which don’t understand certain types added on top can still ignore them (or pass them along opaquely), without missing a beat.

      In short: Bencode is more basic than BSON. But this also makes the protocol independent of whatever datatypes one would like to use. You get two types of scalars (integers and byte-strings) and you get two aggregation mechanisms (lists and dictionaries). That’s it – everything else needs to be mapped onto those basic types. None of them will ever become obsolete, so the decoders never need to be versioned to deal with changes to this serialisation choice.

      Having said that. Nothing prevents you from BSON-encoding some data, and passing it along as a bunch of bytes inside Bencode. Or vice-versa.

  3. Good break-down, thanks. I wonder what your use-case for Bencode in JeeLabs is, is this about hopping or even mesh-forwarding a data set across an unreliable-powered set of unreliably-connected nodes?

    • The main purpose of this is to use it in combination with ZeroMQ, i.e. on a Linux host.

      For packet data, this is still too verbose (e.g. room node: bencoded 20b, packed C struct 4b). I’ve been exploring an absolutely-minimal “packed Bencode” idea, but that’s still 8b. For comparison: JSON 16b, protocol buffers 11b. Assuming about 10-bytes overhead in the RF12 driver, sending 14 vs 30 bytes, is still a reduction in transmit power consumption by more than half.

      Yep, this is yak-shaving in the extreme :)

Comments are closed.