Computing stuff tied to the physical world

Decoding Bencode data

In AVR, Software on Oct 3, 2012 at 00:01

After the encoding and collection of Bencoded data, all the pieces are finally in place to actually process that incoming data again.

I’ve extended the EmBencode library and the serialReceive demo sketch to illustrate the process:

Screen Shot 2012 09 30 at 15 15 53

Let’s try it with the same test data, i.e. the following text (all on one line):

    5:abcde3:123i12345eli987eli654eei321ee
    i999999999ed3:onei11e3:twoi22ee4:bye!

Here is the output:

Screen Shot 2012 09 30 at 15 18 03

You can see the pieces being separated, and all the different parts of the message being decoded properly. One convenient feature is that the asString() and asNumber() calls can be mixed at will. So strings that come in but are actually numeric can be decoded as numbers, and numbers can be extracted as strings. For binary data, you can get at the exact string length, even when there are 0-bytes in the data.

The library takes care of all string length manipulations, zero-byte termination (which is not in the encoded data!), and buffer management. The incoming data is in fact not copied literally to the buffer, but stored in a convenient way for subsequenct use. This is done in such a way that the amount of buffer space needed never exceeds the size of the incoming data.

The general usage guideline for this decoder is:

  • while data comes in, pass it to the decoder
  • when the decoder says it’s ready, switch to getting the data out:
    • call nextToken() to find out the type of the next data item
    • if it’s a string or a number, pull out that value as needed
  • the last token in the buffer will always be T_END
  • reset the decoder to prepare for the next incoming data

Note that dict and list nesting is decoded, but there is no processing of these lists – you merely get told where a dict or list starts, and where it ends, and this can happen in a nested fashion.

Is it primitive? Yes, quite… but it works!

But the RAM overhead is just the buffer and a few extra bytes (8 right now). And the code is also no more than 2..3 KB. So this should still fit fairly comfortably in an 8-bit ATmega (or perhaps even ATtiny).

Note that the maximum buffer size (hence packet size) is about 250 bytes. But that’s just this implementation – Bencoded data has no limitations on string length, or even numeric accuracy. That’s right: none, you could use Bencode for terabytes if you wanted to.

PS – It’s worth pointing out that this decoder is driven as a push process, but that actual extraction of the different items is a pull mechanism: once a packet has been received, you can call nextToken() as needed without blocking. Hence the for loop.

  1. Neat format. But couldn*t you use a semicolon or a period or something equally visually distinctive, instead of “e”? That would make the bare strings much more readable for humans, Also I would require a delimiter at the end of strings. It’s redundant, but redundancy is a good thing when communication is not 100% lossless.

    • Unfortunately it’s not my choice to make, since this is an existing format. Changing it means we’d lose compatibility with all the existing implementations. Couple of them coming up tomorrow, BTW.

      What I do intend to implement is a “relaxed” format, with whitespace and indentation inserted, to make this much easier to read for us mortals. This variation would not be recognised by strict decoders, but those which skip everything they don’t recognise would be able to read it. Such as the one in this post, for example.

  2. Nice implementation!

    At first I could not find the T_END but then I discovered that you add it yourself to the buffer to check for end of decoding ;-)

    And the next blog is about the next layer, ie the one that does the actual interpretation ???

Comments are closed.