Computing stuff tied to the physical world

Collecting Bencode data

In AVR, Software on Oct 1, 2012 at 00:01

After having constructed some Bencode-formatted data, let’s try and read this stuff back.

This is considerably more involved. We’re not just collecting incoming bytes, we’re trying to make sense of it, figuring out where the different strings, ints, lists, and dicts begin and end, and – not obvious either perhaps – we’re trying to manage the memory involved in a maximally frugal way. RAM is a scarce resource on a little embedded µC, so pointers and malloc’s have to be avoided where possible.

Let’s start with the essential parsing task of figuring out when an incoming byte stream contains a complete Bencoded data structure. IOW, I want to keep on reading bytes until I have a complete “packet”.

The following serialReceive sketch does just that:

Screen Shot 2012 09 30 at 11 46 31

All of the heavy-lifting is done in the EmBencode library. What we’re doing here is giving incoming bytes to the decoder, until it tells us that we have a complete packet. Here’s the output, using the test data created yesterday:

Screen Shot 2012 09 30 at 11 46 07

Looks obvious, but this thing has fully “understood” the incoming bytes to the point that it knows when the end of each chunk has been reached. Note that in the case of a nested list, all the nesting is included as part of one chunk.

There’s more to this than meets the eye.

First of all, this is a “push” scanner (aka “lexer”). Think about how you’d decode such a data stream. By far the simplest would be something like this (in pseudo code):

  • look at the next character
  • if it’s a digit:
    • get the length, the “:”, and then grab the actual string data
  • if it’s an “i”:
    • convert the following chars to a signed long and drop the final “e”
  • if it’s a “d” or an “l”:
    • recursively grab entries, until we reach an “e” at the end of this dict / list

But that assumes you’re in control – you’re asking for more data, IOW you’re waiting until that data comes in.

What we want though, is to treat this processing as a background task. We don’t know when data comes in and maybe we’d like to do other stuff or go into low-power mode in the meantime. So instead of the scanner “pulling” in data, we need to turn its processing loop inside out, giving it data when there is some, and having it tell us when it’s done.

The code for this is in EmBencode.cpp:

Screen Shot 2012 09 30 at 12 01 52

It’s written in (dense) C++ and implemented as a finite state machine (FSM). This means that we switch between a bunch of “states” as scanning progresses. That state is saved between calls, so that we’ll know what to do with the next character when it comes in.

There’s a fair amount of logic in the above code, but it’s a useful technique to have in your programming toolset, so I’d recommend going through this if FSM’s are new to you. It’s mostly C really, if you keep in mind that all the variables not declared locally in this code are part of a C++ object and will be saved from call to call. The “EMB_*” names are just arbitrary (and unique) constants. See if you can follow how this code flows as each character comes in.

The above code needs 7 bytes of RAM, plus the buffer used to store the incoming bytes.

There are tools such as re2c which can generate similar code for you, given a “token grammar”, but in simple cases such as this one, being able to wrap your mind around such logic is useful. Especially for embedded software on limited 8-bit microcontrollers, where we often don’t have the luxury of an RTOS with multiple “tasks” operating in parallel.

Halfway there: we’ve identified the pieces, but we’re not yet doing anything with ’em!

PS – The new JeeLabs Café website is now alive and kicking.

  1. p>Are the ” hr lines ” leftovers from the conversion?

    Example in

  2. When decoding an INT, it seems that for each character received, you calculate the ‘value’ count=10*count+(ch-(‘0’)). But this value is not used for anything? Then the calculation should perhaps be avoided as it takes quite a few processing cycles.

    Please correct me if I did not understand it correctly.

    • Eagle-eyed reader! :) – yes, this is only used when calculating the string length, and thrown away otherwise.

      This code was merely a quick demo. The latest version does things a bit differently, as you can already see on github. I’ll post about the complete decoder in the next few days, when it gets used to extract incoming data in a more meaningful way.

  3. Did you want to keep around the content such as this, on the new site?

    • Ah, yes, of course! I see that it wasn’t listed on the page (even though that page existed), so we missed it. Fixed. New site makes it easier to auto-list all child pages, so this can’t happen again. Thx!

  4. Thanks! By the way, the new page contains this sentence: “…That means averaging (n) readings together gives you less than SQRT improvement in noise.” The original read SQRT(n) but it looks like the (n) got removed somehow, maybe it looks like some kind of formatting markup(?)

    • It must be Eagle-eye day, today!

      Very odd – I wonder if we’ve uncovered some hidden Textile markup feature. Have added @’s around the text to get it treated as code. Thx for checking.

  5. ALLCAPS(any tag) will trigger html ‘acronym’. Neat effect with “tag” now floating, but mostly undesirable for code snippettes etc…

Comments are closed.