After having constructed some Bencode-formatted data, let’s try and read this stuff back.
This is considerably more involved. We’re not just collecting incoming bytes, we’re trying to make sense of it, figuring out where the different strings, ints, lists, and dicts begin and end, and – not obvious either perhaps – we’re trying to manage the memory involved in a maximally frugal way. RAM is a scarce resource on a little embedded µC, so pointers and malloc’s have to be avoided where possible.
Let’s start with the essential parsing task of figuring out when an incoming byte stream contains a complete Bencoded data structure. IOW, I want to keep on reading bytes until I have a complete “packet”.
The following serialReceive sketch does just that:
All of the heavy-lifting is done in the EmBencode library. What we’re doing here is giving incoming bytes to the decoder, until it tells us that we have a complete packet. Here’s the output, using the test data created yesterday:
Looks obvious, but this thing has fully “understood” the incoming bytes to the point that it knows when the end of each chunk has been reached. Note that in the case of a nested list, all the nesting is included as part of one chunk.
There’s more to this than meets the eye.
First of all, this is a “push” scanner (aka “lexer”). Think about how you’d decode such a data stream. By far the simplest would be something like this (in pseudo code):
- look at the next character
- if it’s a digit:
- get the length, the “:”, and then grab the actual string data
- if it’s an “i”:
- convert the following chars to a signed long and drop the final “e”
- if it’s a “d” or an “l”:
- recursively grab entries, until we reach an “e” at the end of this dict / list
But that assumes you’re in control – you’re asking for more data, IOW you’re waiting until that data comes in.
What we want though, is to treat this processing as a background task. We don’t know when data comes in and maybe we’d like to do other stuff or go into low-power mode in the meantime. So instead of the scanner “pulling” in data, we need to turn its processing loop inside out, giving it data when there is some, and having it tell us when it’s done.
The code for this is in EmBencode.cpp:
It’s written in (dense) C++ and implemented as a finite state machine (FSM). This means that we switch between a bunch of “states” as scanning progresses. That state is saved between calls, so that we’ll know what to do with the next character when it comes in.
There’s a fair amount of logic in the above code, but it’s a useful technique to have in your programming toolset, so I’d recommend going through this if FSM’s are new to you. It’s mostly C really, if you keep in mind that all the variables not declared locally in this code are part of a C++ object and will be saved from call to call. The “EMB_*” names are just arbitrary (and unique) constants. See if you can follow how this code flows as each character comes in.
The above code needs 7 bytes of RAM, plus the buffer used to store the incoming bytes.
There are tools such as re2c which can generate similar code for you, given a “token grammar”, but in simple cases such as this one, being able to wrap your mind around such logic is useful. Especially for embedded software on limited 8-bit microcontrollers, where we often don’t have the luxury of an RTOS with multiple “tasks” operating in parallel.
Halfway there: we’ve identified the pieces, but we’re not yet doing anything with ‘em!