Computing stuff tied to the physical world

RF bootstrap design

In Software on May 24, 2011 at 00:01

After some discussion on the forum, I’d like to present a draft design for an over-the-air bootstrap mechanism, IOW: being able to upload a sketch to a remote JeeNode over wireless.

Warning: there is no release date. It’ll be announced when I get it working (unless someone else gets there first). This is just to get some thoughts down, and have a first mental design to think about and shoot at.

The basic idea is that each remote node contacts a boot server after power up, or when requested to do so by the currently running sketch.

Each node has a built-in unique 2-byte remote ID, and is configured to contact a specific boot server (i.e. RF12 band, group, and node ID).

STEP 1

First we must find out what sketch should be running on this node. This is done by sending out a wireless packet to the boot server and waiting for a reply packet:

  • remote -> server: intial request w/ my remote ID
  • server -> remote: reply with 12 bytes of data

These 12 bytes are encrypted using a pre-shared secret key (PSK), which is unique for each node and known only to that node and the boot server. No one but the boot server can send a valid reply, and no one but the remote node can decode that reply properly.

The reply contains 6 values:

  1. remote ID
  2. sketch ID
  3. sketch length in bytes
  4. sketch checksum
  5. extra sketch check
  6. checksum over the above values 1..5

After decoding this info, the remote knows:

  • that the reply is valid and came from a trusted boot server
  • what sketch should be present in flash memory
  • how to verify that the stored sketch is complete and correct
  • how to verify the next upload, if we decide to start one

The remote has a sketch ID, length and checksum stored in EEPROM. If they match with the reply and the sketch in memory has the correct checksum, then we move forward to step 3.

If no reply comes in within a reasonable amount of time, we also jump to step 3.

STEP 2

Now we need to update the sketch in flash memory. We know the sketch ID to get, we know how to contact the boot server, and we know how to verify the sketch once it has been completely transferred to us.

So this is where most of the work happens: send out a request for some bytes, and wait for a reply containing those bytes – then rinse and repeat for all bytes:

  • remote -> server: request data for block X, sketch Y
  • server -> remote: reply with a check value (X ^ Y) and 64 bytes of data

The remote node gets data 64 bytes at a time, and burns them to flash memory. The process repeats until all data has been transferred. Timeouts and bad packets lead to repeated requests.

The last reply contains 0..63 bytes of data, indicating that it is the final packet. The remote node saves this to flash memory, and goes to step 3.

STEP 3

Now we have the proper sketch, unless something went wrong earlier.

The final step is to verify that the sketch in flash memory is correct, by calculating its checksum and comparing it with the value in EEPROM.

If the checksum is bad, we set a watchdog timer to reset us in a few seconds, and … power down. All our efforts were in vain, so we will retry later.

Else we have the proper sketch and it’s available in flash memory, so we leave bootstrap mode and launch it.

That’s all!

ROBUSTNESS

This scheme requires a working boot server. If none is found or in range, then the bootstrap will not find out about a new sketch to load, and will either launch the current sketch (if valid), or hit a reset and try booting again a few seconds later.

Not only do we need a working boot server, that server must also have an entry for our remote ID (and our PSK) to be able to generate a properly encrypted reply. The remote ID of a node can be recovered if lost, by resetting the node and listening for the first request it sends out.

If the sketch hangs, then the node will hang. But even then a hard reset or power cycle of the node will again start the boot sequence, and allows us to get a better sketch loaded into the node. The only drawback is that it needs a hard reset, which can’t be triggered remotely (unless the crashing sketch happens to trigger the reset, through the watchdog or otherwise).

Errors during reception lead to a failed checksum at the end, which then leads to a reset and a new boot loading attempt. There is no resume mechanism, so such a case does mean we have to fetch all the data blocks again.

SECURITY

This is the hard part. Nodes which end up running some arbitrary sketch have the potential to cause a lot of damage if they also control real devices (lights are fairly harmless, but thermostats and door locks aren’t!).

The first line of defense comes from the fact that it is the remote node which decides when to fetch an update. You can’t simply send packets and make remote nodes reflash themselves if they don’t want to.

You could interrupt AC mains and force a reset in mains-powered nodes, but I’m not going to address that. Nor am I going to address the case of physically grabbing hold of a node or the boot server and messing with it.

The entire protection is based on that initial reply packet, which tells each remote node what sketch it should be running. Only a boot server which knows the remote node’s PSK is able to send out a reply which the remote node will accept.

It seems to me that the actual sketch data need not be protected, since these packets are only sent out in response to requests from a remote node (which asks for a specific sketch ID). Bad packets of any kind will cause the final checksums to fail, and prevent such a sketch from ever being started.

As for packets flying around in a fully operational home network: that level of security is a completely separate issue. Sketches can implement whatever encryption they like, to secure day-to-day operation. In fact, the RF12 library includes an encryption mechanism based on XTEA for just that purpose – see this weblog post.

But for a bootstrap mechanism, which has to fit in 4 Kb including the entire RF12 wireless packet driver, we don’t have that luxury. Which is why I hope that the above will be enough to make it practical – and safe!

  1. looks nice, but will the original sketch loading mechanism (usb/bub) still be there? If not then you have no other means to upload a sketch to the node except to reprogram another bootstrap.

    • That’s indeed an important issue. The way it looks so far, I don’t think I’ll be able to fit both into the bootstrap area.

      The way to handle this is through ISP re-programming of the ATmega, probably with a modified variant of the “Opti-rebooter” (which still isn’t quite right, btw).

      But we ain’t there yet!

    • It would be nice if both could could fit, but if you think about how nodes are used and deployed, you could use the standard optiboot loader whilst you’re developing, and then once it’s ready change to the wireless version, put it in a box and screw it into a location.

      That little bit of initial extra work would make life so much easier when the inevitable “Oh, damn, there’s a bug” sneaks up and bites you after you have deployed 4 round your house!

      From my point of view/use I’ve always got a spare node for dev work, because I’m always fiddling, so that would be my standard FTDI loader JeeNode. Once I’m happy with a sketch I would feed it into the sketch deployment system and either leave it for a few minutes/hours (depending on the remote node’s refresh period) or take a quick walk round the house and power cycle all the nodes to force an update request.

  2. Some thoughts about the security conecpt:

    Shouldn’t the nodes send some random value along with there initial request that is used in the encryption to prevent replay attacks? Think of an attacker that sniffs the reply packets and sketch data of several subsequent updates and can then downgrade your node to a previous version.

    Depending on the strength of the checksum algorithm being used for the actual data, an attacker might even be able to make up a sketch that matches the length and checksum of a previously sniffed version and then replay that header along with his own data.

    One option to prevent downgrading by replay would be to add versioning and make the boot code only accept versions that are higher than the current one.

    • Ah – good one, I hadn’t thought about replay to mimic an obsolete upload.

      Agree, making sure sketch numbers Y always have to increase should solve this. Reverting to an older sketch can still be done, by re-issuing a new version number with the older code.

      (first tests tell me that memory is going to be tight, so simplicity on the loader side is really crucial)

    • Depending on the strength of the checksum algorithm being used for the actual data, an attacker might even be able to make up a sketch that matches the length and checksum of a previously sniffed version and then replay that header along with his own data.

      There’s a second checksum, which depends on the PSK and differs for each remote node. So although the content, size, and checksum are public, that second check can’t be calculated, only brute-force’d. My thought was to use the checksum of the PSK as seed for calculating that second checksum.

Comments are closed.