After some discussion on the forum, I’d like to present a draft design for an over-the-air bootstrap mechanism, IOW: being able to upload a sketch to a remote JeeNode over wireless.
Warning: there is no release date. It’ll be announced when I get it working (unless someone else gets there first). This is just to get some thoughts down, and have a first mental design to think about and shoot at.
The basic idea is that each remote node contacts a boot server after power up, or when requested to do so by the currently running sketch.
Each node has a built-in unique 2-byte remote ID, and is configured to contact a specific boot server (i.e. RF12 band, group, and node ID).
STEP 1
First we must find out what sketch should be running on this node. This is done by sending out a wireless packet to the boot server and waiting for a reply packet:
- remote -> server: intial request w/ my remote ID
- server -> remote: reply with 12 bytes of data
These 12 bytes are encrypted using a pre-shared secret key (PSK), which is unique for each node and known only to that node and the boot server. No one but the boot server can send a valid reply, and no one but the remote node can decode that reply properly.
The reply contains 6 values:
- remote ID
- sketch ID
- sketch length in bytes
- sketch checksum
- extra sketch check
- checksum over the above values 1..5
After decoding this info, the remote knows:
- that the reply is valid and came from a trusted boot server
- what sketch should be present in flash memory
- how to verify that the stored sketch is complete and correct
- how to verify the next upload, if we decide to start one
The remote has a sketch ID, length and checksum stored in EEPROM. If they match with the reply and the sketch in memory has the correct checksum, then we move forward to step 3.
If no reply comes in within a reasonable amount of time, we also jump to step 3.
STEP 2
Now we need to update the sketch in flash memory. We know the sketch ID to get, we know how to contact the boot server, and we know how to verify the sketch once it has been completely transferred to us.
So this is where most of the work happens: send out a request for some bytes, and wait for a reply containing those bytes – then rinse and repeat for all bytes:
- remote -> server: request data for block X, sketch Y
- server -> remote: reply with a check value (X ^ Y) and 64 bytes of data
The remote node gets data 64 bytes at a time, and burns them to flash memory. The process repeats until all data has been transferred. Timeouts and bad packets lead to repeated requests.
The last reply contains 0..63 bytes of data, indicating that it is the final packet. The remote node saves this to flash memory, and goes to step 3.
STEP 3
Now we have the proper sketch, unless something went wrong earlier.
The final step is to verify that the sketch in flash memory is correct, by calculating its checksum and comparing it with the value in EEPROM.
If the checksum is bad, we set a watchdog timer to reset us in a few seconds, and … power down. All our efforts were in vain, so we will retry later.
Else we have the proper sketch and it’s available in flash memory, so we leave bootstrap mode and launch it.
That’s all!
ROBUSTNESS
This scheme requires a working boot server. If none is found or in range, then the bootstrap will not find out about a new sketch to load, and will either launch the current sketch (if valid), or hit a reset and try booting again a few seconds later.
Not only do we need a working boot server, that server must also have an entry for our remote ID (and our PSK) to be able to generate a properly encrypted reply. The remote ID of a node can be recovered if lost, by resetting the node and listening for the first request it sends out.
If the sketch hangs, then the node will hang. But even then a hard reset or power cycle of the node will again start the boot sequence, and allows us to get a better sketch loaded into the node. The only drawback is that it needs a hard reset, which can’t be triggered remotely (unless the crashing sketch happens to trigger the reset, through the watchdog or otherwise).
Errors during reception lead to a failed checksum at the end, which then leads to a reset and a new boot loading attempt. There is no resume mechanism, so such a case does mean we have to fetch all the data blocks again.
SECURITY
This is the hard part. Nodes which end up running some arbitrary sketch have the potential to cause a lot of damage if they also control real devices (lights are fairly harmless, but thermostats and door locks aren’t!).
The first line of defense comes from the fact that it is the remote node which decides when to fetch an update. You can’t simply send packets and make remote nodes reflash themselves if they don’t want to.
You could interrupt AC mains and force a reset in mains-powered nodes, but I’m not going to address that. Nor am I going to address the case of physically grabbing hold of a node or the boot server and messing with it.
The entire protection is based on that initial reply packet, which tells each remote node what sketch it should be running. Only a boot server which knows the remote node’s PSK is able to send out a reply which the remote node will accept.
It seems to me that the actual sketch data need not be protected, since these packets are only sent out in response to requests from a remote node (which asks for a specific sketch ID). Bad packets of any kind will cause the final checksums to fail, and prevent such a sketch from ever being started.
As for packets flying around in a fully operational home network: that level of security is a completely separate issue. Sketches can implement whatever encryption they like, to secure day-to-day operation. In fact, the RF12 library includes an encryption mechanism based on XTEA for just that purpose – see this weblog post.
But for a bootstrap mechanism, which has to fit in 4 Kb including the entire RF12 wireless packet driver, we don’t have that luxury. Which is why I hope that the above will be enough to make it practical – and safe!