Post-mortem of a bug Mar 2017
As everyone knows, the later a bug shows up in production, the more trouble it is to fix…
A while ago, a batch of fresh JeeNode Zero rev4 boards was assembled, in itself the result of quite a bit of experimentation, design work, writing code, trying out things, rearranging the pins - and all of that across a couple of iterations.
Revision 4 is intended to be very close to the “final” design of the JeeNode Zero, and the only way to figure out those last details is to get it out there and push it further.
After assembly, the Mecrisp Forth software was installed, along with a fair amount of code from the embello/explore/1608-forth/jz4 directory on GitHub. So far, so good, it all worked like a charm, the boards worked, the software did its thing, and the radio echo test also passed.
Ready to go, right? So last week, 140 boards were sent off to the UK, where Digital Smarties handles orders, fulfilment, and everything that comes along with that.
For one order in the NL, it was much simpler to send it directly, so one unwitting guinea pig got the board a couple of days early …
… and immediately ran into a bug, right with his very first exploration!
One board worked fine, the other consistently failed. That’s a hardware issue, right? But we’d tested them all, and we were certain the hardware worked (that’s the point of tests after all!).
Then, just to try and understand it, a board was set up at JeeLabs, and … the same test failed.
It wasn’t consistent though, the failure seemed intermittent. And the odd bit: once it went away, it never came back. Hardware? Software? It made no sense!
A trivial test like
1 2 + . worked fine (printing 3), every single time. The
board also always came up fine after power-up and after reset, every single
At this point, climbing up a wall became an attractive thought - how can you debug something, which works on some boards, not others, goes away (!) after a while, and shows that both the hardware and software are performing as expected just about always? It just made no sense.
(which is of course not uncommon with non-trivial bugs …)
The next step luckily started to shed some light. The original failure report was as follows:
adc-init ok. pa0 adc . 38 ok. : forever begin cr pa0 adc . 1000 ms again ; forever Unhandled Interrupt 00000003 ! Stack: [0 ] TOS: 0000002A *> Calltrace: 00000000 00005285 ( 00005256 + 0000002E ) ct-irq [etc...]
Keep in mind that this code didn’t fail on every board, and that even on the same board it sometimes passed, and then never failed again, not even after a reset.
Could it be the power supply? The rev4 has a larger capacitor than rev3 (100 µF iso 22 µF). Nope, that’s not it: the datasheet says that the rise time of Vcc is allowed to be arbitrarily long.
Could it be the ADC, or the systick timer interrupt? Nope, even this tiny test failed:
: forever begin again ; forever
Ehm… sometimes. Getting the failure to happen took some effort. And then it became clear: only a power down of at least a few seconds would trigger the problem, and then in fact it turned out to be fully reproducible, at least on the same board!
Could it be a bug in all that embello code added to Mecrisp and stored in flash? After
init word override does a few things on each reset. This was ruled
out by erasing all words added to flash. And that
forever loop kept failing.
And now the crazy part - this code passed every single time:
: a ; : forever begin again ; forever
Yet this failed:
: a ; a
Finally, a first hypothesis emerged for what might be happening: maybe the generated code was bad, and maybe it was only bad right after power up. But not all the time, else this would have become apparent a long time ago (Mecrisp Forth’s compiler kicks into action all the time, for anyone typing word defintions into it!).
But how can this be? If the generated code was bad, wouldn’t that end up getting compiled into flash memory and lead to builds which would consistently fail?
The next clue came from the disassembler, which is available as an add-on in Mecrisp. By loading this into flash memory, we can disassemble the code before running it. And yep, something strange turned up:
Mecrisp-Stellaris RA 2.3.3 with M0 core for STM32L053C8 by Matthias Koch : a ; ok. see a 20000454: AEAD add r6 sp #2B4 20000456: 2600 movs r6 #0 20000458: 43F6 mvns r6 r6 2000045A: E000 b 2000045E 2000045C: 2600 movs r6 #0 2000045E: 4770 bx lr ok. : a ; Redefine a. ok. see a 20000468: 4770 bx lr ok.
The first definition of “
a” generates different code from the second one. That
can’t be right.
And now the puzzle pieces fall into place:
- if there is a problem with uninitialised RAM in Mecrisp, then that would only happen on some units, and it would not reappear once the init is done
- a > 5s power down will reset RAM to the same state, whereas a brief power loss will not (bits in RAM tend to linger a bit in the same state after power loss)
- the first word definition generates bad code, but once initialised, all is well
Time to get in touch with Matthias - Mecrisp’s author - who was just as surprised by this, and immediately dove in and started checking all the code paths into the code optimiser.
And within hours, he nailed it:
initis called on power up (which is not the case on original Mecrisp’s, but always the case for the “spezial” mods used on all embello builds), then there is one path where an essential compiler init step is bypassed (the Mecrisp core is all assembly code)
After that, it was a matter of time before Matthias had a fix, the fix got tested, verifying that it handled the exact case we ran into, and… Mecrisp 2.3.5 was released, less than an hour later!
And now the race was on, as always with such “late” issues: generating the new image to flash onto a JeeNode Zero rev4, making absolutely sure that this is “the fix and nothing but the fix”, and then collaborating with the UK shop to get an upload mechanism going, getting the code onto each unit, and verifying that the upload succeeded and that each unit starts up properly.
The result is that no-one receiving JNZ’s from the UK shop will ever run into this!
It’s been an interesting experience. It’s also a fantastic tribute to open source, with Matthias stepping in on extremely short notice, and exactly identifying and squashing this nasty bug.
And for once, the shipping delay from NL to UK has allowed us to beat the odds :)