I’ve been wondering for some time whether the power consumption of an ATmega varies depending on the code it is running. Obviously, sleep modes and clock rate changes have a major impact – but how about plain loops?
To test this, I uploaded the following sketch into a JeeNode:
Interrupts were turned off to prevent the normal 1024 µs timer tick from firing and running in between. And I’m using “volatile” variables to make sure the compiler doesn’t optimize these calculations away (as they’re not used).
The result is that the code pattern does indeed show up in the chip’s total current consumption:
The value displayed is the voltage measured over a 10 Ω resistor in series with VCC (the JeeNode I used had no regulator, and was running directly off a 3x AA battery pack @ 3.95V).
What you can see is that the power consumption cycles between about 8.4 mA and 8.8 mA, just by being in different parts of the code. Surprising perhaps, but it’s clearly visible!
The shifts in the second loop are very slow – due to the fact that the ATmega has no barrel shifter. It has to use a little loop to shift by N bits. To get a nice picture, those shifts are performed only 5,000 times instead of 50,000.
The high power consumption is during the multiplication loop, the low consumption is during the shift loop.
In the end, I had to use a lot of tricks to create the above oscilloscope capture, because there was a substantial amount of 50 Hz hum on the measured input signal. Since the repetition rate of the signal I was interested in was not locked to that 50 Hz AC mains signal, most of the “noise” went away by averaging the signal over 128 triggers.
The other trick was to use the scope’s “DC offset” facility to lower the signal by 80 mV. This allows bumping the input sensitivity all the way up to 2 mV/div without the trace running off the screen. An alternative would be to use AC coupling on the input, but then I’d lose the information about the actual DC levels being measured.
What else can we deduce from the above screen shot?
- loop 1 takes 93.48 ms for 50,000 iterations, so one cycle runs in ≈ 1.8 µs
- loop 2 takes 108.52 ms for 5,000 iterations, so one cycle runs in ≈ 21.8 µs
As you can see, shifts by a variable number of bits do take quite a lot of time on an ATmega, relatively speaking!
Update – As noted in the comments, a shift by “321” ends up being done modulo 256, i.e. 65 times. If I change the shift to 3, the run times drop to being comparable to a multiply. The power consumption effect remains.
Cool to see this being done, but whats really suprising is that bit-shifts are an order of magnitude slower than multiplications. I didn’t expect that. Does this mean that there still are some inefficiencies in the RF12 driver code? Because there’s a lot of bit-shifting going on there. Maybe the compiler optimizes it all?
Very few bit shifts are by a variable number, so I assume that the compiler can properly optimize most of ’em, if not all.
First of all, we are comparing two things that are not equivalent. Only a multiplication by a power of two can be substituted by shifting. And then it depends …
@Vliegendehuiskat: shifts are not in general ‘slower’ than multiplications, it would be wrong to memorize this as a result. All this depends a lot on the architecture of the controller. If the controller has no hardware multiplier and no barrel shifter, a multiplication by a power of two can in general be done quicker by shifting than by calling a multiplication subroutine.
@jcw: bit shifts by a constant number can in most cases be very nicely optimized by the compiler but the same is also true for multiplications by powers of two. But I in fact do use variable bit shifts quite often (searching bitwise through a variable which holds status information).
I would mostly just let the compiler decide what is best in a certain situation, the GCC is very good at this. If it finds a situation where it can substitute a multiplication safely by a shift, it will just do it. Compilers are in general in most cases better than humans to make this sort of decisions (ok, I might only be talking about me or other 50+ people here :-)).
You’re also shifting a register that’s just 16 bits wide 321 times. (such a left-shift of a signed value is implementation-defined in C anyway) If you changed your program to shift a sensible number of times, such as 8 or 15, then the timing would be less skewed in favor of the multiply.
That comes as a surprise to me jcw. Can you nail it down to which instructions are using more current? Not that there’s really much practical use to it but it’s still interesting.
This is the basis of quite a bit of crypto research, not something to care about much for most of us, but serious work gets done here. (Also, monitoring the EM from the cpu executing different instructions.
I did some simple compiler efficiency research here. The compiler seems remarkably efficient. Use
avr-objdump -S
on the generated .elf files.Our mental abstraction of the chip internals is perhaps dominated by those neat diagrams of buses and ALU’s – at the lowest level, registers are loaded/enabled/disabled. This reflects in a current draw dependency on both data and instruction.
This effect can actually be used for chip validation by comparing signatures between known good and unknown samples. Measuring the supply current draw accurately is supported by on-chip sensors accessed before bonding and gives the technique its name, IDDQ.
You seem to be shifting by 321 bits! In the real world, I doubt anyone would ever shift by more than 16 – how does the graph change if you are doing saner bitshifts? I assume that the time the loop takes is proportional to the number of shifts made – suggesting that shifting by 30 bits would be similar in magnitude to a multiplication? Or is it not that simple?
I had a look at the assembler. The sequence ends up being a bit like this (avr-gcc 4.7.0):
with LO8(321) in r22 and 123 in r24:r25. So the first thing to notice is that it’s actually only shifting 321&0xff = 65 times, not 321 times as we first imagined. However, it’s also easy to see that the time is proportional to the number of shifts plus a fixed overhead, so a shift by 15 will take only about 1/4 the time, and a shift by 7 will take only about 1/8 the time.That’s one of the things that make it hard to write good crypto code… And one of the reasons why “crypto chips” are *much* more expensive than standard ones: they implement hw countermeasures. Look for DPA (Differential Power Analysis) and the more general “side channel attacks” (that cover DPA, timing and many others).
It might also be worthwhile to point out that the one shift operation (of 321 bits), while consuming less instantaneous power, consumes more energy than one multiplication. The shift consumes 3.95V * 8.4mA * 21.8us = 732nJ, while the multiplication consumes only 3.958.81.8=62nJ, which is about 12 times less.
Would be interesting to see the resulting energy for a multiplication by two and a shift by one bit.
This is how those little 8 bit micros on credit cards are hacked aren’t they? By monitoring their power consumption and deducing what’s going on in the micro?, it’s called a side channel attack as I recall….
This is what Simon Foster was hinting at, its a cool technique unless you are trying to defend yourself from it.
http://www.technologyreview.com/news/427139/eavesdropping-antennas-can-steal-your-smart/
Sorry to distract from the main point of the story with an outrageous shift count. I’ve added a note to the post to put those two values in better perspective.
If you wrote the code to explicitly multiply as a series of additions, I doubt there would be any difference, but the multiply instruction triggers a hardware multiplication unit which wakes up and starts working hard.
Since the power consumption will be proportional to the number of gates transitioning, this should be expected.
And listed on hackaday again! http://hackaday.com/2012/06/14/the-effect-of-code-on-power-consumption/
Nice one Jcw, grats!