Computing stuff tied to the physical world

Flippin’ bits – revisited

In Software on Nov 14, 2010 at 00:01

For one of the infinite number of projects here at Jee Labs, I wanted to know how fast you can toggle a pin. This has been covered before in a previous weblog post, but this time I’m going to actually measure the results.

I’m going to focus on sending out 8 pulses in a row, because my goal is to transfer one byte in or out of the system (i.e. software-based SPI).

Here’s the simplest possible way to do it:

Screen Shot 2010 11 14 at 02.13.56

Measured result = 124.60 KHz, i.e. ≈ 8 µs per bit.

Some of that is loop overhead, and one trick to avoid that is to “unroll” the loop:

Screen Shot 2010 11 13 at 18.19.02

Measured result = 129.14 KHz – no big difference?

The reason is that the Arduino library’s “digitalWrite()” is very slow. So let’s go back to the loop and use a faster mechanism:

Screen Shot 2010 11 14 at 02.17.19

Measured result = 1.72 MHzwhoa, 0.58 µS per bit!

Now let’s unroll that loop again:

Screen Shot 2010 11 13 at 18.25.30

Measured result = 3.03 MHz – yep, now the loop overhad makes a big difference…

Can we do better? Yes, we can (heh) – using the PIND trick to toggle an output:

Screen Shot 2010 11 14 at 02.18.30

Measured result = 2.16 MHz – that’s ≈ 0.46 µs per bit.

And now, at these speeds, loop unrolling makes an even bigger difference:

Screen Shot 2010 11 13 at 18.31.54

Measured result = 4.71 MHz!

That’s a bit odd though. The JeeNode is running at 16 MHz, and 4.71 MHz is not a very clear multiple or divisor of anything. How can a regular sequence of statements generate such an irregular frequency? The puzzle is solved by looking at the bigger picture with a scope:

Screen Shot 2010 11 13 at 18.59.09

(as you can see, my scope can’t sample 8 MHz square waves with good fidelity)

The scope measurement says it all: these are short bursts, because now the calling overhead of “loop()” is taking almost as much time as the 8 pulses.

If I unroll the loop further to contain several hundred “PIND = bit(4);” statements, the frequency readout increases to 7.82 MHz. IOW, each C statement takes one processor cycle!

Quite an amazing feat, the AVR MPUs are clearly RISC-class processors. And the gcc compiler is generating optimal code in this case.

So there you have it. A lowly JeeNode (or Arduino) can generate multi-megahertz signals on an I/O pin, just by writing a few lines of C code!

Note that these are limiting values. As soon as you start adding more logic to these loops – whether unrolled or not – the maximum attainable frequency will quickly drop. But still: not too shabby for them little chips!

Update – measurements were slightly off, because not all loops were 8 long. Fixed now, same conclusions.

  1. Thanks for the very informative comparison!

  2. A very interesting experiment indeed.

    It’s often forgotten that things which make the source code nice and compact can introduce a processing over head.

    With a loop at the end of each iteration your loop variable is incremented, compared to the max value, and then if it is less, a jump is made back to to the top of the loop again.

    It would be interesting to see what difference is made by changing the loop to

    for (byte i = 8; i!=0 ; –i)

    Assuming it will compile of course.

    In theory this would avoid the comparison stage at the end as decrementing the value should automatically set the Zero flag when it reaches zero so a “branch on zero set” (sorry, don’t have the instruction sheet to hand) would suffice.

    • Sorry, that should be “Branch on zero clear”… Zero being set would mean the end of the loop.

  3. for reliability, do not forget to disable interrupts before generating the signal (and re-enable afterwards).

  4. Is gcc (or some kind of pre-processor) not able to “unroll” the simple loops ? That would keep code readable and improve efficiency…

    • Actually I found out myself… In gcc manpage you have the option -funroll-loops ! Would you give it a try on your first code with your setup ? Regards.

    • Neat, but you wouldn’t want it to unroll a loop of 1000 ;-)

      “Where’s all my memory gone?!” :-(

  5. Looking at the assembler output from gcc clarifies a lot when examining tight loops. I think option -S generates assembly. Setting up a function call usually eats a lot of cycles. Getting gcc to build small functions inline can help, and may not use more memory since making a function call also costs memory.

    Also note that after a jump the processor may need to read the instructions from memory, for which the memory needs to be set up (CAS/RAS delays). Although the ATMega may not have that problem due to the small memory?

  6. @Hamlet – Setting gcc command-line options is not so easy with the Arduino IDE, I’m afraid. I still use that most of the time, because that’s probably what most people use. IMO, a bit of manual unrolling is not too bad – and in many cases I suspect that you really only have to do this in one or two places when you start optimizing things.

    @John – First of all: same comment. Luckily, gcc appears to do a very good job of deciding what to inline if you make the functions “static”. Doesn’t work outside single source files of course. As for the delay-after-branch cache miss: I’d guess that this doesn’t apply to the (non-pipelined?) AVR chips, with on-board static RAM. They really appear to achieve one instruction per clock cycle.

Comments are closed.