Computing stuff tied to the physical world

Pin I/O performance

In AVR, Hardware, Software on Jan 6, 2010 at 00:01

There was a discussion on the Arduino developer’s mailing list about the impact of a small change to the digitalWrite() function, and for some time I’ve been hearing that digitalWrite() has a huge amount of overhead.

Time to find out.

Here is the sketch I used to measure how often a pin I/O command can be issued using various mechanisms:

Screen shot 2010-01-05 at 11.42.53.png

The logic is that I’m counting how often the same command can be called between timer overflows, i.e. every 1024 µs (one byte, incrementing @ 16 MHz / 64), before the timer tick count changes again.

And here’s the sample output:

Screen shot 2010-01-05 at 11.42.14.png

There’s a small amount of jitter, which tells me the loops are syncing up almost exactly on the timer ticks. Interrupts have not been disabled, so the timer interrupt is indeed being serviced – once for each loop.

What these values tell me, is that we can do about:

  • 10 analog 10-bit readings per millisecond with analogRead()
  • 128 pwm settings per millisecond with analogWrite()
  • 220 pin reads per millisecond with digitalRead()
  • 224 pin writes per millisecond with digitalWrite()
  • 1056 pin reads per millisecond with direct port reads
  • 1059 pin writes per millisecond with direct port writes

(I’ve corrected the counts by 1000/1024 to arrive at these millisecond values)

So the Arduino’s digital I/O in IDE version 0017 can do roughly 1/5th the speed of direct port access on a 16 MHz ATmega328.

But WAIT! – There’s a large systematic error in the above calculations, due to the loop overhead. It looks like the loop takes 1024000/1251 = 819 ns overhead, so the actual values are quite different: digitalRead() -> 3712 ns, direct port read -> 151 ns. Now the values are more like 1/25th!

So let’s redo this with more I/O in each loop iteration (all 4 ports):

Screen shot 2010-01-05 at 11.55.31.png

The sample output now becomes:

Screen shot 2010-01-05 at 11.56.59.png

With these results we get: one digitalRead() takes 4134 ns, one direct port read takes 83 ns (again correcting for 819 ns loop overhead). The conclusion being that digitalRead() is 50x as slow as direct port reads.

Which one is correct? I don’t know for sure. I retried the direct port read with 16 entries per loop, and got 67 ns, which seems to indicate that a direct port read takes one processor cycle (62.5 ns), as I would indeed expect.

Conclusion: if performance is the goal, then we may need to ditch the Arduino approach.

Update – Based on JimS’s timing code (see comments): digitalRead() = 58 cycles and direct pin read = 1 cycle.

Update #2 – The “1 cycle” mentioned above is indeed what I measured, but incorrect. The bit extraction was probably optimized away. So it looks like direct pin access can’t be more than 29x faster than digitalRead(). As pointed out by WestfW in the comments, digitalRead() and digitalWrite() have predictable performance across all use cases, including when the pin number is variable. In some cases that may matter more than raw speed.

Update #3 – Another caveat – Lies, damn lies, and statistics! – is that the register allocations for the above loops make it extremely difficult to draw exact conclusions. Let me just conclude with: there are order-of-magnitude performance implications, depending on how you do things. As long as you keep that in mind, you’ll be fine.

  1. Also keep in mind that an operation that takes 50 times as long uses 50 times as much power. So you might reach the same conclusion if power saving is the goal, which seems to be a more appropriate one for many of your use cases.

  2. It would be interesting to just measure power consumption for long periods of time to determine which read is using the most. You might get some correlation since the looping instruction work would cancel out between I/O types.

    • I’d expect power drain to be the same for each. Power savings usually come from putting the MPU in a low-power sleep mode. So the quicker the work is done, the quicker the chip can be put back to sleep again. But perhaps I’m misunderstanding what you mean…

  3. Very good experiment and write up. I’m an AVR C fan who uses Arduino just for the convenient access to hardware. I’ve written a C library (not technically a library just yet) with an easy interface like an Arduino language but in raw C. I’ve tried to keep performance into consideration. I’d be interested to see what you think.

    http://tinkerish.com/wiki/

    • Thx – and thanks for the pointer. This looks like an interesting library, I’m definitely going to look into your “libarduino” in some more detail.

  4. Not quite an important question, but… Which kind of IDE are you using for writing your code? It has very usefull color scheme (for the text) as I can see from your screenshots and I like it much.

    • I’m using TextMate on Mac OS X. These are screen dumps, based on a variation of the “IDLE” syntax coloring scheme. The font is Menlo 10 pt, which kind of fits in nicely with the rest of the text, IMO.

  5. You can get clock accurate timing of short operations using the technique in this post: http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1175115259

    • Ah, thanks – that’s a great tip! I’ve already done some more timing experiments using the above approach (to be posted next week), but will definitely try out your trick next time around.

  6. @JimS – I’ve updated the post with latest timing results using your trick. Thx.

  7. I think it’d be hard to do a direct-read equivalent in “1 cycle”; the single cycle “IN” instruction reads the whole port, and you’re going to need an extra cycles or two to isolate a single pin (even for constants.) Adding a couple cycles may not sound like much, but it changes that multiplier from “Arduino library is 50 times slower” to more like only “20 times slower…”

    It’s also important to recognize just what you get for the extra cycles: 1) The port being read is a variable 2) the bit being read is a variable 3) the “pin number” is variable and is mapped at runtime to the appropriate port and bit. Given the fundamental underlying architecture, these are relatively “expensive” features to implement. One of the reasons that faster alternatives are resisted in the core libraries is that they’d tend to make timing “complicated.”

  8. Good points, thx. In hindsight, I found the “1 cycle” result a bit hard to believe. I think what happens is that the “IN” instruction is generated, but the bit extraction isn’t because the value isn’t used in this synthesized example. So at best, the conclusion would be 58:2, i.e. 29x.

    Yes, the variable bit access is what sets digitalRead() and digitalWrite() apart. And the table lookup they use makes their performance predictable. I’m not suggesting that is bad. I merely wanted to point out that direct bit access is substantially faster. For some use cases, 29x more overhead could be a show stopper.

  9. Any idea how the speed of port operations compares? I don’t know how arduino implements them, but I’d guess they’re a lot closer to the bit operations.

Comments are closed.