A recent post described the performance loss in the Arduino’s digitalRead() and digitalWrite() functions, compared to raw pin access.
Can we do better – i.e. hide the details, yet still get the benefits of raw I/O? Sure.
If you’ve used JeeNodes and in particular the “Ports” library, you’ll have noticed that there is a C++ class which hides the details of each port (i.e. JeeNode port, not ATmega port). Let’s look at that first:
I’ve omitted the implementation, but there are still lots of secondary details.
The main point is that this is now C++, and uses a “Port” object as central mechanism. Each object has one byte of data, containing the port number (1..4).
Due to heavy inlining, there is almost no additional overhead for using the Port class over using digitalRead() and digitalWrite(), on which they are based. I verified it by running similar tests as in the recent post about pin I/O:
Using the definition “Port orig (1);” – and sure enough the results are nearly the same.
There are two issues which make this approach sub-optimal: using the slow digital read/write calls, and storing the port number in a memory location which needs to be accessed at run time. There is no way for the compiler to optimize such calls, even “orig.digiRead()” should be the same as writing “bitRead(PORTD, 4)” in this example.
That’s where C++ templates come in. Check out this definition of a new “XPort” class (named that way to avoid a name conflict) and an example of use for port 1:
(As you can see, I’m switching to a different, and hopefully clearer, API along the way)
There’s some funky <…> stuff going on. We’re in fact not declaring one class, but a whole family of classes, parametrized by the integer included in the <…> notation on the last line.
The big difference, is that each class now has that integer value “built-in”, so to speak. So we can define member functions which directly pass that value on to the corresponding bitRead() and bitWrite() macros. And then all of a sudden, all the overhead vanishes: since the member needs no access to object state, it can be made static, and since all the info is known in the header, it can be made inline as well.
So the above template is C++’s modern way of doing far more at compile time, allowing the optimizer to generate much better code.
Note that templates come with some pitfalls: first of all, it’s very easy to inadvertently generate huge amounts of code, so very careful inlining and base class derivation is essential. The second problem is that templates tend to be “instantiated” as late as possible by the compiler, which can lead to confusing error messages when the templates are wrong or used wrongly.
I’m still just exploring this approach for embedded use. The potential performance gains are substantial enough to give it a serious try. My hope is that the hard work can be done in a library, so that everyone else can just use it and benefit from these gains without having to think much about templates, let alone implement new ones. The “one” object declared above acts like any other C++ object, so using it will be just as easy as non-template objects.
Does the above lead to fast code? You bet. Here’s a test sketch:
And here’s some sample output:
As you can see, values 5 and 6 are virtually the same as values 7 and 8. We’ve obtained the performance of direct pin access while using a high-level port-style notation to access those pins. This is why templates are so attractive for embedded use.
The timings are different from the previous post because the loops are coded differently. In this case, only the relative comparisons are relevant.