All microcontrollers have I/O pins to connect to the outside world. These can be used as inputs or outputs, and tend to have various interesting capabilities. Here, I’ll treat these “General Purpose I/O” pins as outputs and try to toggle them as quickly as possible.

These examples will use the STM32F407. Other families may differ slightly (the GPIOs in the older F1 series are somewhat less configurable and less flexible, for example).

Raw GPIO pin access

On ARM, each pin is a bit in a hardware port, mapped to a specific memory location. Naming is straightforward: each port is a letter (“A” to “K”), with up to 16 pins (0 to 15). The STM32 naming convention for pin 13 of port B is “PB13”.

As stated in the STM32F4 reference manual (RM0090, p.65), the port B hardware registers are mapped to the address range 0x40020400..0x400207FF. Each port has several registers, to set its pin modes and to get / set the current pin state.

The output mode is set in the “GPIO port mode register” (MODER) at offset 0x0, with 2 configuration bits per pin. For pin 13, bit positions 27..26 will need to be set to binary “01” (0b01), as follows:

volatile uint32_t* MODER_B = (volatile uint32_t*) 0x40020400;
*MODER_B &= ~(0b11<<26);  // clear bits 27 and 26 to zero
*MODER_B |= (0b01<<26);   // "or" in the new value

This messy bit-fiddling is needed to avoid changing any bits other than bits 27 and 26.

Note: the volatile keyword is essential when modifying hardware registers. It tells the compiler NOT to optimise away any code which might appear redundant, but is in fact required to trigger the requested side-effects within the hardware of the chip.

Here is an example of what’s in the STM32F407 datasheet:

The pin output state can be controlled through another register, the “GPIO output data register” (ODR), by simply setting or clearing that bit (note the different register address):

volatile uint32_t* ODR_B = (volatile uint32_t*) 0x40020414;
*ODR_B |= (1<<13);   // set pin high, this turns the LED "off"
*ODR_B &= ~(1<<13);  // set pin low, this turns the LED "on"

On the board I’m using, the LED is connected between the I/O pin and +3.3V (via a current-limiting resistor). This makes that I/O signal “active low”, i.e. its active state (LED on) is when the GPIO pin is low (“0”). Hence the inverted logic.

So much for the really low-level (and tedious!) way of controlling a GPIO pin.

GPIO pins in C++

Clearly, this detailed level of coding will get real old, real fast.

In the JeeH library, pins can be mapped to variables using a special notation in C++, so with our LED on pin PB13, a variable named “led” can be declared as follows:

PinB<13> led;

This uses C++ templates, hence the <> notation in there.

Setting the pin to behave as an output is now one call:

led.mode(Pinmode::out);

Setting a pin high or low is also concise, using C++ “overloading” of the “=” operator:

led = 1;
led = 0;

Or, if you prefer an explicit function call notation to convey the meaning:

led.write(1);
led.write(0);

And lastly, there’s a way to specify toggling the pin between its two states:

led.toggle();

Under the hood, the same raw pin register manipulations as described earlier still take place, but this wraps it all in a much more compact and readable notation.

How fast is it?

This is where it gets interesting. We’re manipulating the hardware directly in code, at the very lowest hardware register level (even with C++’s notational conveniences).

So let’s write a loop and look at the code generated by the C++ compiler:

    while (true)
        led.toggle();

Here is the corresponding machine + assembly code:

$ arm-none-eabi-objdump -d .pioenvs/f407/src/main.o
[...]
  84:	680a      	ldr	r2, [r1, #0]
  86:	f412 5f00 	tst.w	r2, #8192	; 0x2000
  8a:	bf14      	ite	ne
  8c:	f04f 5200 	movne.w	r2, #536870912	; 0x20000000
  90:	f44f 5200 	moveq.w	r2, #8192	; 0x2000
  94:	601a      	str	r2, [r3, #0]
  96:	e7f5      	b.n	84 <main+0x84>

In short: check the current pin value, switch to the other state (0 or 1), set it, and jump back to the beginning of the loop. Rinse and repeat forever. By the way, it’s interesting to actually see what that generated ARM machine code looks like, all 20 bytes of ‘em.

The LED will blink FAR too quickly to see, but with the F407 clock running at its 16 MHz default and a frequency meter, the pin toggle rate can be measured: it’s 0.79 MHz. It looks like each loop iteration needs 10 clock cycles (two toggles make a full on/off cycle).

Some overhead is due to the loop itself. To see how much, we can try to “unroll” this loop:

    while (true) {
        led.toggle();
        led.toggle();
        led.toggle();
        led.toggle();
        led.toggle();
        led.toggle();
        led.toggle();
        led.toggle();
    }

Note that the blink rate will no longer be 100% constant (it takes just a tad longer when the end of the loop is reached). It turns out that the toggle rate is now … 0.49 MHz?

Compiler effects

That last outcome is highly counter-intuitive. We’ve reduced the loop overhead, and yet the pin toggle rate has gone down?

The reason for this is, that apparently the compiler has switched from inline code to a call to the toggle routine (most probably for code size reasons). As a result, there is now the complete call overhead, as well as loss of locality w.r.t. caching (more on that later).

This can be seen in the generated code (unfortunately, the assembly code is now also more complex). Suffice to say that inlining can be forced by writing the code differently, but with optimising compilers, loop unrolling does not always lead to the desired effect.

More speed

Still, it is possible to further speed up the toggle rate, since there’s no need to check the pin’s state at run time - it’s always known. The following code will toggle just fine:

    while (true) {
        led = 0;
        led = 1;
    }

The corresponding machine + assembly code is considerably simpler:

  8a:	6019      	str	r1, [r3, #0]
  8c:	601a      	str	r2, [r3, #0]
  8e:	e7fc      	b.n	8a <main+0x8a>

The resulting pin output frequency is 3.16 MHz, but this time it’s highly asymmetric:

The “on” and “off” times differ, because one of the two state changes is always followed by a jump, back to the beginning of the loop. This loop takes 5 cycles.

This can be mitigated by unrolling the loop again, then it becomes a “bursty” 8 MHz:

Now the compiler does the right thing, inlining each pin change. But a continuous 50% duty cycle is not possible with a loop. For that, you will need a hardware timer.

Nevertheless, not bad - a toggle rate approaching 8 MHz is as good as it can possibly be, using a single machine instruction to change each pin state.

There’s more going on

If we increase the clock speed to 168 MHz, the above will scale accordingly, i.e. a toggle rate approaching 84 MHz can be achieved.

But this masks an important feature of the STM32 µC chips.

Normally, code is executed from flash memory. Perhaps not widely known, flash memory is quite slow, at least compared to the ≈ 6 ns (168 MHz) cycle times of an F407 running at full speed. So slow in fact, that at 168 MHz, 5 extra “wait states” (i.e. clock cycles doing nothing) have to automatically be inserted for each flash memory access.

The optimised solution created by STM is the “Adaptive real-time memory accelerator” (ART Accelerator). Here is what it does, taken from the STM32F407 datasheet:

To release the processor full 210 DMIPS performance at this frequency, the accelerator implements an instruction prefetch queue and branch cache, which increases program execution speed from the 128-bit Flash memory. Based on CoreMark benchmark, the performance achieved thanks to the ART accelerator is equivalent to 0 wait state program execution from Flash memory at a CPU frequency up to 168 MHz.

The F407’s ART logic is not enabled on power-up. Three hardware bits have to be set:

    *((volatile uint32_t*) 0x40023C00) |= (0b111<<8);

When all instructions fit into the built-in caches, flash memory access times won’t slow us down. As expected, the loop instruction counts match the number of clock cycles.

For a chip of a few dollars, such multi-MHz performance levels are quite impressive.

References