Skip to content

Comparing µC libraries

There are many ways to build firmware for microcontrollers. I’m going to look into a few frameworks and compare them in the context of implementing a trivial “blinking LED” demo.

The examples are built with PlatformIO, which conveniently supports all frameworks out of the box. The code is built for a low-end Nucleo-L432KC dev board from STM, an ARM Cortex M4.

From Arduino to Zephyr

Here is the source code. All it does is blink the LED in an approx 500 ms ON / OFF pattern:

#include <Arduino.h>

void setup () {
    pinMode(LED_BUILTIN, OUTPUT);
}

void loop () {
    digitalWrite(LED_BUILTIN, HIGH);
    delay(500);
    digitalWrite(LED_BUILTIN, LOW);
    delay(500);
}
#include <stm32l4xx.h>

void delayLoop (int n) {
    for (int i = 0; i < n * 1000; ++i)
        asm ("");
}

int main () {
    RCC->AHB2ENR |= RCC_AHB2ENR_GPIOBEN;
    GPIOB->MODER &= ~GPIO_MODER_MODE3_Msk;
    GPIOB->MODER |= GPIO_MODER_MODE3_0;

    while (true) {
        GPIOB->ODR |= 1<<3;
        delayLoop(500);
        GPIOB->ODR &= ~(1<<3);
        delayLoop(500);
    }
}
#include <libopencm3/stm32/rcc.h>
#include <libopencm3/stm32/gpio.h>

void delayLoop (int n) {
    for (int i = 0; i < n * 1000; ++i)
        asm ("");
}

int main () {
    rcc_periph_clock_enable(RCC_GPIOB);
    gpio_mode_setup(GPIOB, GPIO_MODE_OUTPUT, GPIO_PUPD_NONE, GPIO3);

    while (true) {
        gpio_set(GPIOB, GPIO3);
        delayLoop(500);
        gpio_clear(GPIOB, GPIO3);
        delayLoop(500);
    }
}
#include <mbed.h>

DigitalOut led (LED1);

int main () {
    while (true) {
        led = 1;
        wait_us(500000);
        led = 0;
        wait_us(500000);
    }
}
#include "stm32l4xx_hal.h"

int main () {
    HAL_Init();
    __HAL_RCC_GPIOB_CLK_ENABLE();

    GPIO_InitTypeDef GPIO_InitStruct;
    GPIO_InitStruct.Pin = GPIO_PIN_3;
    GPIO_InitStruct.Mode = GPIO_MODE_OUTPUT_PP;
    GPIO_InitStruct.Pull = GPIO_PULLUP;
    GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_MEDIUM;
    HAL_GPIO_Init(GPIOB, &GPIO_InitStruct); 

    while (true) {
        HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_SET);
        HAL_Delay(500);
        HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_RESET);
        HAL_Delay(500);
    }
}

extern "C" void SysTick_Handler () {
    HAL_IncTick();
}
#include <zephyr.h>
#include <device.h>
#include <devicetree.h>
#include <drivers/gpio.h>

#define LED0_NODE DT_ALIAS(led0)

#define LED0    DT_GPIO_LABEL(LED0_NODE, gpios)
#define PIN DT_GPIO_PIN(LED0_NODE, gpios)
#define FLAGS   DT_GPIO_FLAGS(LED0_NODE, gpios)

void main () {
    struct device const* dev = device_get_binding(LED0);
    if (dev == NULL)
        return;
    if (gpio_pin_configure(dev, PIN, GPIO_OUTPUT_ACTIVE|FLAGS) < 0)
        return;

    while (true) {
        gpio_pin_set(dev, PIN, 1);
        k_msleep(500);
        gpio_pin_set(dev, PIN, 0);
        k_msleep(500);
    }
}
Click to see the PlatformIO versions used in these examples …
ststm32                   13.0.0
toolchain-gccarmnoneeabi  1.70201.0
framework-mbed            6.60900.210318
framework-cmsis           2.50501.200527
framework-cmsis-stm32l4   1.6.1
framework-libopencm3      1.10000.200730
framework-arduinoststm32  4.10900.200819
framework-stm32cubel4     1.16.0
framework-zephyr          2.20500.210226

The frameworks differ greatly in the amount of code they use, and the number of libraries that are involved to link the final firmware image. The “first-build” times tend to include all these libraries, but in normal use you wouldn’t have to re-compile them every time, just link them in:

Environment    Status    Duration
-------------  --------  ------------
arduino        SUCCESS   00:00:01.074
cmsis          SUCCESS   00:00:00.394
libopencm3     SUCCESS   00:00:00.381
mbed           SUCCESS   00:00:03.265
stm32cube      SUCCESS   00:00:00.740
zephyr         SUCCESS   00:00:00.774
Environment    Status    Duration
-------------  --------  ------------
arduino        SUCCESS   00:00:07.071
cmsis          SUCCESS   00:00:00.760
libopencm3     SUCCESS   00:00:01.509
mbed           SUCCESS   00:01:49.859
stm32cube      SUCCESS   00:00:04.394
zephyr         SUCCESS   00:00:10.338

(All builds were performed on a MacBook Air M1, running PlatformIO 5.2.0a6)

Another aspect where these frameworks differ substantially, is the amount of generated code:

   text       data        bss        dec        hex    filename
  10632        160       2328      13120       3340    .pio/build/arduino/firmware.elf
    796          8       1568       2372        944    .pio/build/cmsis/firmware.elf
    776          0          0        776        308    .pio/build/libopencm3/firmware.elf
  32484        568       7520      40572       9e7c    .pio/build/mbed/firmware.elf
   1556         20       1572       3148        c4c    .pio/build/stm32cube/firmware.elf
  13844        712       4042      18598       48a6    .pio/build/zephyr/firmware.elf

Mbed in particular creates a relatively large executable, and needs quite a bit of RAM (or at least reserves it). Not all these values can be easily compared, however:

  • Both CMSIS and STM32Cube reserve 1.5 kB of RAM as minimum stack + heap space. This is simply a way to make sure that the resulting code has some space to actually run in.

  • Mbed and Zephyr include an RTOS, i.e. a multi-tasking scheduler. This also explains why the 0.5 sec waits in the source code are different: they yield instead of using a busy loop.

Build result summary

I’m not going to draw any conclusions from this fairly ad-hoc set of builds, the figures speak for themselves. But I will point out a few things which stood out for me:

  • The source code differs greatly across frameworks, with Mbed taking most advantage of C++ syntax (e.g. led = 1) and Arduino arguably easiest to read with little C knowledge.

  • All build times are very short, but nevertheless: the Mbed builds take almost an order of magnitude more time to complete than CMSIS and libopencm3.

  • Speaking of which: libopencm3 stands out as generating the smallest binary, and using no RAM at all (other than the a bit of C stack, as all code does, and which is not shown here).

No runtime at all

It’s also possible to talk directly to the hardware and bypass all runtime support:

#include <cstdint>

void delayLoop (int n) {
    for (int i = 0; i < n * 1000; ++i)
        asm ("");
}

int main () {
    // see RM0394 Rev 4
    const auto RCC   = (volatile uint32_t*) 0x4002'1000; // p.68
    const auto GPIOB = (volatile uint32_t*) 0x4800'0400; // p.68
    enum { AHB2ENR=0x4C };        // p.244
    enum { MODER=0x00,ODR=0x14 }; // pp.274

    RCC[AHB2ENR/4] |= (1<<1);     // GPIOBEN, p.244
    GPIOB[MODER/4] &= ~(0b11<<6); // clear bits 6..7
    GPIOB[MODER/4] |= (0b01<<6);  // output mode, p.267

    while (true) {
        GPIOB[ODR/4] |= 1<<3;
        delayLoop(500);
        GPIOB[ODR/4] &= ~(1<<3);
        delayLoop(500);
    }
}

This code adopts a few C++11 conveniences, such as auto and the “split” 0x4002'1000 notation.

This requires more work, and careful attention to detail:

  • the relevant hardware register areas must be defined as volatile pointers, otherwise the compiler may try to “optimise away” any memory accesses it considers redundant
  • access to the different hardware registers can be done via indexing, when properly offset (i.e. divided by 4 to account for the width of a uint32_t)
  • all hardware addresses, offsets, and bit positions can be found in STM’s (1600-page!) reference manual - in this case the STM32L4xx series, i.e. RM0394 from https://st.com

This code resembles the CMSIS version, which is mostly a header with #define’s. As was to be expected, the resulting firmware image is smaller than any of the runtime-based versions:

   text    data     bss     dec     hex filename
    688       0       0     688     2b0 .pio/build/bare/firmware.elf

(this still uses libopencm3 as runtime, but includes none of its header files)

What’s in the firmware image?

This is the resulting memory map. I’ve filtered out the interrupt handlers which are mapped to the same address and therefore take up no extra space:

$ arm-none-eabi-nm -CnS .pio/build/bare/firmware.elf | grep -v _isr
08000000 000001ac T vector_table
080001ac 00000012 T delayLoop(int)
080001c0 00000050 T main
08000210 00000002 T blocking_handler
08000212 00000002 W debug_monitor_handler
08000214 0000009c W reset_handler
080002b0 D __exidx_end
080002b0 D __exidx_start
080002b0 D __fini_array_end
080002b0 D __fini_array_start
080002b0 D __init_array_end
080002b0 D __init_array_start
080002b0 D __preinit_array_end
080002b0 D __preinit_array_start
080002b0 A _data_loadaddr
080002b0 T _etext
10000000 T end
20000000 D _data
20000000 B _ebss
20000000 D _edata
2000c000 T _stack

The main() app is 0x50, i.e. 80 bytes. The rest is used mostly for the vector dispatch table (0x1AC, i.e. 428 bytes) and low-level startup code in reset_handler (0x9C, i.e. 156 bytes).

The various interrupt handlers add no overhead and there are no C++ initialisers or finalisers.

Generated assembly code

To briefly go into a really deep dive, here’s the assembly code produced by the g++ compiler:

080001c0 <main>:
 80001c0:   4a10        ldr r2, [pc, #64]   ; (8000204 <main+0x44>)
 80001c2:   b508        push    {r3, lr}
 80001c4:   6813        ldr r3, [r2, #0]
 80001c6:   f043 0302   orr.w   r3, r3, #2
 80001ca:   6013        str r3, [r2, #0]
 80001cc:   4b0e        ldr r3, [pc, #56]   ; (8000208 <main+0x48>)
 80001ce:   681a        ldr r2, [r3, #0]
 80001d0:   f022 02c0   bic.w   r2, r2, #192    ; 0xc0
 80001d4:   601a        str r2, [r3, #0]
 80001d6:   681a        ldr r2, [r3, #0]
 80001d8:   f042 0240   orr.w   r2, r2, #64 ; 0x40
 80001dc:   601a        str r2, [r3, #0]
 80001de:   4a0b        ldr r2, [pc, #44]   ; (800020c <main+0x4c>)
 80001e0:   6813        ldr r3, [r2, #0]
 80001e2:   f043 0308   orr.w   r3, r3, #8
 80001e6:   6013        str r3, [r2, #0]
 80001e8:   f44f 70fa   mov.w   r0, #500    ; 0x1f4
 80001ec:   f7ff ffde   bl  80001ac <_Z9delayLoopi>
 80001f0:   6813        ldr r3, [r2, #0]
 80001f2:   f023 0308   bic.w   r3, r3, #8
 80001f6:   6013        str r3, [r2, #0]
 80001f8:   f44f 70fa   mov.w   r0, #500    ; 0x1f4
 80001fc:   f7ff ffd6   bl  80001ac <_Z9delayLoopi>
 8000200:   e7ee        b.n 80001e0 <main+0x20>
 8000202:   bf00        nop
 8000204:   4002104c    .word   0x4002104c
 8000208:   48000400    .word   0x48000400
 800020c:   48000414    .word   0x48000414

The constants from the C++ source code can be found at the end and with a bit of knowledge of assembly language, the C++ code logic can also be (sort of …) recognised in there.

Compiler Explorer

There is an amazing online tool called “Compiler Explorer” (github) to investigate exactly what a compiler does with source code and what output it generates, in a colourful annotated way.

The above code, for example, can also be examined here. Each source code line is colour-coded side-by-side with the generated assembly code, which is colour-coded in the same way. This makes it very easy to see how each C++ source line was compiled.

Note that this will not work with the above runtime libraries, as CE won’t find their headers to compile with. But for a “bare” compile, Matt Godbolt’s tool is magical (see 1h YouTube video).