Some µC speed measurements May 2016
Not long ago, Ken Boak very generously donated one of his assembled PCB designs to JeeLabs:
This is a break-out board for the STM32F746VG, an ARM Cortex M7 CPU with floating point and a whopping 1 MB flash + 320 KB RAM, all in a 100-pin SMD package.
Lots of I/O hardware, including USB and Ethernet, lots of analog I/O with three ADCs capable of millions of samples per second each, and a dual DAC. Lots of UART/I2C/SPI too, of course.
But the most interesting aspect of this chip, versus the lowly STM32F103 chip used in the HyTiny and Olimexino, is perhaps its speed: the STM32F7 series can run at up to 216 MHz, three times as fast as the F103. On first thought, it might seem that this would translate to “simply” running three times as many instructions in the same amount of time. Not so:
This is what the different columns represent:
µs/10k= microseconds to run 10,000 iterations of the loop
clk/loop= processor clock cycles per single loop iteration
iter/µs= iterations per µs (the same as: million iterations per second)
speedup= performance increase of F746 @ 216 MHz over F103 @ 72 MHz
efficiency= performance increase specific to Cortex M7 vs Cortex M3
That last column is the most interesting one: it compares the measured performance of some simple loops in Mecrisp Forth while dividing out the clock rate. So an empty loop runs about 4 times faster than could be explained by the clock speed difference alone.
The most likely explanation is a better cache, a better processing pipeline, or a better lookahead optimiser - or more likely: a mix of all this. Getting to the bottom of this would require much more investigation - for now, the point was simply to show how advances in µC technology can lead to more-than-linear performance increases.
The code used for the above timing results was as follows (running from RAM):
10000 buffer: buf : j0 micros 10000 0 do loop micros swap - . ; : j1 micros 10000 0 do nop loop micros swap - . ; : j2 micros 10000 0 do 1 i buf + c! loop micros swap - . ; : j3 micros 10000 0 do buf c@ drop loop micros swap - . ; : j4 micros 10000 0 do i buf + c@ drop loop micros swap - . ; : j5 micros 10000 0 do i buf + c@ dup + drop loop micros swap - . ; : j6 micros 10000 0 do i buf + c@ dup * drop loop micros swap - . ; : j7 micros 10000 0 do i buf + c@ dup / drop loop micros swap - . ; : jn j0 j1 j2 j3 j4 j5 j6 j7 ;
It’s not a very comprehensive timing suite - just a quick set of explorations which came to mind. Let’s not even try to suggest that this would be representative in any way or for any purpose.
One aspect stands out, though: the amazing speed of this code. It can be typed into the console interactively, yet the resulting performance levels are orders of magnitude higher than other interactive languages, which tend to be interpreted (especially in such a constrained µC context).
The range of power consumption modes is equally impressive, from drawing a few dozen mA when the F103 runs at 72 MHz and about 150 mA when the F746 runs at 216 MHz, to just a few microamps when entering standby mode. Computers have come a long way since the PDP-8!
P.S. - Here is a different kind of performance comparison: running
1,000,000,000 iterations of an empty loop takes about 26 s on an STM32F746 @ 216
MHz, 7 s on a core i7 @ 2.8 GHz, using Qemu in a Linux VM (via
qemu-arm-static), and 1 s on an
Odroid C1+ @ 1.7 GHz. Whereby those last two both use the Linux ARM build (all
these tests were done with “
Mecrisp 2.2.5 RA”).