Performance and I/O toggling Feb 2017

With all those source lines being sent to Mecrisp, it’s time to look a little more under the hood.

Let’s start by pointing out that this article focuses on an STM32F103 µC running at 72 MHz, which is an ARM Cortex M3. Performance figures for the JeeNode Zero’s STM32L052 µC will differ, because it runs at a lower clock speed and because it’s a “simpler” ARM Cortex M0+.

Many Forth implementations use a “threaded interpreter” design, but Mecrisp is different in that it compiles Forth directly to ARM machine code. The code in embello includes a “micros” word, which returns the time of the systick hardware counter as microseconds. We can use this to measure the duration of operations - here’s how to find out how long “nop” takes:

: a micros nop micros swap - . ; a 1  ok.

So that was quick: < 1 µs! - and not so surprising, considering that the clock runs at 72 MHz.

Let’s go through the above code, just so that the logic is made very clear:

It’s very important to compile the code into a definition and then run it, otherwise we’ll be measuring the lookup and compilation time. Interesting, but not what we’re after here:

micros nop micros swap - . 2483  ok.

Now we can have a look at raw performance. Let’s run an empty loop a million times:

: a micros 1000000 0 do loop micros swap - . ; a 83386  ok.

That’s 83.4 ns per loop iteration, i.e. 12 MHz. Since the system clock is 72 MHz, we can deduce that Mecrisp compiles the empty loop to something which takes 6 clock cycles per iteration. It generated this code in just a few milliseconds, by the way - i.e. unnoticeble in interactive use.

More interesting, is how fast we can toggle an I/O pin. Let’s take PC13, the LED on a Blue Pill:

: a micros 1000000 0 do pc13 iox! loop micros swap - . ; a 236240  ok.

That’s 236 ns per iteration, so the pin toggles over 4 million times per second. If we discount for the 6-cycle loop overhead, the pin toggling itself takes 153 ns.

Actually, iox! is still a bit slower than just setting or clearing an I/O pin:

: a micros 1000000 0 do pc13 ios! loop micros swap - . ; a 166752  ok.

Again, without the loop overhead, setting a pin takes a mere 83 nanoseconds. Not too shabby for a self-contained µC board with an interactive prompt - all implemented in 20 KB of code!

This super performance is only possible because Matthias Koch used his expertise to help the compiler perform optimally on io@, ios!, ioc!, and iox! - so now everyone benefits from it.

Now let’s dive in deeper… a lot deeper. There’s a disassembler (written in Forth, what else?), which we can load into flash memory with the following commands (it uses about 6 KB):

compiletoflash
!s ../flib/mecrisp/disassembler-m3.fs
compiletoram

With this, we can examine the actual machine code generated for “a” by Mecrisp’s compiler:

: a micros nop micros swap - . ;  ok.
see a
0000CDE0: B500  push { lr }
0000CDE2: F7FA  bl  00006E12  --> micros
0000CDE4: F816
0000CDE6: F7F7  bl  00004762  --> nop
0000CDE8: FCBC
0000CDEA: F7FA  bl  00006E12  --> micros
0000CDEC: F812
0000CDEE: CF08  ldmia r7 { r3 }
0000CDF0: 1AF6  subs r6 r6 r3
0000CDF2: F7F7  bl  0000434E  --> .
0000CDF4: FAAC
0000CDF6: BD00  pop { pc }

That’s probably a bit too much assembler code to take in, so let’s start a bit simpler:

: a ( n -- n ) 17 + ;  ok.
see a
20000454: 3611  adds r6 #11
20000456: 4770  bx lr

This word takes the top of the stack and adds 17 to it. It gets compiled into a routine with just two machine instructions (the stack top is always in register 6): add 17 (hex 11) and return.

Things are about to get a bit more interesting:

: b ( n -- n ) 7 + 10 + ;  ok.
see b
20000470: 3607  adds r6 #7
20000472: 360A  adds r6 #A
20000474: 4770  bx lr

No surprises yet: we first added 7 and then we added 10 (hex A) more. Now watch this:

: c ( n -- n ) 7 10 + + ;  ok.
see c
20000498: 3611  adds r6 #11
2000049A: 4770  bx lr

Mecrisp does constant folding: the compiler can keep track of the topmost few items on the stack, and decide to optimise things if they are constants! (calculating 7+10 at compile time)

To show some more code generation at work, here are three equivalent definitions:

: d ( n -- n ) 2 * ;  ok.
see d
200004B4: 2302  movs r3 #2
200004B6: 435E  muls r6 r3
200004B8: 4770  bx lr

: e ( n -- n ) 2* ;  ok.
see e
200004D0: 0076  lsls r6 r6 #1
200004D2: 4770  bx lr

: f ( n -- n ) dup + ;  ok.
see f
200004E4: 19B6  adds r6 r6 r6
200004E6: 4770  bx lr

To see how constant folding actually works, we can look at the definition of bit, which takes a number on the stack and returns an int with just that bit set:

: bit ( u -- u ) 1 swap lshift  1-foldable ;
see bit
000058DE: 2301  movs r3 #1
000058E0: 40B3  lsls r3 r6
000058E2: 461E  mov r6 r3
000058E4: 4770  bx lr

The special trick here is that we’ve defined this word as “1-foldable”, meaning: if the topmost 1 elements (i.e. just the top) of the stack is known to be a constant, then instead of compiling the code, Mecrisp will execute that code on the constant value right away, and use the resulting value instead. It’s optimising away the operation, helped by the unique property that run-time and compile-time contexts are very similar in Forth. Here is an example of using bit:

: g ( u -- u ) 3 bit and ;  ok.
see g
0000CEF8: F016  ands r6 r6 #8
0000CEFA: 0608
0000CEFC: 4770  bx lr

This really only scratches the surface of what Mecrisp is doing under the hood. Here are a few more examples - you’ll have to know something about ARM machine code and the assembly mnemonics, but as you can see, it’s all very compact:

: aa ( -- ) 10 0 do loop ;  ok.
see aa
0000CE8E: B500  push { lr }
0000CE90: F847  str r6 [ r7 #-4 ]!
0000CE92: 6D04
0000CE94: B430  push { r4  r5 }
0000CE96: 2400  movs r4 #0
0000CE98: 250A  movs r5 #A
0000CE9A: CF40  ldmia r7 { r6 }
0000CE9C: 3401  adds r4 #1
0000CE9E: 42AC  cmp r4 r5
0000CEA0: D1FC  bne 0000CE9C
0000CEA2: BC30  pop { r4  r5 }
0000CEA4: BD00  pop { pc }

: bb ( n -- n ) if 2 else 3 then ;  ok.
see bb
0000CEB2: B500  push { lr }
0000CEB4: 2E00  cmp r6 #0
0000CEB6: CF40  ldmia r7 { r6 }
0000CEB8: D003  beq 0000CEC2
0000CEBA: F847  str r6 [ r7 #-4 ]!
0000CEBC: 6D04
0000CEBE: 2602  movs r6 #2
0000CEC0: E002  b 0000CEC8
0000CEC2: F847  str r6 [ r7 #-4 ]!
0000CEC4: 6D04
0000CEC6: 2603  movs r6 #3
0000CEC8: BD00  pop { pc }

: cc ( -- ) 2 3 * 4 + drop ;  ok.
see cc
0000CED6: 4770  bx lr

: dd ( -- ) 2 3 * 4 + . ;  ok.
see dd
0000CEE2: F847  str r6 [ r7 #-4 ]!
0000CEE4: 6D04
0000CEE6: 260A  movs r6 #A
0000CEE8: B500  push { lr }
0000CEEA: F7F7  bl  0000434E  --> .
0000CEEC: FA30
0000CEEE: BD00  pop { pc }

It may seem overdone to have Mecrisp generate machine code and do all the optimisation on the fly, but when the lower levels are so performant, it becomes feasible to construct an entire application on top. Even if deeply nested and layered as many tiny “words” calling each other.

Which is a good thing in ultra low-power designs: the sooner the µC is done with its periodic tasks, the sooner it can return to its sub-µA sleep mode.

The best of both worlds: compiler-grade performance, with interpreter-style instant coding!

Weblog © Jean-Claude Wippler. Generated by Hugo.