Generated code & performance May 2017

Let’s look at how the tricks in the previous article end up in code, and their performance effects. First the trace.fs utility, which needs to insert special calls at the start and end of each word:

: a 2 . ;  ok.
see a
20000394: F847  str r6 [ r7 #-4 ]!
20000396: 6D04
20000398: 2602  movs r6 #2
2000039A: B500  push { lr }
2000039C: F244  movw r0 #434F
2000039E: 304F
200003A0: 4780  blx r0  --> .
200003A2: BD00  pop { pc }
ok.

That was the normal untraced code. Here is a definition with tracing added:

: ta 2 . ;  ok.
see ta
20002892: B500  push { lr }
20002894: F7FF  bl  2000269E  --> trace-entry
20002896: FF03
20002898: F847  str r6 [ r7 #-4 ]!
2000289A: 6D04
2000289C: 2602  movs r6 #2
2000289E: F244  movw r0 #434F
200028A0: 304F
200028A2: 4780  blx r0  --> .
200028A4: F7FF  bl  20002736  --> trace-exit
200028A6: FF47
200028A8: BD00  pop { pc }

Even if you’re not familar with ARM assembly code, you can see that two calls have been inserted, for a total of 8 extra bytes code for each compiled word. There’s not much point investigating performance, since traced code will slow down enormously while printing out all the trace messages. It’s usually not a good idea to enable tracing on lots of nested words.

Next, let’s see how profile.fs alters the way compilation works:

: a ;  ok.
see a
20000394: 4770  bx lr
ok.

: ta ;  ok.
see ta
2000263A: F242  movw r0 #2600
2000263C: 6000
2000263E: F2C2  movt r0 #2000
20002640: 0000
20002642: 6AC3  ldr r3 [ r0 #2C ]
20002644: 3301  adds r3 #1
20002646: 62C3  str r3 [ r0 #2C ]
20002648: 4770  bx lr
ok.

In this case, 14 bytes get added, to load the address of a variable, fetch it, increment it, and save the result. Plus another 22 bytes for each “(cnt)“ variable that gets created in the dictionary. So this does add up a bit more: we will need 36 bytes per word extra when enabling profiling.

To see the impact on performance, the micros word can be used to measure the elapsed time when calling a word say 1,000,000 times. The general structure for this type of measurement code is as follows (it has to be compiled since there is a loop involved):

: blah micros 1000000 0 do <some-operation> loop micros swap - . ; blah

This saves the current time (in µs) on the stack, performs the loop, saves the new time, reverses the top two stack elements, calculates their difference, and prints the result:

: dummy micros 1000000 0 do loop micros swap - . ; dummy 83385  ok.

As you can see, the overhead of a dummy loop is 83.4 nanoseconds - this is on an STM32F103 running at 72 MHz. Let’s repeat this for 1,000,000 calls of the unprofiled empty word:

: t1 micros 1000000 0 do a loop micros swap - . ; t1 83385  ok.

Looks like that call of an empty word got optimised away - oh well. Now the profiled version:

: t2 micros 1000000 0 do ta loop micros swap - . ; t2 222359  ok.

So profiling will add 139 ns to every call where profiling has been turned on. Let’s verify:

profile
20002794 00000001 t2
2000272C 00000001 t1
200026D4 00000000 tc
2000267C 00000000 tb
20002630 000F4240 ta
 ok.

Yep, t1 and t2 were called once, and ta was called $F4240 times, i.e. 1,000,000 in decimal.

Keep in mind that words tend to be small and numerous in well-written Forth code, so this will add up. But if you define all the low-level words before including the profile.fs code, and don’t need to track their counts, then in many cases this sort of overhead should be no problem at all: 7 million calls will still add less than one second to the execution time.

Not bad for a compiler + runtime + interactive prompt, running on a tiny little 64K flash µC!

The last utility we’ll look at here is the routes.fs code, which allows after-the-fact patching of words compiled to flash memory:

: ta ;  ok.
see ta
200027EE: F242  movw r0 #2780
200027F0: 7080
200027F2: F2C2  movt r0 #2000
200027F4: 0000
200027F6: 6E03  ldr r3 [ r0 #60 ]
200027F8: F847  str r6 [ r7 #-4 ]!
200027FA: 6D04
200027FC: 461E  mov r6 r3
200027FE: B500  push { lr }
20002800: 2E00  cmp r6 #0
20002802: D001  beq 20002808
20002804: F847  str r6 [ r7 #-4 ]!
20002806: 6D04
20002808: 2E00  cmp r6 #0
2000280A: CF40  ldmia r7 { r6 }
2000280C: D003  beq 20002816
2000280E: F642  movw r0 #2F23
20002810: 7023
20002812: 4780  blx r0  --> execute
20002814: BD00  pop { pc }
 ok.

There’s no need to show the compiled code for a again, it’s the same as in the profiled case. The patchable version now uses 40 bytes instead of 2, plus again 22 bytes for each “(vec)” variable definition, for a total of 60 bytes of overhead per word.

Note that these examples are for RAM-based code. When compiled to flash, all addresses will end up in the lower 64K of memory, reducing calls and variables from 8-byte to 4-byte code.

The overhead for re-routable code is as follows:

: t micros 1000000 0 do td loop micros swap - . ; t 555866  ok.

That’s 472 ns overhead per call to a re-routable word. The reason for this is probably that the logic in each word is more complex: we have to not just fetch the patched value, we also have to check whether it is set (i.e. ≠ 0), and if so: call the replacement code and exit our original code prematurely. This requires an extra -dup, if/then, and exit. As you can see, this adds up.

But let’s keep it all in perspective: both profiling and re-routing add well under a microsecond of overhead. It’s still virtually unnoticable - until you start making millions of “instrumented” calls.

Weblog © Jean-Claude Wippler. Generated by Hugo.