Computing stuff tied to the physical world

TFoC – A wide performance range

(This article is part of the The Fabric of Computing series – “in search of simplicity”)

It’s a tricky business to try and define what constitutes the fabric & essence of computing. But it’s probably reasonable to say that computing is where physics and information meet:

  • a transistor is not a computing device (but it can be used to create one)
  • likewise, a logic gate is not a computer, but it sure forms part of it
  • a software package (of any kind) is not a computer, but it does run on it

So where does the FPGA fit in? Well, in a way, an FPGA could well be called the clearest example of what makes a computer today: it has the logic, the interconnects which make that logic sing, and the memory where computations take place – if not all memory, then at least the accumulators and registers at the heart of a computer’s Central Processing Unit.

Should we throw out everything and switch to the FPGA for all computer implementations?

No, not any day soon. An FPGA is not nearly as efficient (in terms of silicon die size as well as power consumption) as a design created for any specific task, such as a microcontroller. Also, modern µCs contain as a bonus quite advanced and powerful analog hardware, in the form of ADCs, DACs, comparators, and sometimes even op-amps and PGAs (which stands for Programmable Gain Amplifiers here – not in any way related to (F)PGAs).

Only small FPGAs with limited hardware gimmicks fall in the $10..100 hobbyist range.

And besides… who needs processing power in a remote node, sending out a temperature reading every few minutes, and going into ultra low-power sleep the rest of the time?

No, FPGAs are not a replacement. Nor would you want an FPGA to drive your laptop. Here are some timing results of “idle loops” implemented in various ways, to try and grasp the immense range of performance available to us with various computing technologies:

  • the PDP-8/S introduced in 1965 had a cycle time of 10 µs per machine instruction – so at best, it might run an idle loop about 100,000 times per second – although it’s very likely that it took it considerably more than one clock cycle to perform a jump

  • the Z80 introduced in 1976 could run at 4 MHz, but the “jump” instruction for creating an infinite loop took 10 cycles, so that’s 400,000 idle loops per second

  • four decades later, on a 2015-era 4-core 2.8 GHz i7 CPU with its advanced pipelining and branch prediction, each of the cores can process billions of instructions per second – with an optimising gforth compiler for example, the “1000000000 0 do loop” takes around 2 seconds – that’s 2 nanoseconds per loop iteration

And where does the FPGA stand on this performance scale? Well, it depends:

  • with a low-end EP2C5 implementing a Z80 soft core running at 25 MHz, we see these idle loop times: 40 µs (DX Forth), 70 µs (BBC Basic), and 100 µs (MBASIC)

  • but when a slightly newer Spartan-6 FPGA implements the SwapForth engine at 100 Mhz, the idle loop time drops dramatically – to a very impressive 100 nanoseconds

When interpreters for high-level programming languages are involved, we get a completely different range of processing speeds. Here are some quick timing results:

  • the Espruino JavaScript engine, running on a 72 MHz STM32F103 can perform around 1500 iterations per second (note that Espruino does no tokenisation, it parses and interprets each line of source code, including loops, over and over again)

  • for comparison, the Tiny Basic Plus code (in C) running on that same 72 MHz STM32F103 does some 100,000 iterations per second, that’s 10 µs per loop cycle

  • and lastly, probably the silliest example of all: an STM32F103 at 72 MHz, emulating a 6502 MPU, which in turn is running a 10K Basic interpreter: 1,000 cycles/second

So what we’re seeing here is a spread of nearly 6 orders of magnitude. Does it matter?

Actually, that’s not so clear cut. Even that very last silly example, needing 1 millisecond to perform a single BASIC loop iteration, may be more than fast enough to handle a periodic sensor readout and wireless node transmission. In terms of speed, that is.

In terms of power consumption, less so, probably: if the node needs 10 ms to do its thing once a second, then that’s 1 % of the time, at full speed, drawing at least a few milliamps. Not so great for truly low-power nodes, i.e. if you want to run it for years on a coin cell.

But even just one or two orders of magnitude faster, and the on-time of such a node would drop accordingly. Tiny Basic Plus on the STM32 µC might already be fine, for example.

It’s hard to draw clear conclusions from this quick, and somewhat arbitrary, comparison. Clearly, µCs are more than enough for the bulk of home monitoring & automation tasks, even when running an interpreter of some sort. Espruino may not be quite there (yet?).

FPGA’s are total overkill, unless you’re implementing a soft core for a specific purpose, and need to perform a high-speed task on the side, such as some audio or video processing. But they do offer the fascinating capability of shape-shifting into whatever type of computing device you need. And they are reconfigurable on the fly, which has intriguing implications.

Lastly, we should not under-estimate the performance of modern laptop / desktop chips. They are totally beyond the embedded µC’s league. Then again, so are their prices…

So the concluding remark for now has to be “it depends”. With speed not really being a major factor, and the performance loss of an interpreter no longer being a show stopper. Maybe we can now put software development ease and long-term flexibility first?

This is the real art of good engineering – balancing all the constraints (cost, convenience, configurability, etc) to derive an optimum solution.