Skip to content

UART over DMA

Building on top of the new code using C++ on bare metal, this article describes a set of drivers which really puts some of the more advanced hardware through its paces.

A new version of JeeH is in the works …

This code is now being written as part of a new JeeH library, which should eventually become version 2. See https://git.jeelabs.org/jeeh/?h=dev, which is where all the new code is being developed (as time permits).

Three serial I/O drivers

One of the oldest and most widespread ways to transfer data over some distance is the serial port, using an asynchronous protocol. The hardware for this is called a UART (or USART, if it also supports a synchronous communication mode).

Every modern µC has at least one UART nowadays. UARTs work on a byte-by-byte basis: you transmit data by feeding it bytes, and you receive data by pulling bytes out when available. At the common modern speed of 115,200 baud, the transfer of each byte takes about 87 µs.

Polled

The simplest way (in terms of code complexity) to transfer data is by polling the UART: whenever the transmit buffer is empty, we can feed it a new byte, and whenever the receive buffer has data, we can pull out one received byte.

Here is the code to send individual bytes (and wait while busy):

void putchar (int c) {
    constexpr IoReg<0x4001'3800> UART1;
    enum { ISR=0x1C, TDR=0x28 };

    while (!UART1[ISR](7)) {}
    UART1[TDR] = c;
}

Setup is slightly more involved, but there is clearly not much to sending bytes:

  1. wait until a flag bit in the UART reports “TX buffer empty”
  2. store the new byte to send in the “TX data register”

The drawback of this approach, is that the code must busy-wait, i.e. it can’t do anything else while the bytes are being sent out. For reception, this is even more inconvenient, as the app must check at least once every 87 µs to see whether new data has arrived, and then take it out of the receive buffer. Failure to get the data out in time causes a buffer overrun and data loss …

Interrupt-driven

A more advanced approach is to use interrupts: the UART can interrupt the CPU when a byte has been received or when the transmit buffer has been emptied. This is probably the most common technique. The interrupt handlers then take care of saving the incoming data in a larger (circular) buffer, and of feeding waiting output from another buffer (again, usually circular). See this code for an example of such a driver.

Interrupt-driven I/O allows the application to spend less time checking and waiting. Input will be saved in a buffer area, and although it still needs to be picked up and processed, the within-87-µs requirement is gone. Likewise, entire lines of text can be stored in an output buffer, with interrupts getting the data out while the application goes doing more interesting things.

Another potential benefit is power consumption: since the CPU will get interrupts whenever action is required, it can switch to a low-power sleep mode whenever there is nothing else to do. In the case of an application waiting for commands, this could in fact end up being most of the time, so the power savings can be very substantial. In polled mode, the app would have to stay active, polling the UART at least every 87 µs in case a data byte just came in.

In terms of CPU processing, interrupts offer a clear advantage: a few µs processing to handle each interrupt, versus continuous polling. even if the interrupts take 10 µs each, that still leaves the CPU free to do other things (or enter low-power mode) almost 90% of the time.

But there are limits to what interrupts can do. At higher baudrates, the interrupt response times and overhead can become critical. At 1 Mbaud, there is only 10 µs time to get the next incoming byte. If the CPU is currently handling another interrupt, or running with interrupts disabled in a critical piece of code, things may well go wrong. And the overhead will take more and more of the CPU’s time.

Direct Memory Access

This is where the DMA controller comes in: special dedicated hardware which takes over the task of shuttling data between a hardware peripheral such as a UART and memory.

With DMA, you set up things to start a transfer, and it ends with an interrupt once this transfer is done. With memory buffers of say 100 bytes, that means the CPU only gets involved 1% of the time. More importantly, the allowed response time increases 100-fold, the CPU can now ignore the UART for several milliseconds at a time, without any data getting lost.

Setting up DMA is quite involved. There are several new issues to deal with. Separate DMA “streams” are needed for receive and transmit. But the benefits are substantial: even rates over 1 Mbaud can easily be handled with DMA, as long as the buffers are sized appropriately. And since the CPU needs to do less, it can stay (much) longer in a low-power sleep mode.

DMA on STM32

I’ve now implemented DMA-based UART drivers for a range of STM32 families: F1/F3/F4/F7, H7, and L0/L4. Once you figure out the intricacies of DMA hardware, the actual differences across variants are relatively small. These drivers are ≈ 200 lines of (dense) C++17 code.

Writing interrupt-based drivers is tricky business. Interrupts have a habit of occasionally being fired at the most unexpected (and inconvenient) times. Finding all cases can be fiendishly hard. In the end, massive testing in lengthy and varied scenarios is the only way to “get it right”.

Transmit DMA

The transmit side is the simpler one to implement: place the data to send in a contiguous area of memory, prepare the DMA TX stream, start it, and prevent further sends until the DMA “transmit complete” interrupt comes in. The full implementation can be found here - I’ll include a few snippets to illustrate the process:

dmaTX(CPAR) = dev.uart + TDR;
dmaTX(CMAR) = (uint32_t) txBuf;
dmaTX(CNDTR) = txCnt = len;
dmaTX(CCR) = 0b1001'0011; // MINC DIR TCIE EN

In prose: tell the DMA stream where the TX data register is, where in memory to take data from, how many bytes to transfer, and then start the ball rolling. This includes setting a 1-byte “memory increment”, specifying “outbound” as transfer direction, enabling the “transmit complete” interrupt, and then finally enabling the DMA transaction which starts the ball rolling.

When complete, an interrupt will be generated and the DMA stream will stop.

There is a lot more to it than this in terms of setup, however:

  • each DMA device (there are two) has up to 8 “streams” and 8 “channels”
  • an interrupt handler has to be installed for this particular stream
  • each UART (and other devs, e.g. SPI) is tied to specific (hard-coded) streams & channels

I’ll omit these details here, see the source code for exact implementation. The point here is that once the setup is finished, starting new transfers is as simple as the code shown above. From then on, the CPU is no longer involved.

But there is a small imperfection in all this: it takes a bit of CPU involvement to keep the output going full-speed from one buffer to the next. At very high rates, there will be small gaps where the UART is not sending data out.

This can easily be improved. I’ll just point out the general idea here:

  • instead of a fixed buffer, use a circular buffer and place new data into an area which is not currently being sent out via DMA
  • when the completion interrupt comes in, the interrupt handler can immediately set up a new transfer, if there is more data waiting to be sent
  • since the output buffer is now circular, some care is needed to wrap around (and send the data in two separate transfers)

Note that there can still be a period between DMA-complete and next-byte-sent, but now it’s all done within the interrupt handler, which tends to run as soon as the interrupt occurs. Given that a UART tends to still have a byte “in transit” when its TX buffer goes empty, the actual effect is that if the interrupt handling is quick, then there will be no gap in the output stream.

This restart approach avoids waiting for the app to respond to a DMA-done event. For the app, filling the buffer within a few milliseconds will be enough to keep the output going at full pace.

That’s all there is to using DMA for UART transmission. With just a 100-byte output buffer, the CPU overhead drops to 1% of what it would be with an interrupt-based driver. The app can keep output going even if it’s very busy doing other things in between (or go to sleep to reduce power consumption). And all of this will also work with multiple UART streams.

Receive DMA

Reception is very different: first of all, “completion” is not well-defined when it comes to free-form input, such as lines of text of varying lengths. Another complication is that reception really needs to deal with that 87 µs maximum response time under all conditions, even when the input buffer just filled up.

There are two hardware features which address these issues: circular DMA, and idle interrupts:

  • Circular DMA is a mechanism whereby the receive stream keeps on running forever, wrapping around with a buffer of known size. The DMA stream is set up once, and then keeps on placing received bytes into memory. There is still the potential of buffer overrun, so it’s up to the app to read data sufficiently often (I’m not going to go into flow control).

  • Idle interrupts are a special feature of the UART hardware, which generates an interrupt after a byte has been received and then no data is coming in for the duration of at least one more byte. The serial line has gone “idle”, so to speak. This happens when sending slower than what a connection could handle (e.g. typing characters on a keyboard).

One more hardware feature will be used to make DMA reception seamless:

  • Half-full interrupts can be generated when the DMA pointer into a circular buffer passes the midway point, i.e. when half of the buffer has data. As shown below, this gives the driver time to process new data without having to immediately make room for new data.

So how can this all be made to work together? It turns out to be surprisingly simple. First, prepare the DMA receive stream for continuous circular input:

dmaRX(CNDTR) = sizeof rxBuf;
dmaRX(CPAR) = dev.uart + RDR;
dmaRX(CMAR) = (uint32_t) rxBuf;
dmaRX(CCR) = 0b1010'0111; // MINC CIRC HTIE TCIE EN

This code very much resembles the transmit case. A few more flags, data moves the other way, and that’s basically it. Better still, the DMA receive stream is never reconfigured again - it just keeps receiving and wrapping forever. The real work happens in the interrupt handler.

There will be three ways the receive interrupt gets triggered: when the UART is idle, when the DMA stream passes the midway point, and then it wraps around the end. All of these cases are handled by the same interrupt handler:

  • we keep track of the last byte read (or rather: the next byte unread)
  • when the interrupt happens, figure out how much new data there is
  • it may have wrapped, so there may be two contiguous ranges of new data
  • once processed, adjust the last-byte-read position accordingly

And that’s it: DMA fills and fills, and the driver gets interrupts to allow it to figure out how much came in, and where that data is in the input buffer. If the app fails to read data quickly enough, the DMA will just keep going, and the app will miss some of the incoming bytes (without error).

The timing is quite interesting: every time a receive interrupt arrives, and assuming we properly dealt with preceding ones, a very useful invariant holds: at least half the input buffer is still free, so the app has a relatively large amount of time to deal with the new data. While the app processes that data, there will always be room for the DMA controller to save more bytes.

Another way to look at this is that the app never gets more than half a buffer of input data. And if the input goes idle, then the app may well get only single bytes each time it sees new data.

Managing buffers

Circular buffers are very useful on both the transmit and the receive side, but they do add some complexity, since “contiguous” is now a relative term. There might be 90 bytes free in the transmit buffer, even though the next area is smaller because the end of the buffer requires wrapping back to the beginning. Likewise, there may be 75 bytes in the input buffer, waiting to be read out, even though only few bytes remain before the end is reached, with the rest waiting at the beginning of the input buffer.

This “semi-contiguous” aspect of buffers can grossly complicate the code, not just in the driver but also in the application. For transmit as well as receive. One way to solve this is to allocate additional memory and copy data to make the “wrap split” disappear, but that increases memory use, adds CPU time, and seems a bit wasteful.

Instead, I’m using a different approach: on transmit, the driver reports the place and size of the next “slot” in the buffer which can be filled. If the end is reached, it will report the free buffer space as two separate slots. Similarly, on receive, the driver only reports the next contiguous area, even if more data is already present. This approach is combined with a second call to the driver, where the app reports how much data it gave (on TX) or took (on RX).

The most useful effect of this is on the receive side, because you can read just what you need, for example only up to the next newline character. In normal POSIX-like read() calls this is not possible: you have to read either byte-by-byte and stop when reaching the newline, or read bigger chunks and then (in the app) save the part which is not yet being processed.

It’s perhaps easier to illustrate all this using the actual API - here is the transmit side:

uint32_t uart::txSlot (uint8_t** pp =nullptr);
void uart::txPushed (uint32_t len);

The code to send an arbitrary number of bytes – as in a POSIX write() – is as follows:

void uart::send (void const* ptr, uint32_t len) {
    while (len > 0) {
        uint8_t* p;
        auto n = txSlot(&p);
        if (n > 0) {
            if (n > len)
                n = len;
            memcpy(p, ptr, n);
            txPushed(n);
            ptr = (uint8_t const*) ptr + n;
            len -= n;
        } else
            yield();
    }
}

In other words: determine what free buffer space there is, copy data into it, tell the driver how much was added, and then repeat until all data has been sent. Given that it won’t always fit right away, yield() gets called to pass the time (it can be a no-op, a delay, or a task switch).

The point to note here, is that there is no additional memory involved: the app generates data and places as much as it can into the output buffer, and as soon as it can.

The receiver-side API is very similar:

uint32_t uart::rxSlot (uint8_t** pp =nullptr);
void uart::rxPulled (uint32_t len);

The rxSlot call reports how much data there is, and where it is located (the pp arg gets set during the call), then the app can take as much as it wants of that data (including nothing at all). If any data was taken out, the rxPulled call then lets the driver know it can re-use that part of the buffer (i.e. advance its internal buffer position).

There is no receive() example (yet), because it really depends on the app how much data needs to be retrieved. Some scenarios will want to read all there is, others up to a fixed number of bytes, and other still up to some terminator such as a newline character.

Multi-tasking

The current API leaves all waiting up to the app (as in the yield() call above). Or to put it differently: the current UART-DMA driver code does not know or care what sort of multi-tasking environment is in place (if any).

This is workable but not optimal. Ideally, a task which needs to wait for TX output to drain or RX input to arrive should be suspended and queued so that the driver itself can resume such tasks when appropriate. Right now, the task can at best just block and unblock uncoditionally - there is no notion of “appropriate”.

A better solution will have to wait until I figure out the details. Given that these drivers are µC-specific, whereas multi-tasking is not (and should also work non-embedded builds), the proper approach to multi-tasking is to implement this outside of the JeeH library containing UART-DMA drivers such as what has been described above.