chunked transfers for a big improvement in performance #97

eringerli · 2022-05-07T17:36:10Z

The data is prepared by transmitting the reads and writes in chunks instead of byte for byte. A constant is used for the chunk size.

This is a tradeof between RAM/CPU and bus speed: on the 'bigger' arduino platforms the CPU is a lot faster than the SPI and there is a lot of RAM avaible, so using more RAM/CPU cycles and then letting the DMA do its work is the way to go.

The chunked transfers also combine the reads and writes, so the dead time in between is removed, which is especially important for register reads of SPI-attached chips.

Chunked transfers give an improvement of about 40% over bytewise ones, additionally +5% in the case of small reads/writes as used in BusIO_Register by removing the dead time between writing and reading. ESP32 can transfer buffers without an inter-byte delay, M4 and AVR can't, as their arduino cores do the transmission byte for byte. However, as the inner loop is farther down the stack when using the buffer transfer of the core, this delay can be shortened.

The AVR has a chunk size of 32 bytes, all other platforms 64. Especially the ESP32 does some chunking internally and uses also 64 bytes, so this shouldn't impede the hardware with additional overhead. The AVR and M4 don't do chunking, so the size is quite arbitrary, but as the AVR has limited resources and most read/writes are in the range of a typical register read of a sensor, 32 bytes are chosen.

Logic Analyser Traces

The signal "DIO 11" is used as the actually controlled CS of Adafruit_SPIDevice, "DIO 12" is set/cleared right before/after calling the member. I used "DIO 12" to measure the performance to get the overhead.

ESP32

Large Combined Transfer

Here is a transfer of write_and_read() with 70 bytes to write and 70 to read. Without this PR it takes 216us to transmit, with it only 135us. This is an improvement of 37%. As the ESP32 has a special case in the code, the writing half looks solid black, with a small interruption caused by the internal chunking.

Before

After

Small Combined Transfer

Feather M4 Express

Infos

Here is a call to write() then write_then_read(), with a 2 byte and 9 byte buffer. This should be a typical call from Adafruit_BusIO_Register to read some sensor. The bus speed is 10MHz. The unchunked transfer takes ~25us, the chunked one ~20us, this is an improvement of 20%.

The inter-byte time is without chunking 1.17us, with preparing the data 0.5us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs. The SERCOM implementation has only a method to transfer single bytes, so the arduino core calls this method for each byte individualy. I tested adding a method for buffer transfers, this lowers the inter-byte time to 350ns, but didn't remove it. This can be done, but requires DMA.

Before

After

Feather 32u4

Infos

Here is a call to write() then write_then_read(), with a 2 byte and 9 byte buffer. This should be a typical call from Adafruit_BusIO_Register to read some sensor. As the AVR is only clocked at 8MHz, I reduced the bus speed to 2MHz. The unchunked transfer takes ~168us, the chunked one ~140us, this is an improvement of 17%.

The inter-byte time is without chunking 8.5us, with 1.25us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs.

Before

After

ladyada · 2022-05-08T14:44:53Z

@caternuson wanna take a look?

caternuson · 2022-05-09T19:48:35Z

Could this be done more generically, similar to how I2C does chunks?

Adafruit_BusIO/Adafruit_I2CDevice.cpp

Lines 176 to 183 in bb7c77a

    
           while (pos < len) { 
        
             size_t read_len = 
        
                 ((len - pos) > maxBufferSize()) ? maxBufferSize() : (len - pos); 
        
             bool read_stop = (pos < (len - read_len)) ? false : stop; 
        
             if (!_read(buffer + pos, read_len, read_stop)) 
        
               return false; 
        
             pos += read_len; 
        
           }

eringerli · 2022-05-09T23:47:24Z

No, the I2C-implementation of arduino caches the data internally, the one of SPI doesn't. SPI is all about throughput, so there is only a thin layer around the hardware, transfer() is the only method to do multi-byte-transfers in 'standard' arduino code. If you want to do write- or read-only transfers, you have to buffer the data somehow. That's where my code comes in.

caternuson · 2022-05-10T17:39:59Z

Right. But couldn't you work directly on the buffer passed in:

  bool read(uint8_t *buffer, size_t len, uint8_t sendvalue = 0xFF);
  bool write(const uint8_t *buffer, size_t len,
             const uint8_t *prefix_buffer = nullptr, size_t prefix_len = 0);
  bool write_then_read(const uint8_t *write_buffer, size_t write_len,
                       uint8_t *read_buffer, size_t read_len,
                       uint8_t sendvalue = 0xFF);
  bool write_and_read(uint8_t *buffer, size_t len);

instead of making a new buffer:

   std::array<uint8_t, maxBufferSizeForChunkedTransfer> chunkBuffer;

eringerli · 2022-05-10T18:18:35Z

It's the principle of least surprise: If I overwrite the buffer given for writing with the data read from SPI, I'm breaking code out there where someone is expecting them to not change between transmits. Also, they are also given as const, so editing them isn't possible without a pretty unelegant const_cast<>. Using a new buffer is backwards compatible and fast, as it is allocated on the stack.

Then the optimition of combining the read and writes comes into play: the latency for either reading or writing is about 1us, this is also between read/write in write_then_read() and write/write in write(). In 1us, you can transmit 10 bits with 10MHz. Reading is done in the old code with an overload of transmit() with only one byte. This also why reading was so interrupted: the cost of calling the underlying arduino core is about 1us per byte.

eringerli · 2022-05-12T19:52:39Z

This is clearly visible in "Small Combined Transfer" - "Before": to conserve the data in the write buffer (and constness), transfer() is called for each byte individualy. As the overhead for calling the underlying arduino core is about 1us, this results in the interrupted transfer. If you copy the data in a chunk buffer, you can call transfer() with the whole buffer at a time This is the only method in the arduino core which can do multy-byte transfers on all platforms, as it is the most "natural" mode of SPI: for each byte transfered, there is one received. This reduces the overhead dramatically: the statistics are in the PR description.

caternuson · 2022-05-12T20:07:59Z

It seems like this really comes down to using SPI.transfer(buffer, size); in chunks instead of looping on SPI.transfer(val); byte-by-byte.

What prevents this from being implemented on AVR?

eringerli · 2022-05-13T04:47:35Z

AVR doesn't have std::array<> and a lot less RAM and CPU power. I can try it this evening to see if the performance improvement is in the same range. std::array<> is a quite simple sugar coat around an array on the stack (provides amongst other stuff begin() ,end() and size()), so if I find it has a performance impact on AVR, I do a template on my own.

caternuson · 2022-05-13T14:47:44Z

Can you try implementing without using STL or a custom template class? The buffer can just be:

uint8_t chunkBuffer[maxBufferSizeForChunkedTransfer];

Yes, you lose the convenience methods of the array template class. But the buffer access requirements here are hopefully pretty simple. Like in the I2C code snippet above.

If the code doesn't run on an 8bit-AVR, the data is prepared by transmitting the reads and writes in chunks instead of byte for byte. A constant is used for the chunk size. This is a tradeof between RAM/CPU and bus speed: on the 'bigger' arduino platforms the CPU is a lot faster than the SPI and there is a lot of RAM avaible, so using more RAM/CPU cycles and then letting the DMA do its work is the way to go. The chunked transfers also combine the reads and writes, so the dead time in between is removed, which is especially important for register reads of SPI-attached chips. Chunked transfers give an improvement of about 40% over bytewise ones, additionally +5% in the case of small reads/writes as used in BusIO_Register by removing the dead time between writing and reading. This is for all supported platforms except 8bit AVRs, which are specifically #if'd out. The special case for the ESP32 is therefore removed and ARM M0/M4, ESP8266, teensy etc should profit of this improvement too, without special casing each platform by using non standard arduino core extensions.

…r AVR The AVR arduino platform doesn't have std::array. To use the chunked transfer mode, I added a template to replace it. It has roughly the same semantics as std::array<> and is protected to Adafruit_SPIDevice. With this template, the chunked transfer is also possible for 8bit AVRs.

eringerli · 2022-05-14T14:27:46Z

I added another commit to this PR to replace std::array<> with a custom template. This template is protected in Adafruit_SPIDevice, so it doesn't polute the namespace.

caternuson · 2022-05-14T15:17:22Z

Just to be sure - it's relying on the use of the array template class that's preventing this from working on AVR? Are there potentially other architectures that also do not implement array? Is it 100% necessary to use array?

eringerli · 2022-05-14T16:05:20Z

I'm refactoring the code now anyway, I found some more optimisation potential. Pe using memcpy() is about 3x as fast as going through the array byte by byte, but requires some math to figure out what to copy. I think, instead of passing around pointers, a template to bundle them into a nice form is better undestandable. It also allows for a typedef with using, so methods can take a reference of the buffer and have all the information about the buffer.

From all tested architectures in the CI, only AVR failed on std::array, the others use a more modern compiler with better STL support.

eringerli · 2022-05-14T16:07:11Z

I'm setting the state of this PR to Draft until I commited the new changes

eringerli · 2022-05-14T19:13:29Z

The last commit ee48463 uses memcpy() and memset(), this has the following effect:

Before

After

It removes 3.4us/chunk, this results in 6% performance improvement in the case of writing/reading buffers of 80/80 bytes in write_then_read().

I think, this is what's physically possible; I don't see any further optimitions.

eringerli · 2022-05-15T20:12:27Z

I finally had the time to test the chunked transfers on an M4 and a 8bit AVR. I added the logic analyser traces with some comments in the PR description.

eringerli · 2022-05-15T22:50:01Z

Using a template is about manageing complexity: I personally like keeping top level code simple and move the complexity farther down into easier to test (and read) methods with the least amount of boiler plate and code duplication. This is what I have done here too: the template keeps track of the memory, the methods can access the data and do their job: managing the chunks. I usually like to keep it more simple than this, but pointer math is almost never easy on the eyes. I hope at least I'd done a reasonable good job in naming the variables to ease the understanding.

I also don't like reinventing the wheel, so I use as much of the STL as possible. That code is tremendously optimised, thourously tested and well documented.

caternuson · 2022-05-17T22:36:46Z

Your arguments for templates and STL usage are all fine. But for these Arduino libraries, it's generally preferred to keep things simple and avoid using these unless really necessary.

This PR likely won't be merged if it continues to rely on template usage. Not that it does not work - you've very clearly documented that it does. Just that for this, and other Arduino libraries, we also agree with your comment here:

I usually like to keep it more simple than this, but pointer math is almost never easy on the eyes.

The I2C chunk code takes the pointer math approach and works fine. And the readability of that should be OK for Arduino library maintainers.

eringerli · 2022-05-18T19:19:57Z

I removed the template for Array and the usage of std::min.

eringerli · 2022-05-23T21:39:23Z

@caternuson What do you think of the last two commits? Is it good as it is now?

caternuson · 2022-05-23T23:00:31Z

There still appears to be two templates? printChunk and minimum?

eringerli · 2022-05-24T05:03:15Z

I changed printChunk to non-templated. But I draw the line for minimum: I'm surely don't have to explain why preprocessor functions are a really bad idea. Why arduino still uses them is beyond me, especially since templates are there since C++ was invented.

caternuson · 2022-05-24T18:55:22Z

Why not use?
https://www.arduino.cc/reference/en/language/functions/math/min/

eringerli · 2022-05-25T08:27:22Z

Because it is a macro. Macros are really evil used like this: it evaluates the arguments multiple times and the biggest advantage of C++, the strong typing and scoping (pe namespaces), are not used. Defines are for conditionals like platform-dependant stuff and for including headers. For everything else, we have better alternatives. For macros we have templated or inlined functions (since forever), for values we have constexpr (since C++11) and/or enum class {}; (also since C++11, you can even specify the underlying type, pe uint8_t). All these alternatives follow the rules of C++, especially about polluting the (global) namespace. Ever tried using stdlib's <algorythm> when there are macros with generic names like min, max, etc defined? The headers of espressive's ESP are even more extreme, they define all the values of binary flags for 32bit (!), with such generic names like B2. It gets you a nightmare of hard to decifer compiler errors. After I #include arduino headers, I usually have to clean up by #undef a lot of stuff in my code, usually by including the headers inside a header of my own. This is not only wasted time for me, with time, it gets me an unmaintainable mess of custom includes and old, not used anymore dependencies.

So no, I will not use arduino's min, as it doesn't result in the same code. minimum evaluates the arguments only once, is typesafe (it only compares two values of the exact same type), has no side effects, gets compiled to a couple of assembler instructions and most of all, is easy to read. It is also local to Adafruit_SPIDevice.cpp, so it can't be used elsewhere. I rather use std::min, but AVRs don't have them.

Here is another explanation, why to avoid macros: https://luckyresistor.me/knowledge/avoid-preprocessor-macros/

eringerli force-pushed the chunked-transfer branch from a670b04 to fc5c987 Compare May 7, 2022 17:41

ladyada requested a review from caternuson May 8, 2022 14:44

eringerli force-pushed the chunked-transfer branch from fc5c987 to 35e6c29 Compare May 11, 2022 19:21

eringerli added 2 commits May 14, 2022 16:11

eringerli force-pushed the chunked-transfer branch from 35e6c29 to f08f376 Compare May 14, 2022 14:13

eringerli marked this pull request as draft May 14, 2022 16:06

eringerli force-pushed the chunked-transfer branch 6 times, most recently from 303521d to 04ab25a Compare May 14, 2022 18:48

eringerli marked this pull request as ready for review May 14, 2022 19:22

eringerli force-pushed the chunked-transfer branch 2 times, most recently from 298a80c to b486929 Compare May 14, 2022 19:53

eringerli marked this pull request as draft May 14, 2022 19:55

eringerli force-pushed the chunked-transfer branch 2 times, most recently from 5c988e1 to e9df8ca Compare May 15, 2022 19:58

eringerli marked this pull request as ready for review May 15, 2022 20:12

eringerli force-pushed the chunked-transfer branch 3 times, most recently from 0cc8028 to 66e17ce Compare May 15, 2022 22:28

eringerli added 2 commits May 16, 2022 00:44

moved the chunked transfer management to their own private methods

a97c62a

added an additional debug level: if enabled, print the actual data too

177e504

eringerli force-pushed the chunked-transfer branch from 66e17ce to 177e504 Compare May 15, 2022 22:44

eringerli added 2 commits May 16, 2022 14:40

Removed redundant code

59cd422

use the typedef of the chunkbuffer type

1bc392f

eringerli added 2 commits May 18, 2022 21:04

removed template Adafruit_SPIDevice::Array

033dc80

replaced std::min with a custom templated function

14d7336

Changed templated function printChunk to non-templated

839e32c

eringerli mentioned this pull request Jun 5, 2022

Adafruit BusIO and LVGL compatibility adafruit/Adafruit_RA8875#42

Open

caternuson assigned ladyada Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunked transfers for a big improvement in performance #97

chunked transfers for a big improvement in performance #97

eringerli commented May 7, 2022 •

edited

Loading

ladyada commented May 8, 2022

caternuson commented May 9, 2022

eringerli commented May 9, 2022 •

edited

Loading

caternuson commented May 10, 2022

eringerli commented May 10, 2022 •

edited

Loading

eringerli commented May 12, 2022

caternuson commented May 12, 2022

eringerli commented May 13, 2022 •

edited

Loading

caternuson commented May 13, 2022

eringerli commented May 14, 2022

caternuson commented May 14, 2022

eringerli commented May 14, 2022

eringerli commented May 14, 2022 •

edited

Loading

eringerli commented May 14, 2022 •

edited

Loading

eringerli commented May 15, 2022

eringerli commented May 15, 2022 •

edited

Loading

caternuson commented May 17, 2022

eringerli commented May 18, 2022

eringerli commented May 23, 2022

caternuson commented May 23, 2022

eringerli commented May 24, 2022 •

edited

Loading

caternuson commented May 24, 2022

eringerli commented May 25, 2022 •

edited

Loading

chunked transfers for a big improvement in performance #97

Are you sure you want to change the base?

chunked transfers for a big improvement in performance #97

Conversation

eringerli commented May 7, 2022 • edited Loading

Logic Analyser Traces

ESP32

Large Combined Transfer

Before

After

Small Combined Transfer

Before

After

Writing a single Large Buffer

Before

After

Writing a single Small Buffer

Before

After

Writing two Small Buffers (2+12 bytes)

Before

After

Writing two Small Buffers (1+1 bytes)

Before

After

Feather M4 Express

Infos

Before

After

Feather 32u4

Infos

Before

After

ladyada commented May 8, 2022

caternuson commented May 9, 2022

eringerli commented May 9, 2022 • edited Loading

caternuson commented May 10, 2022

eringerli commented May 10, 2022 • edited Loading

eringerli commented May 12, 2022

caternuson commented May 12, 2022

eringerli commented May 13, 2022 • edited Loading

caternuson commented May 13, 2022

eringerli commented May 14, 2022

caternuson commented May 14, 2022

eringerli commented May 14, 2022

eringerli commented May 14, 2022 • edited Loading

eringerli commented May 14, 2022 • edited Loading

Before

After

eringerli commented May 15, 2022

eringerli commented May 15, 2022 • edited Loading

caternuson commented May 17, 2022

eringerli commented May 18, 2022

eringerli commented May 23, 2022

caternuson commented May 23, 2022

eringerli commented May 24, 2022 • edited Loading

caternuson commented May 24, 2022

eringerli commented May 25, 2022 • edited Loading

eringerli commented May 7, 2022 •

edited

Loading

eringerli commented May 9, 2022 •

edited

Loading

eringerli commented May 10, 2022 •

edited

Loading

eringerli commented May 13, 2022 •

edited

Loading

eringerli commented May 14, 2022 •

edited

Loading

eringerli commented May 14, 2022 •

edited

Loading

eringerli commented May 15, 2022 •

edited

Loading

eringerli commented May 24, 2022 •

edited

Loading

eringerli commented May 25, 2022 •

edited

Loading