Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunked transfers for a big improvement in performance #97

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

eringerli
Copy link
Contributor

@eringerli eringerli commented May 7, 2022

The data is prepared by transmitting the reads and writes in chunks instead of byte for byte. A constant is used for the chunk size.

This is a tradeof between RAM/CPU and bus speed: on the 'bigger' arduino platforms the CPU is a lot faster than the SPI and there is a lot of RAM avaible, so using more RAM/CPU cycles and then letting the DMA do its work is the way to go.

The chunked transfers also combine the reads and writes, so the dead time in between is removed, which is especially important for register reads of SPI-attached chips.

Chunked transfers give an improvement of about 40% over bytewise ones, additionally +5% in the case of small reads/writes as used in BusIO_Register by removing the dead time between writing and reading. ESP32 can transfer buffers without an inter-byte delay, M4 and AVR can't, as their arduino cores do the transmission byte for byte. However, as the inner loop is farther down the stack when using the buffer transfer of the core, this delay can be shortened.

The AVR has a chunk size of 32 bytes, all other platforms 64. Especially the ESP32 does some chunking internally and uses also 64 bytes, so this shouldn't impede the hardware with additional overhead. The AVR and M4 don't do chunking, so the size is quite arbitrary, but as the AVR has limited resources and most read/writes are in the range of a typical register read of a sensor, 32 bytes are chosen.

Logic Analyser Traces

The signal "DIO 11" is used as the actually controlled CS of Adafruit_SPIDevice, "DIO 12" is set/cleared right before/after calling the member. I used "DIO 12" to measure the performance to get the overhead.

ESP32

Large Combined Transfer

Here is a transfer of write_and_read() with 70 bytes to write and 70 to read. Without this PR it takes 216us to transmit, with it only 135us. This is an improvement of 37%. As the ESP32 has a special case in the code, the writing half looks solid black, with a small interruption caused by the internal chunking.

Before

wr-70+70-master

After

wr-70+70-chunk

Small Combined Transfer

Here is a transfer of write_and_read() with 2 bytes to write and 12 to read. Without this PR is takes 41us to transmit, with it only 26us. This is an improvement of 35%.

It is plainly visible, what effect a bytewise transfer in a for()-loop has on performance. This is the case on all platforms except writing buffers on ESP32.

Before

wr-2+12-master

After

wr-2+12-chunked

Writing a single Large Buffer

Here is a transfer of write() with 70 bytes. There is a small performance degration, as the chunking is done a layer up instead of in the arduino core.

Before

70-master

After

70-chunk

Writing a single Small Buffer

Here is a transfer of write() with 12 bytes. There is a small performance degradation, at least on an ESP32.

Before

w-12-master

After

w-12-chunked

Writing two Small Buffers (2+12 bytes)

Here is a transfer of write() with 2+12 bytes. The performance is roughly the same, at least on an ESP32.

Before

2+12-master

After

2+12-chunk

Writing two Small Buffers (1+1 bytes)

Here is a transfer of write() with 1+1 bytes. The performance is roughly the same, at least on an ESP32.

Before

1+1-master

After

1+1-chunked

Feather M4 Express

Infos

Here is a call to write() then write_then_read(), with a 2 byte and 9 byte buffer. This should be a typical call from Adafruit_BusIO_Register to read some sensor. The bus speed is 10MHz. The unchunked transfer takes ~25us, the chunked one ~20us, this is an improvement of 20%.

The inter-byte time is without chunking 1.17us, with preparing the data 0.5us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs. The SERCOM implementation has only a method to transfer single bytes, so the arduino core calls this method for each byte individualy. I tested adding a method for buffer transfers, this lowers the inter-byte time to 350ns, but didn't remove it. This can be done, but requires DMA.

Before

Screenshot_20220515_214551-m4-master

After

Screenshot_20220515_214506-m4-chunk

Feather 32u4

Infos

Here is a call to write() then write_then_read(), with a 2 byte and 9 byte buffer. This should be a typical call from Adafruit_BusIO_Register to read some sensor. As the AVR is only clocked at 8MHz, I reduced the bus speed to 2MHz. The unchunked transfer takes ~168us, the chunked one ~140us, this is an improvement of 17%.

The inter-byte time is without chunking 8.5us, with 1.25us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs.

Before

Screenshot_20220515_215521-avr-master

After

Screenshot_20220515_215428-avr-chunk

@eringerli eringerli force-pushed the chunked-transfer branch from a670b04 to fc5c987 Compare May 7, 2022 17:41
@ladyada ladyada requested a review from caternuson May 8, 2022 14:44
@ladyada
Copy link
Member

ladyada commented May 8, 2022

@caternuson wanna take a look?

@caternuson
Copy link
Contributor

Could this be done more generically, similar to how I2C does chunks?

while (pos < len) {
size_t read_len =
((len - pos) > maxBufferSize()) ? maxBufferSize() : (len - pos);
bool read_stop = (pos < (len - read_len)) ? false : stop;
if (!_read(buffer + pos, read_len, read_stop))
return false;
pos += read_len;
}

@eringerli
Copy link
Contributor Author

eringerli commented May 9, 2022

No, the I2C-implementation of arduino caches the data internally, the one of SPI doesn't. SPI is all about throughput, so there is only a thin layer around the hardware, transfer() is the only method to do multi-byte-transfers in 'standard' arduino code. If you want to do write- or read-only transfers, you have to buffer the data somehow. That's where my code comes in.

@caternuson
Copy link
Contributor

Right. But couldn't you work directly on the buffer passed in:

  bool read(uint8_t *buffer, size_t len, uint8_t sendvalue = 0xFF);
  bool write(const uint8_t *buffer, size_t len,
             const uint8_t *prefix_buffer = nullptr, size_t prefix_len = 0);
  bool write_then_read(const uint8_t *write_buffer, size_t write_len,
                       uint8_t *read_buffer, size_t read_len,
                       uint8_t sendvalue = 0xFF);
  bool write_and_read(uint8_t *buffer, size_t len);

instead of making a new buffer:

   std::array<uint8_t, maxBufferSizeForChunkedTransfer> chunkBuffer;

@eringerli
Copy link
Contributor Author

eringerli commented May 10, 2022

It's the principle of least surprise: If I overwrite the buffer given for writing with the data read from SPI, I'm breaking code out there where someone is expecting them to not change between transmits. Also, they are also given as const, so editing them isn't possible without a pretty unelegant const_cast<>. Using a new buffer is backwards compatible and fast, as it is allocated on the stack.

Then the optimition of combining the read and writes comes into play: the latency for either reading or writing is about 1us, this is also between read/write in write_then_read() and write/write in write(). In 1us, you can transmit 10 bits with 10MHz. Reading is done in the old code with an overload of transmit() with only one byte. This also why reading was so interrupted: the cost of calling the underlying arduino core is about 1us per byte.

@eringerli
Copy link
Contributor Author

This is clearly visible in "Small Combined Transfer" - "Before": to conserve the data in the write buffer (and constness), transfer() is called for each byte individualy. As the overhead for calling the underlying arduino core is about 1us, this results in the interrupted transfer. If you copy the data in a chunk buffer, you can call transfer() with the whole buffer at a time This is the only method in the arduino core which can do multy-byte transfers on all platforms, as it is the most "natural" mode of SPI: for each byte transfered, there is one received. This reduces the overhead dramatically: the statistics are in the PR description.

@caternuson
Copy link
Contributor

It seems like this really comes down to using SPI.transfer(buffer, size); in chunks instead of looping on SPI.transfer(val); byte-by-byte.

What prevents this from being implemented on AVR?

@eringerli
Copy link
Contributor Author

eringerli commented May 13, 2022

AVR doesn't have std::array<> and a lot less RAM and CPU power. I can try it this evening to see if the performance improvement is in the same range. std::array<> is a quite simple sugar coat around an array on the stack (provides amongst other stuff begin() ,end() and size()), so if I find it has a performance impact on AVR, I do a template on my own.

@caternuson
Copy link
Contributor

Can you try implementing without using STL or a custom template class? The buffer can just be:

uint8_t chunkBuffer[maxBufferSizeForChunkedTransfer];

Yes, you lose the convenience methods of the array template class. But the buffer access requirements here are hopefully pretty simple. Like in the I2C code snippet above.

eringerli added 2 commits May 14, 2022 16:11
If the code doesn't run on an 8bit-AVR, the data is prepared by
transmitting the reads and writes in chunks instead of byte for byte. A
constant is used for the chunk size.

This is a tradeof between RAM/CPU and bus speed: on the 'bigger' arduino
platforms the CPU is a lot faster than the SPI and there is a lot of
RAM avaible, so using more RAM/CPU cycles and then letting the DMA do
its work is the way to go.

The chunked transfers also combine the reads and writes, so the dead
time in between is removed, which is especially important for register
reads of SPI-attached chips.

Chunked transfers give an improvement of about 40% over bytewise
ones, additionally +5% in the case of small reads/writes as used in
BusIO_Register by removing the dead time between writing and reading. This
is for all supported platforms except 8bit AVRs, which are specifically
#if'd out. The special case for the ESP32 is therefore removed and
ARM M0/M4, ESP8266, teensy etc should profit of this improvement too,
without special casing each platform by using non standard arduino
core extensions.
…r AVR

The AVR arduino platform doesn't have std::array. To use the chunked
transfer mode, I added a template to replace it. It has roughly the same
semantics as std::array<> and is protected to Adafruit_SPIDevice.

With this template, the chunked transfer is also possible for 8bit AVRs.
@eringerli
Copy link
Contributor Author

I added another commit to this PR to replace std::array<> with a custom template. This template is protected in Adafruit_SPIDevice, so it doesn't polute the namespace.

@caternuson
Copy link
Contributor

Just to be sure - it's relying on the use of the array template class that's preventing this from working on AVR? Are there potentially other architectures that also do not implement array? Is it 100% necessary to use array?

@eringerli
Copy link
Contributor Author

I'm refactoring the code now anyway, I found some more optimisation potential. Pe using memcpy() is about 3x as fast as going through the array byte by byte, but requires some math to figure out what to copy. I think, instead of passing around pointers, a template to bundle them into a nice form is better undestandable. It also allows for a typedef with using, so methods can take a reference of the buffer and have all the information about the buffer.

From all tested architectures in the CI, only AVR failed on std::array, the others use a more modern compiler with better STL support.

@eringerli eringerli marked this pull request as draft May 14, 2022 16:06
@eringerli
Copy link
Contributor Author

eringerli commented May 14, 2022

I'm setting the state of this PR to Draft until I commited the new changes

@eringerli eringerli force-pushed the chunked-transfer branch 6 times, most recently from 303521d to 04ab25a Compare May 14, 2022 18:48
@eringerli
Copy link
Contributor Author

eringerli commented May 14, 2022

The last commit ee48463 uses memcpy() and memset(), this has the following effect:

Before

Screenshot_20220514_210431
Screenshot_20220514_210534

After

Screenshot_20220514_210641
Screenshot_20220514_210701

It removes 3.4us/chunk, this results in 6% performance improvement in the case of writing/reading buffers of 80/80 bytes in write_then_read().

I think, this is what's physically possible; I don't see any further optimitions.

@eringerli eringerli marked this pull request as ready for review May 14, 2022 19:22
@eringerli eringerli force-pushed the chunked-transfer branch 2 times, most recently from 298a80c to b486929 Compare May 14, 2022 19:53
@eringerli eringerli marked this pull request as draft May 14, 2022 19:55
@eringerli eringerli force-pushed the chunked-transfer branch 2 times, most recently from 5c988e1 to e9df8ca Compare May 15, 2022 19:58
@eringerli
Copy link
Contributor Author

I finally had the time to test the chunked transfers on an M4 and a 8bit AVR. I added the logic analyser traces with some comments in the PR description.

@eringerli eringerli marked this pull request as ready for review May 15, 2022 20:12
@eringerli eringerli force-pushed the chunked-transfer branch 3 times, most recently from 0cc8028 to 66e17ce Compare May 15, 2022 22:28
@eringerli
Copy link
Contributor Author

eringerli commented May 15, 2022

Using a template is about manageing complexity: I personally like keeping top level code simple and move the complexity farther down into easier to test (and read) methods with the least amount of boiler plate and code duplication. This is what I have done here too: the template keeps track of the memory, the methods can access the data and do their job: managing the chunks. I usually like to keep it more simple than this, but pointer math is almost never easy on the eyes. I hope at least I'd done a reasonable good job in naming the variables to ease the understanding.

I also don't like reinventing the wheel, so I use as much of the STL as possible. That code is tremendously optimised, thourously tested and well documented.

@caternuson
Copy link
Contributor

Your arguments for templates and STL usage are all fine. But for these Arduino libraries, it's generally preferred to keep things simple and avoid using these unless really necessary.

This PR likely won't be merged if it continues to rely on template usage. Not that it does not work - you've very clearly documented that it does. Just that for this, and other Arduino libraries, we also agree with your comment here:

I usually like to keep it more simple than this, but pointer math is almost never easy on the eyes.

The I2C chunk code takes the pointer math approach and works fine. And the readability of that should be OK for Arduino library maintainers.

@eringerli
Copy link
Contributor Author

I removed the template for Array and the usage of std::min.

@eringerli
Copy link
Contributor Author

@caternuson What do you think of the last two commits? Is it good as it is now?

@caternuson
Copy link
Contributor

There still appears to be two templates? printChunk and minimum?

@eringerli
Copy link
Contributor Author

eringerli commented May 24, 2022

I changed printChunk to non-templated. But I draw the line for minimum: I'm surely don't have to explain why preprocessor functions are a really bad idea. Why arduino still uses them is beyond me, especially since templates are there since C++ was invented.

@caternuson
Copy link
Contributor

@eringerli
Copy link
Contributor Author

eringerli commented May 25, 2022

Because it is a macro. Macros are really evil used like this: it evaluates the arguments multiple times and the biggest advantage of C++, the strong typing and scoping (pe namespaces), are not used. Defines are for conditionals like platform-dependant stuff and for including headers. For everything else, we have better alternatives. For macros we have templated or inlined functions (since forever), for values we have constexpr (since C++11) and/or enum class {}; (also since C++11, you can even specify the underlying type, pe uint8_t). All these alternatives follow the rules of C++, especially about polluting the (global) namespace. Ever tried using stdlib's <algorythm> when there are macros with generic names like min, max, etc defined? The headers of espressive's ESP are even more extreme, they define all the values of binary flags for 32bit (!), with such generic names like B2. It gets you a nightmare of hard to decifer compiler errors. After I #include arduino headers, I usually have to clean up by #undef a lot of stuff in my code, usually by including the headers inside a header of my own. This is not only wasted time for me, with time, it gets me an unmaintainable mess of custom includes and old, not used anymore dependencies.

So no, I will not use arduino's min, as it doesn't result in the same code. minimum evaluates the arguments only once, is typesafe (it only compares two values of the exact same type), has no side effects, gets compiled to a couple of assembler instructions and most of all, is easy to read. It is also local to Adafruit_SPIDevice.cpp, so it can't be used elsewhere. I rather use std::min, but AVRs don't have them.

Here is another explanation, why to avoid macros: https://luckyresistor.me/knowledge/avoid-preprocessor-macros/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants