-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunked transfers for a big improvement in performance #97
base: master
Are you sure you want to change the base?
Conversation
a670b04
to
fc5c987
Compare
@caternuson wanna take a look? |
Could this be done more generically, similar to how I2C does chunks? Adafruit_BusIO/Adafruit_I2CDevice.cpp Lines 176 to 183 in bb7c77a
|
No, the I2C-implementation of arduino caches the data internally, the one of SPI doesn't. SPI is all about throughput, so there is only a thin layer around the hardware, |
Right. But couldn't you work directly on the buffer passed in: bool read(uint8_t *buffer, size_t len, uint8_t sendvalue = 0xFF);
bool write(const uint8_t *buffer, size_t len,
const uint8_t *prefix_buffer = nullptr, size_t prefix_len = 0);
bool write_then_read(const uint8_t *write_buffer, size_t write_len,
uint8_t *read_buffer, size_t read_len,
uint8_t sendvalue = 0xFF);
bool write_and_read(uint8_t *buffer, size_t len); instead of making a new buffer: std::array<uint8_t, maxBufferSizeForChunkedTransfer> chunkBuffer; |
It's the principle of least surprise: If I overwrite the buffer given for writing with the data read from SPI, I'm breaking code out there where someone is expecting them to not change between transmits. Also, they are also given as Then the optimition of combining the read and writes comes into play: the latency for either reading or writing is about 1us, this is also between read/write in |
fc5c987
to
35e6c29
Compare
This is clearly visible in "Small Combined Transfer" - "Before": to conserve the data in the write buffer (and |
It seems like this really comes down to using What prevents this from being implemented on AVR? |
AVR doesn't have |
Can you try implementing without using STL or a custom template class? The buffer can just be: uint8_t chunkBuffer[maxBufferSizeForChunkedTransfer]; Yes, you lose the convenience methods of the |
If the code doesn't run on an 8bit-AVR, the data is prepared by transmitting the reads and writes in chunks instead of byte for byte. A constant is used for the chunk size. This is a tradeof between RAM/CPU and bus speed: on the 'bigger' arduino platforms the CPU is a lot faster than the SPI and there is a lot of RAM avaible, so using more RAM/CPU cycles and then letting the DMA do its work is the way to go. The chunked transfers also combine the reads and writes, so the dead time in between is removed, which is especially important for register reads of SPI-attached chips. Chunked transfers give an improvement of about 40% over bytewise ones, additionally +5% in the case of small reads/writes as used in BusIO_Register by removing the dead time between writing and reading. This is for all supported platforms except 8bit AVRs, which are specifically #if'd out. The special case for the ESP32 is therefore removed and ARM M0/M4, ESP8266, teensy etc should profit of this improvement too, without special casing each platform by using non standard arduino core extensions.
…r AVR The AVR arduino platform doesn't have std::array. To use the chunked transfer mode, I added a template to replace it. It has roughly the same semantics as std::array<> and is protected to Adafruit_SPIDevice. With this template, the chunked transfer is also possible for 8bit AVRs.
35e6c29
to
f08f376
Compare
I added another commit to this PR to replace |
Just to be sure - it's relying on the use of the |
I'm refactoring the code now anyway, I found some more optimisation potential. Pe using From all tested architectures in the CI, only AVR failed on |
I'm setting the state of this PR to Draft until I commited the new changes |
303521d
to
04ab25a
Compare
The last commit ee48463 uses BeforeAfterIt removes 3.4us/chunk, this results in 6% performance improvement in the case of writing/reading buffers of 80/80 bytes in I think, this is what's physically possible; I don't see any further optimitions. |
298a80c
to
b486929
Compare
5c988e1
to
e9df8ca
Compare
I finally had the time to test the chunked transfers on an M4 and a 8bit AVR. I added the logic analyser traces with some comments in the PR description. |
0cc8028
to
66e17ce
Compare
66e17ce
to
177e504
Compare
Using a template is about manageing complexity: I personally like keeping top level code simple and move the complexity farther down into easier to test (and read) methods with the least amount of boiler plate and code duplication. This is what I have done here too: the template keeps track of the memory, the methods can access the data and do their job: managing the chunks. I usually like to keep it more simple than this, but pointer math is almost never easy on the eyes. I hope at least I'd done a reasonable good job in naming the variables to ease the understanding. I also don't like reinventing the wheel, so I use as much of the STL as possible. That code is tremendously optimised, thourously tested and well documented. |
Your arguments for templates and STL usage are all fine. But for these Arduino libraries, it's generally preferred to keep things simple and avoid using these unless really necessary. This PR likely won't be merged if it continues to rely on template usage. Not that it does not work - you've very clearly documented that it does. Just that for this, and other Arduino libraries, we also agree with your comment here:
The I2C chunk code takes the pointer math approach and works fine. And the readability of that should be OK for Arduino library maintainers. |
I removed the template for |
@caternuson What do you think of the last two commits? Is it good as it is now? |
There still appears to be two templates? |
I changed |
Because it is a macro. Macros are really evil used like this: it evaluates the arguments multiple times and the biggest advantage of C++, the strong typing and scoping (pe namespaces), are not used. Defines are for conditionals like platform-dependant stuff and for including headers. For everything else, we have better alternatives. For macros we have templated or inlined functions (since forever), for values we have So no, I will not use arduino's Here is another explanation, why to avoid macros: https://luckyresistor.me/knowledge/avoid-preprocessor-macros/ |
The data is prepared by transmitting the reads and writes in chunks instead of byte for byte. A constant is used for the chunk size.
This is a tradeof between RAM/CPU and bus speed: on the 'bigger' arduino platforms the CPU is a lot faster than the SPI and there is a lot of RAM avaible, so using more RAM/CPU cycles and then letting the DMA do its work is the way to go.
The chunked transfers also combine the reads and writes, so the dead time in between is removed, which is especially important for register reads of SPI-attached chips.
Chunked transfers give an improvement of about 40% over bytewise ones, additionally +5% in the case of small reads/writes as used in BusIO_Register by removing the dead time between writing and reading. ESP32 can transfer buffers without an inter-byte delay, M4 and AVR can't, as their arduino cores do the transmission byte for byte. However, as the inner loop is farther down the stack when using the buffer transfer of the core, this delay can be shortened.
The AVR has a chunk size of 32 bytes, all other platforms 64. Especially the ESP32 does some chunking internally and uses also 64 bytes, so this shouldn't impede the hardware with additional overhead. The AVR and M4 don't do chunking, so the size is quite arbitrary, but as the AVR has limited resources and most read/writes are in the range of a typical register read of a sensor, 32 bytes are chosen.
Logic Analyser Traces
The signal "DIO 11" is used as the actually controlled CS of
Adafruit_SPIDevice
, "DIO 12" is set/cleared right before/after calling the member. I used "DIO 12" to measure the performance to get the overhead.ESP32
Large Combined Transfer
Here is a transfer of
write_and_read()
with 70 bytes to write and 70 to read. Without this PR it takes 216us to transmit, with it only 135us. This is an improvement of 37%. As the ESP32 has a special case in the code, the writing half looks solid black, with a small interruption caused by the internal chunking.Before
After
Small Combined Transfer
Here is a transfer of
write_and_read()
with 2 bytes to write and 12 to read. Without this PR is takes 41us to transmit, with it only 26us. This is an improvement of 35%.It is plainly visible, what effect a bytewise transfer in a
for()
-loop has on performance. This is the case on all platforms except writing buffers on ESP32.Before
After
Writing a single Large Buffer
Here is a transfer of
write()
with 70 bytes. There is a small performance degration, as the chunking is done a layer up instead of in the arduino core.Before
After
Writing a single Small Buffer
Here is a transfer of
write()
with 12 bytes. There is a small performance degradation, at least on an ESP32.Before
After
Writing two Small Buffers (2+12 bytes)
Here is a transfer of
write()
with 2+12 bytes. The performance is roughly the same, at least on an ESP32.Before
After
Writing two Small Buffers (1+1 bytes)
Here is a transfer of
write()
with 1+1 bytes. The performance is roughly the same, at least on an ESP32.Before
After
Feather M4 Express
Infos
Here is a call to
write()
thenwrite_then_read()
, with a 2 byte and 9 byte buffer. This should be a typical call fromAdafruit_BusIO_Register
to read some sensor. The bus speed is 10MHz. The unchunked transfer takes ~25us, the chunked one ~20us, this is an improvement of 20%.The inter-byte time is without chunking 1.17us, with preparing the data 0.5us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs. The SERCOM implementation has only a method to transfer single bytes, so the arduino core calls this method for each byte individualy. I tested adding a method for buffer transfers, this lowers the inter-byte time to 350ns, but didn't remove it. This can be done, but requires DMA.
Before
After
Feather 32u4
Infos
Here is a call to
write()
thenwrite_then_read()
, with a 2 byte and 9 byte buffer. This should be a typical call fromAdafruit_BusIO_Register
to read some sensor. As the AVR is only clocked at 8MHz, I reduced the bus speed to 2MHz. The unchunked transfer takes ~168us, the chunked one ~140us, this is an improvement of 17%.The inter-byte time is without chunking 8.5us, with 1.25us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs.
Before
After