Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

todd-herbert · 2025-01-17T13:18:18Z

Unexpected BLE disconnections (such as 0x8 BLE_HCI_CONNECTION_TIMEOUT) cause sd_flash_page_erase and sd_flash_write operations to fail. This failure is reported with NRF_EVT_FLASH_OPERATION_ERROR. Currently, the InternalFileSystem library doesn't detect these events.

This PR aims to detect NRF_EVT_FLASH_OPERATION_ERROR, allowing several reattempts of a failed write / erase operation.

I'm unsure whether this fix is overly crude, and am concerned that I may be missing some finer detail of the filesystem implementation. Of note is the change from a counting semaphore to a binary semaphore. Any input here would be greatly valued.

henrygab

Wow! This was a tricky bug that you tracked down. To be clear, I don't have final say in this ... my advice here is free, and maybe you get what you pay for.

I think your solution is based on solid read of the SDK. My comments primarily revolve around keeping the code easy to read / function naming / etc. I hope you find it useful.

libraries/InternalFileSytem/src/flash/flash_nrf5x.c

todd-herbert · 2025-01-19T10:10:18Z

@henrygab Thanks for the feedback! I'll get onto those changes tomorrow.

todd-herbert · 2025-01-20T15:13:40Z

630668a aims to respond to feedback from @henrygab. Hopefully I'm correctly interpreting what you're envisioning.

the event from the callback is now stored, rather than success / fail as a bool
wait_for_async_flash_op_completion method gathers result only. Delays / queuing now handled in fal_prog and fal_erase.

Please do let me know if you feel there's any room for further improvement here, or if some new logic flaw has appeared during the refactoring.

henrygab

Seems cleaner. Logic remains apparently sound. Nicely done. 🎉

One code style comment repeated a few times.

libraries/InternalFileSytem/src/flash/flash_nrf5x.c

esev · 2025-01-20T23:54:59Z

One thing I've been wondering about this PR: If it fails after retrying, what should the behavior be? From #325 (comment), I've seen this can lead to asserts in LittleFS. These don't necessarily happen right away and can happen some time later when LittleFS actually accesses one of the lost blocks.

I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

I've been debating this a bit myself, as it certainly isn't nice to crash if it can be avoided. But maybe it is useful in this case, to highlight the issue sooner, closer to the root cause of the failure?

Anyway, just adding this comment here in case there is a strong opinion one way or another. I'm happy with the current logic.

henrygab · 2025-01-21T00:46:06Z

... I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

Those are good questions to ask. If the corruption was guaranteed at this point, then you are right ... earlier is better, and here would prevent later-discovered inconsistencies, maybe even leave the file system in a valid state ... if never written to again.

What I'm not 100% sure of is whether a failed write is guaranteed to cause LFS corruption. If I understand correctly, LFS is generally designed to not trust that data was actually written, just because the write reported success. I have not dived into LFS internals for a while...

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?
Are there any situations where, with a similar configuration vs. physical flash, a write that fails would NOT cause LFS corruption?

Maybe, in addition to retries, since the flash is internal, changing the LFS configuration would be a worthwhile second PR for Adafruit's folks to consider? (of course, only if it would make LFS robust to failed writes)

later comments... and I'm not official reviewer.

todd-herbert · 2025-01-21T11:01:10Z

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

I have to be honest, @esev has looked into this in much more depth than me. I've only really narrowed in on this one particular BLE disconnection case.

I'm not actually sure which situations could trigger the loop to hit MAX_RETRY, but maybe that's the argument in favor of asserting in this situation: to uncover any elusive edge cases which could be better handled.

Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

geeksville · 2025-01-22T07:31:08Z

Btw I'm kinda afk for another week but I just have to say: great find Todd! Great work! (Sent from a phone - please ignore typos)

…

On Tue, Jan 21, 2025, 20:01 todd-herbert ***@***.***> wrote: @todd-herbert <https://github.com/todd-herbert> ... questions for you, as you're the one who most recently dived deep.... I have to be honest, @esev <https://github.com/esev> has looked into this in much more depth than me. I've only really narrowed in on this one particular BLE disconnection case. I'm not actually sure which situations could trigger the loop to hit MAX_RETRY, but maybe that's the argument in favor of asserting in this situation: to uncover any elusive edge cases which could be better handled. Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)? I'm no expert in the area, but reading @geeksville <https://github.com/geeksville>'s thoughts in meshtastic/firmware#4447 <meshtastic/firmware#4447>, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur. — Reply to this email directly, view it on GitHub <#838 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABXB2K2YA2HMW6KNHMLBID2LYSI3AVCNFSM6AAAAABVL5BYQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBUGQYDIMBUGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

henrygab · 2025-01-23T00:08:39Z

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

I accept that reporting a failure, in at least some cases, will cause corruption in LFS.

If the assertion compiles to nothing in release builds, then sure ... this is a fine place to assert.

If intending release builds to lock up / crash...

Concern:

Cache layer does: Erase(physical == 32x logical sectors @ 128 bytes each)
Cache layer then re-writes the entire physical page
One of those writes fails

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

(+) The user experience is that the device simply hangs ... some device running tucked away in a hard-to-reach place now needs someone to go an pull the battery & power, or manually press the reset button (if exposed). Maybe ok, maybe not.

Because it's become clear that LFS was never designed to work as currently configured, I have little hope that any additional changes will improve the situation in a meaningful way. Bump the retry count to 100, double the delay every 20 attempts, and then crash if the operation fails.

A correct fix...

A correct fix would be to update the LFS configuration to indicate the 4096-byte block size. LFS supports inline'd files, so small files can end up stored inline within a sector (not taking 4k per file). This would require testing, but something along the lines of changing:

Adafruit_nRF52_Arduino/libraries/InternalFileSytem/src/InternalFileSystem.cpp

Lines 34 to 35 in 4dcfa3b

    
           #define LFS_FLASH_TOTAL_SIZE  (7*FLASH_NRF52_PAGE_SIZE) 
        
           #define LFS_BLOCK_SIZE        128

#define LFS_FLASH_TOTAL_SIZE  (7*FLASH_NRF52_PAGE_SIZE) 
#define LFS_BLOCK_SIZE        FLASH_NRF52_PAGE_SIZE // more correct

Adafruit_nRF52_Arduino/libraries/InternalFileSytem/src/InternalFileSystem.cpp

Lines 100 to 119 in 4dcfa3b

    
           static struct lfs_config _InternalFSConfig = 
        
           { 
        
             .context = NULL, 
        
             .read = _internal_flash_read, 
        
             .prog = _internal_flash_prog, 
        
             .erase = _internal_flash_erase, 
        
             .sync = _internal_flash_sync, 
        
             .read_size = LFS_BLOCK_SIZE, 
        
             .prog_size = LFS_BLOCK_SIZE, 
        
             .block_size = LFS_BLOCK_SIZE, 
        
             .block_count = LFS_FLASH_TOTAL_SIZE / LFS_BLOCK_SIZE, 
        
             .lookahead = 128, 
        
             .read_buffer = NULL, 
        
             .prog_buffer = NULL, 
        
             .lookahead_buffer = NULL, 
        
             .file_buffer = NULL 
        
           };

   .read_size = 128; // ??? flash might support reading single byte at a time... 
   .prog_size = 128; // ??? smallest size of a write to the page ... see flash's datasheet, might be 512?
   .block_size = FLASH_NRF52_PAGE_SIZE; //  This is the physical page size
   .block_count = LFS_FLASH_TOTAL_SIZE / FLASH_NRF52_PAGE_SIZE; // e.g., seven (7)

The above changes are an ESTIMATE / STRAWMAN and entirely UNTESTED, as they are intended for discussion only.

todd-herbert · 2025-01-23T11:23:29Z

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

That makes sense to me.

Personally, I'd be inclined to leave this PRs scope targeting this one specific BLE disconnection issue. It seems like the fix is fairly non-controversial, and could be rolled out without too much fear of causing disruption.

The additional discussion going on here with further aims to improve the stability of InternalFileSystem is certainly very positive and not something I'd want to discourage though!

adafruit#838

Reattempt failed flash operations

3ea7855

todd-herbert mentioned this pull request Jan 17, 2025

Reattempt failed flash operations meshtastic/Adafruit_nRF52_Arduino#1

Merged

esev mentioned this pull request Jan 19, 2025

nRF52832 frequently connect/disconnect occur assertion "head >= 2 && head <= lfs->cfg->block_count" #325

Closed

henrygab requested changes Jan 19, 2025

View reviewed changes

henrygab added the Bug label Jan 19, 2025

Refactor for maintainability

630668a

henrygab reviewed Jan 20, 2025

View reviewed changes

henrygab previously approved these changes Jan 21, 2025

View reviewed changes

braces

facf938

todd-herbert added a commit to todd-herbert/meshtastic-nrf52-arduino that referenced this pull request Jan 24, 2025

Align with pending upstream PR

92e936c

adafruit#838

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

todd-herbert commented Jan 17, 2025

henrygab left a comment

todd-herbert commented Jan 19, 2025

todd-herbert commented Jan 20, 2025

henrygab left a comment

esev commented Jan 20, 2025 •

edited

Loading

henrygab commented Jan 21, 2025

todd-herbert commented Jan 21, 2025 •

edited

Loading

geeksville commented Jan 22, 2025 via email

henrygab commented Jan 23, 2025

todd-herbert commented Jan 23, 2025

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Are you sure you want to change the base?

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Conversation

todd-herbert commented Jan 17, 2025

henrygab left a comment

Choose a reason for hiding this comment

todd-herbert commented Jan 19, 2025

todd-herbert commented Jan 20, 2025

henrygab left a comment

Choose a reason for hiding this comment

esev commented Jan 20, 2025 • edited Loading

henrygab commented Jan 21, 2025

todd-herbert commented Jan 21, 2025 • edited Loading

geeksville commented Jan 22, 2025 via email

henrygab commented Jan 23, 2025

todd-herbert commented Jan 23, 2025

esev commented Jan 20, 2025 •

edited

Loading

todd-herbert commented Jan 21, 2025 •

edited

Loading