Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

todd-herbert
Copy link

Unexpected BLE disconnections (such as 0x8 BLE_HCI_CONNECTION_TIMEOUT) cause sd_flash_page_erase and sd_flash_write operations to fail. This failure is reported with NRF_EVT_FLASH_OPERATION_ERROR. Currently, the InternalFileSystem library doesn't detect these events.

This PR aims to detect NRF_EVT_FLASH_OPERATION_ERROR, allowing several reattempts of a failed write / erase operation.

I'm unsure whether this fix is overly crude, and am concerned that I may be missing some finer detail of the filesystem implementation. Of note is the change from a counting semaphore to a binary semaphore. Any input here would be greatly valued.

Copy link
Collaborator

@henrygab henrygab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! This was a tricky bug that you tracked down. To be clear, I don't have final say in this ... my advice here is free, and maybe you get what you pay for.

I think your solution is based on solid read of the SDK. My comments primarily revolve around keeping the code easy to read / function naming / etc. I hope you find it useful.

libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
@todd-herbert
Copy link
Author

@henrygab Thanks for the feedback! I'll get onto those changes tomorrow.

@henrygab henrygab added the Bug label Jan 19, 2025
@todd-herbert
Copy link
Author

630668a aims to respond to feedback from @henrygab. Hopefully I'm correctly interpreting what you're envisioning.

  • the event from the callback is now stored, rather than success / fail as a bool
  • wait_for_async_flash_op_completion method gathers result only. Delays / queuing now handled in fal_prog and fal_erase.

Please do let me know if you feel there's any room for further improvement here, or if some new logic flaw has appeared during the refactoring.

Copy link
Collaborator

@henrygab henrygab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems cleaner. Logic remains apparently sound. Nicely done. 🎉

One code style comment repeated a few times.

libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
libraries/InternalFileSytem/src/flash/flash_nrf5x.c Outdated Show resolved Hide resolved
@esev
Copy link

esev commented Jan 20, 2025

One thing I've been wondering about this PR: If it fails after retrying, what should the behavior be? From #325 (comment), I've seen this can lead to asserts in LittleFS. These don't necessarily happen right away and can happen some time later when LittleFS actually accesses one of the lost blocks.

I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

I've been debating this a bit myself, as it certainly isn't nice to crash if it can be avoided. But maybe it is useful in this case, to highlight the issue sooner, closer to the root cause of the failure?

Anyway, just adding this comment here in case there is a strong opinion one way or another. I'm happy with the current logic.

@henrygab
Copy link
Collaborator

... I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

Those are good questions to ask. If the corruption was guaranteed at this point, then you are right ... earlier is better, and here would prevent later-discovered inconsistencies, maybe even leave the file system in a valid state ... if never written to again.

What I'm not 100% sure of is whether a failed write is guaranteed to cause LFS corruption. If I understand correctly, LFS is generally designed to not trust that data was actually written, just because the write reported success. I have not dived into LFS internals for a while...

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

  1. Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?
  2. Are there any situations where, with a similar configuration vs. physical flash, a write that fails would NOT cause LFS corruption?

Maybe, in addition to retries, since the flash is internal, changing the LFS configuration would be a worthwhile second PR for Adafruit's folks to consider? (of course, only if it would make LFS robust to failed writes)

henrygab
henrygab previously approved these changes Jan 21, 2025
@henrygab henrygab dismissed their stale review January 21, 2025 00:48

later comments... and I'm not official reviewer.

@todd-herbert
Copy link
Author

todd-herbert commented Jan 21, 2025

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

I have to be honest, @esev has looked into this in much more depth than me. I've only really narrowed in on this one particular BLE disconnection case.

I'm not actually sure which situations could trigger the loop to hit MAX_RETRY, but maybe that's the argument in favor of asserting in this situation: to uncover any elusive edge cases which could be better handled.

Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

@geeksville
Copy link
Contributor

geeksville commented Jan 22, 2025 via email

@henrygab
Copy link
Collaborator

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

I accept that reporting a failure, in at least some cases, will cause corruption in LFS.

If the assertion compiles to nothing in release builds, then sure ... this is a fine place to assert.

If intending release builds to lock up / crash...

Concern:

  • Cache layer does: Erase(physical == 32x logical sectors @ 128 bytes each)
  • Cache layer then re-writes the entire physical page
  • One of those writes fails

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

(+) The user experience is that the device simply hangs ... some device running tucked away in a hard-to-reach place now needs someone to go an pull the battery & power, or manually press the reset button (if exposed). Maybe ok, maybe not.

Because it's become clear that LFS was never designed to work as currently configured, I have little hope that any additional changes will improve the situation in a meaningful way. Bump the retry count to 100, double the delay every 20 attempts, and then crash if the operation fails.

A correct fix...

A correct fix would be to update the LFS configuration to indicate the 4096-byte block size. LFS supports inline'd files, so small files can end up stored inline within a sector (not taking 4k per file). This would require testing, but something along the lines of changing:

#define LFS_FLASH_TOTAL_SIZE (7*FLASH_NRF52_PAGE_SIZE)
#define LFS_BLOCK_SIZE 128

#define LFS_FLASH_TOTAL_SIZE  (7*FLASH_NRF52_PAGE_SIZE) 
#define LFS_BLOCK_SIZE        FLASH_NRF52_PAGE_SIZE // more correct

static struct lfs_config _InternalFSConfig =
{
.context = NULL,
.read = _internal_flash_read,
.prog = _internal_flash_prog,
.erase = _internal_flash_erase,
.sync = _internal_flash_sync,
.read_size = LFS_BLOCK_SIZE,
.prog_size = LFS_BLOCK_SIZE,
.block_size = LFS_BLOCK_SIZE,
.block_count = LFS_FLASH_TOTAL_SIZE / LFS_BLOCK_SIZE,
.lookahead = 128,
.read_buffer = NULL,
.prog_buffer = NULL,
.lookahead_buffer = NULL,
.file_buffer = NULL
};

   .read_size = 128; // ??? flash might support reading single byte at a time... 
   .prog_size = 128; // ??? smallest size of a write to the page ... see flash's datasheet, might be 512?
   .block_size = FLASH_NRF52_PAGE_SIZE; //  This is the physical page size
   .block_count = LFS_FLASH_TOTAL_SIZE / FLASH_NRF52_PAGE_SIZE; // e.g., seven (7)

The above changes are an ESTIMATE / STRAWMAN and entirely UNTESTED, as they are intended for discussion only.

@todd-herbert
Copy link
Author

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

That makes sense to me.


Personally, I'd be inclined to leave this PRs scope targeting this one specific BLE disconnection issue. It seems like the fix is fairly non-controversial, and could be rolled out without too much fear of causing disruption.

The additional discussion going on here with further aims to improve the stability of InternalFileSystem is certainly very positive and not something I'd want to discourage though!

todd-herbert added a commit to todd-herbert/meshtastic-nrf52-arduino that referenced this pull request Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants