-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! This was a tricky bug that you tracked down. To be clear, I don't have final say in this ... my advice here is free, and maybe you get what you pay for.
I think your solution is based on solid read of the SDK. My comments primarily revolve around keeping the code easy to read / function naming / etc. I hope you find it useful.
@henrygab Thanks for the feedback! I'll get onto those changes tomorrow. |
630668a aims to respond to feedback from @henrygab. Hopefully I'm correctly interpreting what you're envisioning.
Please do let me know if you feel there's any room for further improvement here, or if some new logic flaw has appeared during the refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems cleaner. Logic remains apparently sound. Nicely done. 🎉
One code style comment repeated a few times.
One thing I've been wondering about this PR: If it fails after retrying, what should the behavior be? From #325 (comment), I've seen this can lead to asserts in LittleFS. These don't necessarily happen right away and can happen some time later when LittleFS actually accesses one of the lost blocks. I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure. I've been debating this a bit myself, as it certainly isn't nice to crash if it can be avoided. But maybe it is useful in this case, to highlight the issue sooner, closer to the root cause of the failure? Anyway, just adding this comment here in case there is a strong opinion one way or another. I'm happy with the current logic. |
Those are good questions to ask. If the corruption was guaranteed at this point, then you are right ... earlier is better, and here would prevent later-discovered inconsistencies, maybe even leave the file system in a valid state ... if never written to again. What I'm not 100% sure of is whether a failed write is guaranteed to cause LFS corruption. If I understand correctly, LFS is generally designed to not trust that data was actually written, just because the write reported success. I have not dived into LFS internals for a while... @todd-herbert ... questions for you, as you're the one who most recently dived deep....
Maybe, in addition to retries, since the flash is internal, changing the LFS configuration would be a worthwhile second PR for Adafruit's folks to consider? (of course, only if it would make LFS robust to failed writes) |
later comments... and I'm not official reviewer.
I have to be honest, @esev has looked into this in much more depth than me. I've only really narrowed in on this one particular BLE disconnection case. I'm not actually sure which situations could trigger the loop to hit
I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur. |
Btw I'm kinda afk for another week but I just have to say: great find
Todd! Great work!
(Sent from a phone - please ignore typos)
…On Tue, Jan 21, 2025, 20:01 todd-herbert ***@***.***> wrote:
@todd-herbert <https://github.com/todd-herbert> ... questions for you, as
you're the one who most recently dived deep....
I have to be honest, @esev <https://github.com/esev> has looked into this
in much more depth than me. I've only really narrowed in on this one
particular BLE disconnection case.
I'm not actually sure which situations could trigger the loop to hit
MAX_RETRY, but maybe that's the argument in favor of asserting in this
situation: to uncover any elusive edge cases which could be better handled.
Is the corruption essentially an edge case caused by the configuration
choice (vs. the physical flash properties)?
I'm no expert in the area, but reading @geeksville
<https://github.com/geeksville>'s thoughts in meshtastic/firmware#4447
<meshtastic/firmware#4447>, it does sound that
the "32 LittleFS blocks per page" situation creates opportunities for
corruption to occur.
—
Reply to this email directly, view it on GitHub
<#838 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABXB2K2YA2HMW6KNHMLBID2LYSI3AVCNFSM6AAAAABVL5BYQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBUGQYDIMBUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I accept that reporting a failure, in at least some cases, will cause corruption in LFS. If the assertion compiles to nothing in release builds, then sure ... this is a fine place to assert. If intending release builds to lock up / crash...
Concern:
I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result. (+) The user experience is that the device simply hangs ... some device running tucked away in a hard-to-reach place now needs someone to go an pull the battery & power, or manually press the reset button (if exposed). Maybe ok, maybe not. Because it's become clear that LFS was never designed to work as currently configured, I have little hope that any additional changes will improve the situation in a meaningful way. Bump the retry count to A correct fix...
A correct fix would be to update the LFS configuration to indicate the 4096-byte block size. LFS supports inline'd files, so small files can end up stored inline within a sector (not taking 4k per file). This would require testing, but something along the lines of changing: Adafruit_nRF52_Arduino/libraries/InternalFileSytem/src/InternalFileSystem.cpp Lines 34 to 35 in 4dcfa3b
Adafruit_nRF52_Arduino/libraries/InternalFileSytem/src/InternalFileSystem.cpp Lines 100 to 119 in 4dcfa3b
The above changes are an ESTIMATE / STRAWMAN and entirely UNTESTED, as they are intended for discussion only. |
That makes sense to me. Personally, I'd be inclined to leave this PRs scope targeting this one specific BLE disconnection issue. It seems like the fix is fairly non-controversial, and could be rolled out without too much fear of causing disruption. The additional discussion going on here with further aims to improve the stability of InternalFileSystem is certainly very positive and not something I'd want to discourage though! |
Unexpected BLE disconnections (such as 0x8
BLE_HCI_CONNECTION_TIMEOUT)
causesd_flash_page_erase
andsd_flash_write
operations to fail. This failure is reported withNRF_EVT_FLASH_OPERATION_ERROR
. Currently, the InternalFileSystem library doesn't detect these events.This PR aims to detect
NRF_EVT_FLASH_OPERATION_ERROR
, allowing several reattempts of a failed write / erase operation.I'm unsure whether this fix is overly crude, and am concerned that I may be missing some finer detail of the filesystem implementation. Of note is the change from a counting semaphore to a binary semaphore. Any input here would be greatly valued.