Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nRF52832 frequently connect/disconnect occur assertion "head >= 2 && head <= lfs->cfg->block_count" #325

Closed
shaoyuan1943 opened this issue Aug 16, 2019 · 23 comments · Fixed by #397
Labels

Comments

@shaoyuan1943
Copy link

shaoyuan1943 commented Aug 16, 2019

Hi~
I'm using a nRF52832 feather to build keyboard. I found a strange bug when iPhone disconnecting and reconnecting keyboard frequently can cause error:

10:36:26.458 -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find
10:36:33.668 -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find

In my project, iPhone is peripheral and keyboard is cent, iPhone connect keyboard when it started. The connect/disconnect callback code like this:

void prphConnectCallback(uint16_t conn_handle) {
  // Get the reference to current connection
  BLEConnection* connection = Bluefruit.Connection(conn_handle);
  perph_conn_handle = conn_handle;

  // debug information
  char peer_name[32] = { 0 };
  connection->getPeerName(peer_name, sizeof(peer_name));

  LOG("[Prph] Connected handle: %d, name: %s\r\n", conn_handle, peer_name);
}

void prphDisconnectCallback(uint16_t conn_handle, uint8_t reason) {
  (void) conn_handle;
  (void) reason;

  Bluefruit.Advertising.stop();
  delay(10);

  //startAdv();
  Bluefruit.Advertising.start(0);
  // debug information
  LOG("[Prph] Disconnected\r\n");
}

The complete serial logs is(Level 0):
02:31:05.162 -> start...
02:31:05.435 -> in low power
02:31:05.468 -> LN-REDOX bluetooth keyboard RHS started
02:31:05.876 -> [SCAN] scanned, start connect...
02:31:06.186 -> [Cent] Connected handle: 0, name: LHS
02:31:10.046 -> [Prph] Connected handle: 1, name: iPhone
02:31:22.517 -> [Prph] Disconnected
02:31:26.375 -> [Prph] Connected handle: 1, name: iPhone (2
02:31:41.828 -> [Prph] Disconnected
02:31:47.139 -> [Prph] Connected handle: 1, name: iPhone (2
02:31:59.740 -> [Prph] Disconnected
02:32:03.654 -> [Prph] Connected handle: 1, name: iPhone (2
02:32:31.155 -> [Prph] Disconnected
02:32:35.173 -> [Prph] Connected handle: 1, name: iPhone (2
02:32:52.566 -> [Prph] Disconnected
02:32:56.388 -> [Prph] Connected handle: 1, name: iPhone (2
02:33:11.480 -> [Prph] Disconnected
02:33:14.658 -> [Prph] Connected handle: 1, name: iPhone (2
02:33:27.269 -> [Prph] Disconnected
02:33:31.146 -> [Prph] Connected handle: 1, name: iPhone (2
02:33:56.366 -> [Prph] Disconnected
02:34:03.249 -> [Prph] Connected handle: 1, name: iPhone (2
02:34:04.756 -> [Prph] Disconnected
02:34:16.321 -> [Prph] Connected handle: 1, name: iPhone (2
02:37:26.456 -> [Prph] Disconnected
02:42:45.820 -> [Prph] Connected handle: 1, name: iPhone (2
02:53:56.825 -> [Prph] Disconnected
08:45:48.984 -> [Prph] Connected handle: 1, name:
09:03:59.574 -> [Prph] Disconnected
09:33:22.403 -> [Prph] Connected handle: 1, name: iPhone (2
09:34:09.581 -> [Prph] Disconnected
10:03:07.080 -> [Prph] Connected handle: 1, name: iPhone (2
10:03:28.158 -> [Prph] Disconnected
10:23:34.210 -> [Prph] Connected handle: 1, name:
10:23:34.659 -> [Prph] Disconnected
10:23:42.526 -> [Prph] Connected handle: 1, name:
10:23:42.526 -> [Prph] Disconnected
10:23:43.323 -> [Prph] Connected handle: 1, name:
10:23:44.549 -> [Prph] Disconnected
10:23:49.709 -> [Prph] Connected handle: 1, name:
10:23:51.040 -> [Prph] Disconnected
10:23:53.630 -> [Prph] Connected handle: 1, name:
10:23:54.343 -> [Prph] Disconnected
10:27:51.664 -> [Prph] Connected handle: 1, name:
10:27:52.276 -> [Prph] Disconnected
10:27:55.238 -> [Prph] Connected handle: 1, name:
10:27:55.272 -> [Prph] Disconnected
10:27:56.631 -> [Prph] Connected handle: 1, name: iPhone (2
10:28:13.872 -> [Prph] Disconnected
10:28:26.004 -> [Prph] Connected handle: 1, name: iPhone (2
10:28:27.223 -> [Prph] Disconnected
10:28:32.137 -> [Prph] Connected handle: 1, name: iPhone (2
10:28:33.885 -> [Prph] Disconnected
10:36:19.230 -> [Prph] Connected handle: 1, name: iPhone (2
10:36:26.458 -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find
10:36:33.668 -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find

I want to know when this error will occur and how to solve it.

bootloader: feather_nrf52832_bootloader-0.2.11_s132_6.1.1
bsp: 0.11.1

@hathach @jeremypoulter
Plz~

@hathach
Copy link
Member

hathach commented Aug 16, 2019

Could you provide all the information as suggested by the issue template, your simplified sketch and exact steps to reproduce the issue. Nrf52 log when debug level is set to 1 and/or 2, Currently I don't have time but it may help other to troubleshoot this

@shaoyuan1943
Copy link
Author

shaoyuan1943 commented Aug 17, 2019

@hathach ok,I will provide more test information and level 1/2 logs later

@shaoyuan1943
Copy link
Author

shaoyuan1943 commented Aug 17, 2019

@hathach Hi~

bootloader: feather_nrf52832_bootloader-0.2.11_s132_6.1.1
bsp: 0.11.1

Today I got this error again, the level 2 log like this:
......
11:07:32.648 -> [BOND ] Loaded keys from file /adafruit/bond_prph/4424
11:07:32.820 -> [BLE ] BLE_GAP_EVT_CONN_SEC_UPDATE : Conn Handle = 1
11:07:32.820 -> [BOND ] Loaded CCCD from file /adafruit/bond_prph/4424 ( offset = 102, len = 50 bytes )
11:07:32.820 -> [CHR ] Notify 8 bytes
11:07:32.854 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:32.889 -> [BLE ] BLE_GAP_EVT_CONN_PARAM_UPDATE : Conn Handle = 1
11:07:32.889 -> [GAP ] Conn Interval= 30.000000
11:07:32.889 -> [CHR ] Notify 8 bytes
11:07:32.923 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:32.923 -> [CHR ] Notify 8 bytes
11:07:32.992 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:33.126 -> bool BLEAdvertising::stop(): 391: verify failed, error = NRF_ERROR_INVALID_STATE
11:07:33.126 -> bool BLEAdvertising::_start(uint16_t, uint16_t): 366: verify failed, error = NRF_ERROR_CONN_COUNT
11:07:33.126 -> [Prph] Disconnected
11:07:33.333 -> [BLE ] BLE_GAP_EVT_PHY_UPDATE_REQUEST : Conn Handle = 1
11:07:33.333 -> [GAP ] PHY request tx: 2 Mbps, rx: 2 Mbps
11:07:33.333 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:33.333 -> [CHR ] Notify 8 bytes
11:07:33.367 -> [BLE ] BLE_GATTC_EVT_CHAR_VAL_BY_UUID_READ_RSP : Conn Handle = 1
11:07:33.367 -> [Prph] Connected handle: 1, name: iPhone (2
11:07:33.433 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:33.433 -> [CHR ] Notify 8 bytes
11:07:33.502 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:33.502 -> [CHR ] Notify 8 bytes
11:07:33.570 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:33.570 -> [CHR ] Notify 8 bytes
11:07:33.605 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:33.605 -> [CHR ] Notify 8 bytes
11:07:33.673 -> [BLE ] BLE_GAP_EVT_PHY_UPDATE : Conn Handle = 1
11:07:33.673 -> [GAP ] PHY active tx: 2 Mbps, rx: 2 Mbps
11:07:33.673 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:33.673 -> [CHR ] Notify 8 bytes
11:07:33.774 -> [BLE ] BLE_GAP_EVT_DATA_LENGTH_UPDATE_REQUEST : Conn Handle = 1
11:07:33.774 -> [GAP ] Data Length Req is (tx, rx) octets = (251, 251), (tx, rx) time = (2120, 2120) us
11:07:33.808 -> [BLE ] BLE_GAP_EVT_DATA_LENGTH_UPDATE : Conn Handle = 1
11:07:33.808 -> [GAP ] Data Length is (tx, rx) octets = (27, 27), (tx, rx) time = (1364, 1364) us
11:07:33.843 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:33.843 -> [CHR ] Notify 8 bytes
11:07:33.980 -> [BLE ] BLE_GATTS_EVT_EXCHANGE_MTU_REQUEST : Conn Handle = 1
11:07:33.980 -> [GAP ] ATT MTU is changed to 23
11:07:34.014 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:34.014 -> [CHR ] Notify 8 bytes
11:07:34.118 -> [BLE ] BLE_GATTS_EVT_WRITE : Conn Handle = 1
11:07:34.256 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:34.256 -> [CHR ] Notify 8 bytes
11:07:34.326 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:34.326 -> [BLE ] BLE_GATTS_EVT_WRITE : Conn Handle = 1
11:07:34.360 -> 000: 01 00 | ..
11:07:34.360 ->
11:07:34.360 -> [GATTS ] attr's cccd
11:07:34.360 -> [BOND ] Saved CCCD setting to file /adafruit/bond_prph/4424 ( offset = 102, len = 50 bytes )
11:07:34.360 -> [CHR ] Notify 8 bytes
11:07:34.430 -> [BLE ] BLE_GATTS_EVT_WRITE : Conn Handle = 1
11:07:34.430 -> 000: 01 00 | ..
11:07:34.430 ->
11:07:34.430 -> [GATTS ] attr's cccd
11:07:34.430 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:34.532 -> [BLE ] BLE_GATTS_EVT_HVN_TX_COMPLETE : Conn Handle = 1
11:07:34.567 -> [BLE ] BLE_GAP_EVT_CONN_PARAM_UPDATE : Conn Handle = 1
11:07:34.567 -> [GAP ] Conn Interval= 15.000000
11:07:34.567 -> [C[SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:34.636 -> OC [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:34.636 -> ] NRF_EVT_FLASH_O[SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:35.009 -> [BOND ] Saved CCCD setting to file /adafruit/bond_prph/4424 ( offset = 102, len = 50 bytes )
11:07:35.044 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:35.282 -> [BLE ] BLE_GATTS_EVT_WRITE : Conn Handle = 1
11:07:35.315 -> 000: 01 00 | ..
11:07:35.315 ->
11:07:35.315 -> [GATTS ] attr's cccd
11:07:35.692 -> [SOC ] NRF_EVT_FLASH_OPERATION_ERROR
11:07:36.308 -> [BLE ] BLE_GAP_EVT_DISCONNECTED : Conn Handle = 1
11:07:36.308 -> [GAP ] Disconnect Reason 0x08
11:07:36.308 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:36.475 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:36.509 -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find
11:08:06.288 -> [BLE ] BLE_GAP_EVT_ADV_SET_TERMINATED : Conn Handle = 65535
11:17:06.163 -> [BLE ] BLE_GAP_EVT_CONNECTED : Conn Handle = 1
11:17:06.231 -> [BLE ] BLE_GAP_EVT_SEC_INFO_REQUEST : Conn Handle = 1
11:17:06.231 -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find

The completed level 2 log in here: https://github.com/shaoyuan1943/nRF52_error_log/blob/master/error.txt

About 4 minutes from the beginning to the this error, and than connect nRF52832 failed before restart nRf52832.
My step:

  1. start nRF52832
  2. connect
  3. do nothing
  4. disconnect
  5. do nothing
  6. connect
  7. do nothing
  8. disconnect
    ......

My project code like this(my project doesn’t have any file operations ):

void setup() {
  ledOn( LED_BLUE );

  Serial.begin(115200);
  LOG("start...\r\n");

  for (auto &pin : row_pins) {
    pinMode(pin, OUTPUT);
    digitalWrite(pin, HIGH);
  }
  for (auto &pin : col_pins) {
    pinMode(pin, INPUT_PULLUP);
  }

  // peripheral and Central
  Bluefruit.begin(1, 1);
  //Bluefruit.clearBonds();
  Bluefruit.autoConnLed(false);

  nRF5x_lowPower.enableDCDC();
  nRF5x_lowPower.powerMode(POWER_MODE_LOW_POWER);
  LOG("in low power\r\n");

  Battery.begin();
  
  Bluefruit.setTxPower(4);
  Bluefruit.setName("LN-Redox Keyboard");  // bluetooth name

  // callbacks for Peripheral
  Bluefruit.Periph.setConnectCallback(prphConnectCallback);
  Bluefruit.Periph.setDisconnectCallback(prphDisconnectCallback);

  // callbacks for Central
  Bluefruit.Central.setConnectCallback(centConnectCallback);
  Bluefruit.Central.setDisconnectCallback(centDisconnectCallback);

  // set device information
  DeviceInfo.setManufacturer("Redox");
  DeviceInfo.setModel("Redox-Handwire1");
  DeviceInfo.setHardwareRev("0001");
  DeviceInfo.setSoftwareRev(__DATE__);
  DeviceInfo.begin();

  ClientUart.begin();

  /* Start Central Scanning
     - Enable auto scan if disconnected
     - Interval = 100 ms, window = 80 ms
     - Filter only accept bleuart service
     - Don't use active scan
     - Start(timeout) with timeout = 0 will scan forever (until connected)
  */
  Bluefruit.Scanner.setRxCallback(scanCallback);
  Bluefruit.Scanner.restartOnDisconnect(true);
  Bluefruit.Scanner.setInterval(160, 80); // in unit of 0.625 ms
  Bluefruit.Scanner.filterUuid(BLEUART_UUID_SERVICE);
  Bluefruit.Scanner.useActiveScan(false);
  Bluefruit.Scanner.start(0);                   // 0 = Don't stop scanning after n seconds
  
  Hid.begin();
  Battery.begin();

  delay(5);
  startAdv();

  readVBAT();
  LOG("LN-REDOX bluetooth keyboard RHS started\r\n");
}

void prphConnectCallback(uint16_t conn_handle) {
  // Get the reference to current connection
  BLEConnection* connection = Bluefruit.Connection(conn_handle);
  perph_conn_handle = conn_handle;

  // debug information
  char peer_name[32] = { 0 };
  connection->getPeerName(peer_name, sizeof(peer_name));

  LOG("[Prph] Connected handle: %d, name: %s\r\n", conn_handle, peer_name);
}

void prphDisconnectCallback(uint16_t conn_handle, uint8_t reason) {
  (void) conn_handle;
  (void) reason;

  Bluefruit.Advertising.stop();
  delay(10);

  //startAdv();
  Bluefruit.Advertising.start(0);
  // debug information
  LOG("[Prph] Disconnected\r\n");
}

void centConnectCallback(uint16_t conn_handle) {
  BLEConnection* conn = Bluefruit.Connection(conn_handle);
  if (conn) {
    char peer_name[32] = { 0 };
    conn->getPeerName(peer_name, sizeof(peer_name));
    LOG("[Cent] Connected handle: %d, name: %s\r\n", conn_handle, peer_name);
  }

  if (ClientUart.discover(conn_handle)) {
    // Enable TXD's notify
    ClientUart.enableTXD();
  } else {
    if (conn) {
      conn->disconnect();
      LOG("[Cent] ClientUart not discovered, try disconnect\r\n");
    }

    LOG("[Cent] ClientUart not discovered\r\n");
    //Bluefruit.Central.disconnect(conn_handle);
  }
}

void centDisconnectCallback(uint16_t conn_handle, uint8_t reason) {
  (void) conn_handle;
  (void) reason;

  LOG("[Cent] Disconnected\r\n");
}

void scanCallback(ble_gap_evt_adv_report_t* report) {
  // Check if advertising contain BleUart service
  if (Bluefruit.Scanner.checkReportForService(report, ClientUart)) {
    // Connect to device with bleuart service in advertising
    Bluefruit.Central.connect(report);
    LOG("[SCAN] scanned, start connect...\r\n");
  } else {      
    // For Softdevice v6: after received a report, scanner will be paused
    // We need to call Scanner resume() to continue scanning
    Bluefruit.Scanner.resume();
    LOG("[SCAN] scanner resume\r\n");
  }
}

void startAdv(void) {
  Bluefruit.Advertising.addFlags(BLE_GAP_ADV_FLAGS_LE_ONLY_GENERAL_DISC_MODE);
  Bluefruit.Advertising.addTxPower();
  Bluefruit.Advertising.addAppearance(BLE_APPEARANCE_HID_KEYBOARD);
  Bluefruit.Advertising.addService(Hid);

  Bluefruit.ScanResponse.addService(Battery);

  Bluefruit.Advertising.addName();
  Bluefruit.Advertising.restartOnDisconnect(true);
  // in unit of 0.625 ms
  Bluefruit.Advertising.setInterval(32, 244);  
  // number of seconds in fast mode  
  Bluefruit.Advertising.setFastTimeout(30);
  // 0 = Don't stop advertising after n seconds      
  Bluefruit.Advertising.start(0);                
}

void loop() {
// do something

  waitForEvent();
}

@shaoyuan1943
Copy link
Author

shaoyuan1943 commented Aug 20, 2019

Hi, @hathach
How is the progress of this problem? If u need more other information please let me know, thank you.

@hathach
Copy link
Member

hathach commented Aug 21, 2019

Look like it is LFS related issue #227 . I won't have time to run your sketch and reproduce this for probably several weeks or more. Please be patient, and try to investigate it. Keep posting your finding here, maybe other with similar issue can come in and help.

@shaoyuan1943
Copy link
Author

OK, thanks for your reply. I will continue to pay attention to this issue and keep posting new informations.

@henrygab
Copy link
Collaborator

henrygab commented Oct 5, 2019

Hi @shaoyuan1943 ,

The following output from your debug log is of interest:

11:07:35.692 -> [SOC ] NRF_EVT_FLASH_OPERATION_ERROR

I will provide my initial look in a series of responses. I've collapsed the details, to keep the thread somewhat readable, but simply click to expand sections for details:

The above log line indicates some flash operation (likely a write) failed. The code that outputs that line is in the Bluefruit52Lib library:

case NRF_EVT_FLASH_OPERATION_SUCCESS:
case NRF_EVT_FLASH_OPERATION_ERROR:
LOG_LV1("SOC", "NRF_EVT_FLASH_OPERATION_%s", soc_evt == NRF_EVT_FLASH_OPERATION_SUCCESS ? "SUCCESS" : "ERROR");
if ( flash_nrf5x_event_cb ) flash_nrf5x_event_cb(soc_evt);
break;

To investigate, let's look at the Nordic API for writing to flash, the function calls used by the AdaFruit InternalFileSystem library, and then evaluate the reliability of the code....

@henrygab
Copy link
Collaborator

henrygab commented Oct 5, 2019

Nordic API for writing to flash

Here are some rules for using the Nordic API, based on their own header file documentation.

Nordic provides the API that does the actual write to flash as `sd_flash_write()`

/**@brief Flash Write
*
* Commands to write a buffer to flash
*
* If the SoftDevice is enabled:
* This call initiates the flash access command, and its completion will be communicated to the
* application with exactly one of the following events:
* - @ref NRF_EVT_FLASH_OPERATION_SUCCESS - The command was successfully completed.
* - @ref NRF_EVT_FLASH_OPERATION_ERROR - The command could not be started.
*
* If the SoftDevice is not enabled no event will be generated, and this call will return @ref NRF_SUCCESS when the
* write has been completed
*
* @note
* - This call takes control over the radio and the CPU during flash erase and write to make sure that
* they will not interfere with the flash access. This means that all interrupts will be blocked
* for a predictable time (depending on the NVMC specification in the device's Product Specification
* and the command parameters).
* - The data in the p_src buffer should not be modified before the @ref NRF_EVT_FLASH_OPERATION_SUCCESS
* or the @ref NRF_EVT_FLASH_OPERATION_ERROR have been received if the SoftDevice is enabled.
* - This call will make the SoftDevice trigger a hardfault when the page is written, if it is
* protected.
*
*
* @param[in] p_dst Pointer to start of flash location to be written.
* @param[in] p_src Pointer to buffer with data to be written.
* @param[in] size Number of 32-bit words to write. Maximum size is the number of words in one
* flash page. See the device's Product Specification for details.
*
* @retval ::NRF_ERROR_INVALID_ADDR Tried to write to a non existing flash address, or p_dst or p_src was unaligned.
* @retval ::NRF_ERROR_BUSY The previous command has not yet completed.
* @retval ::NRF_ERROR_INVALID_LENGTH Size was 0, or higher than the maximum allowed size.
* @retval ::NRF_ERROR_FORBIDDEN Tried to write to an address outside the application flash area.
* @retval ::NRF_SUCCESS The command was accepted.
*/
SVCALL(SD_FLASH_WRITE, uint32_t, sd_flash_write(uint32_t * p_dst, uint32_t const * p_src, uint32_t size));

1. Exactly one write operation may be pending on the flash device at a time:

* @retval ::NRF_ERROR_BUSY The previous command has not yet completed.

2. Must handle NRF_ERROR_BUSY response by retrying the operation

OK, technically it's not required to retry, but the write would not be successful. Retry just makes sense, as this does not indicate an error writing to that area of flash.

* @retval ::NRF_ERROR_BUSY The previous command has not yet completed.

3. sd_flash_write() is asynchronous when softdevice is enabled, but synchronous otherwise

This adds some fun to the API. When the softdevice is enabled, a return value of success only means that the write was accepted / queued, not that the write succeeded. When the softdevice is disabled, then the write occurs synchronously.

* @retval ::NRF_SUCCESS The command was accepted.

4. On NRF_SUCCESS, when softdevice is enabled, the buffer cannot be modified until event NRF_EVT_FLASH_OPERATION_*

The buffer used for the write must remain unmodified until the write completes. Otherwise, the software is just asking for corrupt data.

* - The data in the p_src buffer should not be modified before the @ref NRF_EVT_FLASH_OPERATION_SUCCESS
* or the @ref NRF_EVT_FLASH_OPERATION_ERROR have been received if the SoftDevice is enabled.

@henrygab
Copy link
Collaborator

henrygab commented Oct 5, 2019

How the Adafruit InternalFileSystem library writes to flash

write function is defined to be _internal_flash_prog

See line 106. Prog in this case means 'program the flash memory', aka write.

static struct lfs_config _InternalFSConfig =
{
.context = NULL,
.read = _internal_flash_read,
.prog = _internal_flash_prog,
.erase = _internal_flash_erase,
.sync = _internal_flash_sync,
.read_size = LFS_BLOCK_SIZE,
.prog_size = LFS_BLOCK_SIZE,
.block_size = LFS_BLOCK_SIZE,
.block_count = LFS_FLASH_TOTAL_SIZE / LFS_BLOCK_SIZE,
.lookahead = 128,
.read_buffer = NULL,
.prog_buffer = NULL,
.lookahead_buffer = NULL,
.file_buffer = NULL
};

_internal_flash_prog() calls flash_nrf5x_write()

It just adds an offset so that callers "see" the space as starting at address zero, while supporting different chipsets with different start address for the flash memory.

// Program a region in a block. The block must have previously
// been erased. Negative error codes are propogated to the user.
// May return LFS_ERR_CORRUPT if the block should be considered bad.
static int _internal_flash_prog (const struct lfs_config *c, lfs_block_t block, lfs_off_t off, const void *buffer,
lfs_size_t size)
{
(void) c;
uint32_t addr = lba2addr(block) + off;
flash_nrf5x_write(addr, buffer, size);
return 0;
}

#ifdef NRF52840_XXAA
#define LFS_FLASH_ADDR 0xED000
#else
#define LFS_FLASH_ADDR 0x6D000
#endif

flash_nrf5x_write() calls flash_cache_write()

uint32_t flash_nrf5x_write (uint32_t dst, void const * src, uint32_t len)
{
// TODO prevent write SD + bootloader region
return flash_cache_write(&_cache, dst, src, len);
}

flash_cache_write() calls flash_cache_flush() to write previous cache page

A small aside. The cache layer caches all writes until the next write to a different page is requested. This is great for performance. However, it means that callers of this function cannot "see" if the write actually made it to the flash or not.

uint32_t flash_cache_write (flash_cache_t* fc, uint32_t dst, void const * src, uint32_t len)
{
uint8_t const * src8 = (uint8_t const *) src;
uint32_t remain = len;
// Program up to page boundary each loop
while ( remain )
{
uint32_t const page_addr = page_addr_of(dst);
uint32_t const offset = page_offset_of(dst);
uint32_t wr_bytes = FLASH_CACHE_SIZE - offset;
wr_bytes = min32(remain, wr_bytes);
// Page changes, flush old and update new cache
if ( page_addr != fc->cache_addr )
{
flash_cache_flush(fc);

flash_cache_flush() calls fc->erase() and fc->program()

First, it checks if the contents were actually altered, if there's a verification function. Else, it first calls erase() to ensure the page can be written, and then it calls program() to actually write the page with the (modified) data.

fc->program() is fal_program()

static flash_cache_t _cache = {
.erase = fal_erase,
.program = fal_program,
.read = fal_read,
.verify = fal_verify,
.cache_addr = FLASH_CACHE_INVALID_ADDR,
.cache_buf = _cache_buffer
};

fal_program() has much of the logic

static uint32_t fal_program (uint32_t dst, void const * src, uint32_t len)
{
// wait for async event if SD is enabled
uint8_t sd_en = 0;
(void) sd_softdevice_is_enabled(&sd_en);
uint32_t err;
// Somehow S140 v6.1.1 assert an error when writing a whole page
// https://devzone.nordicsemi.com/f/nordic-q-a/40088/sd_flash_write-cause-nrf_fault_id_sd_assert
// Workaround: write half page at a time.
#if NRF52832_XXAA
while ( NRF_ERROR_BUSY == (err = sd_flash_write((uint32_t*) dst, (uint32_t const *) src, len/4)) )
{
delay(1);
}
VERIFY_STATUS(err, 0);
if ( sd_en ) xSemaphoreTake(_sem, portMAX_DELAY);
#else
while ( NRF_ERROR_BUSY == (err = sd_flash_write((uint32_t*) dst, (uint32_t const *) src, len/8)) )
{
delay(1);
}
VERIFY_STATUS(err, 0);
if ( sd_en ) xSemaphoreTake(_sem, portMAX_DELAY);
while ( NRF_ERROR_BUSY == (err = sd_flash_write((uint32_t*) (dst+ len/2), (uint32_t const *) (src + len/2), len/8)) )
{
delay(1);
}
VERIFY_STATUS(err, 0);
if ( sd_en ) xSemaphoreTake(_sem, portMAX_DELAY);
#endif
return len;
}

@henrygab
Copy link
Collaborator

henrygab commented Oct 5, 2019

Does Adafruit InternalFileSystem library follow the Nordic rules for reliably writing?

The rules (listed above) were:

  1. Exactly one write operation may be pending on the flash device at a time:
  2. Must handle NRF_ERROR_BUSY response by retrying the operation
  3. sd_flash_write() is asynchronous when softdevice is enabled, but synchronous otherwise
  4. On NRF_SUCCESS, when softdevice is enabled, the buffer cannot be modified until event NRF_EVT_FLASH_OPERATION_*

First, it's clear that this code handles NRF_ERROR_BUSY, and retries the operation after a delay. Thus, at least rules 1 and 2 appear to be handled, which is good.

Next, let's dig a little deeper to see what happens on errors

When softdevice disabled, and writing to flash fails

When the softdevice is disabled, no event is generated and the call to sd_flash_write() occurs synchronously. Therefore, only the code in the function needs to be reviewed.

* If the SoftDevice is not enabled no event will be generated, and this call will return @ref NRF_SUCCESS when the
* write has been completed

BUG 1 -- Code does not report write failures

Here, if the call to sd_flash_write() fails with any status other than NRF_ERROR_BUSY, then the only thing that occurs is that a debug log message is generated via VERIFY_STATUS(). Critically, the parameter len is still returned to the caller, and the return value of the function is defined to represent the number of bytes written.

while ( NRF_ERROR_BUSY == (err = sd_flash_write((uint32_t*) dst, (uint32_t const *) src, len/4)) )
{
delay(1);
}
VERIFY_STATUS(err, 0);

When softdevice disabled, and writing to flash fails

When the softdevice is enabled, the call to sd_flash_write() is asynchronous. Therefore, if the call to sd_flash_write() returns NRF_SUCCESS, that only means the command was queued, not that it actually succeeded. The code must then wait for the event to be generated to know if the write succeeded or failed. This means that there are two failure paths to consider. First, when the call to sd_flash_write() fails, and second, when that succeeds, but the event generated is NRF_EVT_FLASH_OPERATION_ERROR. Both appear to have bugs.

BUG2 -- Code does not report failures from sd_flash_write()
Identical to bug 1, but with softdevice enabled. It's also unclear from the documentation if an event will be generated when the call to sd_flash_write() fails with an error. If not, then this could also cause a hard-lock when this function attempts to take the semaphore.

BUG3 -- Code does not detect or report failures from event
If the call to sd_flash_write() returned NRF_SUCCESS, then the command was queued, and the success or failure is determined by the event generated (NRF_EVT_FLASH_OPERATION_SUCCESS or NRF_EVT_FLASH_OPERATION_ERROR).

The overall event handler is adafruit_soc_task():

/**
* Handle SOC event such as FLASH operation
*/
void adafruit_soc_task(void* arg)
{
(void) arg;
while (1)
{
if ( xSemaphoreTake(Bluefruit._soc_event_sem, portMAX_DELAY) )
{
uint32_t soc_evt;
uint32_t err = ERROR_NONE;
// until no more pending events
while ( NRF_ERROR_NOT_FOUND != (err = sd_evt_get(&soc_evt)) )
{
if (ERROR_NONE == err)
{
switch (soc_evt)
{
// Flash
case NRF_EVT_FLASH_OPERATION_SUCCESS:
case NRF_EVT_FLASH_OPERATION_ERROR:
LOG_LV1("SOC", "NRF_EVT_FLASH_OPERATION_%s", soc_evt == NRF_EVT_FLASH_OPERATION_SUCCESS ? "SUCCESS" : "ERROR");
if ( flash_nrf5x_event_cb ) flash_nrf5x_event_cb(soc_evt);
break;

adafruit_soc_task() calls flash_nrf5x_event_cb() at line 711. That function is:

void flash_nrf5x_event_cb (uint32_t event)
{
// if (event != NRF_EVT_FLASH_OPERATION_SUCCESS) LOG_LV1("IFLASH", "Flash op Failed");
if ( _sem ) xSemaphoreGive(_sem);
}

Critically, neither of those functions in any way stores an indication that the write failed. This prevents fal_program() from propagating the failure, and is thus appears to be a bug.

Possible semaphore race condition?

OK, this is not as sure, as there may be other FreeRTOS restrictions that prevent this. However, it at least smells bad....

  1. fal_program() only takes the semaphore AFTER the write command has been queued.
  2. the soc event handler unconditionally calls xSemphoreGive() for either NRF_EVT_FLASH_OPERATION_SUCCESS or NRF_EVT_FLASH_OPERATION_ERROR events.
  3. a semaphore cannot be given unless it was first taken. From the FreeRTOS documentation:

An error can occur if there is no space on the queue to post a message – indicating that the
semaphore was not first obtained correctly.

Therefore, the soc event handler presumes that the semaphore is already owned. However, the semaphore is not taken until after the command is queued. Therefore, it appears that the following might be possible to occur:

  1. write is queued from fal_program()
  2. callback for soc event interrupts ...
  3. callback flash_nrf5x_event_cb() calls xSemaphoreGive(), which returns an error because there is no owner, but returns silently.
  4. callback for soc event completes ...
  5. fal_program() acquires semaphore
  6. deadlock?

In short, it appears that InternalFileSystem currently completely ignores flash errors.

Let me see if I can generate a branch with a patch file that can at least catch whether these errors are occurring, or if there are additional problems somewhere else.....

@henrygab
Copy link
Collaborator

henrygab commented Oct 5, 2019

@hathach -- I've written a fairly scathing analysis in the above few posts. That analysis suggests that the InternalFileSystem library current implementation doesn't handle any write errors … at all. I'd appreciate your review of the above analysis, as I'm hoping I missed something and am entirely wrong in my analysis....

UPDATE: After review of LFS, it appears LFS allows the implementation to return success even for failed writes. LFS re-reads information written (after sync()) and thus does not rely on flash to report an error on the write.

@shaoyuan1943
Copy link
Author

Hi @henrygab
Thanks for ur nice job!

@henrygab
Copy link
Collaborator

henrygab commented Oct 9, 2019

I've opened a new bug (#350), based on the above analysis, with a title that more accurately reflects the cause. While ARMmbed's LittleFS is robust to power failures and failed writes, it must be told when a write fails. Hopefully, having been provided this deeper analysis, @hathach will have the information needed to rework the InternalFS library to propagate errors.

@henrygab
Copy link
Collaborator

henrygab commented Oct 9, 2019

@shaoyuan1943 -- Do you have a reliably repro for this bug?

The issue is more subtle than I originally thought. But, I have additional possible thoughts, based on your log above, related to multiple tasks calling into LFS simultaneously. It's not clear to me if the InternalFileSystem code can be called safely in parallel. (See re-entrant on Wikipedia.)

If you have the ability to reliably repro this bug, please let me know. Then, I may be able to create a private branch that adds a mutex at the InternalFileSystem callbacks, to ensure that even if LFS attempts to write in parallel, that only one executes at a time.

@shaoyuan1943
Copy link
Author

@henrygab I am not sure if I can reproduce this bug, but I will continue to try it. I plan to spend this weekend trying to reproduce this issue :)

@henrygab
Copy link
Collaborator

@shaoyuan1943 -- Some questions to try to repro this myself:

  1. Can you provide a complete sketch? (what you provided does not compile for me)
  2. You report using BSP 0.11.1 ... is there a reason you are not updated to 0.14.0?
  3. The partial sketch you provided appears to suggest you are running on battery. Is that right?
  4. If you can reliably repro, and are running on battery, does it repro when not on battery?

These will help in attempts to create a repro of the issue.

( FYI, running on battery is OK .. LittleFS should still not end up with corruption .. but knowing it's on battery could help create a repro, such as by lowering the power supply voltage to "emulate" partially battery, or configuring power supply to have low max amperage, to force flash errors to be more common. )

BTW, a prototype that at least compiles now exists, where the change is to force serialization of all access to the flash API. Also, this prototype should* log to serial if this situation ever occurs.
(*) untested

See the main concept in the single file commit:
AdaFruit/hardware/nrf52/0.14.0/libraries/InternalFileSytem/src/flash/flash_cache.c

@shaoyuan1943
Copy link
Author

shaoyuan1943 commented Oct 10, 2019

Hi @henrygab

  1. I will provide complete sketch later.
  2. The BSP version is 0.11.1, I will test it on 0.14.0 later.
    3/4. I'm not sure whether to use USB or Li battery.

@henrygab
Copy link
Collaborator

@shaoyuan1943 -- a stable power connection repro would be best. If it only repro's when running on battery, that's OK, but knowing that it does not repro on stable power will be helpful.

@henrygab
Copy link
Collaborator

@shaoyuan1943 -- Have you been able to reproduce the problem? If so, have you re-compiled and tried to repro with the following single-file change (also at debug level 2)?

AdaFruit/hardware/nrf52/0.14.0/libraries/InternalFileSytem/src/flash/flash_cache.c

@shaoyuan1943
Copy link
Author

@henrygab I did not reproduce this problem, I will try your modified:)

@shaoyuan1943
Copy link
Author

Hi~ @hathach @henrygab can I close this issues now?

@hathach
Copy link
Member

hathach commented Jan 1, 2020

It is up to you, you can close it only if you couldn’t reproduce the issue

@esev
Copy link

esev commented Jan 19, 2025

Drive-by observation many years later.

11:07:35.692 -> [SOC ] NRF_EVT_FLASH_OPERATION_ERROR
11:07:36.308 -> [BLE ] BLE_GAP_EVT_DISCONNECTED : Conn Handle = 1
11:07:36.308 -> [GAP ] Disconnect Reason 0x08
11:07:36.308 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:36.475 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS
11:07:36.509 -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find

These things seem to be related. Specifically a BLE timeout disconnect (Reason 0x08) during a flash operation can cause the flash operation to fail.

This causes corruption in LittleFS due to the block_size being configured smaller than the flash page size.

LittleFS is only able to recover & relocate 128 bytes (block_size) from the failing 4096/2048 physical flash page. The remaining bytes from the physical page are lost, corrupting LittleFS, and resulting in the assertion.

#838 will be able to help in this situation. But due to the above block_size configuration, I think LittleFS won't be able to recover from a non-retryable flash error (like a loss of power or a worn out flash block).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants