Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix endpoint busy flag race condition #383

Merged
merged 1 commit into from
Apr 29, 2020

Conversation

kasjer
Copy link
Collaborator

@kasjer kasjer commented Apr 27, 2020

Busy flag was set to true after call to dcd_edpt_xfer().
In some cased it was possible that transfer finished before function ended.
In this case busy flag could be set to false before it was set to true.
Then setting it to true after dcd_edpt_xfer() made edpoint busy forever.

This change marks endpoint as busy before transfer is started to
avoid race condition.

For my case PR #380 would hide the issue but I guess other cases could result in
same behavior.

Busy flag was set to true after call to dcd_edpt_xfer().
In some cased it was possible that transfer finished before function
ended.
In this case busy flag could be set to false before it was set to
true.
Then setting it to true after dcd_edpt_xfer() made edpoint busy forever.

This change marks endpoint as busy before transfer is started to
avoid race condition.
@pigrew
Copy link
Collaborator

pigrew commented Apr 27, 2020

In the case that the dcd_edpt_xfer fails, I think this will leave busy as true.

On the other hand, I'm not sure if dcd_edpt_xfer fails in any of the drivers.

@pigrew
Copy link
Collaborator

pigrew commented Apr 27, 2020

I also don't see how busy would be set to false before the function returns?

usbd_expt_xfer should only be called from within the usb task. I don't think that busy is ever set false within an interrupt handler.

@hathach
Copy link
Owner

hathach commented Apr 28, 2020

Busy flag was set to true after call to dcd_edpt_xfer().
In some cased it was possible that transfer finished before function ended.

It is physically possible, but it won't mark the endpoint busy forever. The transfer complete will push XFER_COMPLETE event into the queue, which is processed after this call is complete. It will then mark this endpoint as available again.

@hathach
Copy link
Owner

hathach commented Apr 28, 2020

I also don't see how busy would be set to false before the function returns?

usbd_expt_xfer should only be called from within the usb task. I don't think that busy is ever set false within an interrupt handler.

oh, I just realized that using preemptive RTOS, usbd task can actually preempt and mark this as
available before dcd_edpt_xfer return then busy forever is real.

@hathach
Copy link
Owner

hathach commented Apr 28, 2020

In the case that the dcd_edpt_xfer fails, I think this will leave busy as true.

On the other hand, I'm not sure if dcd_edpt_xfer fails in any of the drivers.

Yeah, I didn't encounter any dcd that couldn't queue an transfer. However the PR does keep it the same behavior as it is currently when edpt xfer failed --> busy is not marked. Since there is no error cleanup procedure when edpt_xfer failed, making it busy will cause it busy forever.

Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding, I don't expect a usb controller can run that fast. Also nice usage of TU_VERIFY_HDLR, however, I found it a bit non-instinctive therefore I don't use it anymore and may remove it. Maybe it is best to just use normal ifelse. This is optional, I could change it later on as well.

if ( !dcd_edpt_xfer() ) 
{
  TU_LOG2("failed\r\n")
  TU_BREAKPOINT();
  _usbd_dev.ep_status[epnum][dir].busy = false
}

PS: Out of curiosity, could I ask which MCU you are running on.

@kasjer
Copy link
Collaborator Author

kasjer commented Apr 28, 2020

PS: Out of curiosity, could I ask which MCU you are running on.

It was discovered on STM32F407 and STM32F411, before #380 was merged (which hides the problem).
It happened on mynewt, with CDC driver.
What is most curious that I was able to reproduce it constantly but only when data IN bulk endpoint had lower number then notification interrupt one (0x81, 0x82) but when interrupt endpoint had 0x81 and IN had 0x82 it did not happen at all.

Occasionally during debug session similar situation happened caused by other IN endpoint interrupts that are not handled at all (STM cube generated USB code handles all interrupts in ISR, even if it is just clearing then) that could be applied to synopsys driver. But for this I could not find any systematic reproduction procedure.

@hathach
Copy link
Owner

hathach commented Apr 28, 2020

PS: Out of curiosity, could I ask which MCU you are running on.

It was discovered on STM32F407, before #380 was merged (which hides the problem).
It happened on mynewt, with CDC driver.
What is most curious that I was able to reproduce it constantly but only when data IN bulk endpoint had lower number then notification interrupt one (0x81, 0x82) but when interrupt endpoint had 0x81 and IN had 0x82 it did not happen at all.

Occasionally during debug session similar situation happened caused by other IN endpoint interrupts that are not handled at all (STM cube generated USB code handles all interrupts in ISR, even if it is just clearing then) that could be applied to synopsys driver. But for this I could not find any systematic reproduction procedure.

Yeah, usbnet also observes a similar (but different) issue when a bug is resolved by moving notification EP from EP1 to EP3. It is kind of funny, and hard to explain, probably some a little bit of delay in hw is enough to eliminate the racing.

PS: ah mynewt, I couldn't able to run any tests with it. I think they break newt update/upgrade several months ago and haven't fixed it yet 😓

@pigrew
Copy link
Collaborator

pigrew commented Apr 29, 2020

oh, I just realized that using preemptive RTOS, usbd task can actually preempt and mark this as
available before dcd_edpt_xfer return then busy forever is real.

I had forgotten that we allow USBD functions to be called from any task (I wrongly remembered them only being callable from the USBD task).

This patch does help, and looks right in all but the most pathological cases (such as receiving a USB reset at exactly the wrong time?). So, I'm happy with merging it.

There are definitely many other thread-safety issues. It's heavy-handed, but I wonder if a usbd mutex should be used to prevent concurrency mistakes in usbd.

@hathach
Copy link
Owner

hathach commented Apr 29, 2020

oh, I just realized that using preemptive RTOS, usbd task can actually preempt and mark this as
available before dcd_edpt_xfer return then busy forever is real.

I had forgotten that we allow USBD functions to be called from any task (I wrongly remembered them only being callable from the USBD task).

This patch does help, and looks right in all but the most pathological cases (such as receiving a USB reset at exactly the wrong time?). So, I'm happy with merging it.

There are definitely many other thread-safety issues. It's heavy-handed, but I wonder if a usbd mutex should be used to prevent concurrency mistakes in usbd.

dcd_edpt_xfer() is sort of especial one, I don't remember there is any other dcd API that can have this effect. We are good by only proccess even in the usbd task so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants