-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition (also a regression of the PR 19139) #19221
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 21 files with indirect coverage changes @@ Coverage Diff @@
## main #19221 +/- ##
==========================================
- Coverage 68.82% 68.81% -0.02%
==========================================
Files 420 420
Lines 35649 35664 +15
==========================================
+ Hits 24536 24541 +5
- Misses 9692 9697 +5
- Partials 1421 1426 +5 Continue to review full report in Codecov by Sentry.
|
c76bbeb
to
b1e5ebc
Compare
@fuweid @ivanvc @jmhbnz @serathius This PR fixed a regression caused by #19139. So let's get this merged and backport to 3.5 and probably 3.4. We need to get it included in 3.5.18 |
/test pull-etcd-integration-1-cpu-arm64 |
Hard to review without loading a lot of context, it's not the first time we are having problems with shutdown. I think the problem is lack of high level vision on shutdown protocol for server, and what sub routines should do to follow it, and why everything works together. @ahrtr could you add a comment describing the shutdown protocol you have in mind? It should make it easier to review and be useful for the future. |
Done. Please see the last commit. cc @fuweid @ivanvc @jmhbnz @serathius |
…te before it returns Signed-off-by: Benjamin Wang <[email protected]>
… the errc Signed-off-by: Benjamin Wang <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cc @serathius do you have any further comment? There are three commits in this PR. Note the third commit only adds some comment. The first and second commits are straight forward, please see the description of this PR. |
// after all these sub goroutines exit (checked via `wg`). Writers | ||
// should avoid writing after `stopc` is closed by selecting on | ||
// reading from `stopc`. | ||
errc chan error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using wg to drain errc is a new concept introduced in previous PR, looks like previously we dependent on us correctly predicting needed capacity
Line 257 in c9045d6
e.errc = make(chan error, len(e.Peers)+len(e.Clients)+2*len(e.sctxs)) |
Possibly issues stem from the fact that this logic got outdated, I think now we can have even 4 writers to errc per sctx.
If the wg setup is correct then we should safe with setting capacity of errc channel to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit that the calculation of the errc capacity might not be accurate, but in practice it's already good enough. Also it's a separate topic.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, fuweid, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@fuweid @serathius then let's merge this PR and backport it to 3.5. Afterwards, we can continue to do the refactoring in main only in #19257. We can have more discussion under that PR. Please let me know your thoughts before I merge this PR. thx |
#19139 Was backported to v3.4 so we also need to backport there. |
+1. Agree |
Ok, I was not aware of the partial backport nor the investigation. Makes sense now. Can you provide a link where to where we confirmed that v3.4 doesn't have a regression? #19172 doesn't mention it |
Fix #19172
Please review this PR commit by commit.
Two high level thoughts,
sync.WaitGroup
, we should always callwg.Add
andwg.Wait
in the same goroutine.cc @serathius @fuweid @ivanvc @jmhbnz @joshuazh-x