-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce client circuit breaker #115
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Title | ||
|
||
- RFC PR: https://github.com/tikv/rfcs/pull/115 | ||
- Tracking Issue: https://github.com/tikv/pd/issues/8678 | ||
|
||
## Summary | ||
|
||
Implement a [circuit breaker](https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker) pattern to protect the PD leader from overloading due to retry storms or similar feedback loops. | ||
|
||
## Motivation | ||
|
||
In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios. | ||
Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state. | ||
This feature is especially critical for large clusters where a high number of nodes can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health. | ||
|
||
## Detailed design | ||
|
||
### Configuration | ||
|
||
The key part of the circuit breaker is its configuration. | ||
In order to limit number of system variables, we will use a single system variables per circuit breaker type to define error rate threshold. | ||
|
||
#### System variables | ||
|
||
`tidb_cb_pd_metadata_error_rate_threshold_pct`: | ||
- Scope: GLOBAL | ||
- Persists to cluster: Yes | ||
- Applies to hint SET_VAR: No | ||
- Type: Integer | ||
- Default value: 0 | ||
- Range: [0, 100] | ||
- This variable is used to set percent of errors to trip the circuit breaker.`0` means circuit breaker is disabled. | ||
|
||
#### Hardcoded configs | ||
|
||
* `error_rate_window` | ||
* Defines how long to track errors before evaluating error_rate_threshold. | ||
* Default: 10 | ||
* Unit: seconds | ||
--- | ||
* `min_qps_to_open` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need a config for it or is it necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If "had only 1 request within error_rate_window", you do not have to enable the cb? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Usually circuit breaker tracks the number of failures within error window. I found it problematic to set absolute value given that it could vary a lot depending on qps hence I chose to go with error rate instead. But in this case we do need a protection from low qps cases or time of the day hence I propose minQPS. If we want to limit number of easy configurations to one session variables, then maybe we should consider to set it in absolute number of errors within window. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lgtm |
||
* Defines the average qps over the `error_rate_window` that must be met before evaluating the error rate threshold. | ||
* Default: 10 | ||
* Unit: integer | ||
--- | ||
* `cooldown_interval` | ||
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. The actual wait time is always equally jittered in the `[value/2, value]` interval. | ||
* Default: 60 | ||
* Unit: seconds | ||
--- | ||
* `half_open_success_count` | ||
* Defines how many subsequent requests to test after cooldown period before fully close the circuit. All request in excess of this count will be errored till the circuit is fully closed pending results of the firsts `half_open_success_count` requests. | ||
* Default: 10 | ||
* Unit: integer | ||
|
||
### Circuit Breaker API | ||
|
||
#### Config | ||
|
||
All configs described above will be encapsulated in a `Settings` struct with ability to change error rate threshold dynamically: | ||
|
||
```go | ||
type Settings struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this configuration be exported for the user? And what is the relationship between this and 'system variables'? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. error rate threshold will be a system variable, and everything else is hardcoded or exposed in not documented toml client section. |
||
// Name of circuit breaker for configuration and logging purposes | ||
Name string | ||
// Defines the error rate threshold to trip the circuit breaker. | ||
ErrorRateThresholdPct uint32 | ||
// Defines the average qps over the `error_rate_window` that must be met before evaluating the error rate threshold. | ||
MinQPSForOpen uint32 | ||
// Defines how long to track errors before evaluating error_rate_threshold. | ||
ErrorRateWindow time.Duration | ||
// Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. | ||
CoolDownInterval time.Duration | ||
// Defines how many subsequent requests to test after cooldown period before fully close the circuit. | ||
HalfOpenSuccessCount uint32 | ||
} | ||
|
||
// ChangeErrorRateThresholdPct changes the error rate threshold for the circuit breaker. | ||
func (s *Settings) ChangeErrorRateThresholdPct(threshold uint32) | ||
``` | ||
|
||
#### Call wrapper function | ||
|
||
The circuit breaker will be implemented as a state machine with Closed, Opened and HalfOpen states and expose the following wrapper function for calls which we want to protect: | ||
|
||
```go | ||
func (cb *CircuitBreaker[T]) Execute(run func() (T, error, boolean)) (T, error) | ||
``` | ||
|
||
There is a third boolean parameter in the function argument above, in case the provided function doesn't return an error, but the caller still wants to treat it as a failure for the circuit breaker counting purposes (e.g. empty or corrupted result). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we can run a test where only the gRPC and timeout errors are treated as actual errors, which might be sufficient. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is up to the place where we wire the circuit breaker in. Boolean param allows to configure it. It make sense to start with timeouts, but having ability to tweak it in the API let us tune it further. |
||
Some other implementation return a callback, so that a caller can provide an interpretation result later, but this pattern opens the door to miss the failure count, hence proposed implementation does not offer that. | ||
|
||
#### Errors | ||
|
||
Sometimes implementations provide different error types for open state and half open states, but in this proposal defines a single error type for simplicity. | ||
|
||
#### Integration | ||
|
||
The implementation will live in a new package `circuitbreaker` package under https://github.com/tikv/pd/tree/master/client so that it can be referred from both http and grpc path. | ||
|
||
## Drawbacks | ||
|
||
It is hard to configure the circuit breaker thresholds correctly. | ||
If the thresholds are too low, legitimate requests may be rejected. | ||
If they are too high, the circuit breaker may not trigger in time to prevent overload. | ||
The proposal addressed this concern byt allowing for dynamic adjustment of thresholds so that they can be tuned based on observed load patterns. | ||
|
||
## Alternatives | ||
|
||
While existing solutions like rate-limiting in [Issue #4480](https://github.com/tikv/pd/issues/4480) and [PR #6834](https://github.com/tikv/pd/pull/6834) provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process. | ||
|
||
There is an alternative approach of mitigating retry storm by limiting number of retries using different strategies (e.g. see https://brooker.co.za/blog/2022/02/28/retries.html), but this will not protect PD from external traffic spikes. | ||
|
||
## Unresolved questions | ||
|
||
- Do we need to apply same pattern for TiKV? | ||
- What is the good default configuration for the circuit breaker thresholds? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems
pd_metadata
is difficult to understanderror_rate
is redundant with pctHow about renaming it to
pd_client_cb_error_rate_threshold
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls.
pd_client_get_regions_cb_error_rate_threshold
?pct meant to inidicate that the value is integer representing percent and not float representing the ratio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, the conclusion of the last meeting was that a single switch should control all interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions about this? @rleungx @nolouch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a dedicated circuit breaker for get regions calls, do we need some variables to control other calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okJiang I believe on last meeting we discussed which variables needs to be configured in real time by customer and which one could be hardcoded or configured at startup with toml file. My perception was that we want to allow configure error rate dynamically and everything else hardcoded or configured statically.
@rleungx I believe we need a separate circuit breaker instance for different calls: e.g. circuit breaker for region metdata should be separate to tikv coprocessor. They also needs to be enabled/disabled individually hence we need a separate variables for each call.
How many different circuit breakers we need could be decided later and we can follow same naming and granularity for configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we are only developing the region metadata part this time?🤔 @Tema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to add more. Let's brainstorm which one we want and see if they would need any different default settings. I think I mentioned before that some of other APIs requiring circuit breakers could be:
heatrbeat_cb_error_rate_threshold
this prolly will be tikv.toml and not s sysvar)tikv_client_cb_error_rate_threshold
))Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it is better to manage an interface with a variable. For the circuit breaker itself, different interfaces can share the same implementation and only use a different setting. Right now we can focus on the GetRegion call. /cc @okJiang