Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce client circuit breaker #115

Merged
merged 2 commits into from
Dec 12, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions text/0115-circuit-breaker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Title

- RFC PR: https://github.com/tikv/rfcs/pull/115
- Tracking Issue: https://github.com/tikv/pd/issues/8678

## Summary

Implement a [circuit breaker](https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker) pattern to protect the PD leader from overloading due to retry storms or similar feedback loops.

## Motivation

In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios.
Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state.
This feature is especially critical for large clusters where a high number of nodes can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health.

## Detailed design

### Configuration

The key part of the circuit breaker is its configuration.
In order to limit number of system variables, we will use a single system variables per circuit breaker type to define error rate threshold.

#### System variables

`tidb_cb_pd_metadata_error_rate_threshold_pct`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems

  1. pd_metadata is difficult to understand
  2. error_rate is redundant with pct

How about renaming it to pd_client_cb_error_rate_threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold?

pct meant to inidicate that the value is integer representing percent and not float representing the ratio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the conclusion of the last meeting was that a single switch should control all interfaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions about this? @rleungx @nolouch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a dedicated circuit breaker for get regions calls, do we need some variables to control other calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okJiang I believe on last meeting we discussed which variables needs to be configured in real time by customer and which one could be hardcoded or configured at startup with toml file. My perception was that we want to allow configure error rate dynamically and everything else hardcoded or configured statically.

@rleungx I believe we need a separate circuit breaker instance for different calls: e.g. circuit breaker for region metdata should be separate to tikv coprocessor. They also needs to be enabled/disabled individually hence we need a separate variables for each call.

How many different circuit breakers we need could be decided later and we can follow same naming and granularity for configurations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we are only developing the region metadata part this time?🤔 @Tema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to add more. Let's brainstorm which one we want and see if they would need any different default settings. I think I mentioned before that some of other APIs requiring circuit breakers could be:

  • tikv to PD heartbeat (heatrbeat_cb_error_rate_threshold this prolly will be tikv.toml and not s sysvar)
  • tidb to each tikv requests ((tikv_client_cb_error_rate_threshold ))
    Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it is better to manage an interface with a variable. For the circuit breaker itself, different interfaces can share the same implementation and only use a different setting. Right now we can focus on the GetRegion call. /cc @okJiang

- Scope: GLOBAL
- Persists to cluster: Yes
- Applies to hint SET_VAR: No
- Type: Integer
- Default value: 0
- Range: [0, 100]
- This variable is used to set percent of errors to trip the circuit breaker.`0` means circuit breaker is disabled.

#### Hardcoded configs

* `error_rate_window`
* Defines how long to track errors before evaluating error_rate_threshold.
* Default: 10
* Unit: seconds
---
* `min_qps_to_open`
Copy link
Member

@rleungx rleungx Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a config for it or is it necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "had only 1 request within error_rate_window", you do not have to enable the cb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually circuit breaker tracks the number of failures within error window. I found it problematic to set absolute value given that it could vary a lot depending on qps hence I chose to go with error rate instead. But in this case we do need a protection from low qps cases or time of the day hence I propose minQPS. If we want to limit number of easy configurations to one session variables, then maybe we should consider to set it in absolute number of errors within window.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

* Defines the average qps over the `error_rate_window` that must be met before evaluating the error rate threshold.
* Default: 10
* Unit: integer
---
* `cooldown_interval`
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. The actual wait time is always equally jittered in the `[value/2, value]` interval.
* Default: 60
* Unit: seconds
---
* `half_open_success_count`
* Defines how many subsequent requests to test after cooldown period before fully close the circuit. All request in excess of this count will be errored till the circuit is fully closed pending results of the firsts `half_open_success_count` requests.
* Default: 10
* Unit: integer

### Circuit Breaker API

#### Config

All configs described above will be encapsulated in a `Settings` struct with ability to change error rate threshold dynamically:

```go
type Settings struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this configuration be exported for the user? And what is the relationship between this and 'system variables'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error rate threshold will be a system variable, and everything else is hardcoded or exposed in not documented toml client section.

// Name of circuit breaker for configuration and logging purposes
Name string
// Defines the error rate threshold to trip the circuit breaker.
ErrorRateThresholdPct uint32
// Defines the average qps over the `error_rate_window` that must be met before evaluating the error rate threshold.
MinQPSForOpen uint32
// Defines how long to track errors before evaluating error_rate_threshold.
ErrorRateWindow time.Duration
// Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request.
CoolDownInterval time.Duration
// Defines how many subsequent requests to test after cooldown period before fully close the circuit.
HalfOpenSuccessCount uint32
}

// ChangeErrorRateThresholdPct changes the error rate threshold for the circuit breaker.
func (s *Settings) ChangeErrorRateThresholdPct(threshold uint32)
```

#### Call wrapper function

The circuit breaker will be implemented as a state machine with Closed, Opened and HalfOpen states and expose the following wrapper function for calls which we want to protect:

```go
func (cb *CircuitBreaker[T]) Execute(run func() (T, error, boolean)) (T, error)
```

There is a third boolean parameter in the function argument above, in case the provided function doesn't return an error, but the caller still wants to treat it as a failure for the circuit breaker counting purposes (e.g. empty or corrupted result).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can run a test where only the gRPC and timeout errors are treated as actual errors, which might be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up to the place where we wire the circuit breaker in. Boolean param allows to configure it. It make sense to start with timeouts, but having ability to tweak it in the API let us tune it further.

Some other implementation return a callback, so that a caller can provide an interpretation result later, but this pattern opens the door to miss the failure count, hence proposed implementation does not offer that.

#### Errors

Sometimes implementations provide different error types for open state and half open states, but in this proposal defines a single error type for simplicity.

#### Integration

The implementation will live in a new package `circuitbreaker` package under https://github.com/tikv/pd/tree/master/client so that it can be referred from both http and grpc path.

## Drawbacks

It is hard to configure the circuit breaker thresholds correctly.
If the thresholds are too low, legitimate requests may be rejected.
If they are too high, the circuit breaker may not trigger in time to prevent overload.
The proposal addressed this concern byt allowing for dynamic adjustment of thresholds so that they can be tuned based on observed load patterns.

## Alternatives

While existing solutions like rate-limiting in [Issue #4480](https://github.com/tikv/pd/issues/4480) and [PR #6834](https://github.com/tikv/pd/pull/6834) provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process.

There is an alternative approach of mitigating retry storm by limiting number of retries using different strategies (e.g. see https://brooker.co.za/blog/2022/02/28/retries.html), but this will not protect PD from external traffic spikes.

## Unresolved questions

- Do we need to apply same pattern for TiKV?
- What is the good default configuration for the circuit breaker thresholds?