ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278

goldsteinn · 2025-01-07T23:00:24Z

NB: This is re-submission of #21545 with #22315 applied on top. #21545 was reverted because it exposed an existing deadlock bug in the thread pool which was fixed in #23098.

Tested with ./build.sh --config Release on Linux/X86_64.

The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work).

Changes are twofold:

Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
reality is after ~10^4 spins, if there hasn't been any new work
added its unlikely any new work is imminent so sleep to
preserve power. This aligns more closely with upstream EigenV3.
Use exponential backoff for waiting on memory. This saves a bit
more power, and important increases the time between iterations
in WorkerLoop to help accomidate the dramatically lowering spin
counts.

Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on CPUIDInfo::IsHybrid(). This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of ThreadPoolTempl.

Measured performance on an Intel Meteor Lake CPU across a range of models.

Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better).

Session creation time cost	0.7179
First inference time cost	0.7156
Total inference time cost	1.0146
Total inference requests	0.8874
Average inference time cost	0.8800
Total inference run time	1.0146
Number of inferences per second	0.8955
Avg CPU usage	0.9462
Peak working set size	0.9922
Runs	1.1552
Min Latency	0.7283
Max Latency	0.9258
P50 Latency	0.9534
P90 Latency	0.9639
P95 Latency	0.9659
P99 Latency	0.9640

So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.

The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work). Changes are twofold: 1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The reality is after ~10^4 spins, if there hasn't been any new work added its unlikely any new work is imminent so sleep to preserve power. 2) Use exponential backoff for waiting on memory. This saves a bit more power, and important increases the time between iterations in WorkerLoop to help accomidate the dramatically lowering spin counts. Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on `CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of `ThreadPoolTempl`. Measured performance on an [Intel Raptor Lake CPU](https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html) across a range of models. Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better). Session creation time cost|First inference time cost |Total inference time cost |Total inference requests |Average inference time cost |Total inference run time |Number of inferences per second |Avg CPU usage |Peak working set size |Runs |Min Latency |Max Latency |P50 Latency |P90 Latency |P95 Latency |P99 Latency :-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----: 0.9151 |0.8564 |0.9995 |0.9450 |0.9396 |0.9995 |0.9449 |0.9018 |0.9876 |1.0650 |0.9706 |0.8538 |0.9453 |0.9051 |0.8683 |0.8547

…act.

goldsteinn and others added 2 commits January 7, 2025 13:43

Refactor recent change to the thread pool to minimize binary size imp…

6245d75

…act.

goldsteinn mentioned this pull request Jan 7, 2025

ThreadPool: Spend less time busy waiting. #21545

Merged

snnn requested a review from yuslepukhin January 7, 2025 23:15

Fixup Lints

ed8683f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278

ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278

goldsteinn commented Jan 7, 2025

ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278

Are you sure you want to change the base?

ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278

Conversation

goldsteinn commented Jan 7, 2025