Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

goldsteinn
Copy link
Contributor

NB: This is re-submission of #21545 with #22315 applied on top. #21545 was reverted because it exposed an existing deadlock bug in the thread pool which was fixed in #23098.

Tested with ./build.sh --config Release on Linux/X86_64.

The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work).

Changes are twofold:

  1. Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
    reality is after ~10^4 spins, if there hasn't been any new work
    added its unlikely any new work is imminent so sleep to
    preserve power. This aligns more closely with upstream EigenV3.

  2. Use exponential backoff for waiting on memory. This saves a bit
    more power, and important increases the time between iterations
    in WorkerLoop to help accomidate the dramatically lowering spin
    counts.

Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on CPUIDInfo::IsHybrid(). This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of ThreadPoolTempl.

Measured performance on an Intel Meteor Lake CPU across a range of models.

Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better).

Session creation time cost 0.7179
First inference time cost 0.7156
Total inference time cost 1.0146
Total inference requests 0.8874
Average inference time cost 0.8800
Total inference run time 1.0146
Number of inferences per second 0.8955
Avg CPU usage 0.9462
Peak working set size 0.9922
Runs 1.1552
Min Latency 0.7283
Max Latency 0.9258
P50 Latency 0.9534
P90 Latency 0.9639
P95 Latency 0.9659
P99 Latency 0.9640

So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.

goldsteinn and others added 2 commits January 7, 2025 13:43
The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better
distribute power to cores doing meaningful work).

Changes are twofold:
    1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
       reality is after ~10^4 spins, if there hasn't been any new work
       added its unlikely any new work is imminent so sleep to
       preserve power.

    2) Use exponential backoff for waiting on memory. This saves a bit
       more power, and important increases the time between iterations
       in WorkerLoop to help accomidate the dramatically lowering spin
       counts.

Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.

Measured performance on an [Intel Raptor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html)
across a range of models.

Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like
inference time, lower is better).

Session creation time cost|First inference time cost |Total inference time cost |Total inference requests |Average inference time cost |Total inference run time |Number of inferences per second |Avg CPU usage |Peak working set size |Runs   |Min Latency |Max Latency |P50 Latency |P90 Latency |P95 Latency |P99 Latency
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
0.9151                    |0.8564                    |0.9995                    |0.9450                   |0.9396                      |0.9995                   |0.9449                          |0.9018        |0.9876                |1.0650 |0.9706      |0.8538      |0.9453      |0.9051      |0.8683      |0.8547
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants