ThreadPool: Spend less time busy waiting. (2nd Attempt) #23278
+78
−26
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NB: This is re-submission of #21545 with #22315 applied on top. #21545 was reverted because it exposed an existing deadlock bug in the thread pool which was fixed in #23098.
Tested with
./build.sh --config Release
on Linux/X86_64.The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work).
Changes are twofold:
Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
reality is after ~10^4 spins, if there hasn't been any new work
added its unlikely any new work is imminent so sleep to
preserve power. This aligns more closely with upstream EigenV3.
Use exponential backoff for waiting on memory. This saves a bit
more power, and important increases the time between iterations
in WorkerLoop to help accomidate the dramatically lowering spin
counts.
Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on
CPUIDInfo::IsHybrid()
. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety ofThreadPoolTempl
.Measured performance on an Intel Meteor Lake CPU across a range of models.
Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better).
So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.