-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regression in cargo package
on 1.81.0
#14955
Comments
The dirty check specifically checks for whether any files being packaged are dirty which makes at least part of this a per-package operation. However, may parts could be skipped or moved earlier, depending on where the slow down is. |
#13960 is the culprit of this regression. We do |
Going to do a refactor on moving this out from the package loop. @rustbot claims |
### What does this PR try to resolve? This helps debug <#14955>. ### How should we test and review this PR? While `check_repo_state` is the culprit, let's add some traces for future. ### Additional information
Did some profiling. Detailed report in #14962. To summarize, we were doing Here are profiling data: trace.tar.gz On 59b2ddd (trace-offline.json)#14962 (trace-offline-pathspec.json) |
If we skipped the entire git status check (the |
@weihanglo if we assume most repos are clean, could we ask git if the repo is clean and bypass these checks, only doing them if the repo is dirty? |
Cargo does that today already (+ untracked and ignored). cargo/src/cargo/ops/cargo_package/vcs.rs Lines 170 to 176 in f73259d
As of the current HEAD f73259d, |
I got mixed up; I thought |
A complete fix of the regression might be Cargo offering a |
We would be fine with this fix, the vcs-info isn't really useful for us, so the ability to remove it (and along with it all |
imo providing
imo before we consider alternatives, we should see what the limit is for how much we can optimize what we are already doing and see if that is within an acceptable limit (determined by the Cargo team). If we do look for more alternatives, some additional ideas include:
|
### What does this PR try to resolve? This revives #14962. See benchmark chart in <#14962 (comment)>. #14962 was closed because we found more bugs in `cargo package`, and #14962 could potentially make them even harder to fix. Two of them have been fixed so this is good to ship IMO with its own good. --- An improvement #14955. `check_repo_state` checks the entire git repo status. This is usually fine if you have only a few packages in a workspace. For huge monorepos, it may hit performance issues. For example, on awslabs/aws-sdk-rust@2cbd34d the workspace has roughly 434 members to publish. `git ls-files` reported us 204379 files in this Git repository. That means git may need to check status of all files 434 times. That would be `204379 * 434 = 88,700,486` checks! Moreover, the current algorithm is finding the intersection of `PathSource::list_files` and `git status`. It is an `O(n^2)` check. Let's assume files are evenly distributed into each package, so roughly 470 files per package. If we're unlucky to have some dirty files, say 100 files. We will have to do `470 * 100 = 47,000` times of path comparisons. Even worse, because we `git status` everything in the repo, we'll have to it for all members, even when those dirty files are not part of the current package in question. So it becomes `470 * 100 * 434 = 20,398,000`! #### Solution Instead of comparing with the status of the entire repository, this patch use the magic pathspec[1] to tell git only reports paths that match a certain path prefix. This wouldn't help the `O(n^2)` algorithm, but at least it won't check dirty files outside the current package. Also, we don't `git status` against entire git worktree/index anymore. [1]: https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec ### How should we test and review this PR? Run this command against awslabs/aws-sdk-rust@2cbd34d, and see if it is getting better. ``` CARGO_LOG_PROFILE=1 cargor package --no-verify --offline --allow-dirty -p aws-sdk-accessanalyzer -p aws-sdk-apigateway ``` I've verified checksums of `.crate` files generated from master (d85d761) and this commit (3dabdcd). They are the same. ### Additional information There are some other alternatives, like making `PathSource::list_files` additionally reports dirty files. While we already have rooms to do it, this approach should be the most straightforward one at this moment. Some other approaches like * Switch to gitoxide (I tried and it didn't as good as expected. Maybe I did something wrong). * A flag `--no-vcs` to skip vcs at all * Improve the `O(n^2)` algorithm
If anyone else happens to hit this issue, we did find a workaround. In our publishing step we now |
That is indeed a quick workaround. Just mind that Cargo packages stuff based on git index. If |
Any word on what the performance is like with #14997 merged? Might be a bit more difficult to test until its in a nightly. It would be helpful to know what the performance was pre-1.81, post-1.81, and with this change and whether the new performance is at an acceptable level. |
@epage |
Problem
The aws-sdk-rust recently updated our MSRV to
1.81.0
. This caused a 3-4x slowdown in ourcargo package
invocation. We traced the likely culprit of this slowdown to #13960 which removes anif !opts.allow_dirty
check around thecheck_repo_state
function causing it to run on ever packaging step.This causes a problem with the aws-sdk-rust repo. Since it is very large and has a huge history all invocations to
git
in that repository are slow. Sincecargo package
is now invokinggit
on every packaging step in the workspace it slows the whole process down to a crawl.Steps
git clone https://github.com/awslabs/aws-sdk-rust.git
to checkout theaws-sdk-rust
repo (this will be a bit slow, it is huge)<1.81
aws-sdk-rust
repo runcargo package --no-verify --allow-dirty --workspace
observe the approximate pace at which crates are packaged (you can also use thetime
command, but this fails locally for me withToo many open files (os error 24)
before it completes)=1.81
time cargo package --no-verify --allow-dirty --workspace
(this goes too slowly for me to wait for it to fail, but it is very easy to observe the difference in speed)Possible Solution(s)
Potentially when packaging multiple crates it might be possible to only invoke
git
once at the beginning of the run? Or potentially add a new option to disable the behavior introduced in #13960 that always generates a.cargo_vcs_info.json
Notes
No response
Version
The text was updated successfully, but these errors were encountered: