Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regression in cargo package on 1.81.0 #14955

Open
landonxjames opened this issue Dec 18, 2024 · 16 comments
Open

Performance Regression in cargo package on 1.81.0 #14955

landonxjames opened this issue Dec 18, 2024 · 16 comments
Assignees
Labels
C-bug Category: bug Command-package Performance Gotta go fast! S-triage Status: This issue is waiting on initial triage.

Comments

@landonxjames
Copy link

landonxjames commented Dec 18, 2024

Problem

The aws-sdk-rust recently updated our MSRV to 1.81.0. This caused a 3-4x slowdown in our cargo package invocation. We traced the likely culprit of this slowdown to #13960 which removes an if !opts.allow_dirty check around the check_repo_state function causing it to run on ever packaging step.

This causes a problem with the aws-sdk-rust repo. Since it is very large and has a huge history all invocations to git in that repository are slow. Since cargo package is now invoking git on every packaging step in the workspace it slows the whole process down to a crawl.

Steps

  1. Run git clone https://github.com/awslabs/aws-sdk-rust.git to checkout the aws-sdk-rust repo (this will be a bit slow, it is huge)
  2. Ensure that you are using a cargo version <1.81
  3. In the aws-sdk-rust repo run cargo package --no-verify --allow-dirty --workspace observe the approximate pace at which crates are packaged (you can also use the time command, but this fails locally for me with Too many open files (os error 24) before it completes)
  4. Upgrade your cargo version to =1.81
  5. Again run time cargo package --no-verify --allow-dirty --workspace (this goes too slowly for me to wait for it to fail, but it is very easy to observe the difference in speed)

Possible Solution(s)

Potentially when packaging multiple crates it might be possible to only invoke git once at the beginning of the run? Or potentially add a new option to disable the behavior introduced in #13960 that always generates a .cargo_vcs_info.json

Notes

No response

Version

$ cargo version --verbose
cargo 1.81.0 (2dbb1af80 2024-08-20)
release: 1.81.0
commit-hash: 2dbb1af80a2914475ba76827a312e29cedfa6b2f
commit-date: 2024-08-20
host: aarch64-apple-darwin
libgit2: 1.8.1 (sys:0.19.0 vendored)
libcurl: 8.7.1 (sys:0.4.73+curl-8.8.0 system ssl:(SecureTransport) LibreSSL/3.3.6)
ssl: OpenSSL 1.1.1w  11 Sep 2023
os: Mac OS 14.7.1 [64-bit]
@epage
Copy link
Contributor

epage commented Dec 18, 2024

Potentially when packaging multiple crates it might be possible to only invoke git once at the beginning of the run?

The dirty check specifically checks for whether any files being packaged are dirty which makes at least part of this a per-package operation. However, may parts could be skipped or moved earlier, depending on where the slow down is.

@weihanglo
Copy link
Member

#13960 is the culprit of this regression. We do git ls-files for every package unconditionally after that PR. When -Zpackage-workspace is stabilized, the issue will be more prominent.

@weihanglo
Copy link
Member

Going to do a refactor on moving this out from the package loop.

@rustbot claims

@weihanglo weihanglo self-assigned this Dec 18, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 19, 2024
### What does this PR try to resolve?

This helps debug <#14955>.

### How should we test and review this PR?

While `check_repo_state` is the culprit, let's add some traces for
future.

### Additional information
@weihanglo
Copy link
Member

Did some profiling. Detailed report in #14962.

To summarize, we were doing git status and comparing dirty files against each package to be published. aws-rust-sdk has 400+ members so such dirty comparison has repeated 400+ times.

Here are profiling data: trace.tar.gz

On 59b2ddd (trace-offline.json)

Before

#14962 (trace-offline-pathspec.json)

After

@weihanglo
Copy link
Member

If we skipped the entire git status check (the --allow-dirty behavior prior to #13960), the profiling data would look like this:

Image

@weihanglo
Copy link
Member

#14962 was closed because we found more bugs in cargo package VCS dirtiness check, showing in #14967. If #14962 merged those bugs would be harder to fix.

@epage
Copy link
Contributor

epage commented Dec 31, 2024

@weihanglo if we assume most repos are clean, could we ask git if the repo is clean and bypass these checks, only doing them if the repo is dirty?

@weihanglo
Copy link
Member

weihanglo commented Jan 1, 2025

if we assume most repos are clean, could we ask git if the repo is clean and bypass these checks, only doing them if the repo is dirty?

Cargo does that today already (+ untracked and ignored).

// This is a collection of any dirty or untracked files. This covers:
// - new/modified/deleted/renamed/type change (index or worktree)
// - untracked files (which are "new" worktree files)
// - ignored (in case the user has an `include` directive that
// conflicts with .gitignore).
let mut dirty_files = Vec::new();
collect_statuses(repo, &mut dirty_files)?;

As of the current HEAD f73259d, PathSource::list_files performs a dirwalk and doesn't really compare status between Git index and worktree. I don't know how to combine one dirwalk + git status --untracked --ignored into one git status call in list_files(), and whether it is worthy.

@epage
Copy link
Contributor

epage commented Jan 2, 2025

I got mixed up; I thought git2 had an API for "is repo dirty" but what I was thinking of is "repo state" which is unrelated (though feels wrong to release from anything but a Clean state)

@weihanglo
Copy link
Member

A complete fix of the regression might be Cargo offering a --no-vcs-info, which skips every checks about VCS. I can't see any other approach could really restore the same performance as before.

@landonxjames
Copy link
Author

A complete fix of the regression might be Cargo offering a --no-vcs-info, which skips every checks about VCS. I can't see any other approach could really restore the same performance as before.

We would be fine with this fix, the vcs-info isn't really useful for us, so the ability to remove it (and along with it all git invocations) would suit our purposes.

@epage
Copy link
Contributor

epage commented Jan 6, 2025

imo providing --no-vcs-info should be a last resort

  • This extends the CLI for a fairly limited case where the performance matters and we couldn't help it in other ways. The larger the CLI, the harder it is for people to find things that are relevant
  • There is a lot of interest in packages published to crates.io to have VCS info to check to see if anything has changed between VCS and publish. Improving that situation is the reason we have this performance regression (Include vcs_info even if workspace is dirty #13960).

imo before we consider alternatives, we should see what the limit is for how much we can optimize what we are already doing and see if that is within an acceptable limit (determined by the Cargo team).

If we do look for more alternatives, some additional ideas include:

  • Generate VCS info but skip dirty info with --allow-dirty (this seems a bit magical)
  • Remove dirty info from VCS info: its advisory only anyways

github-merge-queue bot pushed a commit that referenced this issue Jan 13, 2025
### What does this PR try to resolve?

This revives #14962. See benchmark chart in
<#14962 (comment)>.
#14962 was closed because we found more bugs in `cargo package`, and
#14962 could potentially make them even harder to fix. Two of them have
been fixed so this is good to ship IMO with its own good.

---

An improvement #14955.

`check_repo_state` checks the entire git repo status. This is usually
fine if you have only a few packages in a workspace.

For huge monorepos, it may hit performance issues.

For example,
on awslabs/aws-sdk-rust@2cbd34d
the workspace has roughly 434 members to publish.
`git ls-files` reported us 204379 files in this Git repository. That
means git may need to check status of all files 434 times. That would be
`204379 * 434 = 88,700,486` checks!

Moreover, the current algorithm is finding the intersection of
`PathSource::list_files` and `git status`.
It is an `O(n^2)` check.
Let's assume files are evenly distributed into each package, so roughly
470 files per package.
If we're unlucky to have some dirty files, say 100 files. We will have
to do `470 * 100 = 47,000` times of path comparisons.

Even worse, because we `git status` everything in the repo, we'll have
to it for all members,
even when those dirty files are not part of the current package in
question. So it becomes `470 * 100 * 434 = 20,398,000`!

#### Solution

Instead of comparing with the status of the entire repository, this
patch use the magic pathspec[1] to tell git only reports paths that
match a certain path prefix.

This wouldn't help the `O(n^2)` algorithm,
but at least it won't check dirty files outside the current package.
Also, we don't `git status` against entire git worktree/index anymore.

[1]:
https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec

### How should we test and review this PR?

Run this command against
awslabs/aws-sdk-rust@2cbd34d,
and see if it is getting better.

```
CARGO_LOG_PROFILE=1 cargor package --no-verify --offline --allow-dirty -p aws-sdk-accessanalyzer -p aws-sdk-apigateway
```

I've verified checksums of `.crate` files generated from master
(d85d761) and this commit
(3dabdcd). They are the same.

### Additional information

There are some other alternatives, like making `PathSource::list_files`
additionally reports dirty files. While we already have rooms to do it,
this approach should be the most straightforward one at this moment.

Some other approaches like

* Switch to gitoxide (I tried and it didn't as good as expected. Maybe I
did something wrong).
* A flag `--no-vcs` to skip vcs at all
* Improve the `O(n^2)` algorithm
@landonxjames
Copy link
Author

If anyone else happens to hit this issue, we did find a workaround. In our publishing step we now rm -rf .git/ before running cargo package and that has taken a full workspace publish from ~3 hours to ~15 minutes. Would still be nice to have this fixed in the cargo package command itself, but we are back to around the performance we had previously.

@weihanglo
Copy link
Member

In our publishing step we now rm -rf .git/ before running cargo package and that has taken a full workspace publish from ~3 hours to ~15 minutes.

That is indeed a quick workaround. Just mind that Cargo packages stuff based on git index. If .git is gone, it falls back to walking the directory of a package. That could lead to a different packaging behavior if there is something previously ignored or untracked by git.

@epage
Copy link
Contributor

epage commented Jan 15, 2025

Any word on what the performance is like with #14997 merged? Might be a bit more difficult to test until its in a nightly.

It would be helpful to know what the performance was pre-1.81, post-1.81, and with this change and whether the new performance is at an acceptable level.

@weihanglo
Copy link
Member

@epage
See #14955 (comment) for a glance. I believe it would be at least 3x faster by just looking the proportion of each trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: bug Command-package Performance Gotta go fast! S-triage Status: This issue is waiting on initial triage.
Projects
None yet
Development

No branches or pull requests

3 participants