Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use os.sched_getaffinity instead of os.cpu_count where possible #2160

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

bryant1410
Copy link
Contributor

When automatically choosing the number of parallel workers, os.sched_getaffinity is a better choice than the currently used os.cpu_count. The former uses a process' assigned CPU count. See this Stack Overflow answer for an explanation.

I changed this codebase to first check os.sched_getaffinity and otherwise default to os.cpu_count (and then default to 1; as the latter could potentially be None). As some form of validation, this is something PyTorch uses as well.

In the (rare) case that os.sched_getaffinity isn't defined, I make it default to os.cpu_count. PyTorch's code behaves differently by using the value 0. I think using 0 doesn't make sense. Still, this shouldn't happen as I was reading that in Linux you have to assign at least one.

@savingoyal
Copy link
Collaborator

@bryant1410 thanks for the PR! Is there any before/after analysis for this change?

@bryant1410
Copy link
Contributor Author

@bryant1410 thanks for the PR! Is there any before/after analysis for this change?

What do you mean?

@romain-intel
Copy link
Contributor

AWESOME!! We should probably also change S3_WORKER_COUNT to something like this instead of using 64 all the time (that may be more debatable but we have had issues when it brings down the machine).

@savingoyal -- this change is very nice because basically, cpu_count returns the number of CPUs on the entire box and not the ones for just your container. I just tested this and the affinity one returns the correct value which is much more likely what you want.

I remember this had come up in the past but I never stopped to make the (clearly simple) change.

@npow
Copy link

npow commented Dec 5, 2024

Thanks for fixing this! Looks like in 3.13+ we can use process_cpu_count(): https://docs.python.org/3/library/os.html#os.process_cpu_count

@savingoyal
Copy link
Collaborator

@bryant1410 thanks for the PR! Is there any before/after analysis for this change?

What do you mean?

I am curious to see what is observed change in behavior after this patch

@bryant1410
Copy link
Contributor Author

@bryant1410 thanks for the PR! Is there any before/after analysis for this change?

What do you mean?

I am curious to see what is observed change in behavior after this patch

Oh. I didn't test it myself in this repo, but in other cases that I made a similar change what I observed is that it started considering cgroup or container assigned CPUs, as opposed to the system total, which is the desired behavior IMHO.

@savingoyal
Copy link
Collaborator

Yeah, at the moment, exclusive CPU ownership doesn't happen by default on Kubernetes (I am not sure if it's even an option with AWS Batch—maybe @npow knows), so os.sched_getaffinity and os.cpu_count will return the same value and should be safe to roll out for now.

@bryant1410
Copy link
Contributor Author

AWESOME!! We should probably also change S3_WORKER_COUNT to something like this instead of using 64 all the time (that may be more debatable but we have had issues when it brings down the machine).

Should I do this change?

@npow
Copy link

npow commented Jan 16, 2025

we have had issues when it brings down the machine

when does this happen? @romain-intel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants