Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for large loads with dask #912

Closed
aulemahal opened this issue Nov 10, 2021 · 2 comments
Closed

Improve performance for large loads with dask #912

aulemahal opened this issue Nov 10, 2021 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed standards / conventions Suggestions on ways forward

Comments

@aulemahal
Copy link
Collaborator

The transparent use of dask within xarray is really nice, but it has the side effect of creating a dask task for every operation. In the case of xclim, with all those rolling and resample the number of tasks created is ridiculously high.

With large datasets, the scheduler is sometimes so overloaded that it never even begins the computation. Sometimes it crashes, sometimes it only hangs. Most of the time we get several "WARNING - full garbage collections...." and other warnings.

Rechunking to larger chunk sometimes helps and sometime is insufficient.

We had this problem in sdba and it was solved by wrapping every "unit" operation with map_blocks. This way, we combine many small operations in a single dask task. However, this solution has a lot of drawbacks. It's complicated to maintain, has many bugs with "auxiliary" coords and it's just hard to read.

The idea is there though. Could we implement something that would wrap the compute into a single dask task, from within the indicator's __call__? I guess it should be controlled by an option, and maybe only apply to indicators performing resampling, since those are the main victim of task decuplation?

@aulemahal aulemahal added enhancement New feature or request help wanted Extra attention is needed standards / conventions Suggestions on ways forward labels Nov 10, 2021
@aulemahal
Copy link
Collaborator Author

This issue is out-of-scope for xclim. flox has improved the "resample" case and the "rolling" one might have to be improved by a similar process : within xarray or a plugin like flox.

@dcherian
Copy link

It'd be good to see a minimal example of the rolling problem. Also pydata/xarray#7344 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed standards / conventions Suggestions on ways forward
Projects
None yet
Development

No branches or pull requests

2 participants