-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use lithops to parallelize open_mfdataset #9932
base: main
Are you sure you want to change the base?
Conversation
Hmm the test I added passes locally but it looks like I need to install lithops via pip in the CI? |
I think we should strongly consider just taking an Executor and calling |
That makes total sense, but what's the dask equivalent of the |
cc @phofl |
def generate_lazy_ds(path): | ||
# allows passing the open_dataset function to lithops without evaluating it | ||
ds = open_(path, **kwargs) | ||
return ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks potentially like a functools.partial
with **kwargs
?
futures = fn_exec.map(generate_lazy_ds, paths1d) | ||
|
||
# wait for all the serverless workers to finish, and send their resulting lazy datasets back to the client | ||
# TODO do we need download_results? | ||
completed_futures, _ = fn_exec.wait(futures, download_results=True) | ||
datasets = completed_futures.get_result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can find an abstraction that works for both this (which is kinda like concurrent.futures
' pool executors) and dask
.
For example, maybe we can use functools.partial
to mimic dask.delayed
. The result would be a bunch of function objects without parameters, which would then be evaluated in fn_exec.map
using operator.call
.
(But I guess if we refactor the dask code as well we don't really need that idea)
Yep that sounds sensible |
Experiment generalizing the
parallel
kwarg toopen_mfdataset
to accept'lithops'
, instead of a assuming thatdask
is the only option. Lithops can perform eachopen_dataset
on a different serverless worker, though the test case here uses lithops' default configuration to just run on one machine.As cubed can run the computations on its lithops executor, this could allow an entirely serverless user workflow like:
Related to #7810, although the lithops API uses futures and the dask API uses delayed, which makes the case-handling logic in this PR a little convoluted.
Still has the same downside described in #8523, in that each lazy dataset will be sent back to the client (over the network).
Inspired by messing around with the same idea in virtualizarr zarr-developers/VirtualiZarr#349
Closes #xxxxwhats-new.rst
New functions/methods are listed inapi.rst
cc @dcherian @keewis @tomwhite