Use lithops to parallelize open_mfdataset #9932

TomNicholas · 2025-01-08T19:07:41Z

Experiment generalizing the parallel kwarg to open_mfdataset to accept 'lithops', instead of a assuming that dask is the only option. Lithops can perform each open_dataset on a different serverless worker, though the test case here uses lithops' default configuration to just run on one machine.

As cubed can run the computations on its lithops executor, this could allow an entirely serverless user workflow like:

ds = open_mfdataset(
    's3://bucket/files*.nc',
    combine='by_coords',
    parallel='lithops,
    chunked_array_type='cubed',
)

Related to #7810, although the lithops API uses futures and the dask API uses delayed, which makes the case-handling logic in this PR a little convoluted.

Still has the same downside described in #8523, in that each lazy dataset will be sent back to the client (over the network).

Inspired by messing around with the same idea in virtualizarr zarr-developers/VirtualiZarr#349

~~Closes #xxxx~~
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
~~New functions/methods are listed in api.rst~~

cc @dcherian @keewis @tomwhite

TomNicholas · 2025-01-08T19:10:05Z

Hmm the test I added passes locally but it looks like I need to install lithops via pip in the CI?

dcherian · 2025-01-08T19:38:29Z

I think we should strongly consider just taking an Executor and calling executor.submit or executor.map. The use of Delayed here to create a "lazy" future is pointless, everything is immediately "computed"/executed two lines later anyway.

TomNicholas · 2025-01-08T19:43:56Z

That makes total sense, but what's the dask equivalent of the Executor?

dcherian · 2025-01-08T19:59:37Z

distributed.Client has map and submit. For the rest we might need to write a wrapper class that implements .submit as a .get on whichever scheduler is active.

cc @phofl

keewis · 2025-01-08T21:04:30Z

xarray/backends/api.py

+        def generate_lazy_ds(path):
+            # allows passing the open_dataset function to lithops without evaluating it
+            ds = open_(path, **kwargs)
+            return ds


this looks potentially like a functools.partial with **kwargs?

keewis · 2025-01-08T21:12:29Z

xarray/backends/api.py

+        futures = fn_exec.map(generate_lazy_ds, paths1d)
+
+        # wait for all the serverless workers to finish, and send their resulting lazy datasets back to the client
+        # TODO do we need download_results?
+        completed_futures, _ = fn_exec.wait(futures, download_results=True)
+        datasets = completed_futures.get_result()


I wonder if we can find an abstraction that works for both this (which is kinda like concurrent.futures' pool executors) and dask.

For example, maybe we can use functools.partial to mimic dask.delayed. The result would be a bunch of function objects without parameters, which would then be evaluated in fn_exec.map using operator.call.

(But I guess if we refactor the dask code as well we don't really need that idea)

phofl · 2025-01-08T21:13:57Z

distributed.Client has map and submit. For the rest we might need to write a wrapper class that implements .submit as a .get on whichever scheduler is active.

Yep that sounds sensible

TomNicholas added 16 commits October 24, 2024 17:48

new blank whatsnew

01e7518

Merge branch 'main' of https://github.com/pydata/xarray

83e553b

Merge branch 'main' of https://github.com/pydata/xarray

e44326d

Merge branch 'main' of https://github.com/pydata/xarray

4e4eeb0

Merge branch 'main' of https://github.com/pydata/xarray

d858059

Merge branch 'main' of https://github.com/pydata/xarray

d377780

Merge branch 'main' of https://github.com/pydata/xarray

3132f6a

Merge branch 'main' of https://github.com/pydata/xarray

900eef5

Merge branch 'main' of https://github.com/pydata/xarray

4c4462f

Merge branch 'main' of https://github.com/pydata/xarray

5b9b749

Merge branch 'main' of https://github.com/pydata/xarray

fadb953

add deprecation warning for parallel=True

4c412f5

add test for lithops

ab49351

implementation for lithops parallelization

82d4127

add lithops as requirement and to CI

47bc1f7

docstring

cf1960c

TomNicholas added topic-combine combine/concat/merge dependencies Pull requests that update a dependency file labels Jan 8, 2025

add netcdf4 requirement

758293d

keewis reviewed Jan 8, 2025

View reviewed changes

TomNicholas marked this pull request as draft January 8, 2025 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lithops to parallelize open_mfdataset #9932

Use lithops to parallelize open_mfdataset #9932

TomNicholas commented Jan 8, 2025 •

edited

Loading

TomNicholas commented Jan 8, 2025

dcherian commented Jan 8, 2025

TomNicholas commented Jan 8, 2025

dcherian commented Jan 8, 2025

keewis Jan 8, 2025

keewis Jan 8, 2025 •

edited

Loading

phofl commented Jan 8, 2025

Use lithops to parallelize open_mfdataset #9932

Are you sure you want to change the base?

Use lithops to parallelize open_mfdataset #9932

Conversation

TomNicholas commented Jan 8, 2025 • edited Loading

TomNicholas commented Jan 8, 2025

dcherian commented Jan 8, 2025

TomNicholas commented Jan 8, 2025

dcherian commented Jan 8, 2025

keewis Jan 8, 2025

Choose a reason for hiding this comment

keewis Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

phofl commented Jan 8, 2025

TomNicholas commented Jan 8, 2025 •

edited

Loading

keewis Jan 8, 2025 •

edited

Loading