Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache for HF file system is broken #661

Open
shcheklein opened this issue Dec 4, 2024 · 5 comments
Open

Cache for HF file system is broken #661

shcheklein opened this issue Dec 4, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@shcheklein
Copy link
Member

Image

Query to reproduce:

from datachain import DataChain, C

from datachain.lib.tar import process_tar

(
  DataChain.from_storage("hf://datasets/mozilla-foundation/common_voice_17_0")
    .settings(parallel=4, cache=True, prefetch=4)
  	.filter(C("file.path").glob("*/en/train/*.tar"))
  	.gen(file=process_tar)
    .save("common-voice-index")
)
@shcheklein shcheklein added the bug Something isn't working label Dec 4, 2024
@shcheklein
Copy link
Member Author

#747

Related #746

@shcheklein
Copy link
Member Author

@skshetry can this be closed?

@skshetry
Copy link
Member

Cache is still broken. The above script might work because #730 stopped prefetching for HF.

@shcheklein
Copy link
Member Author

@skshetry thanks! does require an upstream fix? could you please summarize (or copy from the previous tickets / discussion).

@skshetry
Copy link
Member

The base Client implementation expects an async filesystem. PTAL at scandir and other async methods in Client.

HF filesystem does not offer an asynchronous API, so there are two issues in our implementation:

  1. some fsspec methods that Client expects does not exist (eg: _info, etc).

  2. We have some async APIs implemented in HFClient that calls synchronous function.

    async def ls_dir(self, path):
    return self.fs.ls(path, detail=True)

    But this is a bad thing to do, as it blocks the event loop. (Similar stuff is done in FileClient, and I am surprised that this thing even passed reviews).

Luckily for caching, it seems we are using client.download and client.put_in_cache which are both synchronous function, so we could reimplement them in HFClient. And that would fix usages from File API.

For usages in AsyncMapper and other places where we need to concurrently perform operations, unfortunately, there are no good solutions. Either we'll be blocking event-loop by running a synchronous function, or we'll need to run the operation in another thread.

fsspec can do the latter automatically, it can generate AsyncFileSystem from a synchronous filesystem, where the operations are run in another thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants