-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve http url glob error #9930
base: main
Are you sure you want to change the base?
Conversation
Unreliable, but still possible; in the specific case of FTP-like HTML pages, which some archives adhere to, the glob does do what you expect. In what cases do users pass HTTP URLs containing "*" characters? |
else: | ||
fsspec = attempt_import("fsspec") | ||
|
||
fs, _, _ = fsspec.core.get_fs_token_paths( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be url_to_fs these days, as it's more user-facing.
expand=False, | ||
) | ||
tmp_paths = fs.glob(fs._strip_protocol(paths)) # finds directories | ||
return [fs.get_mapper(path) for path in tmp_paths] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought zarr, in particular, didn't wan't mappers any more, but Store objects, or URLs that will be immediately converted to Stores within zarr anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhamman has the recommendation changed here recently?
), | ||
expand=False, | ||
) | ||
tmp_paths = fs.glob(fs._strip_protocol(paths)) # finds directories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could pass detail=True, and then filter for directories
Oh interesting, so I probably have forbidden something that users might actually want to do. Though we apparently have no test to cover this...
I don't know. This error came up for me in the context of trying to use glob syntax with an s3 URL without the zarr backend and getting a confusing error telling me I need to use zarr, when I don't. Though I must admit this call was coming from inside VirtualiZarr, where I was trying to do something funky re-using internals of |
That definitely doesn't sound great. We should ideally allow passing globs to S3 URLs of netCDF files or similar, since we have all the pieces. In the meantime, of course a clear error message is very important! |
Fixes some behaviour which confused me inside
open_mfdataset
's handling of remote urls.Previously if you passed a glob-like http url as a single string, it would use fsspec to glob iff the chosen backend was zarr, inevitably find nothing, then return an unhelpful error saying it found no files.
Instead it now assumes that it's not possible to glob a http url, and raises as soon as it is asked to do that. The error message also now makes no explicit reference to the zarr backend, because as far as I can tell that is unrelated (another backend that understands http urls could encounter the same code).
This is directly related to this previous conversation in #4823, in which @martindurant says
However, I'm unsure if this PR disables something that someone might be relying on... It's a bit of a pain to test because I need both http urls and s3 bucket urls that are real and contain valid netcdf files in order to test all the code paths.
Closes #xxxxwhats-new.rst
New functions/methods are listed inapi.rst