Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve http url glob error #9930

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jan 8, 2025

Fixes some behaviour which confused me inside open_mfdataset's handling of remote urls.

Previously if you passed a glob-like http url as a single string, it would use fsspec to glob iff the chosen backend was zarr, inevitably find nothing, then return an unhelpful error saying it found no files.

Instead it now assumes that it's not possible to glob a http url, and raises as soon as it is asked to do that. The error message also now makes no explicit reference to the zarr backend, because as far as I can tell that is unrelated (another backend that understands http urls could encounter the same code).

This is directly related to this previous conversation in #4823, in which @martindurant says

I don't suppose anyone is using globbing with http, since it's generally unreliable.

However, I'm unsure if this PR disables something that someone might be relying on... It's a bit of a pain to test because I need both http urls and s3 bucket urls that are real and contain valid netcdf files in order to test all the code paths.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@martindurant
Copy link
Contributor

I don't suppose anyone is using globbing with http, since it's generally unreliable.

Unreliable, but still possible; in the specific case of FTP-like HTML pages, which some archives adhere to, the glob does do what you expect. In what cases do users pass HTTP URLs containing "*" characters?

else:
fsspec = attempt_import("fsspec")

fs, _, _ = fsspec.core.get_fs_token_paths(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be url_to_fs these days, as it's more user-facing.

expand=False,
)
tmp_paths = fs.glob(fs._strip_protocol(paths)) # finds directories
return [fs.get_mapper(path) for path in tmp_paths]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought zarr, in particular, didn't wan't mappers any more, but Store objects, or URLs that will be immediately converted to Stores within zarr anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhamman has the recommendation changed here recently?

),
expand=False,
)
tmp_paths = fs.glob(fs._strip_protocol(paths)) # finds directories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could pass detail=True, and then filter for directories

@TomNicholas
Copy link
Member Author

Unreliable, but still possible; in the specific case of FTP-like HTML pages, which some archives adhere to, the glob does do what you expect.

Oh interesting, so I probably have forbidden something that users might actually want to do. Though we apparently have no test to cover this...

In what cases do users pass HTTP URLs containing "*" characters?

I don't know. This error came up for me in the context of trying to use glob syntax with an s3 URL without the zarr backend and getting a confusing error telling me I need to use zarr, when I don't. Though I must admit this call was coming from inside VirtualiZarr, where I was trying to do something funky re-using internals of xr.open_mfdataset (see zarr-developers/VirtualiZarr#349), specifically xarray.backends.common._find_absolute_paths.

@martindurant
Copy link
Contributor

This error came up for me in the context of trying to use glob syntax with an s3 URL without the zarr backend and getting a confusing error telling me I need to use zarr, when I don't

That definitely doesn't sound great. We should ideally allow passing globs to S3 URLs of netCDF files or similar, since we have all the pieces. In the meantime, of course a clear error message is very important!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants