-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great so far @sharkinsspatial !
kerchunk backend's specialized encoding translation logic
This part I would really like to either factor out, or at a least really understand what it's doing. See #68
virtualizarr/readers/hdf.py
Outdated
@@ -0,0 +1,206 @@ | |||
from typing import List, Mapping, Optional | |||
|
|||
import fsspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one need fsspec if reading a local file? Is there any other way to read from S3 without fsspec at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not with a filesystem-like API. You would have to use boto3 or aiobotocore directly.
This is one of the great virtues of fsspec and is not to be under-valued.
virtualizarr/readers/hdf.py
Outdated
def virtual_vars_from_hdf( | ||
path: str, | ||
drop_variables: Optional[List[str]] = None, | ||
) -> Mapping[str, xr.Variable]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this an a way to interface with the code in open_virtual_dataset
This looks cool @sharkinsspatial! My opinion is that it doesn't make sense to just forklift the kerchunk code into virtualizarr. What I would love to see is an extremely tight, strictly typed, unit-tested total refactor of the parsing logic. I think you're headed down the right path, but I encourage you to push as far as you can in that direction. |
for more information, see https://pre-commit.ci
@rabernat Fully agree with your take above 👆 👍 . I'm trying to work through this incrementally whenever I can find some spare time. In the spirit of thorough test coverage 🎊 looking through your issue pydata/xarray#7388 and the corresponding PR I'm not sure what the proper incantation of variable encoding configuration is to use |
for more information, see https://pre-commit.ci
127b3d6
to
81874e0
Compare
@TomNicholas We're still investigating some failing test associated with inconsistencies around coordinate variable round tripping and a compatibility issue with nodata and CF time encoding. But given that this backend is still considered experimental and requires an explicit opt-in I think this PR is ok to review and merge. These are the test which are currently xfailed and need further investigation https://github.com/zarr-developers/VirtualiZarr/blob/hdf5_reader/virtualizarr/tests/test_integration.py#L199 This test fails with both the kerchunk hdf reader and the experimental reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! I'll take a more detailed look later / tomorrow / next week 😅
pyproject.toml
Outdated
"h5py", | ||
"hdf5plugin", | ||
"numcodecs", | ||
"imagecodecs", | ||
"imagecodecs-numcodecs==2024.6.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be optional no? (numcodecs
is still required). That also means they can be removed from the min-deps.yml
in CI.
"hdf5": HDF5VirtualBackend, | ||
"netcdf4": HDF5VirtualBackend, # note this is the same as for hdf5 | ||
"netcdf3": NetCDF3VirtualBackend, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't your reader be added here too? Sounds like it could go under hdf
rather than hdf5
? Or maybe that's too subtle...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we discussed treating this new reader as experimental while we do some further battle testing, I'm forcing users to explicitly pass the new backend class HDFVirtualBackend
to the optional backend
argument for open_virtual_dataset
to distinguish it from using the kerchunk reader with filetype auto-detection.
30a83f1
to
f9ead06
Compare
9dfc3db
to
5608292
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @sharkinsspatial ! Thank you for your patience. My comments are mostly just nits.
EDIT: The typing errors should be resolved by merging main
.
"hdf5plugin", | ||
"imagecodecs", | ||
"imagecodecs-numcodecs==2024.6.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you just define the test
dependencies as this list + the hdf_reader
list? Instead of manually repeating entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact because your tests are behind @requires_X
decorators, couldn't we just consider these hdf5 dependencies as not required for the basic test suite (in the way as icechunk is not required for example). But we still make sure to install all these and run them in at least one CI job.
if backend: | ||
backend_cls = backend | ||
else: | ||
backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower()) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if both backend
and filetype
are explicitly specified, this will silently ignore the filetype
argument. I think it should maybe warn or raise in that case, otherwise someone could do
open_virtual_dataset(file.grib, filetype='dmr++', backend=HDFVirtualBackend)
and it could become very unclear what just happened.
path=filepath, reader_options=reader_options, group=group | ||
) | ||
|
||
coord_names = attrs.pop("coordinates", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to return a string like "lat lon"
rather than a list like you want it to?
loadable_variables, | ||
) | ||
|
||
virtual_vars = HDFVirtualBackend._virtual_vars_from_hdf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that my @staticmethod
prevents you from just doing
virtual_vars = HDFVirtualBackend._virtual_vars_from_hdf( | |
virtual_vars = self._virtual_vars_from_hdf( |
is maybe a code smell...
return variables | ||
|
||
@staticmethod | ||
def _attrs_from_root_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this merely defaults to the root group?
def _attrs_from_root_group( | |
def _attrs_from_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe even just ._get_group_attrs
@@ -0,0 +1,146 @@ | |||
import dataclasses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put hdf_filters.py
behind a module, i.e. virtualizarr.readers.hdf.filters.py
?
return codec | ||
|
||
|
||
def cfcodec_from_dataset(dataset: h5py.Dataset) -> Codec | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves a docstring for context as to what it does
@@ -82,14 +83,15 @@ def test_FileType(): | |||
|
|||
|
|||
@requires_kerchunk | |||
@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to not rely on default behaviour?
@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend]) | |
@pytest.mark.parametrize("hdf_backend", [HDF5VirtualBackend, HDFVirtualBackend]) |
open_virtual_dataset(hdf5_groups_file, group="doesnt_exist") | ||
@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend]) | ||
def test_group_kwarg(self, hdf5_groups_file, hdf_backend): | ||
if hdf_backend: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I find this "None
means kerchunk" confusing
roundtrip = xr.open_dataset(kerchunk_file, engine="kerchunk", decode_times=True) | ||
xrt.assert_allclose(ds, roundtrip) | ||
|
||
@pytest.mark.xfail(reason="Coordinate issue affecting kerchunk and HDF reader.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you point to the relevant issue number here
This is a rudimentary initial implementation for #78. The core code is ported directly from kerchunk's hdf backend. I have not ported the bulk of the kerchunk backend's specialized encoding translation logic but I'll try to do so incrementally so that we can build complete test coverage for the many edge cases it currently covers.