-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to custom netcdf4/hdf5 backend #395
Conversation
* Switches autodetected backend selection * updates tests to require kerchunk less often * only test kerchunk hdf reader if kerchunk is available
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #395 +/- ##
==========================================
+ Coverage 77.75% 84.23% +6.47%
==========================================
Files 31 31
Lines 1821 1821
==========================================
+ Hits 1416 1534 +118
+ Misses 405 287 -118
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These mostly still do not run in a kerchunk-free env. See #376 for that piece
|
||
vds = open_virtual_dataset(simple_netcdf4, loadable_variables=["foo"]) | ||
assert vds.virtualize.nbytes == 48 | ||
assert vds.virtualize.nbytes == 104 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am not concerned that the nbytes
is different now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems weird... I would have expected them to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has to do with coordinates being populated for dimensions that are supposed to be "without coordinates" see https://github.com/zarr-developers/VirtualiZarr/pull/395/files#r1934273183.
netcdf4_file_with_data_in_multiple_groups, | ||
group="subgroup", | ||
indexes={}, | ||
backend=HDF5VirtualBackend, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what the output looks like for the kerchunk-based reader:
<xarray.Dataset> Size: 16B
Dimensions: (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
bar (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...
vs the custom reader:
<xarray.Dataset> Size: 32B
Dimensions: (dim_0: 2)
Coordinates:
dim_0 (dim_0) float64 16B 0.0 0.0
Data variables:
bar (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...
for reference here is what it looks like if I just naively open it as a regular dataset:
(Pdb) xr.open_dataset(netcdf4_file_with_data_in_multiple_groups, group="subgroup")
<xarray.Dataset> Size: 16B
Dimensions: (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
bar (dim_0) int64 16B ...
I think this is probably the source of the nbytes
difference too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a bug in the HDF reader. If xarray doesn't think this dimension has a coordinate, then virtualizarr's HDF reader shouldn't create one either.
Opened #401 to track this (FYI @sharkinsspatial )
@TomNicholas @sharkinsspatial I pushed a commit (e18e647) to encode #401 in tests. I think this is good to merge now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing @jsignell! Only one minor suggestion.
Also very good call on not being overzealous and removing kerchunk-dependent tests entirely.
virtualizarr/tests/test_backend.py
Outdated
assert isinstance(vds["bar"].data, ManifestArray) | ||
assert vds["bar"].shape == (2,) | ||
|
||
def test_open_root_group_manually(self, netcdf4_file_with_data_in_multiple_groups): | ||
def test_open_root_group_manually( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this test be combined with the one below by parameterizing group
over ("", None)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8706af5
* Use open_dataset_kerchunk in roundtrip tests that don't otherwise require kerchunk * Make it clear that integration tests require zarr-python * Add in-memory icechunk tests to existing roundtrip tests * Playing around with icechunk / zarr / xarray upgrade * Passing icechunk tests * Update tests to latest kerchunk * Remove icechunk roundtripping * Fixed some warnings * Fixed codec test * Fix warnings in test_backend.py * Tests passing * Remove obsolete comment * Add fill value to fixture * Remove obsolete conditional to ds.close() * Reset workflows with --cov * Reset conftest.py fixtures (air encoding) * Reset contributiong (--cov) removed * Remove context manager from readers/common.py * Reset test_backend with ds.dims * Reset test_icechunk (air encoding) * Fix change that snuck in on #395 --------- Co-authored-by: Aimee Barciauskas <[email protected]>
docs/releases.rst