Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to custom netcdf4/hdf5 backend #395

Merged
merged 12 commits into from
Jan 30, 2025

Conversation

jsignell
Copy link
Contributor

@jsignell jsignell commented Jan 28, 2025

  • Switches autodetected backend selection
  • updates tests to require kerchunk less often
  • only test kerchunk hdf reader if kerchunk is available

* Switches autodetected backend selection
* updates tests to require kerchunk less often
* only test kerchunk hdf reader if kerchunk is available
Copy link

codecov bot commented Jan 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.23%. Comparing base (443928f) to head (529a5b5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
+ Coverage   77.75%   84.23%   +6.47%     
==========================================
  Files          31       31              
  Lines        1821     1821              
==========================================
+ Hits         1416     1534     +118     
+ Misses        405      287     -118     
Files with missing lines Coverage Δ
virtualizarr/backend.py 95.65% <ø> (-1.45%) ⬇️

... and 16 files with indirect coverage changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These mostly still do not run in a kerchunk-free env. See #376 for that piece


vds = open_virtual_dataset(simple_netcdf4, loadable_variables=["foo"])
assert vds.virtualize.nbytes == 48
assert vds.virtualize.nbytes == 104
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am not concerned that the nbytes is different now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems weird... I would have expected them to be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has to do with coordinates being populated for dimensions that are supposed to be "without coordinates" see https://github.com/zarr-developers/VirtualiZarr/pull/395/files#r1934273183.

@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself testing dependencies Updates a dependency labels Jan 29, 2025
@maxrjones maxrjones added the v3-migration Required for migration to Zarr-Python 3.0 label Jan 29, 2025
netcdf4_file_with_data_in_multiple_groups,
group="subgroup",
indexes={},
backend=HDF5VirtualBackend,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the output looks like for the kerchunk-based reader:

<xarray.Dataset> Size: 16B
Dimensions:  (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
    bar      (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...

vs the custom reader:

<xarray.Dataset> Size: 32B
Dimensions:  (dim_0: 2)
Coordinates:
    dim_0    (dim_0) float64 16B 0.0 0.0
Data variables:
    bar      (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...

for reference here is what it looks like if I just naively open it as a regular dataset:

(Pdb) xr.open_dataset(netcdf4_file_with_data_in_multiple_groups, group="subgroup")
<xarray.Dataset> Size: 16B
Dimensions:  (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
    bar      (dim_0) int64 16B ...

I think this is probably the source of the nbytes difference too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a bug in the HDF reader. If xarray doesn't think this dimension has a coordinate, then virtualizarr's HDF reader shouldn't create one either.

Opened #401 to track this (FYI @sharkinsspatial )

@jsignell
Copy link
Contributor Author

@TomNicholas @sharkinsspatial I pushed a commit (e18e647) to encode #401 in tests. I think this is good to merge now.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing @jsignell! Only one minor suggestion.

Also very good call on not being overzealous and removing kerchunk-dependent tests entirely.

assert isinstance(vds["bar"].data, ManifestArray)
assert vds["bar"].shape == (2,)

def test_open_root_group_manually(self, netcdf4_file_with_data_in_multiple_groups):
def test_open_root_group_manually(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this test be combined with the one below by parameterizing group over ("", None)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8706af5

@TomNicholas TomNicholas merged commit 81a76f0 into zarr-developers:main Jan 30, 2025
11 checks passed
@jsignell jsignell deleted the hdfbackend branch January 30, 2025 13:53
jsignell added a commit to jsignell/VirtualiZarr that referenced this pull request Jan 31, 2025
sharkinsspatial added a commit that referenced this pull request Jan 31, 2025
…ts. (#410)

* Do not create variables for non coordinate dimension hdf datasets.

* Revert test changes to avoid HDFVirtualBackend errors from #395.

* Re-enable xfailed roundtrip integration test.

* Fix HDF5 type usage.

* Fix indent error for scanning HDF5 items.
TomNicholas pushed a commit that referenced this pull request Jan 31, 2025
* Use open_dataset_kerchunk in roundtrip tests that don't otherwise require kerchunk

* Make it clear that integration tests require zarr-python

* Add in-memory icechunk tests to existing roundtrip tests

* Playing around with icechunk / zarr / xarray upgrade

* Passing icechunk tests

* Update tests to latest kerchunk

* Remove icechunk roundtripping

* Fixed some warnings

* Fixed codec test

* Fix warnings in test_backend.py

* Tests passing

* Remove obsolete comment

* Add fill value to fixture

* Remove obsolete conditional to ds.close()

* Reset workflows with --cov

* Reset conftest.py fixtures (air encoding)

* Reset contributiong (--cov) removed

* Remove context manager from readers/common.py

* Reset test_backend with ds.dims

* Reset test_icechunk (air encoding)

* Fix change that snuck in on #395

---------

Co-authored-by: Aimee Barciauskas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Updates a dependency Kerchunk Relating to the kerchunk library / specification itself testing v3-migration Required for migration to Zarr-Python 3.0
Projects
Development

Successfully merging this pull request may close these issues.

Switch tests to use HDF reader instead of kerchunk-based HDF5 reader
3 participants