Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Refactor TIFF backend to use async_tiff and obstore #488

Closed
wants to merge 42 commits into from

Conversation

maxrjones
Copy link
Member

This PR does a few things, which may need to get split up but are convenient to prototype together:

There's still a lot to do here, but I'm opening the PR early to make it easier to ask people questions.

  • Closes #xxxx
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@TomNicholas
Copy link
Member

This is awesome!

I would ideally like to see each of your bullets be a separate PR. But I can see it might be easiest to develop them all here then split and merge separately.

@TomNicholas TomNicholas added enhancement New feature or request readers labels Mar 16, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not commit an 80MB file to the repo. The 30KB one is okay, but ideally we would only commit artificially-created data that is literally as small as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'm remove it before finishing the PR. The 30 KB one is stripped rather than tiled and it was easier to implement tiled TIFFs first since I find it more intuitive to work with, which is why I added it for now.

#347 for TIFF along with a tiny in-repo test would be the best approach, but I don't want to get too distracted by other tempting issues 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I try to avoid ever committing a large file to the repo, because if you add it and remove it a few commits later, the entire file will still be in the Git history, so the repo will still be 80MB larger. I think if you squash the merge you'll be fine though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I could use git-filter-repo if we're super concerned about whether the squashing / branch deletion will actually remove the file. I guess this is one reason to use forks, even though they make the collaboration/review process a bit less seamless sometimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even if you used a fork, but then merged via squash or with a merge commit, it would include all the intermediate commits, which would include the large file in the history.

Comment on lines +194 to +196
# Convert to a Zarr store
return xr.open_dataset(
manifest_store, engine="zarr", consolidated=False, zarr_format=3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - isn't this going to create a dataset with all loadable variables???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the variables are "loadable", but not loaded unless you call something like .data on them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeoTIFFs don't really have coordinates, so I think I'd need to work with HDF5 files in order to make sure there's still an option to have 'virtual' coordinates and those aren't loaded eagerly in open_dataset (they probably are).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the variables are "loadable", but not loaded unless you call something like .data on them.

That's what I mean - isn't this going to create a Dataset fully of lazy numpy arrays instead of ManifestArrays?

Copy link
Member Author

@maxrjones maxrjones Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the ManifestArrays are stored in ManifestStore._manifest_group._manifest_dict rather than in each DataArray. I'll need to give this some thought, and probably have a discussion about how to approach this. We could have a final step that replaces the lazy numpy arrays with ManifestArrays for non-loadable variables but that might require a lot more spaghetti in the ManifestStore internals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting aside how majorly disruptive this would be, why must the data variables contain ManifestArrays, rather than indexing the ManifestArrays from the Zarr Store instance at write time? I'm thinking about whether we could use this approach, call .load() on loadable variables, then use isinstance(var, numpy.ndarray) and isinstance(var, xarray.core.indexing.MemoryCachedArray) after the concatenation, merging, etc. to determine whether the data or the VirtualChunkSpec should be stored in Icechunk.

If that's possible, we could provide all the functionality of both the Zarr and Xarray APIs. e.g., to_zarr now could be used to create a "native" zarr store from any archival file format, or virtualize.to_icechunk could create a virtual zarr store with some inlined data. if we are really concerned about accidental data loading we could break the Zarr API a bit to prevent loading any data not included in 'loadable_variables'.

Copy link
Member

@TomNicholas TomNicholas Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why must the data variables contain ManifestArrays, rather than indexing the ManifestArrays from the Zarr Store instance at write time?

Because you can't do arbitrary transformations on ManifestArrays like you can on numpy arrays. IIUC your approach (which I think can be summarized as "lazily fetching the ManifestArray corresponding to this numpy array") would very often lead to people unable to write virtual variables to the store at write time, because of some step they did earlier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to experiment a bit. Because if we have a loadable_arrays attribute on the ManifestStore, we can raise an error if Xarray calls .get() on any of the arrays that aren't set as loadable, preventing any eager computations that would prevent serialization. So, it wouldn't be a full feature complete dataset but you'd get better error messages than those relating to ChunkManagers. Still may be a bad idea, but I think there's no harm in spending an afternoon trying it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that this is just going to un-separate the concerns we want to separate, but I support messing around 😀

Copy link

codecov bot commented Mar 19, 2025

Codecov Report

Attention: Patch coverage is 76.51246% with 66 lines in your changes missing coverage. Please review.

Project coverage is 87.73%. Comparing base (81a760d) to head (71543cd).

Files with missing lines Patch % Lines
virtualizarr/storage/obstore.py 61.32% 41 Missing ⚠️
virtualizarr/readers/tiff.py 85.36% 12 Missing ⚠️
virtualizarr/vendor/zarr/metadata.py 73.68% 5 Missing ⚠️
virtualizarr/storage/common.py 88.57% 4 Missing ⚠️
virtualizarr/readers/common.py 88.88% 2 Missing ⚠️
virtualizarr/manifests/group.py 91.66% 1 Missing ⚠️
virtualizarr/types/general.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #488      +/-   ##
===========================================
- Coverage    89.04%   87.73%   -1.32%     
===========================================
  Files           27       31       +4     
  Lines         1689     1932     +243     
===========================================
+ Hits          1504     1695     +191     
- Misses         185      237      +52     
Files with missing lines Coverage Δ
virtualizarr/__init__.py 75.00% <ø> (ø)
virtualizarr/accessor.py 94.33% <100.00%> (+0.10%) ⬆️
virtualizarr/manifests/__init__.py 100.00% <100.00%> (ø)
virtualizarr/manifests/utils.py 89.23% <ø> (ø)
virtualizarr/readers/hdf/filters.py 95.78% <100.00%> (-0.44%) ⬇️
virtualizarr/manifests/group.py 91.66% <91.66%> (ø)
virtualizarr/types/general.py 80.00% <75.00%> (-20.00%) ⬇️
virtualizarr/readers/common.py 91.78% <88.88%> (-0.95%) ⬇️
virtualizarr/storage/common.py 88.57% <88.57%> (ø)
virtualizarr/vendor/zarr/metadata.py 73.68% <73.68%> (ø)
... and 2 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maxrjones
Copy link
Member Author

I'm going to close this and rewrite the implementation after #477 and #490 are merged, which should make the PR much simpler. Hoping that the new PR will be ready for review by the middle of next week.

@maxrjones maxrjones closed this Mar 21, 2025
@maxrjones maxrjones deleted the tiff_with_virtualobjectstore branch March 21, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request readers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

open_virtual_dataset fails to open tiffs
3 participants