WIP: Refactor TIFF backend to use async_tiff and obstore #488

maxrjones · 2025-03-13T18:07:04Z

This PR does a few things, which may need to get split up but are convenient to prototype together:

Refactor TIFFVirtualBackend to use async_tiff (closes open_virtual_dataset fails to open tiffs #291) (would supersede Tiff zarr reader #295, Update translators.kerchunk in preparation for TIFF reader fix #297, Fix tiff reader #292)
Adds a VirtualDataArrayAccessor and top-level open_virtual_dataarray for single IFD TIFF files (related to Support for groups #84)
Adds a VirtualObjectStore for testing functionality without needing to serialize to kerchunk or icechunk (relates to Loading data from ManifestArrays without saving references to disk first #124)

There's still a lot to do here, but I'm opening the PR early to make it easier to ask people questions.

TomNicholas · 2025-03-13T18:47:39Z

This is awesome!

I would ideally like to see each of your bullets be a separate PR. But I can see it might be easiest to develop them all here then split and merge separately.

TomNicholas · 2025-03-16T22:00:40Z

virtualizarr/tests/data/LC08_L2SP_046027_20201229_20210310_02_T2_SR_B2.TIF

Let's not commit an 80MB file to the repo. The 30KB one is okay, but ideally we would only commit artificially-created data that is literally as small as possible.

Sounds good, I'm remove it before finishing the PR. The 30 KB one is stripped rather than tiled and it was easier to implement tiled TIFFs first since I find it more intuitive to work with, which is why I added it for now.

#347 for TIFF along with a tiny in-repo test would be the best approach, but I don't want to get too distracted by other tempting issues 😅

FWIW I try to avoid ever committing a large file to the repo, because if you add it and remove it a few commits later, the entire file will still be in the Git history, so the repo will still be 80MB larger. I think if you squash the merge you'll be fine though?

good point, I could use git-filter-repo if we're super concerned about whether the squashing / branch deletion will actually remove the file. I guess this is one reason to use forks, even though they make the collaboration/review process a bit less seamless sometimes.

I think even if you used a fork, but then merged via squash or with a merge commit, it would include all the intermediate commits, which would include the large file in the history.

Co-authored-by: Tom Nicholas <[email protected]>

virtualizarr/accessor.py

virtualizarr/readers/tiff.py

Co-authored-by: Kyle Barron <[email protected]>

virtualizarr/readers/tiff.py

TomNicholas · 2025-03-19T19:50:22Z

virtualizarr/readers/tiff.py

+        # Convert to a Zarr store
+        return xr.open_dataset(
+            manifest_store, engine="zarr", consolidated=False, zarr_format=3


I'm confused - isn't this going to create a dataset with all loadable variables???

All the variables are "loadable", but not loaded unless you call something like .data on them.

GeoTIFFs don't really have coordinates, so I think I'd need to work with HDF5 files in order to make sure there's still an option to have 'virtual' coordinates and those aren't loaded eagerly in open_dataset (they probably are).

All the variables are "loadable", but not loaded unless you call something like .data on them.

That's what I mean - isn't this going to create a Dataset fully of lazy numpy arrays instead of ManifestArrays?

True, the ManifestArrays are stored in ManifestStore._manifest_group._manifest_dict rather than in each DataArray. I'll need to give this some thought, and probably have a discussion about how to approach this. We could have a final step that replaces the lazy numpy arrays with ManifestArrays for non-loadable variables but that might require a lot more spaghetti in the ManifestStore internals.

Putting aside how majorly disruptive this would be, why must the data variables contain ManifestArrays, rather than indexing the ManifestArrays from the Zarr Store instance at write time? I'm thinking about whether we could use this approach, call .load() on loadable variables, then use isinstance(var, numpy.ndarray) and isinstance(var, xarray.core.indexing.MemoryCachedArray) after the concatenation, merging, etc. to determine whether the data or the VirtualChunkSpec should be stored in Icechunk.

If that's possible, we could provide all the functionality of both the Zarr and Xarray APIs. e.g., to_zarr now could be used to create a "native" zarr store from any archival file format, or virtualize.to_icechunk could create a virtual zarr store with some inlined data. if we are really concerned about accidental data loading we could break the Zarr API a bit to prevent loading any data not included in 'loadable_variables'.

why must the data variables contain ManifestArrays, rather than indexing the ManifestArrays from the Zarr Store instance at write time?

Because you can't do arbitrary transformations on ManifestArrays like you can on numpy arrays. IIUC your approach (which I think can be summarized as "lazily fetching the ManifestArray corresponding to this numpy array") would very often lead to people unable to write virtual variables to the store at write time, because of some step they did earlier.

I'm going to experiment a bit. Because if we have a loadable_arrays attribute on the ManifestStore, we can raise an error if Xarray calls .get() on any of the arrays that aren't set as loadable, preventing any eager computations that would prevent serialization. So, it wouldn't be a full feature complete dataset but you'd get better error messages than those relating to ChunkManagers. Still may be a bad idea, but I think there's no harm in spending an afternoon trying it out.

My intuition is that this is just going to un-separate the concerns we want to separate, but I support messing around 😀

codecov · 2025-03-19T19:50:54Z

Codecov Report

Attention: Patch coverage is 76.51246% with 66 lines in your changes missing coverage. Please review.

Project coverage is 87.73%. Comparing base (81a760d) to head (71543cd).

Files with missing lines	Patch %	Lines
virtualizarr/storage/obstore.py	61.32%	41 Missing ⚠️
virtualizarr/readers/tiff.py	85.36%	12 Missing ⚠️
virtualizarr/vendor/zarr/metadata.py	73.68%	5 Missing ⚠️
virtualizarr/storage/common.py	88.57%	4 Missing ⚠️
virtualizarr/readers/common.py	88.88%	2 Missing ⚠️
virtualizarr/manifests/group.py	91.66%	1 Missing ⚠️
virtualizarr/types/general.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #488      +/-   ##
===========================================
- Coverage    89.04%   87.73%   -1.32%     
===========================================
  Files           27       31       +4     
  Lines         1689     1932     +243     
===========================================
+ Hits          1504     1695     +191     
- Misses         185      237      +52

Files with missing lines	Coverage Δ
virtualizarr/__init__.py	`75.00% <ø> (ø)`
virtualizarr/accessor.py	`94.33% <100.00%> (+0.10%)`	⬆️
virtualizarr/manifests/__init__.py	`100.00% <100.00%> (ø)`
virtualizarr/manifests/utils.py	`89.23% <ø> (ø)`
virtualizarr/readers/hdf/filters.py	`95.78% <100.00%> (-0.44%)`	⬇️
virtualizarr/manifests/group.py	`91.66% <91.66%> (ø)`
virtualizarr/types/general.py	`80.00% <75.00%> (-20.00%)`	⬇️
virtualizarr/readers/common.py	`91.78% <88.88%> (-0.95%)`	⬇️
virtualizarr/storage/common.py	`88.57% <88.57%> (ø)`
virtualizarr/vendor/zarr/metadata.py	`73.68% <73.68%> (ø)`
... and 2 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maxrjones · 2025-03-21T13:40:01Z

I'm going to close this and rewrite the implementation after #477 and #490 are merged, which should make the PR much simpler. Hoping that the new PR will be ready for review by the middle of next week.

maxrjones added 4 commits March 13, 2025 11:11

Start work on VirtualObjectStore

0ec7f19

Start refactoring the TIFF backend to use async-tiff

08aba9c

Update VirtualObjectStore

1820025

Handle group zarr.json

d2e670b

maxrjones added 5 commits March 14, 2025 09:34

Fix dim names and chunkmanifest shape

8308983

Load some chunks

577b882

Start generalizing listing

33a0399

Handle running event loops

644bc58

Handle other request types

9b1b63a

maxrjones mentioned this pull request Mar 15, 2025

Add Xarray backend for async-tiff's Python bindings NASA-IMPACT/veda-odd#87

Closed

maxrjones added 15 commits March 16, 2025 11:41

Draft VirtualObjectStore implementation

d53e426

Add test

5ad7d37

Handle dataset metadata

ac64c16

Fix chunk length

e9f213f

Pass kwargs to open_dataset

1079607

Add variable attrs to Zarr metadata

b27dec9

Remove extra env

266fbed

Raise NotImplementErrors on get_partial_values and exists

480c175

Add docstring for

04a2e60

Make all VirtualObjectStore instances read_only

81b13c8

Remove unused get_partial_values functionality

540b0a8

Add more docstrings

29111c2

Fix some typos

147682b

Fix typing

2475e6d

Add release notes

cc1b2f8

maxrjones mentioned this pull request Mar 16, 2025

Add ManifestStore for loading data from ManifestArrays #490

Merged

8 tasks

maxrjones added 2 commits March 16, 2025 17:19

Merge branch 'develop' into virtual-obstore-store

5dee76d

Merge branch 'develop' into tiff_with_virtualobjectstore

5996c3f

maxrjones had a problem deploying to test-release March 16, 2025 21:20 — with GitHub Actions Failure

maxrjones added 2 commits March 16, 2025 17:44

Add obstore to test and typing envs

e3195fd

Separate out byte range transformation

e4bfdbd

TomNicholas added enhancement New feature or request readers labels Mar 16, 2025

TomNicholas reviewed Mar 16, 2025

View reviewed changes

maxrjones and others added 9 commits March 17, 2025 12:33

Revise based on code review

ac37f15

Co-authored-by: Tom Nicholas <[email protected]>

Simplify typing

58fc240

Move store selection outside try/except block

f15a52a

Co-authored-by: Tom Nicholas <[email protected]>

Remove accessor method

52272d0

Rename to ManifestStore

756c3c4

Don't include ManifestStore in test_integration

c14e67a

Merge branch 'develop' into virtual-obstore-store

e8b995c

Add basic test for ManifestStore

52287e9

Merge branch 'virtual-obstore-store' into tiff_with_virtualobjectstore

59b45a0

kylebarron reviewed Mar 19, 2025

View reviewed changes

virtualizarr/accessor.py Outdated Show resolved Hide resolved

kylebarron reviewed Mar 19, 2025

View reviewed changes

virtualizarr/readers/tiff.py Outdated Show resolved Hide resolved

Refactor around Zarr model

bc52aa0

maxrjones had a problem deploying to test-release March 19, 2025 19:29 — with GitHub Actions Failure

maxrjones and others added 3 commits March 19, 2025 15:38

Fix import of optional deps

f33b614

Fix typo

09dacc1

Fix import

83a5622

Co-authored-by: Kyle Barron <[email protected]>

maxrjones temporarily deployed to test-release March 19, 2025 19:45 — with GitHub Actions Inactive

TomNicholas reviewed Mar 19, 2025

View reviewed changes

Rename static method

71543cd

maxrjones temporarily deployed to test-release March 19, 2025 20:06 — with GitHub Actions Inactive

maxrjones closed this Mar 21, 2025

maxrjones deleted the tiff_with_virtualobjectstore branch March 21, 2025 13:40

abarciauskas-bgse mentioned this pull request Apr 1, 2025

Add VirtualiZarr reader for async-tiff NASA-IMPACT/veda-odd#88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Refactor TIFF backend to use async_tiff and obstore #488

WIP: Refactor TIFF backend to use async_tiff and obstore #488

maxrjones commented Mar 13, 2025

TomNicholas commented Mar 13, 2025

TomNicholas Mar 16, 2025

maxrjones Mar 16, 2025

kylebarron Mar 19, 2025

maxrjones Mar 19, 2025

kylebarron Mar 19, 2025

TomNicholas Mar 19, 2025

maxrjones Mar 19, 2025

maxrjones Mar 19, 2025

TomNicholas Mar 19, 2025

maxrjones Mar 19, 2025 •

edited

Loading

maxrjones Mar 19, 2025

TomNicholas Mar 20, 2025 •

edited

Loading

maxrjones Mar 20, 2025

TomNicholas Mar 20, 2025

codecov bot commented Mar 19, 2025 •

edited

Loading

maxrjones commented Mar 21, 2025

WIP: Refactor TIFF backend to use async_tiff and obstore #488

WIP: Refactor TIFF backend to use async_tiff and obstore #488

Conversation

maxrjones commented Mar 13, 2025

TomNicholas commented Mar 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxrjones Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 19, 2025 • edited Loading

Codecov Report

maxrjones commented Mar 21, 2025

maxrjones Mar 19, 2025 •

edited

Loading

TomNicholas Mar 20, 2025 •

edited

Loading

codecov bot commented Mar 19, 2025 •

edited

Loading