-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Refactor TIFF backend to use async_tiff and obstore #488
Conversation
This is awesome! I would ideally like to see each of your bullets be a separate PR. But I can see it might be easiest to develop them all here then split and merge separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not commit an 80MB file to the repo. The 30KB one is okay, but ideally we would only commit artificially-created data that is literally as small as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'm remove it before finishing the PR. The 30 KB one is stripped rather than tiled and it was easier to implement tiled TIFFs first since I find it more intuitive to work with, which is why I added it for now.
#347 for TIFF along with a tiny in-repo test would be the best approach, but I don't want to get too distracted by other tempting issues 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I try to avoid ever committing a large file to the repo, because if you add it and remove it a few commits later, the entire file will still be in the Git history, so the repo will still be 80MB larger. I think if you squash the merge you'll be fine though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I could use git-filter-repo if we're super concerned about whether the squashing / branch deletion will actually remove the file. I guess this is one reason to use forks, even though they make the collaboration/review process a bit less seamless sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even if you used a fork, but then merged via squash or with a merge commit, it would include all the intermediate commits, which would include the large file in the history.
Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Kyle Barron <[email protected]>
# Convert to a Zarr store | ||
return xr.open_dataset( | ||
manifest_store, engine="zarr", consolidated=False, zarr_format=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused - isn't this going to create a dataset with all loadable variables???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the variables are "loadable", but not loaded unless you call something like .data on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeoTIFFs don't really have coordinates, so I think I'd need to work with HDF5 files in order to make sure there's still an option to have 'virtual' coordinates and those aren't loaded eagerly in open_dataset
(they probably are).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the variables are "loadable", but not loaded unless you call something like .data on them.
That's what I mean - isn't this going to create a Dataset
fully of lazy numpy arrays instead of ManifestArrays
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the ManifestArray
s are stored in ManifestStore._manifest_group._manifest_dict
rather than in each DataArray
. I'll need to give this some thought, and probably have a discussion about how to approach this. We could have a final step that replaces the lazy numpy arrays with ManifestArrays for non-loadable variables but that might require a lot more spaghetti in the ManifestStore
internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting aside how majorly disruptive this would be, why must the data variables contain ManifestArrays, rather than indexing the ManifestArrays from the Zarr Store instance at write time? I'm thinking about whether we could use this approach, call .load()
on loadable variables, then use isinstance(var, numpy.ndarray) and isinstance(var, xarray.core.indexing.MemoryCachedArray) after the concatenation, merging, etc. to determine whether the data or the VirtualChunkSpec should be stored in Icechunk.
If that's possible, we could provide all the functionality of both the Zarr and Xarray APIs. e.g., to_zarr now could be used to create a "native" zarr store from any archival file format, or virtualize.to_icechunk could create a virtual zarr store with some inlined data. if we are really concerned about accidental data loading we could break the Zarr API a bit to prevent loading any data not included in 'loadable_variables'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why must the data variables contain
ManifestArrays
, rather than indexing theManifestArrays
from the Zarr Store instance at write time?
Because you can't do arbitrary transformations on ManifestArrays
like you can on numpy arrays. IIUC your approach (which I think can be summarized as "lazily fetching the ManifestArray
corresponding to this numpy array") would very often lead to people unable to write virtual variables to the store at write time, because of some step they did earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to experiment a bit. Because if we have a loadable_arrays
attribute on the ManifestStore, we can raise an error if Xarray calls .get()
on any of the arrays that aren't set as loadable, preventing any eager computations that would prevent serialization. So, it wouldn't be a full feature complete dataset but you'd get better error messages than those relating to ChunkManagers. Still may be a bad idea, but I think there's no harm in spending an afternoon trying it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition is that this is just going to un-separate the concerns we want to separate, but I support messing around 😀
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #488 +/- ##
===========================================
- Coverage 89.04% 87.73% -1.32%
===========================================
Files 27 31 +4
Lines 1689 1932 +243
===========================================
+ Hits 1504 1695 +191
- Misses 185 237 +52
🚀 New features to boost your workflow:
|
This PR does a few things, which may need to get split up but are convenient to prototype together:
open_virtual_dataarray
for single IFD TIFF files (related to Support for groups #84)VirtualObjectStore
for testing functionality without needing to serialize to kerchunk or icechunk (relates to Loading data from ManifestArrays without saving references to disk first #124)There's still a lot to do here, but I'm opening the PR early to make it easier to ask people questions.
docs/releases.rst
api.rst