Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ZArray with zarr.core.array.ArrayMetadata #411

Open
abarciauskas-bgse opened this issue Jan 31, 2025 · 8 comments
Open

Replace ZArray with zarr.core.array.ArrayMetadata #411

abarciauskas-bgse opened this issue Jan 31, 2025 · 8 comments

Comments

@abarciauskas-bgse
Copy link
Collaborator

I don't believe there is an issue for this: We want to replace (or dramatically simplify) the custom ZArray class that exists in virtualizarr/zarr.py with zarr-python classes and methods.

Based on discussion in #175, I believe we want to replace ZArray with zarr.core.array.ArrayV2Metadata and zarr.core.array.ArrayV3Metadata and refactor every place that uses ZArray (and probably refactor codecs.py.

I looked at #175 and pulled some of them into a branch I have started working on (https://github.com/zarr-developers/VirtualiZarr/tree/ab/use-zarr-python-metadata, which I have only tested and made one test pass so far (virtualizarr/tests/test_manifests/test_array.py::TestStack::test_stack_empty), but wanted to an open an issue to track the approach to this change.

I think I will next look at the codecs tests to see what refactoring may be necessary there.

@TomNicholas
Copy link
Member

Why would we even need the V2 version of the class?

@TomNicholas
Copy link
Member

Can't we just make all ManifestArrays use V3 metadata? The only place I can think of where we might need the V2 version is to deal with actual Zarr V2 data in @norlandrhagen 's Zarr reader PR, and even then we just immediately convert the metadata to V3 anyway.

@TomNicholas
Copy link
Member

@d-v-b might have advice on this

@abarciauskas-bgse
Copy link
Collaborator Author

🤔 I can understand conceptually how in-memory the zarr array metadata could always be in the V3 format, but I could see the following instances where we might want to use a V2 Metadata representation:

Readers:

  1. @norlandrhagen's Zarr reader will need to read Zarr V2 data, as you mentioned
  2. Kerchunk reader: If opening from an existing Kerchunk store it is most likely, at this time, using V2 metadata

Writing:

  1. Do we want to support writing to kerchunk and Zarr in zarr_format version 2?

@TomNicholas
Copy link
Member

TomNicholas commented Feb 1, 2025 via email

@d-v-b
Copy link

d-v-b commented Feb 1, 2025

+1 to using the metadata classes from zarr-python, this is exactly what they are designed for, and please let us know if there's anything we can improve about their design.

in zarr-python we made the AsyncArray class (i.e., the class that actually reads and writes chunks) generic w.r.t to v2 or v3 metadata. We should probably eventually do this for the group metadata too, because it lets us express the fact that a v2 group will only contain v2 arrays, and so on, without needing to define separate V2Array and V3Array classes. The semantics of v2 and v3 are different enough that you can't losslessly convert one to the other, so it might make sense to support both.

@abarciauskas-bgse
Copy link
Collaborator Author

abarciauskas-bgse commented Feb 1, 2025

Thanks @TomNicholas and @d-v-b

So if we decide all ManifestArrays will just use ArrayV3Metadata internally, then we just need to modify and test all readers are translating their format's metadata to v3.

For writers, I see why I was confused. We have 3 writer methods right now:

  1. to_icechunk - this is clear, we only need to write to zarr format version 3
  2. to_kerchunk - this was unclear, I didn't look closely at zarr-python v3 compatibility fsspec/kerchunk#516, and I think kerchunk still only supports write to zarr format version 2 (which is fine), so I think we will need a way to go from ArrayV3Metadata to zarr v2 kerchunk metadata when writing to kerchunk
  3. to_zarr - We also have this to_zarr accessor based on the ChunkManifest ZEP. Should we remove the to_zarr method in light of icechunk's open source release?

I'm still parsing the code changes that may be required, but in terms of breaking down the work then, it seems like we could break this down into separate PRs to a zarr-python 3 branch:

  • ManifestArray uses ArrayV3Metadata
  • all readers that rely on kerchunk (I believe this would be kerchunk, fits, netcdf3, and tiff) create ManifestArrays with ArrayV3Metadata
  • the HDF reader ⬆
  • the Zarr V3 reader ⬆
  • the DMRPP reader ⬆
  • ManifestArray with ArrayV3Metadata can write to icechunk
  • ManifestArray with ArrayV3Metadata can write to kerchunk
  • Remove references to non-existent Zarr chunk manifest format #359

@TomNicholas
Copy link
Member

Should we remove the to_zarr method in light of icechunk's open source release?

Yes, see #359.

Everything else you've written is exactly right.

We might want to change the interface for getting metadata from the ManifestArray. I think we should ape what zarr.Array does, which is have a .metadata property. So instead of .zarray, we would have

class ManifestArray:
    _manifest: ChunkManifest
    _metadata: ArrayV3Metadata
    
    def __init__(self, chunkmanifest, metadata: ArrayV3Metadata) -> None:
        ...

    @property
    def metadata(self) -> ArrayV3Metadata:
        ...

    def to_kerchunk_v2(self) -> KerchunkArrRefs:
        ...

To create the ManifestArrays we also want a function/method like

def extract_metadata_from_kerchunk(refs: KerchunkArrRefs) -> ArrayV3Metadata:
    ...

Once we have those swapping this in every reader should be straightforwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants