Replace ZArray with zarr.core.array.ArrayMetadata #411

abarciauskas-bgse · 2025-01-31T23:07:23Z

I don't believe there is an issue for this: We want to replace (or dramatically simplify) the custom ZArray class that exists in virtualizarr/zarr.py with zarr-python classes and methods.

Based on discussion in #175, I believe we want to replace ZArray with zarr.core.array.ArrayV2Metadata and zarr.core.array.ArrayV3Metadata and refactor every place that uses ZArray (and probably refactor codecs.py.

I looked at #175 and pulled some of them into a branch I have started working on (https://github.com/zarr-developers/VirtualiZarr/tree/ab/use-zarr-python-metadata, which I have only tested and made one test pass so far (virtualizarr/tests/test_manifests/test_array.py::TestStack::test_stack_empty), but wanted to an open an issue to track the approach to this change.

I think I will next look at the codecs tests to see what refactoring may be necessary there.

The text was updated successfully, but these errors were encountered:

TomNicholas · 2025-01-31T23:14:31Z

Why would we even need the V2 version of the class?

TomNicholas · 2025-01-31T23:16:47Z

Can't we just make all ManifestArrays use V3 metadata? The only place I can think of where we might need the V2 version is to deal with actual Zarr V2 data in @norlandrhagen 's Zarr reader PR, and even then we just immediately convert the metadata to V3 anyway.

TomNicholas · 2025-01-31T23:17:20Z

@d-v-b might have advice on this

abarciauskas-bgse · 2025-01-31T23:55:32Z

🤔 I can understand conceptually how in-memory the zarr array metadata could always be in the V3 format, but I could see the following instances where we might want to use a V2 Metadata representation:

Readers:

@norlandrhagen's Zarr reader will need to read Zarr V2 data, as you mentioned
Kerchunk reader: If opening from an existing Kerchunk store it is most likely, at this time, using V2 metadata

Writing:

Do we want to support writing to kerchunk and Zarr in zarr_format version 2?

TomNicholas · 2025-02-01T00:04:17Z

All those cases happen during creation or serialisation of ManifestArrays though. So we need functions to make ManifestArrays from V2 metadata, but we don't need the ManifestArray class itself to have flexibility, all ManifestArrays can just represent their metadata in the V3 format. I don't think kerchunk's format even makes a distinction right now, and we cannot write to Zarr, only Icechunk, which is only for V3.

…

On Fri, Jan 31, 2025, 6:55 PM Aimee Barciauskas ***@***.***> wrote: 🤔 I can understand conceptually how in-memory the zarr array metadata could always be in the V3 format, but I could see the following instances where we might want to use a V2 Metadata representation: Readers: 1. @norlandrhagen <https://github.com/norlandrhagen>'s Zarr reader will need to read Zarr V2 data, as you mentioned 2. Kerchunk reader: If opening from an existing Kerchunk store it is most likely, at this time, using V2 metadata Writing: 1. Do we want to support writing to kerchunk and Zarr in zarr_format version 2? — Reply to this email directly, view it on GitHub <#411 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AISNPIYFOMTSUBYG3UDDVIL2NQEQVAVCNFSM6AAAAABWISEMPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGU4DMNJUGQ> . You are receiving this because you commented.Message ID: ***@***.***>

d-v-b · 2025-02-01T11:23:11Z

+1 to using the metadata classes from zarr-python, this is exactly what they are designed for, and please let us know if there's anything we can improve about their design.

in zarr-python we made the AsyncArray class (i.e., the class that actually reads and writes chunks) generic w.r.t to v2 or v3 metadata. We should probably eventually do this for the group metadata too, because it lets us express the fact that a v2 group will only contain v2 arrays, and so on, without needing to define separate V2Array and V3Array classes. The semantics of v2 and v3 are different enough that you can't losslessly convert one to the other, so it might make sense to support both.

abarciauskas-bgse · 2025-02-01T17:31:53Z

Thanks @TomNicholas and @d-v-b

So if we decide all ManifestArrays will just use ArrayV3Metadata internally, then we just need to modify and test all readers are translating their format's metadata to v3.

For writers, I see why I was confused. We have 3 writer methods right now:

to_icechunk - this is clear, we only need to write to zarr format version 3
to_kerchunk - this was unclear, I didn't look closely at zarr-python v3 compatibility fsspec/kerchunk#516, and I think kerchunk still only supports write to zarr format version 2 (which is fine), so I think we will need a way to go from ArrayV3Metadata to zarr v2 kerchunk metadata when writing to kerchunk
to_zarr - We also have this to_zarr accessor based on the ChunkManifest ZEP. Should we remove the to_zarr method in light of icechunk's open source release?

I'm still parsing the code changes that may be required, but in terms of breaking down the work then, it seems like we could break this down into separate PRs to a zarr-python 3 branch:

ManifestArray uses ArrayV3Metadata
all readers that rely on kerchunk (I believe this would be kerchunk, fits, netcdf3, and tiff) create ManifestArrays with ArrayV3Metadata
the HDF reader ⬆
the Zarr V3 reader ⬆
the DMRPP reader ⬆
ManifestArray with ArrayV3Metadata can write to icechunk
ManifestArray with ArrayV3Metadata can write to kerchunk
Remove references to non-existent Zarr chunk manifest format #359

TomNicholas · 2025-02-01T17:58:23Z

Should we remove the to_zarr method in light of icechunk's open source release?

Yes, see #359.

Everything else you've written is exactly right.

We might want to change the interface for getting metadata from the ManifestArray. I think we should ape what zarr.Array does, which is have a .metadata property. So instead of .zarray, we would have

class ManifestArray:
    _manifest: ChunkManifest
    _metadata: ArrayV3Metadata
    
    def __init__(self, chunkmanifest, metadata: ArrayV3Metadata) -> None:
        ...

    @property
    def metadata(self) -> ArrayV3Metadata:
        ...

    def to_kerchunk_v2(self) -> KerchunkArrRefs:
        ...

To create the ManifestArrays we also want a function/method like

def extract_metadata_from_kerchunk(refs: KerchunkArrRefs) -> ArrayV3Metadata:
    ...

Once we have those swapping this in every reader should be straightforwards.

abarciauskas-bgse mentioned this issue Jan 31, 2025

Planning explicit dependency on Zarr v3 #392

Open

4 tasks

maxrjones moved this to Todo in VirtualiZarr Zarr-Python 3.0 migration Jan 31, 2025

maxrjones added this to VirtualiZarr Zarr-Python 3.0 migration Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ZArray with zarr.core.array.ArrayMetadata #411

Replace ZArray with zarr.core.array.ArrayMetadata #411

abarciauskas-bgse commented Jan 31, 2025

TomNicholas commented Jan 31, 2025

TomNicholas commented Jan 31, 2025

TomNicholas commented Jan 31, 2025

abarciauskas-bgse commented Jan 31, 2025

TomNicholas commented Feb 1, 2025 via email

d-v-b commented Feb 1, 2025

abarciauskas-bgse commented Feb 1, 2025 •

edited

Loading

TomNicholas commented Feb 1, 2025

Replace ZArray with zarr.core.array.ArrayMetadata #411

Replace ZArray with zarr.core.array.ArrayMetadata #411

Comments

abarciauskas-bgse commented Jan 31, 2025

TomNicholas commented Jan 31, 2025

TomNicholas commented Jan 31, 2025

TomNicholas commented Jan 31, 2025

abarciauskas-bgse commented Jan 31, 2025

TomNicholas commented Feb 1, 2025 via email

d-v-b commented Feb 1, 2025

abarciauskas-bgse commented Feb 1, 2025 • edited Loading

TomNicholas commented Feb 1, 2025

abarciauskas-bgse commented Feb 1, 2025 •

edited

Loading