[v3 design] Internal ChunkGrid API #1635

jhamman · 2024-01-20T00:11:58Z

jhamman
Jan 20, 2024
Maintainer

This discussion concerns the design for Zarr-Python's internal ChunkGrid API. This is a follow on to #1583.

Background

Zarr v2 only supported a "regular chunk grid". The v3 spec introduces the notion of an extensible chunk grid . Because Zarr-Python was written with only a regular chunk grid in mind, there is not much in terms of an abstraction around the chunk grid is handled. The closest thing we have is an Indexer API but these all target the regular chunk grid case.

The most obvious chunk grid extension to consider is the "variable chunking" case (zep, prototype). Others could include:

Virtualized arrays (e.g. concatenated arrays) with or without consistent chunking
Rechunked copies of the same array
Overlapping chunks (exotic, I know ;)

Toward supporting these new chunk grids, I've been trying to think of the right level of abstraction we can offer as an internal API. I've come up with the idea described below.

Design thoughts

What is the job of the ChunkGrid? In short, it is responsible for is mapping Indexers to keys in a store, then either

loading those keys and creating an array or
taking an array, turning it into chunks (bytes) and writing them to those keys.

Sparing many of the actual details, this is what I envision:

class BaseChunkGrid:

    def __init__(self, array_metadata, store):
        self.array_metadata = array_metadata
        self.store = store

    @abstractmethod
    async def to_chunks(self, selection, data) -> dict[str, bytes]:
        """Turn array into bytes

           This method is responsible for 
             1) mapping the selection to the chunk grid coordinates
             2) splitting data into chunks and encoding those chunks as bytes
        """
        ....

    @abstractmethod
    async def from_chunks(self, selection) -> ArrayLike:
        """Given an selection, return an array

           This method is responsible for 
             1) determining which keys need to be loaded
             2) loading those keys
             3) decoding those keys from bytes into an array and assigning that data to the return array
        """
        ....


class AsyncArray:
    def __init__(self, metadata, store, ...):
        ...
        if metadata["chunk_grid"]["name"] == "regular":
           self._chunk_grid = RegularChunkGrid(metadata, store)
        ...

    async def getitem(self, selection) -> ArrayLike:
        return await self._chunk_grid.from_chunks(selection) 

    async def setitem(self, selection, value) -> None:
        return await self._chunk_grid.to_chunks(selection, value)

There is a lot of potential for shared logic between chunk grids (e.g. encoding/decoding, interacting with the store). If I had to summarize the perspective this brings is that I've started to convince myself that we need to be dispatching out to stand alone chunk grid implementations for the logic of assembling/disassembling arrays.

@d-v-b has been digging in on this for the past few weeks so he's likely out in front of me already but this was in my head and I wanted to get it out.

cc @normanrz, @martindurant

rabernat · 2024-01-24T21:55:10Z

rabernat
Jan 24, 2024
Maintainer

A few points of feedback:

    @abstractmethod
    async def to_chunks(self, selection, data) -> dict[str, bytes]:

This method will put all of the chunks into memory. Wouldn't it potentially be more efficient to stream them to the store? Any async iterator or futures might be a better return type.

    async def setitem(self, selection, value) -> None:
        return await self._chunk_grid.to_chunks(selection, value)

This only works if the selection aligns exactly with chunk boundaries. For chunks that are being partially filled, you have to read them first, then update the chunk in memory with the new data, then write it again. 🤮 In the existing Zarr code, this happens here

It's not clear to me where that logic will live with this design.

0 replies

normanrz · 2024-01-25T11:49:20Z

normanrz
Jan 25, 2024
Maintainer

What is the motivation for adding the loading/storing logic into the chunk grid?

I could imagine that the chunk grid abstraction is just turning array-level selections into chunk-level selections. Basically, what the Indexer currently does.

class ChunkProjection:
    chunk_coords: ChunkCoords # for turning chunk_coords into chunk keys with the chunk key encoding
    chunk_selection: SliceSelection
    out_selection: SliceSelection
    is_full_chunk: bool

class BaseChunkGrid:

    @abstractmethod
    def slice(self, selection) -> Iterator[ChunkProjection]:
        """Slice a selection into one selection per chunk"""
        ....


class AsyncArray:
    def __init__(self, metadata, store, ...):
        ...
        if metadata["chunk_grid"]["name"] == "regular":
           self._chunk_grid = RegularChunkGrid(array_metadata.chunk_grid, array_metadata.shape)
        ...

    async def getitem(self, selection) -> ArrayLike:
        out = ...
        for chunk_projection in self._chunk_grid.slice(selection):
            chunk_key = self.chunk_key_encoding.encode_key(chunk_projection.chunk_coords) 
            chunk_bytes = await self.store.get(chunk_key)
            chunk_array = await self.codec_pipeline.decode(chunk_bytes)
            out[chunk_projection.out_selection] = chunk_array[chunk_projection.chunk_selection]
        return out

    async def setitem(self, selection, value) -> None:
        for chunk_projection in self._chunk_grid.slice(selection):
            chunk_key = self.chunk_key_encoding.encode_key(chunk_projection.chunk_coords) 
            if chunk_projection.is_full_chunk:
                chunk_array = value[chunk_projection.out_selection]
                chunk_bytes = await self.codec_pipeline.encode(chunk_array)
                await self.store.set(chunk_key, chunk_bytes)
            else:
                ...

It could also handle chunk key resolution, but would need to coordinate with the chunk key encoding.

0 replies

martindurant · 2024-01-25T18:09:57Z

martindurant
Jan 25, 2024
Maintainer

@normanrz , I was thinking along the same lines. There is the matter of async-all-the-way for a more stream-like interface (below), but for the part->chunk mapping (or whatever we call it), the existing indexers and indeed the more complete ones in xarray already have most of what we need - which is good! In fact, it's a fast trivial operation, so to generate a mapping (e.g., dict) doesn't even need to be async at all.

@rabernat Wouldn't it potentially be more efficient to stream them to the store?

You cannot call this via a sync API, though, because then you have latency again. Elsewhere we talked about transactions or other way to batch calls into async operations, so that's where we are again.

Are we assuming in all these places that storage-chunks don't overlap multiple logical-chunks? That's probably a requirement at least for writing, else you need locks.

    async def getitem(self, selection) -> ArrayLike:
        out = ...
        for chunk_projection in self._chunk_grid.slice(selection):
            chunk_key = self.chunk_key_encoding.encode_key(chunk_projection.chunk_coords) 
            chunk_bytes = await self.store.get(chunk_key)
            chunk_array = await self.codec_pipeline.decode(chunk_bytes)
            out[chunk_projection.out_selection] = chunk_array[chunk_projection.chunk_selection]
        return out

^ this does not run anything concurrently unless the calling code creates many coroutines and uses gather(). At what level in the stack does the concurrency actually happen?

1 reply

normanrz Jan 26, 2024
Maintainer

^ this does not run anything concurrently unless the calling code creates many coroutines and uses gather(). At what level in the stack does the concurrency actually happen?

My code example was a bit simplified. In the v3 branch, we already use gather (via concurrent_map) to fetch concurrently. My main point is to keep the fetching and decoding logic in the array instead of moving it to the chunk grid.

ivirshup · 2024-02-07T16:11:02Z

ivirshup
Feb 7, 2024

The most obvious chunk grid extension to consider is the "variable chunking" case (zep, #1483). Others could include:

One point here: since that proposal allows a mix of variable and fixed chunking for the dimensions, the functionality is effectively a superset of the fixed case. E.g. an implementation of this chunk grid would likely also work for a fixed chunk grid, meaning there may not be a ton of value in having separate implementations for a RegularChunkGrid and VariableChunkGrid.

1 reply

martindurant Feb 7, 2024
Maintainer

Totally agree. We do somewhere in the implementation want to care about whether we are dealing with "whole" chunks so that we can read/decompress directly into the output without copy. That can be a secondary consideration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3 design] Internal ChunkGrid API #1635

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[v3 design] Internal ChunkGrid API #1635

jhamman Jan 20, 2024 Maintainer

Background

Design thoughts

Replies: 4 comments · 2 replies

rabernat Jan 24, 2024 Maintainer

normanrz Jan 25, 2024 Maintainer

martindurant Jan 25, 2024 Maintainer

normanrz Jan 26, 2024 Maintainer

ivirshup Feb 7, 2024

martindurant Feb 7, 2024 Maintainer

jhamman
Jan 20, 2024
Maintainer

Replies: 4 comments 2 replies

rabernat
Jan 24, 2024
Maintainer

normanrz
Jan 25, 2024
Maintainer

martindurant
Jan 25, 2024
Maintainer

normanrz Jan 26, 2024
Maintainer

ivirshup
Feb 7, 2024

martindurant Feb 7, 2024
Maintainer