Replies: 4 comments 2 replies
-
A few points of feedback:
This method will put all of the chunks into memory. Wouldn't it potentially be more efficient to stream them to the store? Any async iterator or futures might be a better return type.
This only works if the selection aligns exactly with chunk boundaries. For chunks that are being partially filled, you have to read them first, then update the chunk in memory with the new data, then write it again. 🤮 In the existing Zarr code, this happens here It's not clear to me where that logic will live with this design. |
Beta Was this translation helpful? Give feedback.
-
What is the motivation for adding the loading/storing logic into the chunk grid? I could imagine that the chunk grid abstraction is just turning array-level selections into chunk-level selections. Basically, what the Indexer currently does. class ChunkProjection:
chunk_coords: ChunkCoords # for turning chunk_coords into chunk keys with the chunk key encoding
chunk_selection: SliceSelection
out_selection: SliceSelection
is_full_chunk: bool
class BaseChunkGrid:
@abstractmethod
def slice(self, selection) -> Iterator[ChunkProjection]:
"""Slice a selection into one selection per chunk"""
....
class AsyncArray:
def __init__(self, metadata, store, ...):
...
if metadata["chunk_grid"]["name"] == "regular":
self._chunk_grid = RegularChunkGrid(array_metadata.chunk_grid, array_metadata.shape)
...
async def getitem(self, selection) -> ArrayLike:
out = ...
for chunk_projection in self._chunk_grid.slice(selection):
chunk_key = self.chunk_key_encoding.encode_key(chunk_projection.chunk_coords)
chunk_bytes = await self.store.get(chunk_key)
chunk_array = await self.codec_pipeline.decode(chunk_bytes)
out[chunk_projection.out_selection] = chunk_array[chunk_projection.chunk_selection]
return out
async def setitem(self, selection, value) -> None:
for chunk_projection in self._chunk_grid.slice(selection):
chunk_key = self.chunk_key_encoding.encode_key(chunk_projection.chunk_coords)
if chunk_projection.is_full_chunk:
chunk_array = value[chunk_projection.out_selection]
chunk_bytes = await self.codec_pipeline.encode(chunk_array)
await self.store.set(chunk_key, chunk_bytes)
else:
... It could also handle chunk key resolution, but would need to coordinate with the chunk key encoding. |
Beta Was this translation helpful? Give feedback.
-
@normanrz , I was thinking along the same lines. There is the matter of async-all-the-way for a more stream-like interface (below), but for the part->chunk mapping (or whatever we call it), the existing indexers and indeed the more complete ones in xarray already have most of what we need - which is good! In fact, it's a fast trivial operation, so to generate a mapping (e.g., dict) doesn't even need to be async at all.
You cannot call this via a sync API, though, because then you have latency again. Elsewhere we talked about transactions or other way to batch calls into async operations, so that's where we are again. Are we assuming in all these places that storage-chunks don't overlap multiple logical-chunks? That's probably a requirement at least for writing, else you need locks. async def getitem(self, selection) -> ArrayLike:
out = ...
for chunk_projection in self._chunk_grid.slice(selection):
chunk_key = self.chunk_key_encoding.encode_key(chunk_projection.chunk_coords)
chunk_bytes = await self.store.get(chunk_key)
chunk_array = await self.codec_pipeline.decode(chunk_bytes)
out[chunk_projection.out_selection] = chunk_array[chunk_projection.chunk_selection]
return out ^ this does not run anything concurrently unless the calling code creates many coroutines and uses |
Beta Was this translation helpful? Give feedback.
-
One point here: since that proposal allows a mix of variable and fixed chunking for the dimensions, the functionality is effectively a superset of the fixed case. E.g. an implementation of this chunk grid would likely also work for a fixed chunk grid, meaning there may not be a ton of value in having separate implementations for a |
Beta Was this translation helpful? Give feedback.
-
This discussion concerns the design for Zarr-Python's internal
ChunkGrid
API. This is a follow on to #1583.Background
Zarr v2 only supported a "regular chunk grid". The v3 spec introduces the notion of an extensible chunk grid . Because Zarr-Python was written with only a regular chunk grid in mind, there is not much in terms of an abstraction around the chunk grid is handled. The closest thing we have is an Indexer API but these all target the regular chunk grid case.
The most obvious chunk grid extension to consider is the "variable chunking" case (zep, prototype). Others could include:
Toward supporting these new chunk grids, I've been trying to think of the right level of abstraction we can offer as an internal API. I've come up with the idea described below.
Design thoughts
What is the job of the
ChunkGrid
? In short, it is responsible for is mapping Indexers to keys in a store, then eitherSparing many of the actual details, this is what I envision:
There is a lot of potential for shared logic between chunk grids (e.g. encoding/decoding, interacting with the store). If I had to summarize the perspective this brings is that I've started to convince myself that we need to be dispatching out to stand alone chunk grid implementations for the logic of assembling/disassembling arrays.
@d-v-b has been digging in on this for the past few weeks so he's likely out in front of me already but this was in my head and I wanted to get it out.
cc @normanrz, @martindurant
Beta Was this translation helpful? Give feedback.
All reactions