Performance Using `partial_decompress=True` #1138

ryanhausen · 2022-09-16T15:43:14Z

ryanhausen
Sep 16, 2022

Hi I am experimenting with zarr to see if it would be a good fit for storing dataset of approximately 48 TB. We currently store the data in shards of (512, 512, 512) and containing (8, 8, 8) chunks using custom software for quick read access. zarr seems to read the whole chunk in before accessing the data, which for our larger chunks is a bit time consuming. I tried using the partial_decompress=True flag, but didn't see any noticeable improvement in access times. I think I am using it right, but could be wrong. Below is a toy example of how I am using zarr. Am I using it correctly? Should I expect a speed improvement using partial_decompress?

import numpy as np
import zarr
from numcodecs import Blosc

kwargs = dict(
    shape=(1024, 1024, 1024, 1),
    chunks=(512, 512, 512, None),
    compression=Blosc(),
    write_empty_chunks=True,
    dtype=np.float32,
)


arr = zarr.creation.open_array(
    zarr.storage.FSStore("array.zarr", modestr="w"),
    mode="w",
    **kwargs,
)

test_vals =  np.random.randn(1024, 1024, 1024, 1)
arr[:] = test_vals

del arr

arr_partial_decompress = zarr.creation.open_array(
    zarr.storage.FSStore("array.zarr", modestr="r"),
    mode="r",
    partial_decompress=True,
    **kwargs,
)

Using smaller chunks like (64, 64, 64) works much more quickly, but will create a large number of files. I saw that sharding is currently being worked on in #877, #1111, and zarr-developers/zarr-specs#152 and looks really promising for our use case.

Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance Using `partial_decompress=True` #1138

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Performance Using partial_decompress=True #1138

Uh oh!

ryanhausen Sep 16, 2022

Replies: 0 comments

Performance Using `partial_decompress=True` #1138

ryanhausen
Sep 16, 2022