Slow performance on Zarr DirectoryStore #2484

huberb · 2024-11-12T14:21:23Z

huberb
Nov 12, 2024

I'm looking for ways to improve the performance of my dataloading pipeline and I found Zarr. To get an idea about throughput, I started a small benchmark script in python. To get a baseline I also run tests using numpy memory mapped arrays.

I'm working with 4D arrays which are quite large. One of my criterias is that I need to access them as a key-value store. From each value, I access randomly on the first axis.

I created some dummy arrays to test throughput.

Here is my complete benchmarking code that compares Zarr to accessing raw Numpy arrays on disk:

import numpy as np
from pathlib import Path
import time
from tqdm import tqdm
import os
import random
import zarr


def write_numpy_dataset(data_path, raw_data):
    Path(data_path).mkdir(parents=True, exist_ok=True)
    for sample_name, sample_data in tqdm(raw_data.items()):
        mmap = np.memmap(
            f"{data_path}/{sample_name}.npy",
            dtype=sample_data.dtype,
            mode='w+',
            shape=sample_data.shape
        )
        mmap[:] = sample_data[:]
        mmap.flush()


def write_zarr_dataset(data_path, raw_data):
    zarr_store = zarr.DirectoryStore(f'{data_path}/data.zarr')
    zarr_group = zarr.group(zarr_store, overwrite=True)
    for data_key, sample in tqdm(raw_data.items()):
        zarr_group.create_dataset(
            data_key,
            data=sample,
            chunks=[28, 28, 28],  # I also tried sample.shape[1:] and sample.shape here
            overwrite=True,
            compressor=None  # remove compression because I want as much throughput as possible
        )


def load_samples_numpy(data_path):
    numpy_files = [f"{data_path}/{f}" for f in os.listdir(data_path)]
    while True:
        for numpy_file in numpy_files:
            index = random.randint(0, 499)
            # open a new memory map for every access. This way I will never cache samples in RAM
            volume = np.memmap(
                numpy_file,
                mode='r',
                dtype=np.float32
            ).reshape([500, -1, 28, 28, 28])
            yield volume[index]


def load_samples_zarr(data_path):
    zarr_store = zarr.DirectoryStore(f"{data_path}/data.zarr")
    zarr_group = zarr.open_group(zarr_store, mode="r")
    data_keys = [s for s in zarr_group]
    while True:
        for data_key in data_keys:
            index = random.randint(0, 499)
            yield zarr_group[data_key][index]


def benchmark(dataset_path, load_fn):
    total_size = 0
    start_time = time.time()

    print("\nStarting benchmark")
    print(f"reading from: {dataset_path}")
    sample_sums = []
    for i, sample in tqdm(enumerate(load_fn(dataset_path))):
        if i == 2000:
            break
        total_size += sample.nbytes
        sample_sums.append(sample.sum())  # do some dummy calculation to force the data into RAM

    end_time = time.time()

    throughput = (total_size / (end_time - start_time)) / (1024 * 1024)
    print(f"Throughput: {throughput:.2f} MB/s")


if __name__ == '__main__':
    data_path_numpy = "/mnt/cache/dataset_numpy"
    data_path_zarr = "/mnt/cache/dataset_zarr"

    # the second axis of my data samples is not fixed
    raw_data = {
        'sample0': np.random.normal(size=(500, 14, 28, 28, 28)).astype(np.float32),
        'sample1': np.random.normal(size=(500, 16, 28, 28, 28)).astype(np.float32),
        'sample2': np.random.normal(size=(500, 12, 28, 28, 28)).astype(np.float32),
        'sample3': np.random.normal(size=(500, 11, 28, 28, 28)).astype(np.float32),
        'sample4': np.random.normal(size=(500, 13, 28, 28, 28)).astype(np.float32),
        'sample5': np.random.normal(size=(500, 19, 28, 28, 28)).astype(np.float32),
        'sample6': np.random.normal(size=(500, 20, 28, 28, 28)).astype(np.float32),
        'sample7': np.random.normal(size=(500, 15, 28, 28, 28)).astype(np.float32),
        'sample8': np.random.normal(size=(500, 18, 28, 28, 28)).astype(np.float32),
        'sample9': np.random.normal(size=(500, 17, 28, 28, 28)).astype(np.float32),
    }

    write_numpy_dataset(data_path_numpy, raw_data)
    benchmark(data_path_numpy, load_samples_numpy)

    write_zarr_dataset(data_path_zarr, raw_data)
    benchmark(data_path_zarr, load_samples_zarr)

It turns out that accessing Numpy arrays outperforms Zarr by a factor of ~6-7
My maximum disk speed is 500MB/s and I reach roughly 400MB/s using numpy. With Zarr I see a throughput of ~50-60MB/s

This difference is so big that I feel like I must be missing something. I tried different chunk sizes and disabled compression completely. Still, Zarr never reaches a throughput that comes even close to Numpy's memory mapped arrays.

Does anyone have a hint on what I'm missing? Is Zarr generally slow for my usecase of accessing large 4D arrays?
Appreciate any help

d-v-b · 2024-11-12T14:37:27Z

d-v-b
Nov 12, 2024
Maintainer

It's important to distinguish between zarr the format and zarr-python, the python library. On a local file system, zarr-the-format can be read as quickly as you can read + decompress files stored on disk. Since you are not using any compression, the chunks of your arrays are literally the serialized bytes of the in-memory numpy arrays you are trying to create, so you can skip zarr-python entirely for reading and just memory-map each chunk. Happy to elaborate on how this would work if you are interested.

As for zarr-python, you should know that array indexing in zarr-python versions 2.x.x is serial, which is slow. When performing array indexing like array[:], the chunks for that array are fetched one after another. So if you have any array indexing patterns that read multiple chunks, you will wait O(n_chunks * read_time * decompress_time) time for your entire array indexing operation to complete.

We are trying to make indexing faster in zarr-python 3.x, and we have made good progress in that direction. But I think there a limits to how far we can go w.r.t performance in zarr-python, or any python library. If performance is your main goal, you might consider using tensorstore for reading your data. Or just memory-map your chunks as numpy arrays, as I suggested earlier. The python library dask is also a popular solution for concurrent access to zarr arrays, which has the advantage of generalizing to compute clusters.

1 reply

huberb Nov 12, 2024
Author

Thank you for this quick response.

Indeed, performance is my main criteria here. Previously I've used the TFRecords format which worked great for me so far but only allows sequential read access to my data instead of (random) index based access, which would be really beneficial for my use case.

My datasets exceed 50TB nowadays and since I got some new hardware to play with, data loading became my main bottleneck.

I'm currently benchmarking on a Linux system with a SSD that gives me 500MB/s read access, but later I want to read data from a NFS that has much higher throughput but also adds latency when opening filehandles. This is why I don't think I can afford storing each sample as a single file on disk, so I group them into chunks of 500 samples that have common shapes, as you can see in the code example that I gave above. Still, I think I'm loading far too many filehandles when storing my data like this.

I thought about just grouping even more samples into single files and access those with memory mapping. Can you think of any better approach than this?

My main criterias are:

index based access
some form of key-value store (so far this is my filesystem, using filenames as keys)
try to avoid opening too many filehandles because latency will kill my performance
compression would be nice to save disk space, but throughput has a higher priority for me

Other stuff like prefetching and batching is an afterthought after I figured out how to max out my throughput on a single process.

I'm willing to redo my whole data pipeline if necessary and I'm happy for any hints. So far I've tried TFRecords, HDF5, DALI, memory mapped Numpy arrays and Zarr. tensorstore seems like a good candidate to try. Can you point me to any examples that might fit my usecase? Would you recommend any specific backend?

Thanks again for the response

d-v-b · 2024-11-12T15:28:35Z

d-v-b
Nov 12, 2024
Maintainer

I think the tensorstore tutorial is a good place to start. You would probably not want to use the format in that example (n5); instead, you can use tensorstore to read and write zarr v2 and v3 arrays (to write zarr groups you can just use zarr-python). Tensorstore also supports other formats but I have much less experience with them. Because tensorstore has a performance focus, you might get good pointers by opening a discussion / issue on their github page.

4 replies

huberb Nov 13, 2024
Author

Do you also have by chance any guidance on how to set up a Linux system to improve reading speed through caching? I'm asking because my current benchmarks seems to improve if they run multiple times and it seems like I'm running into some filesystem caching mechanism on Linux even if my files exceed my available RAM. Running some new benchmarks on my system multiple times shows this:

$ python benchmark.py
Bytes read: 20580.00 MB
Throughput: 530.77 MB/s

$ python benchmark.py
Bytes read: 20580.00 MB
Throughput: 775.01 MB/s

$ python benchmark.py
Bytes read: 20580.00 MB
Throughput: 996.57 MB/s

I've been reading this StackExchange post:
https://unix.stackexchange.com/questions/30286/can-i-configure-my-linux-system-for-more-aggressive-file-system-caching

but I'm not sure if this applies to my NFS with filesystem type nfs4. Filehandle caching could also be helpful to me

Anyway, thanks a lot for the help! We can also close this discussion since it's moving away from zarr related questions.

d-v-b Nov 13, 2024
Maintainer

unfortunately I know basically nothing about tuning local file systems for throughput -- most of the data I work with is on cloud storage, where we have different challenges :) That being said, if you come up with a good story for how to get the most out of zarr on local disk(s), I'd be interested in reading about it (and it could potentially end up in our docs)

huberb Nov 14, 2024
Author

So what turned out to work really well for me is saving my data as .HDF5 with LZF compression and an appropriate chunk size. I group a ton of samples into single files to avoid file handle overhead. This way I can read 13.4Gbit/s of data on my test setup, even though my bandwidth would only allow for 10Gbit/s.

With Zarr I can't get to similar performance and it seems to me this is only because of the format that Zarr stores on disk. HDF5 puts everything into single files. Zarr stores chunks as separate files instead.
With strace -T I can see that a lot of time is spend on the openat syscall when reading those .zarr files. This only happens when I read from my NFS but I can't find a way to speed this up yet.

Maybe this is too specific to the setup that I'm using but I think there is a usecase for a storage option that minimizes the amount of files on disk.

d-v-b Nov 14, 2024
Maintainer

zarr v3 introduces a compression strategy (sharding) that can pack multiple "read chunks" inside a larger file, called a "shard", to address the problem of too many files. The upcoming zarr-python 3 release will support this, and tensorstore supports it today. Using sharding should give you the same IO benefits of hdf5.

huberb · 2025-03-10T16:08:15Z

huberb
Mar 10, 2025
Author

Sooo 5 months later and I'm still here 😓 Since you mentioned you are using cloud storages for your zarr datasets I wondered which kind of cloud technology you use? Is it as simple as some S3 buckets or are you using anything that is tuned for latency and throughput? Is there anything you can recommend for the purpose of reading data from a network as fast as possible?

4 replies

d-v-b Mar 10, 2025
Maintainer

i tend to use s3, and reading from s3 or any other network-bound storage efficiently typically requires making a lot of concurrent requests

huberb Mar 13, 2025
Author

I see. Do you host your own S3 or do you use some publically available cloud? What kind of throughput can you expect if you store huge datasets and access small chunks like 0.5MB per sample? I have a storage system attached that could theoretically run S3.

To feed my GPUs I'd need to reach something like 4GB/s while accessing only small chunks (like 500KB per sample, but a batch size of 1024). This means I would've to reach a throughput of 8.000 samples per second. The complete dataset would be above 20TB if uncompressed. I guess I would have to run thousands of requests in parallel?
I'm connected to my storage with a 100Gbit/s connection so theoretically this would work. I'm not sure if S3 would handle this properly.

Just trying to avoid another dead-end here because I've wasted to much time on other approaches.

rabernat Mar 13, 2025
Maintainer

If you already have a fast hard drive on your system, it's very unlikely that you would ever get better performance by putting an additional S3 layer on top of this. That would only introduce overhead.

We should return to your original problem and understand why your throughput is not as expected.

huberb Mar 13, 2025
Author

I have many of those datasets and I already mount my storage as NFS because I don't have local hard drives on my system that have the required size.

Basically I'm killed by random access. I can read my dataset sequentially and I'll reach my desired throughput, but it starts to break down once I do random access into those datasets.

HDF5 starts to become really slow with random indexing. I also tried saving those datasets as raw numpy arrays on disk and memory map them, but even if I access them with many processes / threads at once I can't reach more than 1GB/s with random access.

I can give more details on my previous approaches if you think there's still something to be gained. I'm totally open to the idea that I'm overlooking something obvious

rabernat · 2025-03-13T23:12:56Z

rabernat
Mar 13, 2025
Maintainer

I just tried running the original example on my macbook using Zarr 3.0.5. I made a few changes

I made it it bit smaller so it was cheaper to run (changed the first dimension to 100 rather than 500)
Updated it to use Zarr V3
I picked a chunk shape that was aligned with the random access pattern.

In this example, Zarr was about 4x slower than memmapped numpy. Memmapping is very efficient, so I'm not super surprised by this.

full code

import numpy as np
from pathlib import Path
import time
from tqdm import tqdm
import os
import random
import zarr
import tempfile

import zarr.codecs



def write_numpy_dataset(data_path, raw_data):
    Path(data_path).mkdir(parents=True, exist_ok=True)
    for sample_name, sample_data in tqdm(raw_data.items()):
        mmap = np.memmap(
            f"{data_path}/{sample_name}.npy",
            dtype=sample_data.dtype,
            mode='w+',
            shape=sample_data.shape
        )
        mmap[:] = sample_data[:]
        mmap.flush()


def write_zarr_dataset(data_path, raw_data):
    zarr_store = f'{data_path}/data.zarr'
    zarr_group = zarr.group(zarr_store, zarr_format=3, overwrite=True)
    for data_key, sample in tqdm(raw_data.items()):
        a = zarr_group.create_array(
            data_key,
            dtype=sample.dtype,
            shape=sample.shape,
            chunks=(1,) + tuple(sample.shape[1:]),  
            compressors=None,
        )
        a[:] = sample


def load_samples_numpy(data_path):
    numpy_files = [f"{data_path}/{f}" for f in os.listdir(data_path)]
    while True:
        for numpy_file in numpy_files:
            index = random.randint(0, 99)
            # open a new memory map for every access. This way I will never cache samples in RAM
            volume = np.memmap(
                numpy_file,
                mode='r',
                dtype=np.float32
            ).reshape([100, -1, 28, 28, 28])
            yield volume[index]


def load_samples_zarr(data_path):
    zarr_store = f"{data_path}/data.zarr"
    zarr_group = zarr.open_group(zarr_store, mode="r")
    data_keys = [s for s in zarr_group]
    while True:
        for data_key in data_keys:
            index = random.randint(0, 99)
            yield zarr_group[data_key][index]


def benchmark(dataset_path, load_fn):
    total_size = 0
    start_time = time.time()

    print("\nStarting benchmark")
    print(f"reading from: {dataset_path}")
    sample_sums = []
    for i, sample in tqdm(enumerate(load_fn(dataset_path))):
        if i == 2000:
            break
        total_size += sample.nbytes
        sample_sums.append(sample.sum())  # do some dummy calculation to force the data into RAM

    end_time = time.time()

    throughput = (total_size / (end_time - start_time)) / (1024 * 1024)
    print(f"Throughput: {throughput:.2f} MB/s")


def main():
    data_path_numpy = tempfile.TemporaryDirectory()
    data_path_zarr = tempfile.TemporaryDirectory()

    raw_data = {
        'sample0': np.random.normal(size=(100, 14, 28, 28, 28)).astype(np.float32),
        'sample1': np.random.normal(size=(100, 16, 28, 28, 28)).astype(np.float32),
        'sample2': np.random.normal(size=(100, 12, 28, 28, 28)).astype(np.float32),
        'sample3': np.random.normal(size=(100, 11, 28, 28, 28)).astype(np.float32),
        'sample4': np.random.normal(size=(100, 13, 28, 28, 28)).astype(np.float32),
        'sample5': np.random.normal(size=(100, 19, 28, 28, 28)).astype(np.float32),
        'sample6': np.random.normal(size=(100, 20, 28, 28, 28)).astype(np.float32),
        'sample7': np.random.normal(size=(100, 15, 28, 28, 28)).astype(np.float32),
        'sample8': np.random.normal(size=(100, 18, 28, 28, 28)).astype(np.float32),
        'sample9': np.random.normal(size=(100, 17, 28, 28, 28)).astype(np.float32),
    }

    write_numpy_dataset(data_path_numpy.name, raw_data)
    write_zarr_dataset(data_path_zarr.name, raw_data)

    benchmark(data_path_numpy.name, load_samples_numpy)
    benchmark(data_path_zarr.name, load_samples_zarr)


if __name__ == "__main__":
    main()

Starting benchmark
reading from: /var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmpgyk19na3
2000it [00:00, 2264.43it/s]
Throughput: 2938.54 MB/s

Starting benchmark
reading from: /var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmpqe6_bafq
2000it [00:03, 586.10it/s]
Throughput: 760.69 MB/s

Interestingly, when I ran the exact same code from a jupyter notebook in VScode, the timing was almost identical!

Starting benchmark
reading from: [/var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmprv8b_hip](https://file+.vscode-resource.vscode-cdn.net/var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmprv8b_hip)
2000it [00:03, 576.82it/s]
Throughput: 748.22 MB/s

Starting benchmark
reading from: [/var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmpkh65n3qh](https://file+.vscode-resource.vscode-cdn.net/var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmpkh65n3qh)
2000it [00:03, 574.40it/s]
Throughput: 745.03 MB/s

I wonder if there is some clue here.

1 reply

huberb Mar 14, 2025
Author

Alright, digging deeper again. The folder I'm using is located on a NFS that is mounted to my Linux system and is connected with 100Gbit/s Ethernet

def load_samples_numpy(data_path, dataset_len):
    while True:
        index = random.randint(0, dataset_len - 1)
        volume = np.memmap(
            data_path,
            mode='r',
            dtype=np.float32,
            shape=(dataset_len, 10, 28, 28, 28)
        )
        yield volume[index]


def benchmark(dataset_path, dataset_len):
    total_size = 0
    start_time = time.time()

    print("\nStarting benchmark")
    print(f"reading from: {dataset_path}")
    sample_sums = []
    loader = load_samples_numpy(dataset_path, dataset_len)
    for i, sample in tqdm(enumerate(loader)):
        if i == 5000:
            break
        total_size += sample.nbytes
        sample_sums.append(sample.sum())

    end_time = time.time()

    throughput = (total_size / (end_time - start_time)) / (1024 * 1024)
    print(f"Throughput: {throughput:.2f} MB/s")

    dataset_path_numpy = '...'
    dataset_len = 100
    benchmark(data_path_numpy, dataset_len)

5000it [00:02, 2106.75it/s]
Throughput: 1758.60 MB/s

looks good so far, however, this dataset is much bigger than 100 samples

    dataset_path_numpy = '...'
    dataset_len = 1800000
    benchmark(data_path_numpy, dataset_len)

5000it [00:31, 157.78it/s]
Throughput: 132.09 MB/s

seems like using the actual filesize is killing my performance.
Going to try something lighter:

    dataset_path_numpy = '...'
    dataset_len = 10000
    benchmark(data_path_numpy, dataset_len)

5000it [00:10, 470.81it/s]
Throughput: 393.96 MB/s

Still quite slow. From playing around, it seems like I should split my files into something like ~500 samples per file. Otherwise I lose too much throughput.
This means I'll write 3600 files to my dataset folder and determine the file with modulo on the index that I'm searching for

However, I'm actually using float16 instead of float32, which is killing my performance even more (less bytes per read).
Also, with that many files I won't benefit from filesystem caching anymore. The opening of filehandles will take more time

Also, this is only a single dataset. I will need to use at least 15 of those and I need to sample from them with a predefined probability. I'll reach 20000 files easily if I only save 500 samples per file

rabernat · 2025-03-15T20:30:09Z

rabernat
Mar 15, 2025
Maintainer

I dug into this example a bit more with the goal of understanding the impact of memmap. I wrote a very naive flat binary storage layer which uses regular filesystem calls

code

import numpy as np
from pathlib import Path
import time
from tqdm import tqdm
import os
import random
import zarr
import tempfile

import zarr.codecs



def write_numpy_dataset(data_path, raw_data):
    Path(data_path).mkdir(parents=True, exist_ok=True)
    for sample_name, sample_data in tqdm(raw_data.items()):
        mmap = np.memmap(
            f"{data_path}/{sample_name}.npy",
            dtype=sample_data.dtype,
            mode='w+',
            shape=sample_data.shape
        )
        mmap[:] = sample_data[:]
        mmap.flush()


def write_raw_dataset(data_path, raw_data):
    for sample_name, sample_data in tqdm(raw_data.items()):
        Path(f"{data_path}/{sample_name}").mkdir(parents=True, exist_ok=True)
        for n in range(sample_data.shape[0]):
            raw = sample_data[n].tobytes()
            with open(f"{data_path}/{sample_name}/{n:02d}.raw", 'wb') as f:
                f.write(raw)


def write_zarr_dataset(data_path, raw_data):
    zarr_store = f'{data_path}/data.zarr'
    zarr_group = zarr.group(zarr_store, zarr_format=3, overwrite=True)
    for data_key, sample in tqdm(raw_data.items()):
        a = zarr_group.create_array(
            data_key,
            dtype=sample.dtype,
            shape=sample.shape,
            chunks=(1,) + tuple(sample.shape[1:]),  
            compressors=None,
        )
        a[:] = sample


def load_samples_numpy(data_path):
    numpy_files = [f"{data_path}/{f}" for f in os.listdir(data_path)]
    while True:
        for numpy_file in numpy_files:
            index = random.randint(0, 99)
            # open a new memory map for every access. This way I will never cache samples in RAM
            volume = np.memmap(
                numpy_file,
                mode='r',
                dtype=np.float32
            ).reshape([100, -1, 28, 28, 28])
            yield volume[index]


def load_samples_raw(data_path):
    sample_names = os.listdir(data_path)
    while True: 
        for sample_name in sample_names:
            index = random.randint(0, 99)
            fname =  f"{data_path}/{sample_name}/{index:02d}.raw"
            with open(fname, 'rb') as f:
                data = np.frombuffer(f.read(), dtype=np.float32)
                data = data.reshape([-1, 28, 28, 28])
                yield data

def load_samples_zarr(data_path):
    zarr_store = f"{data_path}/data.zarr"
    zarr_group = zarr.open_group(zarr_store, mode="r")
    data_keys = [s for s in zarr_group]
    while True:
        for data_key in data_keys:
            index = random.randint(0, 99)
            yield zarr_group[data_key][index]


def benchmark(dataset_path, load_fn):
    total_size = 0
    start_time = time.time()

    print("\nStarting benchmark")
    print(f"reading from: {dataset_path}")
    sample_sums = []
    for i, sample in tqdm(enumerate(load_fn(dataset_path))):
        if i == 2000:
            break
        total_size += sample.nbytes
        #assert sample.shape == (1, 1, 28, 28, 28), f"Unexpected shape: {sample.shape}"
        sample_sums.append(sample.sum())  # do some dummy calculation to force the data into RAM

    end_time = time.time()

    throughput = (total_size / (end_time - start_time)) / (1024 * 1024)
    print(f"Throughput: {throughput:.2f} MB/s")


def main():
    data_path_numpy = tempfile.TemporaryDirectory()
    data_path_raw = tempfile.TemporaryDirectory()
    data_path_zarr = tempfile.TemporaryDirectory()

    raw_data = {
        'sample0': np.random.normal(size=(100, 14, 28, 28, 28)).astype(np.float32),
        'sample1': np.random.normal(size=(100, 16, 28, 28, 28)).astype(np.float32),
        'sample2': np.random.normal(size=(100, 12, 28, 28, 28)).astype(np.float32),
        'sample3': np.random.normal(size=(100, 11, 28, 28, 28)).astype(np.float32),
        'sample4': np.random.normal(size=(100, 13, 28, 28, 28)).astype(np.float32),
        'sample5': np.random.normal(size=(100, 19, 28, 28, 28)).astype(np.float32),
        'sample6': np.random.normal(size=(100, 20, 28, 28, 28)).astype(np.float32),
        'sample7': np.random.normal(size=(100, 15, 28, 28, 28)).astype(np.float32),
        'sample8': np.random.normal(size=(100, 18, 28, 28, 28)).astype(np.float32),
        'sample9': np.random.normal(size=(100, 17, 28, 28, 28)).astype(np.float32),
    }

    write_numpy_dataset(data_path_numpy.name, raw_data)
    write_raw_dataset(data_path_raw.name, raw_data)
    write_zarr_dataset(data_path_zarr.name, raw_data)

    benchmark(data_path_numpy.name, load_samples_numpy)
    benchmark(data_path_raw.name, load_samples_raw)
    benchmark(data_path_zarr.name, load_samples_zarr)


if __name__ == "__main__":
    main()

# Numpy Memmap
Starting benchmark
reading from: /var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmpnh004qcj
2000it [00:00, 2582.81it/s]
Throughput: 3063.67 MB/s

# Numpy Raw
Starting benchmark
reading from: /var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmpl2jjhfzu
2000it [00:00, 2023.97it/s]
Throughput: 2400.85 MB/s

# Zarr
Starting benchmark
reading from: /var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/tmps_ggbroe
2000it [00:02, 676.33it/s]
Throughput: 802.37 MB/s

I feel that Zarr should be able to at least match the "Numpy raw" performance. I wonder if this is related to the unnecessary memory copies issue described in #2904.

0 replies

Uh oh!

Slow performance on Zarr DirectoryStore #2484

Uh oh!

Uh oh!

huberb Nov 12, 2024

Replies: 5 comments · 10 replies

Uh oh!

Uh oh!

d-v-b Nov 12, 2024 Maintainer

Uh oh!

Uh oh!

huberb Nov 12, 2024 Author

Uh oh!

d-v-b Nov 12, 2024 Maintainer

Uh oh!

huberb Nov 13, 2024 Author

Uh oh!

d-v-b Nov 13, 2024 Maintainer

Uh oh!

huberb Nov 14, 2024 Author

Uh oh!

d-v-b Nov 14, 2024 Maintainer

Uh oh!

huberb Mar 10, 2025 Author

Uh oh!

d-v-b Mar 10, 2025 Maintainer

Uh oh!

huberb Mar 13, 2025 Author

Uh oh!

rabernat Mar 13, 2025 Maintainer

Uh oh!

huberb Mar 13, 2025 Author

Uh oh!

rabernat Mar 13, 2025 Maintainer

Uh oh!

Uh oh!

huberb Mar 14, 2025 Author

Uh oh!

rabernat Mar 15, 2025 Maintainer

huberb
Nov 12, 2024

Replies: 5 comments 10 replies

d-v-b
Nov 12, 2024
Maintainer

huberb Nov 12, 2024
Author

d-v-b
Nov 12, 2024
Maintainer

huberb Nov 13, 2024
Author

d-v-b Nov 13, 2024
Maintainer

huberb Nov 14, 2024
Author

d-v-b Nov 14, 2024
Maintainer

huberb
Mar 10, 2025
Author

d-v-b Mar 10, 2025
Maintainer

huberb Mar 13, 2025
Author

rabernat Mar 13, 2025
Maintainer

huberb Mar 13, 2025
Author

rabernat
Mar 13, 2025
Maintainer

huberb Mar 14, 2025
Author

rabernat
Mar 15, 2025
Maintainer