- 
                Notifications
    You must be signed in to change notification settings 
- Fork 27
Open
Description
Blosc and Blosc2 crash when faced with variable-width strings, both the legacy object strings or the new NpyStrings a.k.a. StringDType.
This is caused by an upstream bug. Pytables is also affected.
#363 introduces unit tests for string dtypes, which have been temporarily skipped for blosc and blosc2.
Reproducer
| compression | i8 | S3 | object | T | 
|---|---|---|---|---|
| "gzip" | ✔️ | ✔️ | ✔️ | ✔️ | 
| "lzf" | ✔️ | ✔️ | ✔️ | ✔️ | 
| hdf5plugin.BZip2() | ✔️ | ✔️ | ✔️ | ✔️ | 
| hdf5plugin.LZ4() | ✔️ | ✔️ | ✔️ | ✔️ | 
| hdf5plugin.Blosc() | ✔️ | ✔️ | segfault | segfault | 
| hdf5plugin.Blosc2() | ✔️ | ✔️ | segfault | segfault | 
Full reproducer:
import os
import h5py
import hdf5plugin
import numpy as np
fname = "/tmp/ds.h5"
for compression in (
    None,
    "gzip",
    "lzf",
    hdf5plugin.BZip2(),
    hdf5plugin.LZ4(),
    hdf5plugin.Blosc(),
    hdf5plugin.Blosc2(),
):
    for data in (
        np.asarray([1]),
        np.asarray(["foo"], dtype="S"),
        np.asarray([b"foo"], dtype="O"),
        np.asarray(["foo"], dtype="T"),
    ):
        print("desired compression =", compression)
        print("dtype =", data.dtype)
        # Optional: produce meaningful differences in file size
        data = np.tile(data, 1_000_000)
        with h5py.File(fname, "w") as f:
            f.create_dataset("mydataset", data=data, compression=compression)
        print("file size =", os.path.getsize(fname))
        with h5py.File(fname, "r+") as f:
            ds = f["mydataset"]
            print("actual compression =", ds.compression)
            print("compression_opts =", ds.compression_opts)
            actual = (ds.astype("T") if data.dtype.kind == "T" else ds)[:]
        np.testing.assert_array_equal(actual, data)
        print("=" * 80, flush=True)Metadata
Metadata
Assignees
Labels
No labels