-
-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer uses signed bytes with v2 compressors #2735
Comments
Oops, I did not go far back enough in the blame. It actual came in with |
This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed. But it seems like in numpy dtype parlance, a byte is an 8-bit integer? I can see how those could be signed, but I'm not aware of anything in zarr-python that depends on a particular signing. Assuming nothing breaks if we switch to the signing that works for imagecodecs, then we should consider that change. @madsbk any insight here? |
This is an oversight, the buffers shouldn't care about signedness. |
To allow |
Also I just noticed that concatenating a "B" and "b" array in numpy returns def __add__(self, other: core.Buffer) -> Self:
"""Concatenate two buffers"""
other_array = other.as_array_like()
assert other_array.dtype == np.dtype("b")
return self.__class__(
np.concatenate((np.asanyarray(self._data), np.asanyarray(other_array)))
) |
All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.
Signed bytes are -128 to 127, unsigned bytes are 0 to 255. The only way to represent that whole range -128 to 255 is casting upward to 16 bits (which is -32768 to 32767). I don't know that it makes sense to allow both, but if you want to, then I would suggest allowing both as input, but changing internals to always be unsigned (using a view should be copy-free). |
This makes compressors consistent with v2, and seems more correct than signed bytes. Fixes zarr-developers#2735
This makes compressors consistent with v2, and seems more correct than signed bytes. Fixes zarr-developers#2735
But as far as I understood it, the buffer API shouldn't store bytes with integer semantics -- do I have this right @madsbk ? |
Yes, a Buffer just a contiguous blob of memory. |
it sounds like |
Maybe just reinterpret the data like? buf.view(dtype="uint8") |
yes, I think the expectation for consumers of |
Zarr version
3.0.1
Numcodecs version
0.15.0
Python Version
3.13.1
Operating System
Fedora Rawhide
Installation
Fedora package
Description
I'm looking at cgohlke/imagecodecs#123 and after fixing some imports and setting
zarr_format=2
, I can run many more tests, but several are failing with mismatched types, namely thatimagecodecs
compressors are expectinguint8_t
, but are gettingsigned char
.I have traced this to
Buffer
requiringdtype='b'
, along with casts incpu.Buffer.from_bytes
.If I modify those checks/casts to use the unsigned
dtype='B'
, then I can getimagecodecs
tests to pass.I see this came in with GPU support in #1967. Was this actually intentional, or was it more that no-one noticed that the NumPy
b
dtype is a signed byte? It would seem odd to me thatbytes
would be treated as signed as in regular Python they are treated as unsigned (e.g.,b'\xff'[0] == 255
, not -1).If this is intentional, then it seems like something that should be documented in the migration guide that would break compressors for
zarr_format=2
.Steps to reproduce
Install
imagecodecs
, modify tests to usezarr.storage.MemoryStorage
instead ofzarr.MemoryStorage
, and setzarr_format=2
, then run its tests.Additional output
No response
The text was updated successfully, but these errors were encountered: