Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer uses signed bytes with v2 compressors #2735

Open
QuLogic opened this issue Jan 20, 2025 · 11 comments · May be fixed by #2738
Open

Buffer uses signed bytes with v2 compressors #2735

QuLogic opened this issue Jan 20, 2025 · 11 comments · May be fixed by #2738
Labels
bug Potential issues with the zarr-python library

Comments

@QuLogic
Copy link
Contributor

QuLogic commented Jan 20, 2025

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.13.1

Operating System

Fedora Rawhide

Installation

Fedora package

Description

I'm looking at cgohlke/imagecodecs#123 and after fixing some imports and setting zarr_format=2, I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

I have traced this to Buffer requiring dtype='b', along with casts in cpu.Buffer.from_bytes.

If I modify those checks/casts to use the unsigned dtype='B', then I can get imagecodecs tests to pass.

I see this came in with GPU support in #1967. Was this actually intentional, or was it more that no-one noticed that the NumPy b dtype is a signed byte? It would seem odd to me that bytes would be treated as signed as in regular Python they are treated as unsigned (e.g., b'\xff'[0] == 255, not -1).

If this is intentional, then it seems like something that should be documented in the migration guide that would break compressors for zarr_format=2.

Steps to reproduce

Install imagecodecs, modify tests to use zarr.storage.MemoryStorage instead of zarr.MemoryStorage, and set zarr_format=2, then run its tests.

Additional output

No response

@QuLogic QuLogic added the bug Potential issues with the zarr-python library label Jan 20, 2025
@QuLogic
Copy link
Contributor Author

QuLogic commented Jan 20, 2025

I see this came in with GPU support in #1967.

Oops, I did not go far back enough in the blame. It actual came in with Buffer's first implementation in #1826. I don't see any specific discussion on signedness other than pointing out Buffer is a specialization of NDBuffer: #1826 (comment)

@d-v-b
Copy link
Contributor

d-v-b commented Jan 20, 2025

This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed. But it seems like in numpy dtype parlance, a byte is an 8-bit integer? I can see how those could be signed, but I'm not aware of anything in zarr-python that depends on a particular signing. Assuming nothing breaks if we switch to the signing that works for imagecodecs, then we should consider that change.

@madsbk any insight here?

@madsbk
Copy link
Contributor

madsbk commented Jan 20, 2025

Was this actually intentional, or was it more that no-one noticed that the NumPy "b" dtype is a signed byte?

This is an oversight, the buffers shouldn't care about signedness.
I am fine moving to "B" exclusively, but we could also accept both "B" and "b"?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 20, 2025

To allow "B" and "b", we would need some way of passing that value for the dtype used in np.frombuffer. The simplest solution would be to give Buffer a dtype attribute, at which point we are halfway to admitting that the Buffer class is really just NDBuffer with 1 dimension and dtype constrained to "B" or "b".

@madsbk
Copy link
Contributor

madsbk commented Jan 20, 2025

Also I just noticed that concatenating a "B" and "b" array in numpy returns int16. Thus, if we want to support both, we would need to change the CPU and GPU implementation of Buffer.__add__():

    def __add__(self, other: core.Buffer) -> Self:
        """Concatenate two buffers"""

        other_array = other.as_array_like()
        assert other_array.dtype == np.dtype("b")
        return self.__class__(
            np.concatenate((np.asanyarray(self._data), np.asanyarray(other_array)))
        )

@QuLogic
Copy link
Contributor Author

QuLogic commented Jan 21, 2025

This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed.

All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.

Also I just noticed that concatenating a "B" and "b" array in numpy returns int16.

Signed bytes are -128 to 127, unsigned bytes are 0 to 255. The only way to represent that whole range -128 to 255 is casting upward to 16 bits (which is -32768 to 32767).

I don't know that it makes sense to allow both, but if you want to, then I would suggest allowing both as input, but changing internals to always be unsigned (using a view should be copy-free).

QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 21, 2025
This makes compressors consistent with v2, and seems more correct than
signed bytes.

Fixes zarr-developers#2735
@QuLogic QuLogic linked a pull request Jan 21, 2025 that will close this issue
6 tasks
QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 21, 2025
This makes compressors consistent with v2, and seems more correct than
signed bytes.

Fixes zarr-developers#2735
@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.

But as far as I understood it, the buffer API shouldn't store bytes with integer semantics -- Buffer contains stuff like JSON documents, so the contents of Buffer are supposed to just piles of bits, with no meaning attached. We happen to be using numpy arrays to back Buffer instances, but the fact that those arrays have a dtype is an implementation detail that should not affect the public API of Buffer.

do I have this right @madsbk ?

@madsbk
Copy link
Contributor

madsbk commented Jan 21, 2025

Yes, a Buffer just a contiguous blob of memory.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

it sounds like imagecodecs compressors are assuming a numpy array. but Buffer is not a numpy array (despite using one under the hood). So while I think the fix proposed here makes sense, there is a larger issue if imagecodecs assumes that Buffer instances are full of integers.

@madsbk
Copy link
Contributor

madsbk commented Jan 21, 2025

Maybe just reinterpret the data like?

buf.view(dtype="uint8")

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

yes, I think the expectation for consumers of Buffer should be that they are responsible for imposing dtype semantics on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants