Fix UTF generation for numpy in property-based tests #2801

moradology · 2025-02-05T16:56:55Z

This PR fixes the numpy_arrays() hypothesis strategy for Unicode string (e.g. dtypes (<U#>)) by:

Ensuring correct string length enforcement using dtype.itemsize // 4 (instead of parsing str(dtype)[2:]).
Removing invalid surrogate Unicode characters (U+D800–U+DFFF) to prevent encoding errors.
Introduced safe_unicode_for_dtype(dtype), which:

Fixes #2732

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Why modify the default string generation from hypothesis?

NumPy's dtype='<U#>' uses UTF-32 storage, which can store all Unicode code points, including surrogates (U+D800–U+DFFF).
UTF-32 does not enforce the UTF-8 restriction that surrogates must be part of a valid surrogate pair.

This means NumPy can store invalid Unicode sequences without immediate errors:

import numpy as np
arr = np.array(["\ud910"], dtype="<U6")  # Allowed by NumPy
print(arr)  # ['\ud910']

Surrogate characters are not permitted in UTF-8.
If we store surrogates in NumPy and later export the text as UTF-8, it results in a UnicodeEncodeError:
```
arr[0].encode("utf-8")  # UnicodeEncodeError
```

dcherian · 2025-02-05T17:09:45Z

Fix UTF generation for numpy in property-based tests

632a252

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 5, 2025

Add changelog entry

ee52b02

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 5, 2025

Merge branch 'main' into fix/nparray-property-gen

61987a0

dcherian approved these changes Feb 5, 2025

View reviewed changes

d-v-b approved these changes Feb 5, 2025

View reviewed changes

d-v-b merged commit a52048d into zarr-developers:main Feb 5, 2025
30 checks passed

moradology deleted the fix/nparray-property-gen branch February 5, 2025 17:33