Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF generation for numpy in property-based tests #2801

Merged
merged 3 commits into from
Feb 5, 2025

Conversation

moradology
Copy link
Contributor

@moradology moradology commented Feb 5, 2025

This PR fixes the numpy_arrays() hypothesis strategy for Unicode string (e.g. dtypes (<U#>)) by:

  1. Ensuring correct string length enforcement using dtype.itemsize // 4 (instead of parsing str(dtype)[2:]).
  2. Removing invalid surrogate Unicode characters (U+D800–U+DFFF) to prevent encoding errors.
  3. Introduced safe_unicode_for_dtype(dtype), which:
  • Computes the correct string length limit for NumPy's UTF-32-based storage.
  • Generates only valid UTF-8 text while filtering out surrogate characters.

Fixes #2732

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Why modify the default string generation from hypothesis?

  • NumPy's dtype='<U#>' uses UTF-32 storage, which can store all Unicode code points, including surrogates (U+D800U+DFFF).

  • UTF-32 does not enforce the UTF-8 restriction that surrogates must be part of a valid surrogate pair.

  • This means NumPy can store invalid Unicode sequences without immediate errors:

    import numpy as np
    arr = np.array(["\ud910"], dtype="<U6")  # Allowed by NumPy
    print(arr)  # ['\ud910']
  • Surrogate characters are not permitted in UTF-8.

  • If we store surrogates in NumPy and later export the text as UTF-8, it results in a UnicodeEncodeError:

    arr[0].encode("utf-8")  # UnicodeEncodeError

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 5, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 5, 2025
@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2025

Thanks @moradology !

@d-v-b d-v-b merged commit a52048d into zarr-developers:main Feb 5, 2025
30 checks passed
@moradology moradology deleted the fix/nparray-property-gen branch February 5, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_vindex can sometimes fail with UnicodeDecodeError
3 participants