Fix UTF generation for numpy in property-based tests #2801
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes the
numpy_arrays()
hypothesis strategy for Unicode string (e.g.dtypes (<U#>)
) by:safe_unicode_for_dtype(dtype)
, which:Fixes #2732
TODO:
docs/user-guide/*.rst
changes/
Why modify the default string generation from hypothesis?
NumPy's
dtype='<U#>'
uses UTF-32 storage, which can store all Unicode code points, including surrogates (U+D800
–U+DFFF
).UTF-32 does not enforce the UTF-8 restriction that surrogates must be part of a valid surrogate pair.
This means NumPy can store invalid Unicode sequences without immediate errors:
Surrogate characters are not permitted in UTF-8.
If we store surrogates in NumPy and later export the text as UTF-8, it results in a
UnicodeEncodeError
: