Skip to content

Re-enable VLenBytes round-trip None test #736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 11, 2025

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Apr 9, 2025

Re-enable the VLenBytes's test_encode_none, which tests how a None value is handled in the VLenBytes codec. This was previously disabled in PR ( #690 ). This is due to a crash occurring when running this test (more details in the issues below).

This appears to be due to a bug in the VLenBytes codec that traces pretty far back. Namely VLenByte.encode does some normalizing of values (like None) where it turns them into b"". However it was not keeping track of this normalization. As a result later steps assumed they had a bytes object and would hand it to low-level C Python APIs that either did not check the type causing undefined-behavior (hence crashing) or checked with an assert, which gives a low-level backtrace.

In either case this is not behavior we would want users to encounter. At a minimum we would want to raise a Python error that could be caught and handled. Though in this specific case of normalization the behavior has been defined in tests and other VLen* codecs. Only VLenBytes had the issue.

This PR fixes the issue by storing all the normalized values for reuse. Additionally this moves to using Cython types with these objects. This improves the overall handling and type checking of these objects. Also making the code a bit more similar to equivalent Python (while keeping the underlying fast C performance).

Fixes #683
Fixes #735

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

Copy link

codecov bot commented Apr 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.96%. Comparing base (85eeed3) to head (8cc95fe).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
- Coverage   99.96%   99.96%   -0.01%     
==========================================
  Files          63       63              
  Lines        2738     2736       -2     
==========================================
- Hits         2737     2735       -2     
  Misses          1        1              
Files with missing lines Coverage Δ
numcodecs/tests/test_vlen_bytes.py 100.00% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

During `VLenBytes.encode`, it normalizes some values. However it was not
actually keeping the normalized values. So when it comes time to get
properties from these items, it assumes they are of the right type.
However it is still grabbing the original unnormalized values. So fix
this by storing the normalized values for process.
Drop our own type checking in favor of assigning to Cython
typed-variables. This should do the same kind of type checking, but may
be more robust than what we are doing.
Simplify normalization code using the ternary expression.
Instead of making so many calls to C Python APIs, rely on the Cython
types of variables and Cython's tight binding to pick the right function
to apply in each instance.

This makes the code more readable to the typical Python developer. Also
it makes it less likely some issue would sneak in like the one
encountered in this bug report.
Copy link
Member Author

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added some notes below to explain the cause of the bug and what was done here to fix it

@@ -268,7 +271,7 @@ class VLenBytes(Codec):
l = lengths[i]
store_le32(<uint8_t*>data, l)
data += HEADER_LENGTH
encv = PyBytes_AS_STRING(values[i])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this line was the actual one that failed ( a bit further in than the aforementioned error would suggest: #683 (comment) )

The issue is that value[i] does not contain the normalized value. So if None shows up here, it errors. Also the error is a nasty segfault (due to attempting to access a field that does not exist) or an assertion error if Python's debug mode is used

@@ -240,6 +241,7 @@ class VLenBytes(Codec):
n_items = values.shape[0]

# setup intermediates
normed_values = np.empty(n_items, dtype=object)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the error mentioned above, we need to create a place to store the normalized values

So we allocate an empty array to start. object type as we are storing Python objects in it

@@ -250,6 +252,7 @@ class VLenBytes(Codec):
b = b''
elif not PyBytes_Check(b):
raise TypeError('expected byte string, found %r' % b)
normed_values[i] = b
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now when we loop through and normalize an input, we can store each normalized value for processing later

@@ -268,7 +271,7 @@ class VLenBytes(Codec):
l = lengths[i]
store_le32(<uint8_t*>data, l)
data += HEADER_LENGTH
encv = PyBytes_AS_STRING(values[i])
encv = PyBytes_AS_STRING(normed_values[i])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus we can now access the normalized value here and know it is of the expected type

That said, this code can be improved further by introducing Cython typed variables and using them consistently to get better and more consistent type checking throughout

When encoding values, there is no need to modify the encoded data when
writing it out. So mark the pointers used to reference the encoded data
as `const`. While there is nothing happening here that should cause
issues, this will help further safeguard developers making changes here
and clarify the intent.
@jakirkham jakirkham marked this pull request as ready for review April 10, 2025 00:52
These variables are nearly identical and only the total length is used.
As `data` is used elsewhere, change `data_length` to capture the value
of `total_length` and just use `data_length` throughout.
@jakirkham
Copy link
Member Author

Antonio confirmed the bug fix: #735 (comment)

Also have confirmed the fixed test on CI

Given that, will go ahead and put this fix in

@jakirkham jakirkham merged commit 3438e16 into zarr-developers:main Apr 11, 2025
30 checks passed
@jakirkham jakirkham deleted the fix_vlen_enc branch April 11, 2025 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault in numcodecs/tests/test_vlen_bytes.py::test_encode_none macOS CI failing
1 participant