Re-enable `VLenBytes` round-trip `None` test #736

jakirkham · 2025-04-09T22:42:14Z

Re-enable the VLenBytes's test_encode_none, which tests how a None value is handled in the VLenBytes codec. This was previously disabled in PR ( #690 ). This is due to a crash occurring when running this test (more details in the issues below).

This appears to be due to a bug in the VLenBytes codec that traces pretty far back. Namely VLenByte.encode does some normalizing of values (like None) where it turns them into b"". However it was not keeping track of this normalization. As a result later steps assumed they had a bytes object and would hand it to low-level C Python APIs that either did not check the type causing undefined-behavior (hence crashing) or checked with an assert, which gives a low-level backtrace.

In either case this is not behavior we would want users to encounter. At a minimum we would want to raise a Python error that could be caught and handled. Though in this specific case of normalization the behavior has been defined in tests and other VLen* codecs. Only VLenBytes had the issue.

This PR fixes the issue by storing all the normalized values for reuse. Additionally this moves to using Cython types with these objects. This improves the overall handling and type checking of these objects. Also making the code a bit more similar to equivalent Python (while keeping the underlying fast C performance).

Fixes #683
Fixes #735

TODO:

Unit tests and/or doctests in docstrings
Tests pass locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
Docs build locally
GitHub Actions CI passes
Test coverage to 100% (Codecov passes)

codecov · 2025-04-09T22:49:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.96%. Comparing base (85eeed3) to head (8cc95fe).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
- Coverage   99.96%   99.96%   -0.01%     
==========================================
  Files          63       63              
  Lines        2738     2736       -2     
==========================================
- Hits         2737     2735       -2     
  Misses          1        1

Files with missing lines	Coverage Δ
numcodecs/tests/test_vlen_bytes.py	`100.00% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

During `VLenBytes.encode`, it normalizes some values. However it was not actually keeping the normalized values. So when it comes time to get properties from these items, it assumes they are of the right type. However it is still grabbing the original unnormalized values. So fix this by storing the normalized values for process.

Drop our own type checking in favor of assigning to Cython typed-variables. This should do the same kind of type checking, but may be more robust than what we are doing.

Simplify normalization code using the ternary expression.

Instead of making so many calls to C Python APIs, rely on the Cython types of variables and Cython's tight binding to pick the right function to apply in each instance. This makes the code more readable to the typical Python developer. Also it makes it less likely some issue would sneak in like the one encountered in this bug report.

jakirkham

Have added some notes below to explain the cause of the bug and what was done here to fix it

jakirkham · 2025-04-10T00:21:19Z

numcodecs/vlen.pyx

@@ -268,7 +271,7 @@ class VLenBytes(Codec):
            l = lengths[i]
            store_le32(<uint8_t*>data, l)
            data += HEADER_LENGTH
-            encv = PyBytes_AS_STRING(values[i])


Looks like this line was the actual one that failed ( a bit further in than the aforementioned error would suggest: #683 (comment) )

The issue is that value[i] does not contain the normalized value. So if None shows up here, it errors. Also the error is a nasty segfault (due to attempting to access a field that does not exist) or an assertion error if Python's debug mode is used

jakirkham · 2025-04-10T00:21:33Z

numcodecs/vlen.pyx

@@ -240,6 +241,7 @@ class VLenBytes(Codec):
        n_items = values.shape[0]

        # setup intermediates
+        normed_values = np.empty(n_items, dtype=object)


To fix the error mentioned above, we need to create a place to store the normalized values

So we allocate an empty array to start. object type as we are storing Python objects in it

jakirkham · 2025-04-10T00:21:45Z

numcodecs/vlen.pyx

@@ -250,6 +252,7 @@ class VLenBytes(Codec):
                b = b''
            elif not PyBytes_Check(b):
                raise TypeError('expected byte string, found %r' % b)
+            normed_values[i] = b


Now when we loop through and normalize an input, we can store each normalized value for processing later

jakirkham · 2025-04-10T00:23:01Z

numcodecs/vlen.pyx

@@ -268,7 +271,7 @@ class VLenBytes(Codec):
            l = lengths[i]
            store_le32(<uint8_t*>data, l)
            data += HEADER_LENGTH
-            encv = PyBytes_AS_STRING(values[i])
+            encv = PyBytes_AS_STRING(normed_values[i])


Thus we can now access the normalized value here and know it is of the expected type

That said, this code can be improved further by introducing Cython typed variables and using them consistently to get better and more consistent type checking throughout

When encoding values, there is no need to modify the encoded data when writing it out. So mark the pointers used to reference the encoded data as `const`. While there is nothing happening here that should cause issues, this will help further safeguard developers making changes here and clarify the intent.

These variables are nearly identical and only the total length is used. As `data` is used elsewhere, change `data_length` to capture the value of `total_length` and just use `data_length` throughout.

jakirkham · 2025-04-11T04:40:35Z

Antonio confirmed the bug fix: #735 (comment)

Also have confirmed the fixed test on CI

Given that, will go ahead and put this fix in

jakirkham force-pushed the fix_vlen_enc branch from 1fb7f87 to 494cc34 Compare April 9, 2025 22:43

Re-enable VLenBytes round-trip None test

07571e5

jakirkham force-pushed the fix_vlen_enc branch from 494cc34 to 07571e5 Compare April 9, 2025 22:45

jakirkham added 6 commits April 9, 2025 16:52

Use double quotes in encoding normalization steps

e7b6b4b

Use ensure_contiguous_memoryview more in vlen

9bcd15c

Assign normalized values to encode to typed vars

8e24cce

Drop our own type checking in favor of assigning to Cython typed-variables. This should do the same kind of type checking, but may be more robust than what we are doing.

Use ternary expression to normalize values

d6085a8

Simplify normalization code using the ternary expression.

jakirkham commented Apr 10, 2025

View reviewed changes

jakirkham added 2 commits April 9, 2025 17:28

Inline pointer usage in memcpy calls

739dd88

jakirkham mentioned this pull request Apr 10, 2025

Segfault in numcodecs/tests/test_vlen_bytes.py::test_encode_none #735

Closed

jakirkham marked this pull request as ready for review April 10, 2025 00:52

jakirkham added 3 commits April 9, 2025 20:28

Consolidate total_length into data_length

a37b8f9

These variables are nearly identical and only the total length is used. As `data` is used elsewhere, change `data_length` to capture the value of `total_length` and just use `data_length` throughout.

Use upper case L to avoid confusion

644f147

Merge branch 'main' into fix_vlen_enc

8cc95fe

jakirkham merged commit 3438e16 into zarr-developers:main Apr 11, 2025
30 checks passed

jakirkham deleted the fix_vlen_enc branch April 11, 2025 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable `VLenBytes` round-trip `None` test #736

Re-enable `VLenBytes` round-trip `None` test #736

jakirkham commented Apr 9, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading

jakirkham left a comment

jakirkham Apr 10, 2025

jakirkham Apr 10, 2025

jakirkham Apr 10, 2025

jakirkham Apr 10, 2025

jakirkham commented Apr 11, 2025

Re-enable VLenBytes round-trip None test #736

Re-enable VLenBytes round-trip None test #736

Conversation

jakirkham commented Apr 9, 2025 • edited Loading

codecov bot commented Apr 9, 2025 • edited Loading

Codecov Report

jakirkham left a comment

Choose a reason for hiding this comment

jakirkham Apr 10, 2025

Choose a reason for hiding this comment

jakirkham Apr 10, 2025

Choose a reason for hiding this comment

jakirkham Apr 10, 2025

Choose a reason for hiding this comment

jakirkham Apr 10, 2025

Choose a reason for hiding this comment

jakirkham commented Apr 11, 2025

Re-enable `VLenBytes` round-trip `None` test #736

Re-enable `VLenBytes` round-trip `None` test #736

jakirkham commented Apr 9, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading