-
Notifications
You must be signed in to change notification settings - Fork 95
Re-enable VLenBytes
round-trip None
test
#736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #736 +/- ##
==========================================
- Coverage 99.96% 99.96% -0.01%
==========================================
Files 63 63
Lines 2738 2736 -2
==========================================
- Hits 2737 2735 -2
Misses 1 1
🚀 New features to boost your workflow:
|
During `VLenBytes.encode`, it normalizes some values. However it was not actually keeping the normalized values. So when it comes time to get properties from these items, it assumes they are of the right type. However it is still grabbing the original unnormalized values. So fix this by storing the normalized values for process.
Drop our own type checking in favor of assigning to Cython typed-variables. This should do the same kind of type checking, but may be more robust than what we are doing.
Simplify normalization code using the ternary expression.
Instead of making so many calls to C Python APIs, rely on the Cython types of variables and Cython's tight binding to pick the right function to apply in each instance. This makes the code more readable to the typical Python developer. Also it makes it less likely some issue would sneak in like the one encountered in this bug report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added some notes below to explain the cause of the bug and what was done here to fix it
@@ -268,7 +271,7 @@ class VLenBytes(Codec): | |||
l = lengths[i] | |||
store_le32(<uint8_t*>data, l) | |||
data += HEADER_LENGTH | |||
encv = PyBytes_AS_STRING(values[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this line was the actual one that failed ( a bit further in than the aforementioned error would suggest: #683 (comment) )
The issue is that value[i]
does not contain the normalized value. So if None
shows up here, it errors. Also the error is a nasty segfault (due to attempting to access a field that does not exist) or an assert
ion error if Python's debug mode is used
@@ -240,6 +241,7 @@ class VLenBytes(Codec): | |||
n_items = values.shape[0] | |||
|
|||
# setup intermediates | |||
normed_values = np.empty(n_items, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the error mentioned above, we need to create a place to store the normalized values
So we allocate an empty array to start. object
type as we are storing Python objects in it
@@ -250,6 +252,7 @@ class VLenBytes(Codec): | |||
b = b'' | |||
elif not PyBytes_Check(b): | |||
raise TypeError('expected byte string, found %r' % b) | |||
normed_values[i] = b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now when we loop through and normalize an input, we can store each normalized value for processing later
numcodecs/vlen.pyx
Outdated
@@ -268,7 +271,7 @@ class VLenBytes(Codec): | |||
l = lengths[i] | |||
store_le32(<uint8_t*>data, l) | |||
data += HEADER_LENGTH | |||
encv = PyBytes_AS_STRING(values[i]) | |||
encv = PyBytes_AS_STRING(normed_values[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus we can now access the normalized value here and know it is of the expected type
That said, this code can be improved further by introducing Cython typed variables and using them consistently to get better and more consistent type checking throughout
When encoding values, there is no need to modify the encoded data when writing it out. So mark the pointers used to reference the encoded data as `const`. While there is nothing happening here that should cause issues, this will help further safeguard developers making changes here and clarify the intent.
These variables are nearly identical and only the total length is used. As `data` is used elsewhere, change `data_length` to capture the value of `total_length` and just use `data_length` throughout.
Antonio confirmed the bug fix: #735 (comment) Also have confirmed the fixed test on CI Given that, will go ahead and put this fix in |
Re-enable the
VLenBytes
'stest_encode_none
, which tests how aNone
value is handled in theVLenBytes
codec. This was previously disabled in PR ( #690 ). This is due to a crash occurring when running this test (more details in the issues below).This appears to be due to a bug in the
VLenBytes
codec that traces pretty far back. NamelyVLenByte.encode
does some normalizing of values (likeNone
) where it turns them intob""
. However it was not keeping track of this normalization. As a result later steps assumed they had abytes
object and would hand it to low-level C Python APIs that either did not check the type causing undefined-behavior (hence crashing) or checked with anassert
, which gives a low-level backtrace.In either case this is not behavior we would want users to encounter. At a minimum we would want to raise a Python error that could be caught and handled. Though in this specific case of normalization the behavior has been defined in tests and other
VLen*
codecs. OnlyVLenBytes
had the issue.This PR fixes the issue by storing all the normalized values for reuse. Additionally this moves to using Cython types with these objects. This improves the overall handling and type checking of these objects. Also making the code a bit more similar to equivalent Python (while keeping the underlying fast C performance).
Fixes #683
Fixes #735
TODO: