Fix logical conflict in Variant write API PR#1
Merged
friendlymatthew merged 8 commits intoJun 25, 2025
Conversation
# Which issue does this PR close? Part of * apache#6736 # Rationale for this change The variant spec makes extensive use of 4-byte offsets. On a 32-bit system, this can lead to integer overflow panics when doing offset arithmetic on malicious or malformed variant data, because `usize` is 32 bits. That is bad. # What changes are included in this PR? Use checked arithmetic in code that operates on offsets extracted from (untrusted) variant byte buffers. Of particular interest, we define and use a new `slice_from_slice_at_offset` helper, which safely adds an offset to a range. # Are there any user-facing changes? No.
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - closes apache#7746 # Rationale for this change The parquet-variant tests fail when run as part of `verify-release-candidate.sh` due to the `parquet-testing` directory being checked out in a different location # What changes are included in this PR? Update the test to look at the `PARQUET_TEST_DATA` environment variable as well # How are these changes tested? I tested this manually: ```shell # note this is a different name than the submodule: git clone https://github.com/apache/parquet-testing.git parquet-testing-data export PARQUET_TEST_DATA=$PWD/parquet-testing-data/data # checkout my fork git clone https://github.com/alamb/arrow-rs.git cd arrow-rs # This fails on main git checkout main cargo test -p parquet-variant # PASSES on branch with fix git checkout alamb/fix_variant_tests cargo test -p parquet-variant ``` # Are there any user-facing changes? No this is a test only change
…inst empty strings (apache#7767) # Which issue does this PR close? This avoids a call to memcmp for the relatively common case of comparing against an empty string. Closes apache#7766. # Rationale for this change This speeds up some of the queries in the `arrow_reader_clickbench` benchmark, some of them significantly. The biggest benefits are for Q10, Q11 and Q12, I did not observe any slowdowns on any other query. Benchmark results are for an uncompressed parquet file. ``` arrow_reader_clickbench/sync/Q10 time: [8.3934 ms 8.4411 ms 8.5212 ms] change: [-36.714% -36.040% -35.243%] (p = 0.00 < 0.05) Performance has improved. arrow_reader_clickbench/sync/Q11 time: [10.180 ms 10.315 ms 10.476 ms] change: [-33.571% -32.145% -30.661%] (p = 0.00 < 0.05) Performance has improved. arrow_reader_clickbench/sync/Q12 time: [17.262 ms 17.419 ms 17.616 ms] change: [-21.201% -19.289% -17.409%] (p = 0.00 < 0.05) Performance has improved. ``` # Are there any user-facing changes? No
# Which issue does this PR close? Closes [Variant: Write Variant Values as JSON](apache#7426) apache#7426 Part of [[EPIC] [Parquet] Implement Variant type support in Parquet](apache#6736) apache#6736 # Rationale for this change This is an initial version, serving as a simple interface between the Variant implementation and the Serde JSON library. A huge thank you to @PinkCrow007, @mprammer, @alamb, the rest of the CMU variant team, and everyone else we've interacted with who has helped me get started with contributing to this project. This is my first Arrow-related PR, and I thank you all for your insight and support. # What changes are included in this PR? This PR implements a comprehensive JSON conversion API for Variant types with three main functions (`variant_to_json`, `variant_to_json_string`, and `variant_to_json_value`) that convert different Variant types to JSON format, including primitives, decimals, dates, timestamps, and binary data with proper escaping and base64 encoding. The implementation adds missing methods to `VariantObject` and `VariantArray` for field/element access, includes two new dependencies (`serde_json` and `base64`), and provides comprehensive test coverage with unit, integration, and documentation test suites. Open to input for improving any part of this implementation. # Are there any user-facing changes? The new API's added in parquet-variant will be user-facing. --------- Co-authored-by: Andrew Lamb <[email protected]>
…e#7719) # Which issue does this PR close? As suggested by @Dandandan in apache#7650 (comment): > We probably should set this as constant somewhere and use it # Rationale for this change Using a symbolic constant in the code rather than a hard coded constant makes it easier to: 1. Understand what the value means 2. Link / attach documentation to the constant to provide context # What changes are included in this PR? 1. Introduce `MAX_INLINE_VIEW_LEN` constant for string/byte views 2. Update code to use that instead of `12` # Are there any user-facing changes? A new constant
c82c0c7
into
pydantic:friendlymatthew/duplicate
25 of 30 checks passed
adriangb
pushed a commit
that referenced
this pull request
Jan 31, 2026
# Which issue does this PR close? small optimization # Rationale for this change key insight is the byte clone is cheap just a ref count compare to vec clone is a alloc + memcopy. before ``` let mut result = Vec::new(); // alloc #1 result.extend_from_slice(prefix); result.extend_from_slice(suffix); let data = Bytes::from(result.clone()); // alloc #2 + memcpy item.set_from_bytes(data); self.previous_value = result; // keep Vec ``` after ``` let mut result = Vec::with_capacity(prefix_len + suffix.len()); // alloc #1 result.extend_from_slice(&self.previous_value[..prefix_len]); result.extend_from_slice(suffix); let data = Bytes::from(result); // no alloc, takes Vec buffer item.set_from_bytes(data.clone()); // cheap refcount bump self.previous_value = data; // move, no alloc ``` # What changes are included in this PR? previous_value type changed to Bytes preallocate result vec capacity. # Are these changes tested? the existing test should pass # Are there any user-facing changes? no
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a PR that targets this PR in arrow-rs:
Variant::Objectapache/arrow-rs#7741Merging this PR wil update apache#7741
It has
git merge apache/maincc @friendlymatthew