Skip to content

Fix logical conflict in Variant write API PR#1

Merged
friendlymatthew merged 8 commits into
pydantic:friendlymatthew/duplicatefrom
alamb:alamb/fix_merge
Jun 25, 2025
Merged

Fix logical conflict in Variant write API PR#1
friendlymatthew merged 8 commits into
pydantic:friendlymatthew/duplicatefrom
alamb:alamb/fix_merge

Conversation

@alamb
Copy link
Copy Markdown

@alamb alamb commented Jun 25, 2025

This is a PR that targets this PR in arrow-rs:

Merging this PR wil update apache#7741

It has

  1. the results of git merge apache/main
  2. Fixing a logical conflict (8ab36bc)

cc @friendlymatthew

scovich and others added 8 commits June 24, 2025 17:12
# Which issue does this PR close?

Part of
* apache#6736

# Rationale for this change

The variant spec makes extensive use of 4-byte offsets. On a 32-bit
system, this can lead to integer overflow panics when doing offset
arithmetic on malicious or malformed variant data, because `usize` is 32
bits. That is bad.

# What changes are included in this PR?

Use checked arithmetic in code that operates on offsets extracted from
(untrusted) variant byte buffers.

Of particular interest, we define and use a new
`slice_from_slice_at_offset` helper, which safely adds an offset to a
range.

# Are there any user-facing changes?

No.
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- closes apache#7746

# Rationale for this change

The parquet-variant tests fail when run as part of
`verify-release-candidate.sh` due to the `parquet-testing` directory
being checked out in a different location

# What changes are included in this PR?

Update the test to look at the `PARQUET_TEST_DATA` environment variable
as well

# How are these changes tested?

I tested this manually:
```shell
# note this is a different name than the submodule:
git clone https://github.com/apache/parquet-testing.git parquet-testing-data
export PARQUET_TEST_DATA=$PWD/parquet-testing-data/data
# checkout my fork
git clone https://github.com/alamb/arrow-rs.git
cd arrow-rs
# This fails on main
git checkout main
cargo test -p parquet-variant
# PASSES on branch with fix
git checkout alamb/fix_variant_tests
cargo test -p parquet-variant
```

# Are there any user-facing changes?
No this is a test only change
…inst empty strings (apache#7767)

# Which issue does this PR close?

This avoids a call to memcmp for the relatively common case of comparing
against an empty string.
Closes apache#7766.

# Rationale for this change

This speeds up some of the queries in the `arrow_reader_clickbench`
benchmark, some of them significantly. The biggest benefits are for Q10,
Q11 and Q12, I did not observe any slowdowns on any other query.
Benchmark results are for an uncompressed parquet file.

```
arrow_reader_clickbench/sync/Q10
                        time:   [8.3934 ms 8.4411 ms 8.5212 ms]
                        change: [-36.714% -36.040% -35.243%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_reader_clickbench/sync/Q11
                        time:   [10.180 ms 10.315 ms 10.476 ms]
                        change: [-33.571% -32.145% -30.661%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_reader_clickbench/sync/Q12
                        time:   [17.262 ms 17.419 ms 17.616 ms]
                        change: [-21.201% -19.289% -17.409%] (p = 0.00 < 0.05)
                        Performance has improved.
```

# Are there any user-facing changes?

No
# Which issue does this PR close?

Closes [Variant: Write Variant Values as
JSON](apache#7426) apache#7426
Part of [[EPIC] [Parquet] Implement Variant type support in
Parquet](apache#6736) apache#6736

# Rationale for this change
This is an initial version, serving as a simple interface between the
Variant implementation and the Serde JSON library.

A huge thank you to @PinkCrow007, @mprammer, @alamb, the rest of the CMU
variant team, and everyone else we've interacted with who has helped me
get started with contributing to this project. This is my first
Arrow-related PR, and I thank you all for your insight and support.

# What changes are included in this PR?

This PR implements a comprehensive JSON conversion API for Variant types
with three main functions (`variant_to_json`, `variant_to_json_string`,
and `variant_to_json_value`) that convert different Variant types to
JSON format, including primitives, decimals, dates, timestamps, and
binary data with proper escaping and base64 encoding. The implementation
adds missing methods to `VariantObject` and `VariantArray` for
field/element access, includes two new dependencies (`serde_json` and
`base64`), and provides comprehensive test coverage with unit,
integration, and documentation test suites.

Open to input for improving any part of this implementation.

# Are there any user-facing changes?

The new API's added in parquet-variant will be user-facing.

---------

Co-authored-by: Andrew Lamb <[email protected]>
…e#7719)

# Which issue does this PR close?

As suggested by @Dandandan in
apache#7650 (comment):

> We probably should set this as constant somewhere and use it

# Rationale for this change

Using a symbolic constant in the code rather than a hard coded constant
makes it easier to:
1. Understand what the value means
2. Link / attach documentation to the constant to provide context

# What changes are included in this PR?

1. Introduce `MAX_INLINE_VIEW_LEN` constant for string/byte views
2. Update code to use that instead of `12`

# Are there any user-facing changes?

A new constant
@friendlymatthew friendlymatthew merged commit c82c0c7 into pydantic:friendlymatthew/duplicate Jun 25, 2025
25 of 30 checks passed
@alamb alamb deleted the alamb/fix_merge branch June 25, 2025 20:33
adriangb pushed a commit that referenced this pull request Jan 31, 2026
# Which issue does this PR close?

small optimization

# Rationale for this change
key insight is the byte clone is cheap just a ref count compare to vec
clone is a alloc + memcopy.

before
```
let mut result = Vec::new();          // alloc #1
result.extend_from_slice(prefix);
result.extend_from_slice(suffix);

let data = Bytes::from(result.clone()); // alloc #2 + memcpy
item.set_from_bytes(data);
self.previous_value = result;          // keep Vec
```

after
```
let mut result = Vec::with_capacity(prefix_len + suffix.len()); // alloc #1
result.extend_from_slice(&self.previous_value[..prefix_len]);
result.extend_from_slice(suffix);

let data = Bytes::from(result);       // no alloc, takes Vec buffer
item.set_from_bytes(data.clone());    // cheap refcount bump
self.previous_value = data;           // move, no alloc
```

# What changes are included in this PR?
previous_value type changed to Bytes
preallocate result vec capacity.

# Are these changes tested?

the existing test should pass

# Are there any user-facing changes?

no
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants