-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Support Shredded Objects in variant_get (take 2) #8280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Support Shredded Objects in variant_get (take 2) #8280
Conversation
Resolves conflicts between PR 8166 (shredding support) and PR 8179 (multi-type support): - Preserves PR 8179's comprehensive multi-type support for all numeric primitives - Keeps PR 8166's superior row builder architecture and shredding support - Integrates both test suites for complete coverage - Maintains enhanced path parsing from PR 8166 The merge successfully combines: - Multi-type variant_get support (Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float16, Float32, Float64) - Advanced shredding capabilities with row builder approach - Comprehensive test coverage from both PRs
attn @carpecodeum @alamb |
Thank you @scovich for this |
Perhaps you would like to merge this PR @carpecodeum ? |
I'm also totally fine with people plundering this PR so the one with all the comment history can merge. And if somebody knows how to make it a useful/correct PR against the other PR (@alamb seems to be expert in this), that's another possibility. Either way, I'd also hope we can catalog all the follow-up items before merging so they don't get lost. I think some (like null mask unioning) were already addressed, and I just didn't notice until yesterday. |
I will double check / catalog before I merge anything |
Perhaps saw this a bit too late, I am fine with both the approaches honestly, I will review any other PR that stems from #8166 |
@scovich could you take this branch and create a new PR against arrow-rs (rather than the cmu fork) that we can merge? |
This PR is against apache:main, I think it can just merge as-is? |
Oh, but the PR description still talks as if it were a merge to the other branch. Let me clean that up. |
I will review this in a few hours. THank you so much @scovich and @carpecodeum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich @carpecodeum and @liamzwbao -- I think this is a big step forward. I realize it still needs a bunch of work, but I think it is good enough to merge at this point as it is significantly better than what is on main.
One thing I don't feel I have a good handle on yet is where we are in terms of what is supported and what is left to go. I think I will spend some more time working on tests to figure that out next.
Onwards!
/// Partially shredded: | ||
/// * value is an object | ||
/// * typed_value is a shredded object. | ||
/// Imperfectly shredded: Shredded values reside in `typed_value` while those that failed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you -- this made it much clearer to me
} | ||
} | ||
|
||
/// Helper trait for converting `Variant` values to arrow primitive values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a follow on, this trait might be more discoverable if we put it somewhere in parquet-variant-compute/src/type_conversion.rs
} | ||
|
||
/// Create from &str | ||
/// Create from &str with support for dot notation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably eventually needs support for escaping, etc but is probably fine for now
I pushed a commit to remove an unnecessary |
Note to reviewers: This PR includes 1600+ LoC of new unit tests. The actual changes are half that big. # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes #8310 # Rationale for this change The original `cast_to_variant` code did columnar conversions of types to variant. For primitive types this worked ok, but for deeply nested types it means repeatedly creating new variants (and variant metadata), only to re-code them by copying the variant values to new arrays (with new metadata and field ids). Very expensive. # What changes are included in this PR? Follow the example of #8280, and introduce a row builder concept that takes individual array values and writes them to an `impl VariantBuilderExt`. Row builders for complex types instantiate the appropriate list or object builder to pass to their children. # Are these changes tested? Existing unit tests continue to pass. Extensive new unit tests added as well. # Are there any user-facing changes? * `VariantBuilderExt` has a new `append_null` method. * `ObjectFieldBuilder` moved to `builder.rs` and made public
# Which issue does this PR close? * Quick-follow to #8280 # Rationale for this change The change from output builders to row builders left some no-longer-used files behind. # What changes are included in this PR? Delete the unused files. # Are these changes tested? N/A # Are there any user-facing changes? No
# Which issue does this PR close? * Follow-up to #8280 # Rationale for this change See #8280 (comment) # What changes are included in this PR? See description. # Are these changes tested? Code movement. Compilation suffices. # Are there any user-facing changes? No. --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
variant_get
: typed path access (STEP 1) #8150Rationale for this change
Add support for extracting fields from both shredded and non-shredded variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them to Int32 with proper NULL handling for type mismatches.
NOTE: This is a second attempt at
which suffered strong logical conflicts with
and which itself grew out of
See the other two PR for the vast majority of review commentary relating to this change.
I started from the original PR commits (first three), performed the merge, and fixed up a bunch of issues.
Manually diffing the before (76b75eebc..882aa4d69) and after (0ba91aed9..f6fd91583) diffs gives the following non-trivial differences vs. the original PR
cargo fmt
typed_value_to_variant
now supports all primitive numeric types (previously only int16)shredded_get_path
-- the original code was wrongly unioning in the null buffer fromtyped_value
column:get_variant_perfectly_shredded_int32_as_variant
test case, because [Variant] Support typed access for numeric types in variant_get #8179 introduced a battery of unit tests that cover the same functionality..unwrap()
calls from object builderfinish
calls in unit testscreate_depth_1_shredded_test_data_working
, which captured the return value of a nested builder'sfinish
(()
) instead of the return value of the top-level builder. I'm not quite sure what this code was trying to do, but I changed it to just create a nested builder instead of a second top-level builder:a_variant
andb_variant
) forcreate_depth_2_shredded_test_data_working
make_shredding_row_builder
now supports signed int and float types (unsigned int not supported yet)get_type_name
helper in row_builder.rs that gives human-readable data type names. I'm not convinced it's necessary (and the code is in the wrong spot, jammed in the middle ofVariantAsPrimitive
code.impl VariantAsPrimitive
for all signed int and float typesPrimitiveVariantShreddingRowBuilder
now has a lifetime param because it takes a reference to cast options (it now respects unsafe vs. safe casting)What changes are included in this PR?
Everything in the original PR, plus merge in the main branch, fix logical conflicts and fix various broken tests.
Are these changes tested?
All unit tests now pass.
Are there any user-facing changes?
No (variant is not public yet)