Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 4, 2025

Which issue does this PR close?

Rationale for this change

Add support for extracting fields from both shredded and non-shredded variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them to Int32 with proper NULL handling for type mismatches.

NOTE: This is a second attempt at

See the other two PR for the vast majority of review commentary relating to this change.

I started from the original PR commits (first three), performed the merge, and fixed up a bunch of issues.

Manually diffing the before (76b75eebc..882aa4d69) and after (0ba91aed9..f6fd91583) diffs gives the following non-trivial differences vs. the original PR

  • Ran cargo fmt
  • typed_value_to_variant now supports all primitive numeric types (previously only int16)
  • cast options plumbed through and respected
  • Fix a null buffer bug in shredded_get_path -- the original code was wrongly unioning in the null buffer from typed_value column:
      // Path exhausted! Create a new `VariantArray` for the location we landed on.
    - // Also union nulls from the final typed_value field we landed on  
    - if let Some(typed_value) = shredding_state.typed_value_field() {  
    -     accumulated_nulls = arrow::buffer::NullBuffer::union(         
    -         accumulated_nulls.as_ref(),    
    -         typed_value.nulls(),         
    -     );                  
    - }                
    let target = make_target_variant(    
        shredding_state.value_field().cloned(),     
        shredding_state.typed_value_field().cloned(),   
        accumulated_nulls,               
    );     
  • Remove the get_variant_perfectly_shredded_int32_as_variant test case, because [Variant] Support typed access for numeric types in variant_get #8179 introduced a battery of unit tests that cover the same functionality.
  • Remove now-unnecessary .unwrap() calls from object builder finish calls in unit tests
  • Fixed broken test code in create_depth_1_shredded_test_data_working, which captured the return value of a nested builder's finish (()) instead of the return value of the top-level builder. I'm not quite sure what this code was trying to do, but I changed it to just create a nested builder instead of a second top-level builder:
      fn create_depth_1_shredded_test_data_working() -> ArrayRef {
          // Create metadata following the working pattern from shredded_object_with_x_field_variant_array
          let (metadata, _) = { 
    -         let a_variant = {
    -             let mut a_builder = parquet_variant::VariantBuilder::new();
    -             let mut a_obj = a_builder.new_object();
    -             a_obj.insert("x", Variant::Int32(55));  // "a.x" field (shredded when possible)
    -             a_obj.finish().unwrap()
    -         }; 
              let mut builder = parquet_variant::VariantBuilder::new();
              let mut obj = builder.new_object();
    -         obj.insert("a", a_variant);
    +
    +         // Create the nested "a" object 
    +         let mut a_obj = obj.new_object("a");
    +         a_obj.insert("x", Variant::Int32(55)); 
    +         a_obj.finish();
    +
              obj.finish().unwrap();
              builder.finish()
          };
  • Similar fix (twice, a_variant and b_variant) for create_depth_2_shredded_test_data_working
  • make_shredding_row_builder now supports signed int and float types (unsigned int not supported yet)
  • A new get_type_name helper in row_builder.rs that gives human-readable data type names. I'm not convinced it's necessary (and the code is in the wrong spot, jammed in the middle of VariantAsPrimitive code.
  • impl VariantAsPrimitive for all signed int and float types
  • PrimitiveVariantShreddingRowBuilder now has a lifetime param because it takes a reference to cast options (it now respects unsafe vs. safe casting)

What changes are included in this PR?

Everything in the original PR, plus merge in the main branch, fix logical conflicts and fix various broken tests.

Are these changes tested?

All unit tests now pass.

Are there any user-facing changes?

No (variant is not public yet)

scovich and others added 11 commits August 22, 2025 14:24
Resolves conflicts between PR 8166 (shredding support) and PR 8179 (multi-type support):

- Preserves PR 8179's comprehensive multi-type support for all numeric primitives
- Keeps PR 8166's superior row builder architecture and shredding support
- Integrates both test suites for complete coverage
- Maintains enhanced path parsing from PR 8166

The merge successfully combines:
- Multi-type variant_get support (Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float16, Float32, Float64)
- Advanced shredding capabilities with row builder approach
- Comprehensive test coverage from both PRs
@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 4, 2025
@scovich
Copy link
Contributor Author

scovich commented Sep 4, 2025

attn @carpecodeum @alamb

@carpecodeum
Copy link
Contributor

attn @carpecodeum @alamb

Thank you @scovich for this

@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

attn @carpecodeum @alamb

Thank you @scovich for this

Perhaps you would like to merge this PR @carpecodeum ?

@scovich
Copy link
Contributor Author

scovich commented Sep 5, 2025

I'm also totally fine with people plundering this PR so the one with all the comment history can merge. And if somebody knows how to make it a useful/correct PR against the other PR (@alamb seems to be expert in this), that's another possibility.

Either way, I'd also hope we can catalog all the follow-up items before merging so they don't get lost. I think some (like null mask unioning) were already addressed, and I just didn't notice until yesterday.

@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

I'm also totally fine with people plundering this PR so the one with all the comment history can merge. And if somebody knows how to make it a useful/correct PR against the other PR (@alamb seems to be expert in this), that's another possibility.

Either way, I'd also hope we can catalog all the follow-up items before merging so they don't get lost. I think some (like null mask unioning) were already addressed, and I just didn't notice until yesterday.

I will double check / catalog before I merge anything

@carpecodeum
Copy link
Contributor

attn @carpecodeum @alamb

Thank you @scovich for this

Perhaps you would like to merge this PR @carpecodeum ?

Perhaps saw this a bit too late, I am fine with both the approaches honestly, I will review any other PR that stems from #8166

@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

@scovich could you take this branch and create a new PR against arrow-rs (rather than the cmu fork) that we can merge?

@scovich
Copy link
Contributor Author

scovich commented Sep 8, 2025

@scovich could you take this branch and create a new PR against arrow-rs (rather than the cmu fork) that we can merge?

This PR is against apache:main, I think it can just merge as-is?

@scovich scovich marked this pull request as ready for review September 8, 2025 11:42
@scovich
Copy link
Contributor Author

scovich commented Sep 8, 2025

Oh, but the PR description still talks as if it were a merge to the other branch. Let me clean that up.

@scovich scovich changed the title Merge arrow-rs/main into cmu-db/shredding-variant-part1 [Variant] Support Shredded Objects in variant_get (take 2) Sep 8, 2025
@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

I will review this in a few hours. THank you so much @scovich and @carpecodeum

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich @carpecodeum and @liamzwbao -- I think this is a big step forward. I realize it still needs a bunch of work, but I think it is good enough to merge at this point as it is significantly better than what is on main.

One thing I don't feel I have a good handle on yet is where we are in terms of what is supported and what is left to go. I think I will spend some more time working on tests to figure that out next.

Onwards!

/// Partially shredded:
/// * value is an object
/// * typed_value is a shredded object.
/// Imperfectly shredded: Shredded values reside in `typed_value` while those that failed to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you -- this made it much clearer to me

}
}

/// Helper trait for converting `Variant` values to arrow primitive values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow on, this trait might be more discoverable if we put it somewhere in parquet-variant-compute/src/type_conversion.rs

}

/// Create from &str
/// Create from &str with support for dot notation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably eventually needs support for escaping, etc but is probably fine for now

@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

I pushed a commit to remove an unnecessary #allow(unused) annotation and merge up from main.

@alamb alamb merged commit fb7d02e into apache:main Sep 8, 2025
12 checks passed
alamb pushed a commit that referenced this pull request Sep 10, 2025
Note to reviewers: This PR includes 1600+ LoC of new unit tests. The
actual changes are half that big.

# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes #8310

# Rationale for this change

The original `cast_to_variant` code did columnar conversions of types to
variant. For primitive types this worked ok, but for deeply nested types
it means repeatedly creating new variants (and variant metadata), only
to re-code them by copying the variant values to new arrays (with new
metadata and field ids). Very expensive.

# What changes are included in this PR?

Follow the example of #8280, and
introduce a row builder concept that takes individual array values and
writes them to an `impl VariantBuilderExt`. Row builders for complex
types instantiate the appropriate list or object builder to pass to
their children.

# Are these changes tested?

Existing unit tests continue to pass. Extensive new unit tests added as
well.

# Are there any user-facing changes?

* `VariantBuilderExt` has a new `append_null` method.
* `ObjectFieldBuilder` moved to `builder.rs` and made public
alamb pushed a commit that referenced this pull request Sep 11, 2025
# Which issue does this PR close?

* Quick-follow to #8280

# Rationale for this change

The change from output builders to row builders left some no-longer-used
files behind.

# What changes are included in this PR?

Delete the unused files.

# Are these changes tested?

N/A

# Are there any user-facing changes?

No
alamb added a commit that referenced this pull request Sep 11, 2025
# Which issue does this PR close?

* Follow-up to #8280

# Rationale for this change

See #8280 (comment)

# What changes are included in this PR?

See description.

# Are these changes tested?

Code movement. Compilation suffices.

# Are there any user-facing changes?

No.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Support Shredded Objects in variant_get: typed path access (STEP 1)
3 participants