refactor(variant): self-align log-block variant rows, drop buffer-level projection hook by voonhous · Pull Request #18923 · apache/hudi

voonhous · 2026-06-05T13:15:48Z

Describe the issue this Pull Request addresses

Addresses #18739.

PR #18674 aligned Spark 4.1 MOR variant reads via HoodieReaderContext.getLogBlockRecordProjection, a per-row projector run inside the engine-neutral FileGroupRecordBuffer. That leaked Spark-4.1 / PushVariantIntoScan concerns into format-agnostic merge code. This PR pushes variant projection down into the log readers so the buffer no longer knows about variants.

Summary and Changelog

Each log reader now emits rows already aligned to the projected read schema; the buffer-level projection hook is removed.

Remove FileGroupRecordBuffer.getProjectedTransformer and HoodieReaderContext.getLogBlockRecordProjection. The buffers (FileGroupRecordBuffer, PositionBasedFileGroupRecordBuffer) now call getSchemaTransformerWithEvolvedSchema directly and keep returning the unprojected evolved schema (merger reads metadata cols by ordinal).
Add a no-op HoodieReaderContext.projectLogBlockRecords(iter, dataBlockSchema). HoodieAvroDataBlock.deserializeRecords invokes it; the Spark reader context overrides it to apply the VariantGet rewrite (relocated from the old hook).
Parquet log blocks: new HoodieSparkParquetReader.getUnsafeRowIterator(HoodieSchema, StructType, filters) overload threads the projected struct so SPARK_ROW_REQUESTED_SCHEMA carries VariantMetadata and parquet-mr projects natively (mirrors the base-file path).
Single shouldProjectVariants gate (variant projection present AND merger not PAYLOAD_BASED) drives both paths, preserving the previous custom-payload skip.

No code copied. buildVariantProjector / isVariantProjectionStruct are unchanged (caller moved).

Impact

Internal change to the HoodieReaderContext extension point (getLogBlockRecordProjection -> projectLogBlockRecords); the new hook is a no-op by default. No new configs. No user-facing behavior change for non-Spark engines or Spark < 4.1; Spark 4.1 variant MOR reads are unchanged in result, only the alignment moves out of the buffer.

Risk Level

Medium. Touches the engine-neutral merge buffer and the reader-context API. Mitigations: the new hook is a no-op default (non-Spark engines and Spark < 4.1 unaffected); projection is gated by shouldProjectVariants (custom-payload tables skip it); the buffer still returns the unprojected evolved schema, so the merger's ordinal-based metadata access is unchanged. Verified with TestVariantDataType on Spark 4.0 and 4.1 (cow + mor: insert / update / delete / merge-into, cast(v as string) selects) across key-based, position-based, and unmerged paths, plus a custom-payload variant MOR case.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR pushes variant projection down into the log readers (parquet native projection and avro deserialization hook) and removes the buffer-level projection step, which cleanly separates Spark-4.1 variant concerns from format-agnostic merge code. One worth-checking edge case in the inline comment around how the new ordering interacts with the schema-evolution transformer when both are active. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability nits below, otherwise the refactor is clean.

…er hook Remove the engine-neutral FileGroupRecordBuffer variant-projection composition (apache#18674's getLogBlockRecordProjection hook) so the merge buffer stays format-agnostic; each log reader now emits rows already aligned to the projected read schema (apache#18739). - Parquet log blocks: thread the variant-overlaid StructType into a new HoodieSparkParquetReader.getUnsafeRowIterator(HoodieSchema, StructType, filters) overload so SPARK_ROW_REQUESTED_SCHEMA carries VariantMetadata and parquet-mr decodes variants into the projected struct shape natively (mirrors the base-file path). Wired in SparkFileFormatInternalRowReaderContext.getFileRecordIterator. - Avro log blocks: new no-op HoodieReaderContext.projectLogBlockRecords hook, invoked from HoodieAvroDataBlock.deserializeRecords; Spark overrides it to apply the VariantGet rewrite (relocated from the deleted buffer hook). - Both paths gated by a single shouldProjectVariants predicate (variant projection present AND merger not PAYLOAD_BASED), preserving the buffer's custom-payload skip. - FileGroupRecordBuffer/PositionBased now call getSchemaTransformerWithEvolvedSchema directly; getProjectedTransformer and getLogBlockRecordProjection deleted. - Sub-task 4: documented why the sparkRequiredSchema overlay must stay (HoodieSchema can't carry VariantMetadata); kept Spark-side, no schema-model change. buildVariantProjector / isVariantProjectionStruct unchanged (caller moved). Addresses apache#18739.

- HoodieSparkParquetReader: rename variant-overload parameter to structSchema and drop the no-op alias. - SparkFileFormatInternalRowReaderContext: extract isPayloadBased and drop the double negation in shouldProjectVariants.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR cleanly pushes the Spark 4.1 variant projection out of the engine-neutral merge buffer and into the engine-specific reader/log-block paths, which is a nice architectural improvement. No new issues flagged from this automated pass — the ordering-flip concern (schema-evolve then project → project then schema-evolve, which only matters when schema-on-read is active) was already raised and acknowledged in round 1. A Hudi committer or PMC member can take it from here for a final review. A couple of minor naming and readability suggestions below.

cc @yihua

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback! These follow-ups rename the parquet reader's variant overload parameter to projectedStructSchema (matching Javadoc and call sites) and document why the merger != null guard in the Spark reader context is real rather than defensive. Both of my prior nits and the corresponding hudi-agent suggestions are resolved, and no new issues surfaced from this incremental pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

…rger guard - Rename getUnsafeRowIterator param structSchema to projectedStructSchema - Add inline comment explaining the merger != null guard

hudi-bot · 2026-06-09T05:14:53Z

CI report:

71cabc3 UNKNOWN
3092939 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! This round's changes (renaming structSchema → projectedStructSchema, extracting isPayloadBased, and the null-guard rationale) all look good. The variant-projection refactor cleanly separates engine-specific concerns from the buffer code — Avro log blocks project in deserializeRecords via projectLogBlockRecords, parquet log blocks project natively in the reader, and the buffers go back to the plain evolved-schema transformer. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One minor naming suggestion; the rest of the changes are clean and well-commented.

cc @yihua

hudi-agent · 2026-06-09T16:52:39Z

+    if (!shouldProjectVariants) {
+      return recordIterator
    }
    val req = sparkRequiredSchema.get


🤖 nit: req is a bit terse here — the adjacent variables (dataStruct, targetStruct) are more descriptive. Could you rename it to requiredStruct to stay consistent?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

codecov-commenter · 2026-06-10T12:36:59Z

Codecov Report

❌ Patch coverage is 83.87097% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.24%. Comparing base (aac975c) to head (3092939).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...hudi/SparkFileFormatInternalRowReaderContext.scala	75.00%	1 Missing and 3 partials ⚠️
...ache/hudi/io/storage/HoodieSparkParquetReader.java	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18923      +/-   ##
============================================
+ Coverage     68.20%   68.24%   +0.04%     
- Complexity    29458    29476      +18     
============================================
  Files          2542     2542              
  Lines        142545   142553       +8     
  Branches      17778    17778              
============================================
+ Hits          97218    97289      +71     
+ Misses        37316    37247      -69     
- Partials       8011     8017       +6

Flag	Coverage Δ
common-and-other-modules	`44.73% <16.12%> (+0.05%)`	⬆️
hadoop-mr-java-client	`44.71% <100.00%> (-0.04%)`	⬇️
spark-client-hadoop-common	`48.03% <32.25%> (-0.01%)`	⬇️
spark-java-tests	`48.76% <64.51%> (-0.01%)`	⬇️
spark-scala-tests	`44.85% <83.87%> (+<0.01%)`	⬆️
utilities	`37.28% <64.51%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...apache/hudi/common/engine/HoodieReaderContext.java	`95.00% <100.00%> (ø)`
...di/common/table/log/block/HoodieAvroDataBlock.java	`55.15% <100.00%> (+0.23%)`	⬆️
...ommon/table/read/buffer/FileGroupRecordBuffer.java	`94.62% <100.00%> (-0.38%)`	⬇️
...ead/buffer/PositionBasedFileGroupRecordBuffer.java	`75.00% <100.00%> (ø)`
...ache/hudi/io/storage/HoodieSparkParquetReader.java	`78.72% <80.00%> (+0.22%)`	⬆️
...hudi/SparkFileFormatInternalRowReaderContext.scala	`78.08% <75.00%> (+1.03%)`	⬆️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous linked an issue Jun 5, 2026 that may be closed by this pull request

Revisit projection for variant type columns on Spark #18739

Open

voonhous requested a review from yihua June 5, 2026 13:16

voonhous force-pushed the fix-18739 branch from d60fbdc to ce316f9 Compare June 5, 2026 13:31

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 5, 2026

hudi-agent reviewed Jun 6, 2026

View reviewed changes

voonhous added this to the release-1.3.0 milestone Jun 7, 2026

voonhous mentioned this pull request Jun 8, 2026

Flesh out schema evolution support for variant #18285

Open

voonhous added 2 commits June 8, 2026 13:58

voonhous force-pushed the fix-18739 branch from 71cabc3 to 78a81a9 Compare June 8, 2026 05:58

hudi-agent reviewed Jun 8, 2026

View reviewed changes

Comment thread ...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java Outdated

Comment thread ...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala

hudi-agent reviewed Jun 8, 2026

View reviewed changes

refactor(variant): address review nits on parquet reader param and me…

3092939

…rger guard - Rename getUnsafeRowIterator param structSchema to projectedStructSchema - Add inline comment explaining the merger != null guard

voonhous force-pushed the fix-18739 branch from 3c5d4e9 to 3092939 Compare June 9, 2026 02:49

hudi-agent reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(variant): self-align log-block variant rows, drop buffer-level projection hook#18923

refactor(variant): self-align log-block variant rows, drop buffer-level projection hook#18923
voonhous wants to merge 3 commits into
apache:masterfrom
voonhous:fix-18739

voonhous commented Jun 5, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-agent left a comment

Uh oh!

Uh oh!

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-bot commented Jun 9, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 9, 2026

Uh oh!

wombatu-kun Jun 14, 2026

Uh oh!

codecov-commenter commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

voonhous commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jun 9, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

wombatu-kun Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 10, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voonhous commented Jun 5, 2026 •

edited

Loading