Skip to content

refactor(variant): self-align log-block variant rows, drop buffer-level projection hook#18923

Open
voonhous wants to merge 3 commits into
apache:masterfrom
voonhous:fix-18739
Open

refactor(variant): self-align log-block variant rows, drop buffer-level projection hook#18923
voonhous wants to merge 3 commits into
apache:masterfrom
voonhous:fix-18739

Conversation

@voonhous

@voonhous voonhous commented Jun 5, 2026

Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

Addresses #18739.

PR #18674 aligned Spark 4.1 MOR variant reads via HoodieReaderContext.getLogBlockRecordProjection, a per-row projector run inside the engine-neutral FileGroupRecordBuffer. That leaked Spark-4.1 / PushVariantIntoScan concerns into format-agnostic merge code. This PR pushes variant projection down into the log readers so the buffer no longer knows about variants.

Summary and Changelog

Each log reader now emits rows already aligned to the projected read schema; the buffer-level projection hook is removed.

  • Remove FileGroupRecordBuffer.getProjectedTransformer and HoodieReaderContext.getLogBlockRecordProjection. The buffers (FileGroupRecordBuffer, PositionBasedFileGroupRecordBuffer) now call getSchemaTransformerWithEvolvedSchema directly and keep returning the unprojected evolved schema (merger reads metadata cols by ordinal).
  • Add a no-op HoodieReaderContext.projectLogBlockRecords(iter, dataBlockSchema). HoodieAvroDataBlock.deserializeRecords invokes it; the Spark reader context overrides it to apply the VariantGet rewrite (relocated from the old hook).
  • Parquet log blocks: new HoodieSparkParquetReader.getUnsafeRowIterator(HoodieSchema, StructType, filters) overload threads the projected struct so SPARK_ROW_REQUESTED_SCHEMA carries VariantMetadata and parquet-mr projects natively (mirrors the base-file path).
  • Single shouldProjectVariants gate (variant projection present AND merger not PAYLOAD_BASED) drives both paths, preserving the previous custom-payload skip.

No code copied. buildVariantProjector / isVariantProjectionStruct are unchanged (caller moved).

Impact

Internal change to the HoodieReaderContext extension point (getLogBlockRecordProjection -> projectLogBlockRecords); the new hook is a no-op by default. No new configs. No user-facing behavior change for non-Spark engines or Spark < 4.1; Spark 4.1 variant MOR reads are unchanged in result, only the alignment moves out of the buffer.

Risk Level

Medium. Touches the engine-neutral merge buffer and the reader-context API. Mitigations: the new hook is a no-op default (non-Spark engines and Spark < 4.1 unaffected); projection is gated by shouldProjectVariants (custom-payload tables skip it); the buffer still returns the unprojected evolved schema, so the merger's ordinal-based metadata access is unchanged. Verified with TestVariantDataType on Spark 4.0 and 4.1 (cow + mor: insert / update / delete / merge-into, cast(v as string) selects) across key-based, position-based, and unmerged paths, plus a custom-payload variant MOR case.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous linked an issue Jun 5, 2026 that may be closed by this pull request
@voonhous voonhous requested a review from yihua June 5, 2026 13:16
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 5, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR pushes variant projection down into the log readers (parquet native projection and avro deserialization hook) and removes the buffer-level projection step, which cleanly separates Spark-4.1 variant concerns from format-agnostic merge code. One worth-checking edge case in the inline comment around how the new ordering interacts with the schema-evolution transformer when both are active. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability nits below, otherwise the refactor is clean.

voonhous added 2 commits June 8, 2026 13:58
…er hook

Remove the engine-neutral FileGroupRecordBuffer variant-projection composition
(apache#18674's getLogBlockRecordProjection hook) so the merge buffer stays
format-agnostic; each log reader now emits rows already aligned to the projected
read schema (apache#18739).

- Parquet log blocks: thread the variant-overlaid StructType into a new
  HoodieSparkParquetReader.getUnsafeRowIterator(HoodieSchema, StructType, filters)
  overload so SPARK_ROW_REQUESTED_SCHEMA carries VariantMetadata and parquet-mr
  decodes variants into the projected struct shape natively (mirrors the base-file
  path). Wired in SparkFileFormatInternalRowReaderContext.getFileRecordIterator.
- Avro log blocks: new no-op HoodieReaderContext.projectLogBlockRecords hook,
  invoked from HoodieAvroDataBlock.deserializeRecords; Spark overrides it to apply
  the VariantGet rewrite (relocated from the deleted buffer hook).
- Both paths gated by a single shouldProjectVariants predicate (variant projection
  present AND merger not PAYLOAD_BASED), preserving the buffer's custom-payload skip.
- FileGroupRecordBuffer/PositionBased now call getSchemaTransformerWithEvolvedSchema
  directly; getProjectedTransformer and getLogBlockRecordProjection deleted.
- Sub-task 4: documented why the sparkRequiredSchema overlay must stay (HoodieSchema
  can't carry VariantMetadata); kept Spark-side, no schema-model change.

buildVariantProjector / isVariantProjectionStruct unchanged (caller moved).

Addresses apache#18739.
- HoodieSparkParquetReader: rename variant-overload parameter to
  structSchema and drop the no-op alias.
- SparkFileFormatInternalRowReaderContext: extract isPayloadBased and
  drop the double negation in shouldProjectVariants.

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR cleanly pushes the Spark 4.1 variant projection out of the engine-neutral merge buffer and into the engine-specific reader/log-block paths, which is a nice architectural improvement. No new issues flagged from this automated pass — the ordering-flip concern (schema-evolve then project → project then schema-evolve, which only matters when schema-on-read is active) was already raised and acknowledged in round 1. A Hudi committer or PMC member can take it from here for a final review. A couple of minor naming and readability suggestions below.

cc @yihua

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback! These follow-ups rename the parquet reader's variant overload parameter to projectedStructSchema (matching Javadoc and call sites) and document why the merger != null guard in the Spark reader context is real rather than defensive. Both of my prior nits and the corresponding hudi-agent suggestions are resolved, and no new issues surfaced from this incremental pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

…rger guard

- Rename getUnsafeRowIterator param structSchema to projectedStructSchema
- Add inline comment explaining the merger != null guard
@hudi-bot

hudi-bot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! This round's changes (renaming structSchemaprojectedStructSchema, extracting isPayloadBased, and the null-guard rationale) all look good. The variant-projection refactor cleanly separates engine-specific concerns from the buffer code — Avro log blocks project in deserializeRecords via projectLogBlockRecords, parquet log blocks project natively in the reader, and the buffers go back to the plain evolved-schema transformer. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One minor naming suggestion; the rest of the changes are clean and well-commented.

cc @yihua

if (!shouldProjectVariants) {
return recordIterator
}
val req = sparkRequiredSchema.get

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: req is a bit terse here — the adjacent variables (dataStruct, targetStruct) are more descriptive. Could you rename it to requiredStruct to stay consistent?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.87097% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.24%. Comparing base (aac975c) to head (3092939).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...hudi/SparkFileFormatInternalRowReaderContext.scala 75.00% 1 Missing and 3 partials ⚠️
...ache/hudi/io/storage/HoodieSparkParquetReader.java 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18923      +/-   ##
============================================
+ Coverage     68.20%   68.24%   +0.04%     
- Complexity    29458    29476      +18     
============================================
  Files          2542     2542              
  Lines        142545   142553       +8     
  Branches      17778    17778              
============================================
+ Hits          97218    97289      +71     
+ Misses        37316    37247      -69     
- Partials       8011     8017       +6     
Flag Coverage Δ
common-and-other-modules 44.73% <16.12%> (+0.05%) ⬆️
hadoop-mr-java-client 44.71% <100.00%> (-0.04%) ⬇️
spark-client-hadoop-common 48.03% <32.25%> (-0.01%) ⬇️
spark-java-tests 48.76% <64.51%> (-0.01%) ⬇️
spark-scala-tests 44.85% <83.87%> (+<0.01%) ⬆️
utilities 37.28% <64.51%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/hudi/common/engine/HoodieReaderContext.java 95.00% <100.00%> (ø)
...di/common/table/log/block/HoodieAvroDataBlock.java 55.15% <100.00%> (+0.23%) ⬆️
...ommon/table/read/buffer/FileGroupRecordBuffer.java 94.62% <100.00%> (-0.38%) ⬇️
...ead/buffer/PositionBasedFileGroupRecordBuffer.java 75.00% <100.00%> (ø)
...ache/hudi/io/storage/HoodieSparkParquetReader.java 78.72% <80.00%> (+0.22%) ⬆️
...hudi/SparkFileFormatInternalRowReaderContext.scala 78.08% <75.00%> (+1.03%) ⬆️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisit projection for variant type columns on Spark

5 participants