perf(spark): Resolve drop-partition-columns projection once per write… by voonhous · Pull Request #18972 · apache/hudi

voonhous · 2026-06-11T06:56:16Z

…r instead of per row

Describe the issue this Pull Request addresses

BulkInsertDataInternalWriterHelper#write(InternalRow) redoes constant work for every row when hoodie.datasource.write.drop.partition.columns=true: it resolves the config flag, instantiates a key generator via constructor reflection through HoodieDatasetBulkInsertHelper.getPartitionPathCols (ReflectionUtils caches only the Class object, not instances), recomputes the partition-column ordinals into a fresh HashSet, and round-trips the whole row through toSeq/fromSeq, boxing every column. None of this depends on the row. The path runs per record in row-writer bulk insert and clustering rewrites whenever drop-partition-columns is enabled.

Summary and Changelog

The projection is now resolved once per writer instead of once per row.

The flag is resolved once in the constructor. The retained (non-partition) field ordinals and types are computed once on the first write() call.
The initialization is deliberately lazy rather than constructor-eager: the bucket-index subclasses (BucketBulkInsertDataInternalWriterHelper, ConsistentBucketBulkInsertDataInternalWriterHelper) override write() and never drop partition columns, so the partition-column resolution was previously unreachable for them and must stay that way -- an eager precompute would turn configs that complete today (for example bucket index with drop enabled and a CustomKeyGenerator, whose partition fields keep the field:type suffix that structType.fieldIndex rejects) into deterministic construction failures. Lazy init also keeps zero-row tasks free of the resolution, matching the previous reachability exactly.
write() copies the retained fields into a fresh GenericInternalRow via row.get(ordinal, type), which is value-identical to the previous toSeq/filter/fromSeq round trip (toSeq itself reads each field with row.get(i, dataType)), and a fresh row per record keeps the same aliasing behavior.

Covered by the existing functional tests that exercise this branch: TestGetPartitionValuesFromPath (bulk_insert with drop enabled) and TestHoodieSparkSqlWriter.

Impact

Performance: removes per-row key-generator constructor reflection, config resolution, ordinal recomputation, and the full-row boxing round trip from row-writer bulk insert when drop-partition-columns is enabled. Output rows are value-identical. No public API change; the default flag-false path is unchanged.

Risk Level

Low. The transformation is a local row projection with no timeline or concurrency interaction; lazy initialization preserves the exact previous reachability for subclasses and zero-row tasks.

Documentation Update

None.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…r instead of per row BulkInsertDataInternalWriterHelper#write redid constant work for every row when hoodie.datasource.write.drop.partition.columns is enabled: resolving the config flag, instantiating a key generator via constructor reflection through getPartitionPathCols, recomputing the partition-column ordinals into a fresh HashSet, and round-tripping the whole row through toSeq/fromSeq (boxing every column). The flag is now resolved once in the constructor, and the retained (non-partition) field ordinals and types are computed once on the first write(). The lazy initialization keeps the partition-column resolution unreachable for the bucket-index subclasses, which override write() and never drop columns, and for tasks that write no rows, matching the previous reachability exactly. write() copies the retained fields into a fresh GenericInternalRow, which is value-identical to the previous toSeq/filter/fromSeq output.

codecov-commenter · 2026-06-11T08:28:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.26%. Comparing base (86d1650) to head (48b25e9).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18972      +/-   ##
============================================
+ Coverage     68.25%   68.26%   +0.01%     
- Complexity    29509    29531      +22     
============================================
  Files          2542     2542              
  Lines        142632   142697      +65     
  Branches      17789    17800      +11     
============================================
+ Hits          97354    97415      +61     
- Misses        37271    37275       +4     
  Partials       8007     8007

Flag	Coverage Δ
common-and-other-modules	`44.81% <0.00%> (+0.03%)`	⬆️
hadoop-mr-java-client	`44.77% <ø> (+0.10%)`	⬆️
spark-client-hadoop-common	`48.07% <0.00%> (+0.01%)`	⬆️
spark-java-tests	`48.80% <100.00%> (+0.01%)`	⬆️
spark-scala-tests	`44.85% <96.00%> (+0.02%)`	⬆️
utilities	`37.27% <8.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ion/commit/BulkInsertDataInternalWriterHelper.java	`90.62% <100.00%> (+0.96%)`	⬆️

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-06-11T10:23:13Z

CI report:

48b25e9 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous added the area:performance Performance optimizations label Jun 11, 2026

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 11, 2026

wombatu-kun approved these changes Jun 13, 2026

View reviewed changes

danny0405 merged commit 6428cb2 into apache:master Jun 13, 2026
63 of 67 checks passed

voonhous deleted the perf-bulk-insert-drop-partition-cols branch June 13, 2026 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(spark): Resolve drop-partition-columns projection once per write…#18972

perf(spark): Resolve drop-partition-columns projection once per write…#18972
danny0405 merged 1 commit into
apache:masterfrom
voonhous:perf-bulk-insert-drop-partition-cols

voonhous commented Jun 11, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 11, 2026

Uh oh!

hudi-bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

voonhous commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

codecov-commenter commented Jun 11, 2026

Codecov Report

Uh oh!

hudi-bot commented Jun 11, 2026

CI report:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voonhous commented Jun 11, 2026 •

edited

Loading