Skip to content

perf(spark): Resolve drop-partition-columns projection once per write…#18972

Merged
danny0405 merged 1 commit into
apache:masterfrom
voonhous:perf-bulk-insert-drop-partition-cols
Jun 13, 2026
Merged

perf(spark): Resolve drop-partition-columns projection once per write…#18972
danny0405 merged 1 commit into
apache:masterfrom
voonhous:perf-bulk-insert-drop-partition-cols

Conversation

@voonhous

@voonhous voonhous commented Jun 11, 2026

Copy link
Copy Markdown
Member

…r instead of per row

Describe the issue this Pull Request addresses

Closes #18969

BulkInsertDataInternalWriterHelper#write(InternalRow) redoes constant work for every row when hoodie.datasource.write.drop.partition.columns=true: it resolves the config flag, instantiates a key generator via constructor reflection through HoodieDatasetBulkInsertHelper.getPartitionPathCols (ReflectionUtils caches only the Class object, not instances), recomputes the partition-column ordinals into a fresh HashSet, and round-trips the whole row through toSeq/fromSeq, boxing every column. None of this depends on the row. The path runs per record in row-writer bulk insert and clustering rewrites whenever drop-partition-columns is enabled.

Summary and Changelog

The projection is now resolved once per writer instead of once per row.

  • The flag is resolved once in the constructor. The retained (non-partition) field ordinals and types are computed once on the first write() call.
  • The initialization is deliberately lazy rather than constructor-eager: the bucket-index subclasses (BucketBulkInsertDataInternalWriterHelper, ConsistentBucketBulkInsertDataInternalWriterHelper) override write() and never drop partition columns, so the partition-column resolution was previously unreachable for them and must stay that way -- an eager precompute would turn configs that complete today (for example bucket index with drop enabled and a CustomKeyGenerator, whose partition fields keep the field:type suffix that structType.fieldIndex rejects) into deterministic construction failures. Lazy init also keeps zero-row tasks free of the resolution, matching the previous reachability exactly.
  • write() copies the retained fields into a fresh GenericInternalRow via row.get(ordinal, type), which is value-identical to the previous toSeq/filter/fromSeq round trip (toSeq itself reads each field with row.get(i, dataType)), and a fresh row per record keeps the same aliasing behavior.

Covered by the existing functional tests that exercise this branch: TestGetPartitionValuesFromPath (bulk_insert with drop enabled) and TestHoodieSparkSqlWriter.

Impact

Performance: removes per-row key-generator constructor reflection, config resolution, ordinal recomputation, and the full-row boxing round trip from row-writer bulk insert when drop-partition-columns is enabled. Output rows are value-identical. No public API change; the default flag-false path is unchanged.

Risk Level

Low. The transformation is a local row projection with no timeline or concurrency interaction; lazy initialization preserves the exact previous reachability for subclasses and zero-row tasks.

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…r instead of per row

BulkInsertDataInternalWriterHelper#write redid constant work for every
row when hoodie.datasource.write.drop.partition.columns is enabled:
resolving the config flag, instantiating a key generator via
constructor reflection through getPartitionPathCols, recomputing the
partition-column ordinals into a fresh HashSet, and round-tripping the
whole row through toSeq/fromSeq (boxing every column).

The flag is now resolved once in the constructor, and the retained
(non-partition) field ordinals and types are computed once on the
first write(). The lazy initialization keeps the partition-column
resolution unreachable for the bucket-index subclasses, which override
write() and never drop columns, and for tasks that write no rows,
matching the previous reachability exactly. write() copies the
retained fields into a fresh GenericInternalRow, which is
value-identical to the previous toSeq/filter/fromSeq output.
@voonhous voonhous added the area:performance Performance optimizations label Jun 11, 2026
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.26%. Comparing base (86d1650) to head (48b25e9).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18972      +/-   ##
============================================
+ Coverage     68.25%   68.26%   +0.01%     
- Complexity    29509    29531      +22     
============================================
  Files          2542     2542              
  Lines        142632   142697      +65     
  Branches      17789    17800      +11     
============================================
+ Hits          97354    97415      +61     
- Misses        37271    37275       +4     
  Partials       8007     8007              
Flag Coverage Δ
common-and-other-modules 44.81% <0.00%> (+0.03%) ⬆️
hadoop-mr-java-client 44.77% <ø> (+0.10%) ⬆️
spark-client-hadoop-common 48.07% <0.00%> (+0.01%) ⬆️
spark-java-tests 48.80% <100.00%> (+0.01%) ⬆️
spark-scala-tests 44.85% <96.00%> (+0.02%) ⬆️
utilities 37.27% <8.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ion/commit/BulkInsertDataInternalWriterHelper.java 90.62% <100.00%> (+0.96%) ⬆️

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 11, 2026
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 6428cb2 into apache:master Jun 13, 2026
63 of 67 checks passed
@voonhous voonhous deleted the perf-bulk-insert-drop-partition-cols branch June 13, 2026 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:performance Performance optimizations size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: Row-writer bulk insert re-instantiates the key generator and boxes the full row per record when dropping partition columns

5 participants