Skip to content

perf: Bucket index partitioning re-parses the hash-field config string per record #18978

@voonhous

Description

@voonhous

Describe the problem

SparkBucketIndexPartitioner#getPartition calls the BucketIdentifier.getBucketId overload that takes the raw comma-separated hash-field config String; that overload re-parses the immutable config value on every call (KeyGenUtils.getIndexKeyFields: split, per-token trim, empty filter, new list), allocating per record in the shuffle of every upsert or insert into simple-bucket-index tables.

The same pattern repeats on the row paths: the BucketPartitionUtils.createDataFrame keyBy closure and BucketBulkInsertDataInternalWriterHelper#write re-parse the same config string per row in bucket-index row-writer bulk inserts.

Proposed fix

Precompute the parsed field list once per partitioner or writer with KeyGenUtils.getIndexKeyFields (the exact parser the String overload uses today, including trim, empty-token filtering and null handling) and call the existing List-taking getBucketId overload. Bucket ids are bit-identical since the downstream chain is unchanged; HoodieBucketIndex already follows this pattern for the tagging path.

Expected improvement

A simple micro-measurement (JDK 17, single-field hash key, 10M iterations after warmup, single thread) of the two existing getBucketId overloads, which differ exactly by the per-call config parse:

  • String overload (re-parses config per call): ~62 ns/op
  • List overload (precomputed field list): ~4 ns/op

So the config re-parse is the large majority of the per-record bucket-id cost, and the fix also removes a handful of transient allocations per record (split array, trimmed tokens, list) from the shuffle hot path. End-to-end job impact depends on the workload since record serialization dominates shuffles; the partitioning step itself becomes roughly an order of magnitude cheaper at hundreds of millions of records per commit.

Will raise a PR for this.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions