perf: Bucket index partitioning re-parses the hash-field config string per record

### Describe the problem

`SparkBucketIndexPartitioner#getPartition` calls the `BucketIdentifier.getBucketId` overload that takes the raw comma-separated hash-field config String; that overload re-parses the immutable config value on every call (`KeyGenUtils.getIndexKeyFields`: split, per-token trim, empty filter, new list), allocating per record in the shuffle of every upsert or insert into simple-bucket-index tables.

The same pattern repeats on the row paths: the `BucketPartitionUtils.createDataFrame` keyBy closure and `BucketBulkInsertDataInternalWriterHelper#write` re-parse the same config string per row in bucket-index row-writer bulk inserts.

### Proposed fix

Precompute the parsed field list once per partitioner or writer with `KeyGenUtils.getIndexKeyFields` (the exact parser the String overload uses today, including trim, empty-token filtering and null handling) and call the existing List-taking `getBucketId` overload. Bucket ids are bit-identical since the downstream chain is unchanged; `HoodieBucketIndex` already follows this pattern for the tagging path.

### Expected improvement

A simple micro-measurement (JDK 17, single-field hash key, 10M iterations after warmup, single thread) of the two existing `getBucketId` overloads, which differ exactly by the per-call config parse:

- String overload (re-parses config per call): ~62 ns/op
- List overload (precomputed field list): ~4 ns/op

So the config re-parse is the large majority of the per-record bucket-id cost, and the fix also removes a handful of transient allocations per record (split array, trimmed tokens, list) from the shuffle hot path. End-to-end job impact depends on the workload since record serialization dominates shuffles; the partitioning step itself becomes roughly an order of magnitude cheaper at hundreds of millions of records per commit.

Will raise a PR for this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Bucket index partitioning re-parses the hash-field config string per record #18978

Describe the problem

Proposed fix

Expected improvement

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: Bucket index partitioning re-parses the hash-field config string per record #18978

Description

Describe the problem

Proposed fix

Expected improvement

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions