Describe the problem
SparkBucketIndexPartitioner#getPartition calls the BucketIdentifier.getBucketId overload that takes the raw comma-separated hash-field config String; that overload re-parses the immutable config value on every call (KeyGenUtils.getIndexKeyFields: split, per-token trim, empty filter, new list), allocating per record in the shuffle of every upsert or insert into simple-bucket-index tables.
The same pattern repeats on the row paths: the BucketPartitionUtils.createDataFrame keyBy closure and BucketBulkInsertDataInternalWriterHelper#write re-parse the same config string per row in bucket-index row-writer bulk inserts.
Proposed fix
Precompute the parsed field list once per partitioner or writer with KeyGenUtils.getIndexKeyFields (the exact parser the String overload uses today, including trim, empty-token filtering and null handling) and call the existing List-taking getBucketId overload. Bucket ids are bit-identical since the downstream chain is unchanged; HoodieBucketIndex already follows this pattern for the tagging path.
Expected improvement
A simple micro-measurement (JDK 17, single-field hash key, 10M iterations after warmup, single thread) of the two existing getBucketId overloads, which differ exactly by the per-call config parse:
- String overload (re-parses config per call): ~62 ns/op
- List overload (precomputed field list): ~4 ns/op
So the config re-parse is the large majority of the per-record bucket-id cost, and the fix also removes a handful of transient allocations per record (split array, trimmed tokens, list) from the shuffle hot path. End-to-end job impact depends on the workload since record serialization dominates shuffles; the partitioning step itself becomes roughly an order of magnitude cheaper at hundreds of millions of records per commit.
Will raise a PR for this.
Describe the problem
SparkBucketIndexPartitioner#getPartitioncalls theBucketIdentifier.getBucketIdoverload that takes the raw comma-separated hash-field config String; that overload re-parses the immutable config value on every call (KeyGenUtils.getIndexKeyFields: split, per-token trim, empty filter, new list), allocating per record in the shuffle of every upsert or insert into simple-bucket-index tables.The same pattern repeats on the row paths: the
BucketPartitionUtils.createDataFramekeyBy closure andBucketBulkInsertDataInternalWriterHelper#writere-parse the same config string per row in bucket-index row-writer bulk inserts.Proposed fix
Precompute the parsed field list once per partitioner or writer with
KeyGenUtils.getIndexKeyFields(the exact parser the String overload uses today, including trim, empty-token filtering and null handling) and call the existing List-takinggetBucketIdoverload. Bucket ids are bit-identical since the downstream chain is unchanged;HoodieBucketIndexalready follows this pattern for the tagging path.Expected improvement
A simple micro-measurement (JDK 17, single-field hash key, 10M iterations after warmup, single thread) of the two existing
getBucketIdoverloads, which differ exactly by the per-call config parse:So the config re-parse is the large majority of the per-record bucket-id cost, and the fix also removes a handful of transient allocations per record (split array, trimmed tokens, list) from the shuffle hot path. End-to-end job impact depends on the workload since record serialization dominates shuffles; the partitioning step itself becomes roughly an order of magnitude cheaper at hundreds of millions of records per commit.
Will raise a PR for this.