-
Notifications
You must be signed in to change notification settings - Fork 713
train_pgo: Add json and avro training #29487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances PGO (Profile-Guided Optimization) training by adding Avro and JSON schema support alongside the existing Protobuf training. This helps optimize Redpanda's performance when Iceberg is used with these different serialization formats.
Changes:
- Added Avro and JSON schema definitions with corresponding sample payloads
- Refactored the Protobuf-specific setup function into a generic schema setup function that supports multiple formats
- Added a new function to send messages via rpk for testing Avro and JSON schemas
- Updated topic names to be format-specific and fixed hardcoded references
Iceberg generally still a massive performance concern. Adding avro training adds a another perf bump to help with that. We do a very minimal rpk based training which seems to result in good enough training coverage.
c5c32b6 to
92eb520
Compare
tools/pgo_bolt/train_pgo.py
Outdated
| }""" | ||
| JSON_TOPIC_NAME = "iceberg-json-topic" | ||
| JSON_SAMPLE_PAYLOAD = json.dumps( | ||
| {"name": "hello my name is json shady", "id": 13579, "ts": 1625079045123456} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put an array and null in there so we hit those paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add it but it seems make very little difference either way. E.g.: in our datalake omb test we use a message with only string fields and just training on a single integer id field results in same perf as what is shown above. So it seems to not make much difference. Whatever you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"So it seems to not make much difference."
The effort to do it is close to zero, so yes I think we should. We always suffer from the problem of our tests being much narrower than real world scenarios so we need to augment our tests (which unfortunately are very similar between training and validation) with our judgment and guesses: imagine someone has a schema which is mostly one giant array. Then it may matter.
Similar to the avro training also add a json equivalent.
92eb520 to
af44bed
Compare
Add Avro and JSON training to further help with iceberg perf when those are in use.
Backports Required
Release Notes