Skip to content

Conversation

@StephanDollberg
Copy link
Member

Add Avro and JSON training to further help with iceberg perf when those are in use.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances PGO (Profile-Guided Optimization) training by adding Avro and JSON schema support alongside the existing Protobuf training. This helps optimize Redpanda's performance when Iceberg is used with these different serialization formats.

Changes:

  • Added Avro and JSON schema definitions with corresponding sample payloads
  • Refactored the Protobuf-specific setup function into a generic schema setup function that supports multiple formats
  • Added a new function to send messages via rpk for testing Avro and JSON schemas
  • Updated topic names to be format-specific and fixed hardcoded references

Iceberg generally still a massive performance concern. Adding avro
training adds a another perf bump to help with that.

We do a very minimal rpk based training which seems to result in good
enough training coverage.
@StephanDollberg StephanDollberg force-pushed the stephan/avro-json-training branch from c5c32b6 to 92eb520 Compare February 2, 2026 16:33
}"""
JSON_TOPIC_NAME = "iceberg-json-topic"
JSON_SAMPLE_PAYLOAD = json.dumps(
{"name": "hello my name is json shady", "id": 13579, "ts": 1625079045123456}
Copy link
Member

@travisdowns travisdowns Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put an array and null in there so we hit those paths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add it but it seems make very little difference either way. E.g.: in our datalake omb test we use a message with only string fields and just training on a single integer id field results in same perf as what is shown above. So it seems to not make much difference. Whatever you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"So it seems to not make much difference."

The effort to do it is close to zero, so yes I think we should. We always suffer from the problem of our tests being much narrower than real world scenarios so we need to augment our tests (which unfortunately are very similar between training and validation) with our judgment and guesses: imagine someone has a schema which is mostly one giant array. Then it may matter.

Similar to the avro training also add a json equivalent.
@StephanDollberg StephanDollberg force-pushed the stephan/avro-json-training branch from 92eb520 to af44bed Compare February 3, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants