Skip to content

Conversation

@IoannisRP
Copy link
Contributor

@IoannisRP IoannisRP commented Feb 3, 2026

Fixes: CORE-15273

introduces a tally for each rule present in client quota store. Only rules that have active entries are considered when trying to resolve an entity key.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings February 3, 2026 18:14
@IoannisRP IoannisRP changed the title [CORE-15273] kafka/server: client quota perf optmization [DNM][CORE-15273] kafka/server: client quota perf optmization Feb 3, 2026
@IoannisRP IoannisRP requested review from a team and pgellert February 3, 2026 18:15
@IoannisRP IoannisRP self-assigned this Feb 3, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements performance optimizations for client quota operations by adding a rule counter mechanism and refactoring the quota lookup logic.

Changes:

  • Moved client_quota_rule enum to a shared location (cluster::client_quota::rule) to enable reuse across components
  • Added rule counters to the quota store to track active quota rules and skip checking inactive ones
  • Simplified benchmark tests by removing unused quota configuration parameters

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/v/kafka/server/tests/quota_manager_bench.cc Removed unused fetch_tp field and associated benchmark test cases
src/v/kafka/server/quota_manager.cc Updated references to use cluster::client_quota::all_client_quota_rules
src/v/kafka/server/client_quota_translator.h Replaced local client_quota_rule definition with alias to cluster::client_quota::rule
src/v/kafka/server/client_quota_translator.cc Refactored quota lookup to use rule counters for optimization and added helper functions
src/v/cluster/client_quota_types.h New file defining shared rule enum and supporting constants
src/v/cluster/client_quota_types.cc New file implementing stream operator for rule enum
src/v/cluster/client_quota_store.h Added rule counters array and prefix group filter overloads
src/v/cluster/client_quota_store.cc Implemented rule counter updates on quota set/remove/clear operations
src/v/cluster/BUILD Added new client_quota_types source and header files

return os << "kafka_user_client_prefix";
case rule::kafka_user_client_id:
return os << "kafka_user_client_id";
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch statement is missing a default case or return statement at the end, which can lead to undefined behavior if an invalid enum value is passed.

Suggested change
}
}
return os << "unknown";

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +73
rule get_rule(const entity_key& key) {
const bool has_user = has_part<entity_key::part::user_match>(key);
const bool has_user_default
= has_part<entity_key::part::user_default_match>(key);
const bool has_client_id = has_part<entity_key::part::client_id_match>(key);
const bool has_client_prefix
= has_part<entity_key::part::client_id_prefix_match>(key);
const bool has_client_default
= has_part<entity_key::part::client_id_default_match>(key);

if (has_user) {
if (has_client_id) {
return rule::kafka_user_client_id;
}
if (has_client_prefix) {
return rule::kafka_user_client_prefix;
}
if (has_client_default) {
return rule::kafka_user_client_default;
}
return rule::kafka_user;
}

if (has_user_default) {
if (has_client_id) {
return rule::kafka_user_default_client_id;
}
if (has_client_prefix) {
return rule::kafka_user_default_client_prefix;
}
if (has_client_default) {
return rule::kafka_user_default_client_default;
}
return rule::kafka_user_default;
}

if (has_client_id) {
return rule::kafka_client_id;
}
if (has_client_prefix) {
return rule::kafka_client_prefix;
}
if (has_client_default) {
return rule::kafka_client_default;
}

return rule::not_applicable;
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic does not handle the case where both has_user and has_user_default are true, which could occur with malformed keys. Consider adding validation or asserting mutual exclusivity of these flags.

Copilot uses AI. Check for mistakes.
case client_quota_rule::kafka_user_default_client_prefix:
case client_quota_rule::kafka_user_client_prefix:
vassert(false, "make_entity_key should not be called on these rules");
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch statement is missing a return or vassert at the end. If an unexpected enum value is passed, the function may return with undefined behavior.

Suggested change
}
}
vassert(
false,
"Unknown client_quota_rule in make_entity_key: {}",
static_cast<int>(r));

Copilot uses AI. Check for mistakes.
case client_quota_rule::kafka_user_client_default:
case client_quota_rule::kafka_user_client_id:
vassert(false, "make_entity_key should not be called on these rules");
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch statement is missing a return or vassert at the end to handle potential undefined behavior if an unexpected enum value is passed.

Suggested change
}
}
vassert(false, "Unexpected client_quota_rule in make_group_quotas: {}",
r);

Copilot uses AI. Check for mistakes.
Comment on lines 81 to 124
tracker_key make_tracker_key(
const std::string_view k1_name, const std::string_view k2_name) {
return tracker_key{
std::in_place_type<std::pair<K1, K2>>, std::make_pair(k1_name, k2_name)};
const client_quota_rule r,
std::string_view user,
// This is either the client name or the group name
std::string_view client_id) {
switch (r) {
case client_quota_rule::not_applicable:
return tracker_key{std::in_place_type<k_not_applicable>};
case client_quota_rule::kafka_client_default:
return tracker_key{std::in_place_type<k_client_id>, client_id};
case client_quota_rule::kafka_client_prefix:
return tracker_key{std::in_place_type<k_group_name>, client_id};
case client_quota_rule::kafka_client_id:
return tracker_key{std::in_place_type<k_client_id>, client_id};
case client_quota_rule::kafka_user_default:
return tracker_key{std::in_place_type<k_user>, user};
case client_quota_rule::kafka_user_default_client_default:
return tracker_key{
std::in_place_type<std::pair<k_user, k_client_id>>,
std::make_pair(user, client_id)};
case client_quota_rule::kafka_user_default_client_prefix:
return tracker_key{
std::in_place_type<std::pair<k_user, k_group_name>>,
std::make_pair(user, client_id)};
case client_quota_rule::kafka_user_default_client_id:
return tracker_key{
std::in_place_type<std::pair<k_user, k_client_id>>,
std::make_pair(user, client_id)};
case client_quota_rule::kafka_user:
return tracker_key{std::in_place_type<k_user>, user};
case client_quota_rule::kafka_user_client_default:
return tracker_key{
std::in_place_type<std::pair<k_user, k_client_id>>,
std::make_pair(user, client_id)};
case client_quota_rule::kafka_user_client_prefix:
return tracker_key{
std::in_place_type<std::pair<k_user, k_group_name>>,
std::make_pair(user, client_id)};
case client_quota_rule::kafka_user_client_id:
return tracker_key{
std::in_place_type<std::pair<k_user, k_client_id>>,
std::make_pair(user, client_id)};
}
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch statement is missing a return or default case at the end. If an unexpected enum value is passed, the function may return with undefined behavior.

Copilot uses AI. Check for mistakes.
@IoannisRP
Copy link
Contributor Author

IoannisRP commented Feb 3, 2026

Results of bazel run --config=release //src/v/kafka/server/tests:quota_manager_rpbench with these changes
These are to be compared to baseline presented in CORE-15273

test iters runtime allocs tasks inst cycles
throughput_group.test_quota_manager_off_shared 35100000 21.28ns ± 0.00% 1.500 0.000 342.61 84.4
throughput_group.test_quota_manager_off_unique 37600000 19.75ns ± 0.00% 1.000 0.000 314.44 75.1
latency_group.existing_client_produce_100_others 125200 256.90ns ± 0.00% 13.000 0.000 4126.94 974.5
latency_group.existing_user_produce_100_others 126400 231.44ns ± 0.00% 9.000 0.000 3663.90 876.1
latency_group.existing_user_client_produce_100_others 106900 604.77ns ± 0.00% 17.000 0.000 8636.81 2309.0
latency_group.existing_client_fetch_100_others 121400 511.83ns ± 0.00% 26.000 0.000 8146.94 1964.2
latency_group.existing_user_fetch_100_others 123100 451.19ns ± 0.00% 18.000 0.000 7220.90 1728.1
latency_group.existing_user_client_fetch_100_others 101200 1.15µs ± 0.00% 34.000 0.000 17165.80 4434.9
latency_group.new_client_produce_100_others 118200 661.77ns ± 0.00% 19.040 0.000 9682.12 2548.1
latency_group.new_user_produce_100_others 118900 650.38ns ± 0.00% 19.040 0.000 9557.60 2502.2
latency_group.new_user_client_produce_100_others 96000 1.65µs ± 0.00% 27.040 0.000 24085.98 6364.3
latency_group.new_client_fetch_100_others 115400 926.13ns ± 0.00% 28.040 0.000 13248.28 3572.9
latency_group.new_user_fetch_100_others 116200 903.74ns ± 0.00% 28.040 0.000 13118.26 3485.8
latency_group.new_user_client_fetch_100_others 91400 2.20µs ± 0.00% 38.040 0.000 32063.80 8495.1
latency_group.existing_client_produce_1000_others 67500 255.23ns ± 0.00% 13.000 0.000 4031.87 965.0
latency_group.existing_client_fetch_1000_others 66200 502.00ns ± 0.00% 26.000 0.000 7956.87 1927.0
latency_group.new_client_produce_1000_others 65100 727.01ns ± 0.00% 19.020 0.000 10030.65 2797.0
latency_group.new_client_fetch_1000_others 63900 990.72ns ± 0.00% 28.020 0.000 13591.00 3819.7
latency_group.existing_client_produce_10000_others 12000 251.54ns ± 0.00% 13.000 0.001 4047.34 965.3
latency_group.existing_client_fetch_10000_others 12000 497.74ns ± 0.00% 26.000 0.001 7986.32 1942.9
latency_group.new_client_produce_10000_others 11900 615.74ns ± 0.00% 19.000 0.002 9022.41 2406.3
latency_group.new_client_fetch_10000_others 11900 875.21ns ± 0.00% 28.000 0.001 12584.54 3446.9
latency_group.existing_client_produce_100_others_not_shard_0 99900 423.35ns ± 0.00% 11.000 0.000 3913.25 1581.4
latency_group.existing_client_fetch_100_others_not_shard_0 96000 842.40ns ± 0.00% 22.000 0.000 7928.63 3209.1
latency_group.new_client_produce_100_others_not_shard_0 82200 2.47µs ± 0.00% 8.020 3.000 20426.52 9471.7
latency_group.new_client_fetch_100_others_not_shard_0 79000 2.96µs ± 0.00% 15.020 3.000 24923.59 11376.7
latency_group.existing_client_produce_1000_others_not_shard_0 29400 408.37ns ± 0.00% 11.000 0.000 3767.49 1521.4
latency_group.existing_client_fetch_1000_others_not_shard_0 28900 825.12ns ± 0.00% 22.000 0.000 7784.65 3142.5
latency_group.new_client_produce_1000_others_not_shard_0 27500 2.55µs ± 0.00% 8.010 3.000 21370.08 9830.9
latency_group.new_client_fetch_1000_others_not_shard_0 27100 3.03µs ± 0.00% 15.010 3.004 25754.44 11688.6
latency_group.existing_client_produce_10000_others_not_shard_0 3900 416.92ns ± 0.00% 11.000 0.005 3808.92 1549.8
latency_group.existing_client_fetch_10000_others_not_shard_0 3900 774.98ns ± 0.00% 22.000 0.008 7315.94 2951.5
latency_group.new_client_produce_10000_others_not_shard_0 3900 2.33µs ± 0.00% 8.000 3.000 19266.26 8954.3
latency_group.new_client_fetch_10000_others_not_shard_0 3800 2.84µs ± 0.00% 15.000 3.001 23654.02 10928.3
latency_group.default_configs_produce_worst 143600 53.85ns ± 0.00% 1.000 0.000 350.70 144.1
latency_group.default_configs_fetch_worst 143100 73.71ns ± 0.00% 2.000 0.000 539.68 219.0

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 3, 2026

Retry command for Build#80033

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_client_group_consume_rate_throttle_mechanism
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_throttling_ms_enforcement_is_per_connection
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_client_response_throttle_mechanism_applies_to_next_request
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_client_response_and_produce_throttle_mechanism
tests/rptest/tests/cluster_quota_test.py::ClusterQuotaPartitionMutationTest.test_partition_throttle_mechanism
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_client_quota_metrics
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_client_group_produce_rate_throttle_mechanism
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest.test_client_response_throttle_mechanism

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#80033
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterQuotaPartitionMutationTest test_partition_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b1b-4909-8487-908e36a3e5dd FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0050, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterQuotaPartitionMutationTest&test_method=test_partition_throttle_mechanism
ClusterQuotaPartitionMutationTest test_partition_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fe7-4a85-a6fe-381e61031866 FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0050, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterQuotaPartitionMutationTest&test_method=test_partition_throttle_mechanism
ClusterRateQuotaTest test_client_group_consume_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b13-480e-a823-d3b9da28ffec FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest test_client_group_consume_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fde-4db0-9729-d220d22c1bae FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest test_client_group_produce_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b14-4bcc-9066-a6775baf6f39 FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0026, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_produce_rate_throttle_mechanism
ClusterRateQuotaTest test_client_group_produce_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fdf-462c-871e-491f2f9485c3 FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0026, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_produce_rate_throttle_mechanism
ClusterRateQuotaTest test_client_quota_metrics null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b15-425c-a473-9dc3a0758cea FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_quota_metrics
ClusterRateQuotaTest test_client_quota_metrics null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fe0-4a8b-ac8a-f031d62595d0 FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_quota_metrics
ClusterRateQuotaTest test_client_response_and_produce_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b17-4919-b08e-9e99ff71197d FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_and_produce_throttle_mechanism
ClusterRateQuotaTest test_client_response_and_produce_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fe1-4510-9c29-e5d211e62006 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_and_produce_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b17-42c5-9860-8d988222d0a5 FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fe2-4aba-9e2b-e56017e12b4b FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism_applies_to_next_request null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b18-4034-a489-187f12732e50 FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
ClusterRateQuotaTest test_client_response_throttle_mechanism_applies_to_next_request null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fe3-4933-b2b4-e943f6c9510b FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
ClusterRateQuotaTest test_throttling_ms_enforcement_is_per_connection null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cb-1b19-443b-a656-05043e8e96bb FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_throttling_ms_enforcement_is_per_connection
ClusterRateQuotaTest test_throttling_ms_enforcement_is_per_connection null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fe4-4598-a11c-c1cf5c369aab FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_throttling_ms_enforcement_is_per_connection
NodePostRestartProbeTest post_restart_probe_test null integration https://buildkite.com/redpanda/redpanda/builds/80033#019c24cf-7fe3-4933-b2b4-e943f6c9510b FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0086, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodePostRestartProbeTest&test_method=post_restart_probe_test

@IoannisRP IoannisRP force-pushed the CORE-15273/user-quotas-perf branch from 2e1403b to 1dc3751 Compare February 4, 2026 13:41
@IoannisRP
Copy link
Contributor Author

Given the complexity of the optimization, it was decided to take the perf-hit.

@IoannisRP IoannisRP closed this Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants