Skip to content

coordinated compaction improvements#29486

Merged
bashtanov merged 3 commits intoredpanda-data:devfrom
bashtanov:coco-improvements
Feb 4, 2026
Merged

coordinated compaction improvements#29486
bashtanov merged 3 commits intoredpanda-data:devfrom
bashtanov:coco-improvements

Conversation

@bashtanov
Copy link
Contributor

@bashtanov bashtanov commented Feb 2, 2026

https://redpandadata.atlassian.net/browse/CORE-15445

An update to group offsets may be lost due to leadership transfers.

We previously tolerated it: partitions are typically written into and compacted, so coordination offsets often move and a new update will get through.

However, it is difficult to test this way. Fix the functionality and the test.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings February 2, 2026 15:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where group offset updates in coordinated compaction could be lost during leadership transfers. The core issue was that the compaction coordinator did not properly handle leadership changes, potentially losing MTRO/MXRO updates that were in flight when leadership transferred.

Changes:

  • Modified leadership change handling to track the leadership term and force replication of group offsets when leadership changes
  • Updated test to handle transient leadership transfer failures and retry replication/leadership operations
  • Removed the flaky = True flag from the test build configuration after fixing the underlying race condition

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/v/raft/compaction_coordinator.h Added term tracking and force update flag for leadership changes
src/v/raft/compaction_coordinator.cc Implemented term-based leadership tracking and forced offset distribution on leadership change
src/v/raft/consensus.cc Updated leadership notification to pass term ID to compaction coordinator
src/v/raft/tests/coordinated_compaction_test.cc Enhanced test robustness with retry logic for leadership transfers and replication
src/v/raft/tests/BUILD Removed flaky test flag

// we will retry later anyway. Retries may be cancelled
// a) on leadership change, where the flag will be reset back to true;
// or
// b) on a further call to `on_group_offsets_update` due to
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'on_group_offsets_update' (which has been removed) to 'update_group_offsets' in the comment.

Suggested change
// b) on a further call to `on_group_offsets_update` due to
// b) on a further call to `update_group_offsets` due to

Copilot uses AI. Check for mistakes.
"compaction coordinator planning to distribute group offsets in {}",
group_offsets_send_delay);
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message is missing a unit for the delay. Consider changing to 'planning to distribute group offsets in {} seconds' or using appropriate formatting for the duration.

Suggested change
"compaction coordinator planning to distribute group offsets in {}",
group_offsets_send_delay);
"compaction coordinator planning to distribute group offsets in {} ms",
std::chrono::duration_cast<std::chrono::milliseconds>(
group_offsets_send_delay)
.count());

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt prints (or rather should print) durations in human-readable form

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 2, 2026

CI test results

test results on build#79952
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/79952#019c1efa-b4dd-438f-9021-44504ffa0e92 FLAKY 17/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0920, p0=0.2777, reject_threshold=0.0100. adj_baseline=0.2514, p1=0.2210, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#79978
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
QuotaManagementUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/79978#019c2071-3851-442e-8d2a-b214023a8c78 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0632, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1778, p1=0.1413, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=QuotaManagementUpgradeTest&test_method=test_upgrade
test results on build#80040
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ConsumerGroupBalancingTest test_coordinator_nodes_balance null integration https://buildkite.com/redpanda/redpanda/builds/80040#019c2576-fee5-4d6e-982f-144b23009242 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0049, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ConsumerGroupBalancingTest&test_method=test_coordinator_nodes_balance
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/80040#019c252d-3f4a-423e-9fee-0b11ae2e87b0 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0057, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
ScalingUpTest test_scaling_up_with_recovered_topic null integration https://buildkite.com/redpanda/redpanda/builds/80040#019c2576-fee5-4d6e-982f-144b23009242 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0157, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_scaling_up_with_recovered_topic
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/80040#019c2576-fee5-4d6e-982f-144b23009242 FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1005, p0=0.6532, reject_threshold=0.0100. adj_baseline=0.2722, p1=0.1977, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

void compaction_coordinator::on_group_offsets_update() {
if (_is_leader) {
if (
_leader_term_id && (mtro_updated || mxro_updated || _need_force_update)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could this get called with the right branch of the 'and' false
can this be called with mtro_updated false, mxro updated false, and need_force_update false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. in the following situation. We are the leader and we have been the one in the current term for a while. The timer triggered collect_all_replica_offsets, it called update_local_replica_offsets. Local offsets have increased, so it called recalculate_group_offsets. However, the local offsets were not the lowest, as some replica is holding behind the compaction. new_mtro and new_mxro remained the same. update_group_offsets is called with the same values as at some point before. mtro_updated and mxro_updated are both false. _need_force_update is false as well, as it is not the first update in the current term.

auto last_data_offset = r.value().last_offset;
last_data_offsets.push_back(last_data_offset);
model::offset last_data_offset;
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while true without a loop-global timeout makes me nervous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it into a for (int attempt = 0; attempt < 5; ++attempt)

node_id,
committed_offset);
return committed_offset == last_data_offset;
return committed_offset >= last_data_offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason behind this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to a leadership transfer a configuration batch may have been replicated after the data.

joe-redpanda
joe-redpanda previously approved these changes Feb 2, 2026
Copy link
Contributor

@joe-redpanda joe-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions
some nits
globally looks good

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

ASSERT_FALSE_CORO(r.has_error());
auto last_data_offset = r.value().last_offset;
last_data_offsets.push_back(last_data_offset);
model::offset last_data_offset;
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable last_data_offset is uninitialized and will remain uninitialized if all 5 retry attempts fail (line 55-70). After the loop completes, it's used unconditionally at lines 71-75 and 99, which will result in undefined behavior. Initialize it to a sentinel value or add an assertion after the retry loop to ensure it was set.

Copilot uses AI. Check for mistakes.
_need_force_update = true;
}
_leader_term_id = {new_term};
arm_timer_if_needed(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a risk of arming the timer twice if the same replica is elected leader back to back?

An update to group offsets may be lost due to leadership transfers. We
would previously tolerate it: partitions are typically written into and
compacted, so coordination offsets often move and a new update will get
through. Change it so that such updates are never lost:
- Store current leadership term.
- In each term leader to distribute group offsets to followers
regardless whether they changed.
- When replica offsets are updated with group updates do not recalculate
group updates. This is to avoid pointless updates.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

void compaction_coordinator::recalculate_group_offsets() {
vassert(
_is_leader, "only leader can recalculate max tombstone remove offset");
vassert(_leader_term_id, "only leader can recalculate group offsets");
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using _leader_term_id (an std::optional<model::term_id>) directly in a boolean context is incorrect. The assertion should check _leader_term_id.has_value() to properly verify whether there is a valid leader term ID.

Suggested change
vassert(_leader_term_id, "only leader can recalculate group offsets");
vassert(
_leader_term_id.has_value(),
"only leader can recalculate group offsets");

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's the same

joe-redpanda
joe-redpanda previously approved these changes Feb 3, 2026
bharathv
bharathv previously approved these changes Feb 3, 2026
Make resilient to leader transfers which may happen under load:
- increase timeouts;
- check for current leader;
- retry on errc::not_leader;
- allow for configuration batches offsets.

Also add a test for no lost updates after a leadership transfer.
@bashtanov bashtanov merged commit 774fa2c into redpanda-data:dev Feb 4, 2026
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29486-v25.3.x-725 remotes/upstream/v25.3.x
git cherry-pick -x 386a4885f4 e6fc8df2af 00b2afb3e3

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants