Skip to content

tests/dt: deflake WriteCachingFailureInjectionE2ETest.test_crash_all#29452

Closed
bharathv wants to merge 1 commit intoredpanda-data:devfrom
bharathv:fix_wc_crash_test
Closed

tests/dt: deflake WriteCachingFailureInjectionE2ETest.test_crash_all#29452
bharathv wants to merge 1 commit intoredpanda-data:devfrom
bharathv:fix_wc_crash_test

Conversation

@bharathv
Copy link
Contributor

@bharathv bharathv commented Jan 29, 2026

The problem is that sometimes the consumer doesn’t realize there was
a data loss. Data might get produced before it notices that the offsets
rolled back.

Typical happy path is

loop:
  produce(1500)
  total_records_produced += 1500
  wait_for_consumption(total_records_produced)
  crash_restart_all_brokers()

In the flaky failures, the test doesn’t wait for the consumer to notice
that the offsets rolled back. It immediately produces another batch of
1500 messages. Consumer sees the epoch bump, assumes the offsets are
still valid, and tries to resume from offset 1500, but there’s no data
there, so it just hangs and the total consumed count never increases.

Fixes: https://redpandadata.atlassian.net/browse/CORE-13458

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

@bharathv
Copy link
Contributor Author

/ci-repeat 1
skip-redpanda-build
skip-units
dt-repeat=100
rp_repo=nightly
rp_version=latest
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all

@bharathv
Copy link
Contributor Author

/ci-repeat 1
skip-units
dt-repeat=100
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all

The problem is that sometimes the consumer doesn’t realize there was
a data loss. Data might get produced before it notices that the offsets
rolled back.

Typical happy path is

loop:
  produce(1500)
  total_records_produced += 1500
  wait_for_consumption(total_records_produced)
  crash_restart_all_brokers()

In the flaky failures, the test doesn’t wait for the consumer to notice
that the offsets rolled back. It immediately produces another batch of
1500 messages. Consumer sees the epoch bump, assumes the offsets are
still valid, and tries to resume from offset 1500, but there’s no data
there, so it just hangs and the total consumed count never increases.
@bharathv bharathv marked this pull request as ready for review January 29, 2026 01:47
Copilot AI review requested due to automatic review settings January 29, 2026 01:47
@bharathv
Copy link
Contributor Author

/ci-repeat 1
skip-redpanda-build
skip-units
dt-repeat=200
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all

@bharathv bharathv requested a review from nvartolomei January 29, 2026 01:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a flaky test (WriteCachingFailureInjectionE2ETest.test_crash_all) by addressing a race condition between producing messages and consumer offset awareness after broker crashes.

Changes:

  • Adds synchronization to wait for the consumer to observe lost offsets after each crash/restart cycle before producing new messages
  • Introduces a helper function get_lost_offsets() to retrieve lost offset information from the consumer validator
  • Adds prev_lost_offsets tracking to detect when the consumer has registered new data loss

@nvartolomei
Copy link
Contributor

@bharathv i know this test being flaky because of a bug introduced earlier and still not fixed https://redpandadata.slack.com/archives/C07FJGU5AKV/p1759413141157599. The test is correct afaik.

@bharathv
Copy link
Contributor Author

@bharathv i know this test being flaky because of a bug introduced earlier and still not fixed https://redpandadata.slack.com/archives/C07FJGU5AKV/p1759413141157599. The test is correct afaik.

thanks for the pointer, let me 👀

@bharathv
Copy link
Contributor Author

@bharathv i know this test being flaky because of a bug introduced earlier and still not fixed https://redpandadata.slack.com/archives/C07FJGU5AKV/p1759413141157599. The test is correct afaik.

I think I understand what’s going on. I agree that the change in the returned epoch is what caused this to regress in the first place.

The current test fix masks the issue, because the new wait condition in the PR waits until the client actually detects a truncation before sending another round of messages.

In practice, this forces the broker to eventually return an offset_out_of_range error, since there’s no new data (the fetch tries to read from 1500 while the local start offset is 0). That error then triggers an offset_for_leader_epoch request, which resets both the epoch and the offset. Before the regression, this flow worked because fenced_leader_epoch kicked off the truncation detection logic.

I think this race can still technically happen even after Andrew’s fix. It’s possible for data to be produced before the client detects the truncation, and by the time it does (via fenced_leader_epoch), the offsets already line up. At that point, it’s effectively the same as a normal leadership change, and the test could still hit the same timeout IIUC.

@bharathv
Copy link
Contributor Author

bharathv commented Feb 4, 2026

I think this race can still technically happen even after Andrew’s fix. It’s possible for data to be produced before the client detects the truncation, and by the time it does (via fenced_leader_epoch), the offsets already line up

This seems unlikely with kip-320, lets follow up on #28618

@bharathv bharathv closed this Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants