tests/dt: deflake WriteCachingFailureInjectionE2ETest.test_crash_all by bharathv · Pull Request #29452 · redpanda-data/redpanda

bharathv · 2026-01-29T00:09:04Z

The problem is that sometimes the consumer doesn’t realize there was
a data loss. Data might get produced before it notices that the offsets
rolled back.

Typical happy path is

loop:
  produce(1500)
  total_records_produced += 1500
  wait_for_consumption(total_records_produced)
  crash_restart_all_brokers()

In the flaky failures, the test doesn’t wait for the consumer to notice
that the offsets rolled back. It immediately produces another batch of
1500 messages. Consumer sees the epoch bump, assumes the offsets are
still valid, and tries to resume from offset 1500, but there’s no data
there, so it just hangs and the total consumed count never increases.

Fixes: https://redpandadata.atlassian.net/browse/CORE-13458

Backports Required

Release Notes

none

bharathv · 2026-01-29T00:11:07Z

/ci-repeat 1
skip-redpanda-build
skip-units
dt-repeat=100
rp_repo=nightly
rp_version=latest
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all

bharathv · 2026-01-29T00:15:58Z

/ci-repeat 1
skip-units
dt-repeat=100
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all

The problem is that sometimes the consumer doesn’t realize there was a data loss. Data might get produced before it notices that the offsets rolled back. Typical happy path is loop: produce(1500) total_records_produced += 1500 wait_for_consumption(total_records_produced) crash_restart_all_brokers() In the flaky failures, the test doesn’t wait for the consumer to notice that the offsets rolled back. It immediately produces another batch of 1500 messages. Consumer sees the epoch bump, assumes the offsets are still valid, and tries to resume from offset 1500, but there’s no data there, so it just hangs and the total consumed count never increases.

bharathv · 2026-01-29T01:47:38Z

/ci-repeat 1
skip-redpanda-build
skip-units
dt-repeat=200
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all

Copilot

Pull request overview

This PR fixes a flaky test (WriteCachingFailureInjectionE2ETest.test_crash_all) by addressing a race condition between producing messages and consumer offset awareness after broker crashes.

Changes:

Adds synchronization to wait for the consumer to observe lost offsets after each crash/restart cycle before producing new messages
Introduces a helper function get_lost_offsets() to retrieve lost offset information from the consumer validator
Adds prev_lost_offsets tracking to detect when the consumer has registered new data loss

tests/rptest/tests/write_caching_fi_e2e_test.py

nvartolomei · 2026-01-29T01:55:54Z

@bharathv i know this test being flaky because of a bug introduced earlier and still not fixed https://redpandadata.slack.com/archives/C07FJGU5AKV/p1759413141157599. The test is correct afaik.

bharathv · 2026-01-29T15:36:23Z

@bharathv i know this test being flaky because of a bug introduced earlier and still not fixed https://redpandadata.slack.com/archives/C07FJGU5AKV/p1759413141157599. The test is correct afaik.

thanks for the pointer, let me 👀

bharathv · 2026-01-29T22:27:13Z

@bharathv i know this test being flaky because of a bug introduced earlier and still not fixed https://redpandadata.slack.com/archives/C07FJGU5AKV/p1759413141157599. The test is correct afaik.

I think I understand what’s going on. I agree that the change in the returned epoch is what caused this to regress in the first place.

The current test fix masks the issue, because the new wait condition in the PR waits until the client actually detects a truncation before sending another round of messages.

In practice, this forces the broker to eventually return an offset_out_of_range error, since there’s no new data (the fetch tries to read from 1500 while the local start offset is 0). That error then triggers an offset_for_leader_epoch request, which resets both the epoch and the offset. Before the regression, this flow worked because fenced_leader_epoch kicked off the truncation detection logic.

I think this race can still technically happen even after Andrew’s fix. It’s possible for data to be produced before the client detects the truncation, and by the time it does (via fenced_leader_epoch), the offsets already line up. At that point, it’s effectively the same as a normal leadership change, and the test could still hit the same timeout IIUC.

bharathv · 2026-02-04T18:59:36Z

I think this race can still technically happen even after Andrew’s fix. It’s possible for data to be produced before the client detects the truncation, and by the time it does (via fenced_leader_epoch), the offsets already line up

This seems unlikely with kip-320, lets follow up on #28618

bharathv force-pushed the fix_wc_crash_test branch from 6643d6c to dbbeb1a Compare January 29, 2026 01:45

bharathv marked this pull request as ready for review January 29, 2026 01:47

Copilot AI review requested due to automatic review settings January 29, 2026 01:47

bharathv requested a review from nvartolomei January 29, 2026 01:48

Copilot AI reviewed Jan 29, 2026

View reviewed changes

tests/rptest/tests/write_caching_fi_e2e_test.py Show resolved Hide resolved

bharathv closed this Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/dt: deflake WriteCachingFailureInjectionE2ETest.test_crash_all#29452

tests/dt: deflake WriteCachingFailureInjectionE2ETest.test_crash_all#29452
bharathv wants to merge 1 commit intoredpanda-data:devfrom
bharathv:fix_wc_crash_test

bharathv commented Jan 29, 2026 •

edited

Loading

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

nvartolomei commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bharathv commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

nvartolomei commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Jan 29, 2026

Uh oh!

bharathv commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bharathv commented Jan 29, 2026 •

edited

Loading