tests/dt: deflake WriteCachingFailureInjectionE2ETest.test_crash_all#29452
tests/dt: deflake WriteCachingFailureInjectionE2ETest.test_crash_all#29452bharathv wants to merge 1 commit intoredpanda-data:devfrom
Conversation
|
/ci-repeat 1 |
|
/ci-repeat 1 |
The problem is that sometimes the consumer doesn’t realize there was a data loss. Data might get produced before it notices that the offsets rolled back. Typical happy path is loop: produce(1500) total_records_produced += 1500 wait_for_consumption(total_records_produced) crash_restart_all_brokers() In the flaky failures, the test doesn’t wait for the consumer to notice that the offsets rolled back. It immediately produces another batch of 1500 messages. Consumer sees the epoch bump, assumes the offsets are still valid, and tries to resume from offset 1500, but there’s no data there, so it just hangs and the total consumed count never increases.
6643d6c to
dbbeb1a
Compare
|
/ci-repeat 1 |
There was a problem hiding this comment.
Pull request overview
This PR fixes a flaky test (WriteCachingFailureInjectionE2ETest.test_crash_all) by addressing a race condition between producing messages and consumer offset awareness after broker crashes.
Changes:
- Adds synchronization to wait for the consumer to observe lost offsets after each crash/restart cycle before producing new messages
- Introduces a helper function
get_lost_offsets()to retrieve lost offset information from the consumer validator - Adds
prev_lost_offsetstracking to detect when the consumer has registered new data loss
|
@bharathv i know this test being flaky because of a bug introduced earlier and still not fixed https://redpandadata.slack.com/archives/C07FJGU5AKV/p1759413141157599. The test is correct afaik. |
thanks for the pointer, let me 👀 |
I think I understand what’s going on. I agree that the change in the returned epoch is what caused this to regress in the first place. The current test fix masks the issue, because the new wait condition in the PR waits until the client actually detects a truncation before sending another round of messages. In practice, this forces the broker to eventually return an I think this race can still technically happen even after Andrew’s fix. It’s possible for data to be produced before the client detects the truncation, and by the time it does (via |
This seems unlikely with kip-320, lets follow up on #28618 |
The problem is that sometimes the consumer doesn’t realize there was
a data loss. Data might get produced before it notices that the offsets
rolled back.
Typical happy path is
In the flaky failures, the test doesn’t wait for the consumer to notice
that the offsets rolled back. It immediately produces another batch of
1500 messages. Consumer sees the epoch bump, assumes the offsets are
still valid, and tries to resume from offset 1500, but there’s no data
there, so it just hangs and the total consumed count never increases.
Fixes: https://redpandadata.atlassian.net/browse/CORE-13458
Backports Required
Release Notes