storage: in remap, when since is empty, suspend instead of panic'ing #31226

aljoscha · 2025-01-29T12:28:06Z

As the comment describes, this is a race condition that is expected to happen and it's better to suspend rather than bring down the whole cluster, which causes pain for customers/the oncall.

@def- This fixes the panic of incident-360

Motivation

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

def-

What would be a good way to test this? Should I try just creating sources, using them, and then deleting them?

aljoscha · 2025-01-29T13:07:08Z

The case we observed in the incident is the source being dropped right after creating it, I think that's the most promising path. It also helps when the cluster is really busy, so it doesn't try and render the source in time but is delayed to after the source is dropped already

As the comment describes, this is a race condition that is expected to happen and it's better to suspend rather than bring down the whole cluster, which causes pain for customers/the oncall.

def- · 2025-01-30T08:42:47Z

I have a simple testdrive test with which I'm trying to reproduce this, but I ran into another panic instead:

thread 'timely:work-0' panicked at src/storage/src/render/sources.rs:223:18:
resuming an already finished ingestion
   5: core::panicking::panic_fmt
   6: core::option::expect_failed
   7: mz_storage::render::sources::render_source::<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_storage_types::sources::kafka::KafkaSourceConnection>
   8: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic>::{closure#0}::{closure#0}
   9: <timely::worker::Worker<timely_communication::allocator::generic::Generic>>::dataflow_core::<(), (), mz_storage::render::build_ingestion_dataflow<timely_communication::allocator::generic::Generic>::{closure#0}, alloc::boxed::Box<()>>
  10: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic>
  11: <mz_storage::storage_state::Worker<timely_communication::allocator::generic::Generic>>::run_client
  12: <mz_storage::server::Config as mz_cluster::types::AsRunnableWorker<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::build_and_run::<timely_communication::allocator::generic::Generic>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I'll experiment around a bit with making it reproduce more reliably and open a separate PR.

Inspired by MaterializeInc#31226, but found another panic instead: thread 'timely:work-0' panicked at src/storage/src/render/sources.rs:223:18: resuming an already finished ingestion 5: core::panicking::panic_fmt 6: core::option::expect_failed 7: mz_storage::render::sources::render_source::<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_storage_types::sources::kafka::KafkaSourceConnection> 8: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic>::{closure#0}::{closure#0} 9: <timely::worker::Worker<timely_communication::allocator::generic::Generic>>::dataflow_core::<(), (), mz_storage::render::build_ingestion_dataflow<timely_communication::allocator::generic::Generic>::{closure#0}, alloc::boxed::Box<()>> 10: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic> 11: <mz_storage::storage_state::Worker<timely_communication::allocator::generic::Generic>>::run_client 12: <mz_storage::server::Config as mz_cluster::types::AsRunnableWorker<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::build_and_run::<timely_communication::allocator::generic::Generic> note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

def- · 2025-01-30T17:05:56Z

Good news, my test in #31243 can also reproduces this panic, seen in this nightly run: https://buildkite.com/materialize/nightly/builds/11014#_

testdrive-materialized-1          | thread 'timely:work-0' panicked at src/storage/src/render/sources.rs:305:37: upsert_rehydration: u2689: sc109340c-2914-49d2-b5b7-3d0531da7bf8 cannot serve requested as_of Antichain { elements: [0] }: Since(Antichain { elements: [] })

teskje · 2025-01-31T13:24:09Z

src/storage/src/source/reclock/compat.rs

+            // This can happen when, say, a source is being dropped but we on
+            // the cluster are busy and notice that only later. In those cases
+            // it can happen that we still try to render an ingestion that is
+            // not valid anymore and where the shards it uses are not valid to
+            // use anymore.
+            //
+            // This is a rare race condition and something that is expected to
+            // happen every now and then. It's not a bug in the current way of
+            // how things work.


Storage has the DroppedIds protocol response that I assumed to exist to let the storage controller wait until all replicas have acknowledged the dropping of an object. Is that not true? Or does this race only come up in the context of 0dt upgrades?

aljoscha requested review from petrosagg, def- and teskje January 29, 2025 12:28

aljoscha requested a review from a team as a code owner January 29, 2025 12:28

def- reviewed Jan 29, 2025

View reviewed changes

storage: in remap, when since is empty, suspend instead of panic'ing

9442ff2

As the comment describes, this is a race condition that is expected to happen and it's better to suspend rather than bring down the whole cluster, which causes pain for customers/the oncall.

aljoscha force-pushed the storage-suspend-on-empty-since branch from df33a47 to 9442ff2 Compare January 29, 2025 16:34

def- mentioned this pull request Jan 30, 2025

testdrive: Add upsert source race test #31243

Open

5 tasks

teskje reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: in remap, when since is empty, suspend instead of panic'ing #31226

storage: in remap, when since is empty, suspend instead of panic'ing #31226

aljoscha commented Jan 29, 2025

def- left a comment

aljoscha commented Jan 29, 2025

def- commented Jan 30, 2025

def- commented Jan 30, 2025

teskje Jan 31, 2025

storage: in remap, when since is empty, suspend instead of panic'ing #31226

Are you sure you want to change the base?

storage: in remap, when since is empty, suspend instead of panic'ing #31226

Conversation

aljoscha commented Jan 29, 2025

Motivation

Tips for reviewer

Checklist

def- left a comment

Choose a reason for hiding this comment

aljoscha commented Jan 29, 2025

def- commented Jan 30, 2025

def- commented Jan 30, 2025

teskje Jan 31, 2025

Choose a reason for hiding this comment