-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: in remap, when since is empty, suspend instead of panic'ing #31226
base: main
Are you sure you want to change the base?
storage: in remap, when since is empty, suspend instead of panic'ing #31226
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be a good way to test this? Should I try just creating sources, using them, and then deleting them?
The case we observed in the incident is the source being dropped right after creating it, I think that's the most promising path. It also helps when the cluster is really busy, so it doesn't try and render the source in time but is delayed to after the source is dropped already |
As the comment describes, this is a race condition that is expected to happen and it's better to suspend rather than bring down the whole cluster, which causes pain for customers/the oncall.
df33a47
to
9442ff2
Compare
I have a simple testdrive test with which I'm trying to reproduce this, but I ran into another panic instead:
I'll experiment around a bit with making it reproduce more reliably and open a separate PR. |
Inspired by MaterializeInc#31226, but found another panic instead: thread 'timely:work-0' panicked at src/storage/src/render/sources.rs:223:18: resuming an already finished ingestion 5: core::panicking::panic_fmt 6: core::option::expect_failed 7: mz_storage::render::sources::render_source::<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_storage_types::sources::kafka::KafkaSourceConnection> 8: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic>::{closure#0}::{closure#0} 9: <timely::worker::Worker<timely_communication::allocator::generic::Generic>>::dataflow_core::<(), (), mz_storage::render::build_ingestion_dataflow<timely_communication::allocator::generic::Generic>::{closure#0}, alloc::boxed::Box<()>> 10: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic> 11: <mz_storage::storage_state::Worker<timely_communication::allocator::generic::Generic>>::run_client 12: <mz_storage::server::Config as mz_cluster::types::AsRunnableWorker<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::build_and_run::<timely_communication::allocator::generic::Generic> note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Good news, my test in #31243 can also reproduces this panic, seen in this nightly run: https://buildkite.com/materialize/nightly/builds/11014#_
|
// This can happen when, say, a source is being dropped but we on | ||
// the cluster are busy and notice that only later. In those cases | ||
// it can happen that we still try to render an ingestion that is | ||
// not valid anymore and where the shards it uses are not valid to | ||
// use anymore. | ||
// | ||
// This is a rare race condition and something that is expected to | ||
// happen every now and then. It's not a bug in the current way of | ||
// how things work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storage has the DroppedIds
protocol response that I assumed to exist to let the storage controller wait until all replicas have acknowledged the dropping of an object. Is that not true? Or does this race only come up in the context of 0dt upgrades?
As the comment describes, this is a race condition that is expected to happen and it's better to suspend rather than bring down the whole cluster, which causes pain for customers/the oncall.
@def- This fixes the panic of incident-360
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.