-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[copy_from] Proper cancelation via CancelOneshotIngestion
message
#31136
base: main
Are you sure you want to change the base?
Conversation
f16fa82
to
56ccb4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. The thing I'm not clear about is what happens to oneshot ingestions when their target cluster is dropped. There are two places that have to deal with the possibility of a missing replica and both handle them differently (returning an error or gracefully ignoring, respectively). I think we should handle this consistently, but the right thing to do depends on whether or not we drop the controller state for pending oneshot ingestions when we drop an instance or not.
/// [`RunOneshotIngestion`]: crate::client::StorageCommand::RunOneshotIngestion | ||
CancelOneshotIngestion { | ||
ingestions: Vec<Uuid>, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to make these batched commands? In compute we at one point transformed all commands into unbatched ones because the batching made various things more cumbersome (mainly keeping statistics about the number of commands in the history) and it didn't provide any benefits wrt. protobuf encoding size. I think there are plans for also moving to unbatched commands for storage (either @aljoscha or @petrosagg mentioned that), so if that's still the case it'd make sense to introduce new commands as unbatched immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it, yeah. If possible we should use a flattened field here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I made these batched commands is because the loop in fn reconcile(...)
doesn't remove commands, instead of mutates the batch and removes relevant ones, so I decided to stick with this existing pattern.
Chatted with @petrosagg about this today though and I'll first try to refactor the loop and actually remove commands instead of just draining batched ones.
|
||
let instance = self.instances.get_mut(&pending.cluster_id).ok_or_else(|| { | ||
// TODO(cf2): Refine this error. | ||
StorageError::Generic(anyhow::anyhow!("missing cluster {}", pending.cluster_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get here that would be because of a bug in the storage controller, not because of a usage error, right? I wouldn't return an error here, but do a (soft) panic instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a soft_panic, but I think we'd also need to return an error anyways? Because we need an instance
to send a Command
and I wouldn't want to return Ok(())
if we didn't have an instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to agree on what should happen when a user drops a cluster that still has pending oneshot ingestions. Plausible options would be:
- Cancel all the pending oneshot ingestions as well, by removing them from
pending_oneshot_ingestions
. - Return an error to the cluster drop command, requiring that all oneshot ingestions are canceled before one can drop a cluster.
- Just drop the cluster regardless of what's currently pending.
This PR implements (3), which is consistent with what the storage controller does for other kinds of cluster objects, afaict. But it forces us to answer the question of what happens to these orphan pending ingestions now. We probably don't want to leak the state, so there should be a way to still remove them. At the moment, the only way to remove them will return an error, which doesn't feel great. I think returning Ok(())
would actually make sense: You wanted to cancel the ingestion, it wasn't physically running anymore but it was still pending in the storage controller, and that pending state is now canceled.
In any case, my initial assumption that we can only get here due to a bug in the storage controller seems wrong, so we shouldn't put a soft panic here!
.filter(|ingestion_id| { | ||
let created = create_oneshot_ingestions.contains(ingestion_id); | ||
let dropped = cancel_oneshot_ingestions.contains(ingestion_id); | ||
!created && !dropped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check seems unnecessary. We shouldn't have any drop commands for ingestions we didn't previously create in the command stream, right? So !created && dropped
shouldn't be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure! This is definitely a bit defensive but covers the case of an upstream error where we send a CancelOneshotIngestion
command before sending a create command. I can remove it if you'd prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have some indication (like a soft panic or an error log) if the protocol assumptions are broken, to give us a chance to discover these cases. But I also don't want to make the reconciliation logic more complicated than it already is. Fine to leave it as it is!
56ccb4d
to
cae7e7f
Compare
* add CancelOneshotIngestion message to the storage controller * handle new message in 'reduce' and 'reconcile' in the storage-controller * emit a CancelOntshotIngestion whenever an ingestion completes
cae7e7f
to
6680ba6
Compare
This PR fixes the
TODO(cf1)
related to canceling oneshot ingestions. It adds aStorageCommand::CancelOneshotIngestion
thatreduce
s/compacts away a correspondingStorageCommand::RunOneshotIngestion
, much likeComputeCommand::Peek
andComputeCommand::CancelPeek
.We send a
StorageCommand::CancelOneshotIngestion
whenever a user has canceled aCOPY FROM
statement, but also the storage controller will send one whenever aRunOneshotIngestion
command completes.Motivation
Fix
TODO(cf1)
related to cancelationChecklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.