[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

ParkMyCar · 2025-01-22T03:49:49Z

This PR fixes the TODO(cf1) related to canceling oneshot ingestions. It adds a StorageCommand::CancelOneshotIngestion that reduces/compacts away a corresponding StorageCommand::RunOneshotIngestion, much like ComputeCommand::Peek and ComputeCommand::CancelPeek.

We send a StorageCommand::CancelOneshotIngestion whenever a user has canceled a COPY FROM statement, but also the storage controller will send one whenever a RunOneshotIngestion command completes.

Motivation

Fix TODO(cf1) related to cancelation

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

teskje

Looks good overall. The thing I'm not clear about is what happens to oneshot ingestions when their target cluster is dropped. There are two places that have to deal with the possibility of a missing replica and both handle them differently (returning an error or gracefully ignoring, respectively). I think we should handle this consistently, but the right thing to do depends on whether or not we drop the controller state for pending oneshot ingestions when we drop an instance or not.

teskje · 2025-01-27T15:27:02Z

src/storage-client/src/client.rs

+    /// [`RunOneshotIngestion`]: crate::client::StorageCommand::RunOneshotIngestion
+    CancelOneshotIngestion {
+        ingestions: Vec<Uuid>,
+    },


Is there a reason to make these batched commands? In compute we at one point transformed all commands into unbatched ones because the batching made various things more cumbersome (mainly keeping statistics about the number of commands in the history) and it didn't provide any benefits wrt. protobuf encoding size. I think there are plans for also moving to unbatched commands for storage (either @aljoscha or @petrosagg mentioned that), so if that's still the case it'd make sense to introduce new commands as unbatched immediately.

I mentioned it, yeah. If possible we should use a flattened field here

The reason I made these batched commands is because the loop in fn reconcile(...) doesn't remove commands, instead of mutates the batch and removes relevant ones, so I decided to stick with this existing pattern.

Chatted with @petrosagg about this today though and I'll first try to refactor the loop and actually remove commands instead of just draining batched ones.

teskje · 2025-01-27T15:37:24Z

src/storage-controller/src/lib.rs

+
+        let instance = self.instances.get_mut(&pending.cluster_id).ok_or_else(|| {
+            // TODO(cf2): Refine this error.
+            StorageError::Generic(anyhow::anyhow!("missing cluster {}", pending.cluster_id))


If we get here that would be because of a bug in the storage controller, not because of a usage error, right? I wouldn't return an error here, but do a (soft) panic instead.

We can add a soft_panic, but I think we'd also need to return an error anyways? Because we need an instance to send a Command and I wouldn't want to return Ok(()) if we didn't have an instance?

I think we need to agree on what should happen when a user drops a cluster that still has pending oneshot ingestions. Plausible options would be:

Cancel all the pending oneshot ingestions as well, by removing them from pending_oneshot_ingestions.

Return an error to the cluster drop command, requiring that all oneshot ingestions are canceled before one can drop a cluster.

Just drop the cluster regardless of what's currently pending.

This PR implements (3), which is consistent with what the storage controller does for other kinds of cluster objects, afaict. But it forces us to answer the question of what happens to these orphan pending ingestions now. We probably don't want to leak the state, so there should be a way to still remove them. At the moment, the only way to remove them will return an error, which doesn't feel great. I think returning Ok(()) would actually make sense: You wanted to cancel the ingestion, it wasn't physically running anymore but it was still pending in the storage controller, and that pending state is now canceled.

In any case, my initial assumption that we can only get here due to a bug in the storage controller seems wrong, so we shouldn't put a soft panic here!

src/storage-controller/src/lib.rs

teskje · 2025-01-27T15:58:30Z

src/storage/src/storage_state.rs

+            .filter(|ingestion_id| {
+                let created = create_oneshot_ingestions.contains(ingestion_id);
+                let dropped = cancel_oneshot_ingestions.contains(ingestion_id);
+                !created && !dropped


The check seems unnecessary. We shouldn't have any drop commands for ingestions we didn't previously create in the command stream, right? So !created && dropped shouldn't be possible.

I wasn't sure! This is definitely a bit defensive but covers the case of an upstream error where we send a CancelOneshotIngestion command before sending a create command. I can remove it if you'd prefer

I think it would be good to have some indication (like a soft panic or an error log) if the protocol assumptions are broken, to give us a chance to discover these cases. But I also don't want to make the reconciliation logic more complicated than it already is. Fine to leave it as it is!

* add CancelOneshotIngestion message to the storage controller * handle new message in 'reduce' and 'reconcile' in the storage-controller * emit a CancelOntshotIngestion whenever an ingestion completes

ParkMyCar requested a review from teskje January 22, 2025 03:49

ParkMyCar requested review from a team as code owners January 22, 2025 03:49

ParkMyCar requested a review from aljoscha January 22, 2025 03:49

ParkMyCar force-pushed the copy/cancel branch 3 times, most recently from f16fa82 to 56ccb4d Compare January 23, 2025 14:13

teskje reviewed Jan 27, 2025

View reviewed changes

ParkMyCar force-pushed the copy/cancel branch from 56ccb4d to cae7e7f Compare February 3, 2025 22:32

ParkMyCar requested a review from a team as a code owner February 3, 2025 22:32

ParkMyCar added 2 commits February 3, 2025 17:40

start, proper cancelation for Oneshot Ingestions

6370553

* add CancelOneshotIngestion message to the storage controller * handle new message in 'reduce' and 'reconcile' in the storage-controller * emit a CancelOntshotIngestion whenever an ingestion completes

update comment

6680ba6

ParkMyCar force-pushed the copy/cancel branch from cae7e7f to 6680ba6 Compare February 3, 2025 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

ParkMyCar commented Jan 22, 2025

teskje left a comment

teskje Jan 27, 2025

petrosagg Jan 28, 2025

ParkMyCar Jan 28, 2025

teskje Jan 27, 2025

ParkMyCar Feb 3, 2025

teskje Feb 5, 2025

teskje Jan 27, 2025

ParkMyCar Feb 3, 2025

teskje Feb 5, 2025

[copy_from] Proper cancelation via CancelOneshotIngestion message #31136

Are you sure you want to change the base?

[copy_from] Proper cancelation via CancelOneshotIngestion message #31136

Conversation

ParkMyCar commented Jan 22, 2025

Motivation

Checklist

teskje left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136