-
Notifications
You must be signed in to change notification settings - Fork 133
Description
Summary
Currently, cancelling or killing multiple invocations requires sending individual Command::TerminateInvocation WAL commands per invocation. The existing batch endpoints (/internal/invocations_batch_operations/kill and /cancel) fan out individual RPC calls via future::join_all(), and the CLI resolves matching invocations via DataFusion SQL queries before issuing one-by-one cancel/kill calls.
We should introduce a new Command variant that allows batch cancellation/killing of invocations based on filter criteria, rather than requiring pre-resolution of individual invocation IDs.
Motivation
When a user wants to cancel/kill a large number of invocations (e.g., all invocations of a particular service, or all invocations stuck in a certain state), the current approach has several downsides:
- N individual WAL entries: Each cancellation results in a separate WAL command. Since the partition processor state machine eagerly schedules invocations when applying a kill command, it can happen that we schedule invocations that will be killed with the next Bifrost message.
- Race conditions: Between querying for matching invocations and issuing the cancellation commands, new invocations matching the criteria may have been created.
- Client-side resolution: The filter logic lives in the CLI/admin API layer, not in the partition processor where the authoritative state lives.
Proposed Filter Criteria
The new command could support filtering invocations by combinations of:
- Time-based: created before/after a timestamp, last modified before/after a timestamp
- Service-based: belonging to a specific service (by name)
- Handler-based: targeting a specific handler
- State-based: being in a certain status (e.g.,
suspended,backing-off,ready,running,scheduled) - Deployment-based: pinned to a specific deployment
Open Questions
How to avoid blocking the partition processor loop
The partition processor's main event loop (PartitionProcessor::run_inner() in crates/worker/src/partition/mod.rs) processes commands in batches from the bifrost log within a single partition store transaction. A batch cancel/kill that scans and terminates thousands of invocations in one go could block this loop for too long, starving other work (RPCs, timers, invoker effects, etc.).
Some alternatives to consider:
-
Chunked processing: The state machine processes the filter incrementally — e.g., cancel N invocations per tick, re-enqueue the command if there are more matches. This keeps the loop responsive but adds complexity around tracking progress.
-
Pre-resolved invocation IDs: Instead of sending filter criteria in the WAL command, resolve the IDs at the admin API layer and send them as a batch
Vec<InvocationId>in the WAL command. This is simpler but loses the atomicity/consistency benefit and still has the race condition issue.
Interaction with multi-partition setups
If invocations are spread across multiple partitions, a filter-based batch command would need to be sent to each partition. The coordination and progress tracking for this needs consideration.
Progress tracking for filter based termination commands
When sending the filter condition together with the termination WAL command, we might need a different way to signal progress to the user (how many invocations are being killed, how many have already been killed).