Skip to content

Introduce a WalCommand variant for batch cancel/kill invocations based on filter criteria #4455

@tillrohrmann

Description

@tillrohrmann

Summary

Currently, cancelling or killing multiple invocations requires sending individual Command::TerminateInvocation WAL commands per invocation. The existing batch endpoints (/internal/invocations_batch_operations/kill and /cancel) fan out individual RPC calls via future::join_all(), and the CLI resolves matching invocations via DataFusion SQL queries before issuing one-by-one cancel/kill calls.

We should introduce a new Command variant that allows batch cancellation/killing of invocations based on filter criteria, rather than requiring pre-resolution of individual invocation IDs.

Motivation

When a user wants to cancel/kill a large number of invocations (e.g., all invocations of a particular service, or all invocations stuck in a certain state), the current approach has several downsides:

  1. N individual WAL entries: Each cancellation results in a separate WAL command. Since the partition processor state machine eagerly schedules invocations when applying a kill command, it can happen that we schedule invocations that will be killed with the next Bifrost message.
  2. Race conditions: Between querying for matching invocations and issuing the cancellation commands, new invocations matching the criteria may have been created.
  3. Client-side resolution: The filter logic lives in the CLI/admin API layer, not in the partition processor where the authoritative state lives.

Proposed Filter Criteria

The new command could support filtering invocations by combinations of:

  • Time-based: created before/after a timestamp, last modified before/after a timestamp
  • Service-based: belonging to a specific service (by name)
  • Handler-based: targeting a specific handler
  • State-based: being in a certain status (e.g., suspended, backing-off, ready, running, scheduled)
  • Deployment-based: pinned to a specific deployment

Open Questions

How to avoid blocking the partition processor loop

The partition processor's main event loop (PartitionProcessor::run_inner() in crates/worker/src/partition/mod.rs) processes commands in batches from the bifrost log within a single partition store transaction. A batch cancel/kill that scans and terminates thousands of invocations in one go could block this loop for too long, starving other work (RPCs, timers, invoker effects, etc.).

Some alternatives to consider:

  1. Chunked processing: The state machine processes the filter incrementally — e.g., cancel N invocations per tick, re-enqueue the command if there are more matches. This keeps the loop responsive but adds complexity around tracking progress.

  2. Pre-resolved invocation IDs: Instead of sending filter criteria in the WAL command, resolve the IDs at the admin API layer and send them as a batch Vec<InvocationId> in the WAL command. This is simpler but loses the atomicity/consistency benefit and still has the race condition issue.

Interaction with multi-partition setups

If invocations are spread across multiple partitions, a filter-based batch command would need to be sent to each partition. The coordination and progress tracking for this needs consideration.

Progress tracking for filter based termination commands

When sending the filter condition together with the termination WAL command, we might need a different way to signal progress to the user (how many invocations are being killed, how many have already been killed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions