Skip to content

tracking: unify process-wide shutdown via root CancellationToken #1051

@zancas

Description

@zancas

Goal

Replace the heterogeneous shutdown patterns across zaino-state, zaino-serve, and zainod with a single primitive: a root CancellationToken owned by the indexer, with child_token() clones distributed to each long-running task. SIGINT/SIGTERM cancels the root; every leaf observes cancellation immediately, no polling.

Motivation

After #1033 / #1049 landed, tokio_util::sync::CancellationToken is in use at one site — DbLifecycle::shutdown. The remaining shutdown sites still use a status-polling pattern with 100 ms intervals plus JoinHandle::abort(). The polling is correct but introduces a latency floor (50–500 ms typical) and forfeits graceful drain when abort() is the only effective interrupt.

Target architecture

zainod root CancellationToken
├── child: jsonrpc server token   (Site 2 sub-issue)
├── child: grpc server token      (Site 2 sub-issue)
├── child: mempool token          (Site 1 sub-issue)
├── child: state service token    (no-op — task is foreign, only abort()able)
├── child: fetch service token    (no-op — shutdown delegates downward)
└── child: chain_index / finalised_state token  (already CT, via #1049)

Cancelling the root propagates to every leaf. child_token().cancel() does NOT propagate up — contains worker-bug blast radius per service.

Sub-issues

  • mempool: replace status-poll + abort with CT (Site 1 — separate sub-issue)
  • zaino-serve jsonrpc + grpc: replace shutdown-signal polling with CT (Site 2 — separate sub-issue)
  • After both land: zainod/indexer.rs holds the root token; SIGINT/SIGTERM handlers call root.cancel(); supervisor loop selects on root.cancelled()

Explicit non-goals

  • backends/state.rs — the long-running task is owned by zebra_state upstream. We can only abort(); no CT plumbing is possible without upstream changes.
  • backends/fetch.rs — shutdown delegates to the underlying gRPC client; no local task to cancel.

Acceptance criteria

  • Supervisor loop in zainod/indexer.rs selects on root.cancelled() instead of polling.
  • Every long-running task we own accepts a CancellationToken (directly or via trait getter) and selects against .cancelled().
  • Notify / status-polling shutdown patterns are absent from sources after this lands. JoinHandle::abort() survives only as the Drop-time backstop.
  • cargo nextest run passes; zainod shutdown latency under SIGINT measurably lower (target: <50 ms from signal to last task exit).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions