Skip to content

server/cluster: RaftCluster.Start leaks goroutines on startup failure #10309

@lance6716

Description

@lance6716

Bug report

RaftCluster.Start() can leak startup goroutines when it returns an error before running is set to true.

The failure mode is generic and not tied to any one feature: Start() creates components that launch background loops using c.ctx, but if a later startup step fails, Start() returns without canceling that context. Since running is still false, a later RaftCluster.Stop() also returns early and never calls c.cancel().

Current startup path that reproduces it

On current master, this sequence is enough:

  1. Start() creates RegionLabeler, which starts RegionLabeler.doGC().
  2. Start() creates affinity.Manager, which starts its availability check loop.
  3. keyspaceGroupManager.Bootstrap(c.ctx) returns an error.
  4. Start() returns early while c.ctx is still live.
  5. The package-level goleak check reports leaked goroutines.

Reproduction

I reproduced this locally with a regression test under server/cluster that makes keyspaceGroupManager.Bootstrap fail after the earlier startup components are initialized.

Run:

go test ./server/cluster -run TestStartCancelsContextOnBootstrapFailure -count=1

Observed result: the test body itself passes the expected startup error, but the package fails in goleak with leaked goroutines including:

  • labeler.(*RegionLabeler).doGC
  • affinity.(*Manager).startAvailabilityCheckLoop.func2

I also saw related partially-initialized background goroutines such as hot cache workers still alive in the same run.

Expected result

If RaftCluster.Start() fails at any point after InitCluster() succeeds, all background work started off c.ctx should be torn down before returning.

Likely root cause

Start() currently relies on Stop() for normal shutdown, but Stop() exits early when running == false. That means any startup failure after background goroutines are created but before running = true has no unified rollback path.

Suggested fix

Add a startup-failure rollback path inside RaftCluster.Start() itself. For example, after InitCluster() succeeds, defer a cleanup that cancels c.ctx when err != nil.

That would make startup failures self-contained and avoid depending on Stop() for a state where running was never set.


  • LLM model: gpt-5.4
  • Thinking level: xhigh
  • Client: Codex Desktop 26.305.950 (build 863); codex-cli 0.105.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions