Skip to content

server/cluster: cancel ctx on Start failure#10310

Open
lance6716 wants to merge 1 commit intotikv:masterfrom
lance6716:codex/issue-10309-start-cancel
Open

server/cluster: cancel ctx on Start failure#10310
lance6716 wants to merge 1 commit intotikv:masterfrom
lance6716:codex/issue-10309-start-cancel

Conversation

@lance6716
Copy link
Contributor

@lance6716 lance6716 commented Mar 7, 2026

What problem does this PR solve?

Issue Number: Close #10309

What is changed and how does it work?

fix RaftCluster.Start leaks goroutines on startup failure

Check List

Tests

  • Unit test

Code changes

Side effects

Related changes

  • Need to cherry-pick to the release branch

Release note

None.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved cluster initialization robustness by ensuring all background processes are properly terminated if startup fails, preventing resource leaks and enhancing system stability.

Copilot AI review requested due to automatic review settings March 7, 2026 05:19
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 7, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cabinfeverb for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 7, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f47d3ad-d81b-4d6e-a994-f63114e833af

📥 Commits

Reviewing files that changed from the base of the PR and between 95cde21 and 98a46a7.

📒 Files selected for processing (2)
  • server/cluster/cluster.go
  • server/cluster/cluster_test.go

📝 Walkthrough

Walkthrough

The changes add error-triggered cleanup to RaftCluster.Start() by deferring context cancellation if an error occurs after InitCluster succeeds, preventing goroutine leaks on startup failure. A comprehensive test validates this cleanup behavior under bootstrap failure scenarios.

Changes

Cohort / File(s) Summary
Context Cleanup on Startup Failure
server/cluster/cluster.go
Defers cancellation of cluster context if an error occurs after InitCluster, ensuring background goroutines are cleaned up on startup failure before returning.
Bootstrap Failure Test Suite
server/cluster/cluster_test.go
Introduces a failing keyspace bootstrap error sentinel, storage wrapper, test server variant with internal component getters, and TestStartCancelsContextOnBootstrapFailure to verify context cancellation and RaftCluster state on bootstrap failure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A context that once did leak,
Now finds its rest, so crisp and sleek.
When startup stumbles, fails, or falls,
A deferred defer now handles calls.
Background bunnies rest in peace,
The goroutine leaks finally cease! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'server/cluster: cancel ctx on Start failure' clearly and concisely describes the main change: adding context cancellation logic when Start fails, which matches the core fix in the changeset.
Description check ✅ Passed The PR description includes all required sections from the template: a linked issue (Close #10309), a clear commit message describing the fix, unit test confirmation, and release note status.
Linked Issues check ✅ Passed The PR fully addresses issue #10309 by implementing deferred context cancellation in Start() to prevent goroutine leaks when startup fails after InitCluster, including a regression test to verify the fix.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: cluster.go adds startup-failure cleanup logic, and cluster_test.go adds the regression test to verify goroutine cleanup on bootstrap failure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lance6716
Copy link
Contributor Author

/check-issue-triage-complete

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a RaftCluster startup goroutine leak by ensuring the cluster context is canceled when RaftCluster.Start() fails before running is set to true (so Stop() would otherwise no-op).

Changes:

  • Add a deferred cleanup in RaftCluster.Start() to cancel c.ctx on startup error.
  • Add a regression test that forces keyspaceGroupManager.Bootstrap() to fail and asserts c.ctx is canceled (preventing background goroutine leaks).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
server/cluster/cluster.go Adds deferred context cancellation on Start() error to prevent leaked background goroutines during partial startup.
server/cluster/cluster_test.go Adds a regression test that simulates a startup failure after goroutine-starting components are created.

// If a later step fails before `running` is set to true, `Stop` will return
// early and cannot reliably clean them up.
defer func() {
if err != nil {
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deferred cleanup only cancels c.ctx when err != nil, but Start can return early with err == nil while c.running is still false (e.g. when LoadClusterInfo() returns cluster == nil / not bootstrapped). Because InitCluster() already creates components that spawn goroutines off c.ctx (via newSchedulingController -> statistics.NewHotStat -> NewHotCache), this can still leave background goroutines running with no way for Stop() to cancel them (since Stop() returns early when running is false). Consider cancelling on any early exit where running was not set to true (e.g. defer cleanup conditioned on !c.running, or explicitly cancel before returning in the cluster == nil path).

Suggested change
if err != nil {
// Cancel the context on any failed start, or if we return without ever
// marking the cluster as running, to avoid leaking background goroutines.
if err != nil || !c.running {

Copilot uses AI. Check for mistakes.
@lance6716
Copy link
Contributor Author

/retest

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 7, 2026

@lance6716: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 98a46a7 link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link

codecov bot commented Mar 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.85%. Comparing base (c1f3166) to head (98a46a7).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10310      +/-   ##
==========================================
+ Coverage   78.78%   78.85%   +0.06%     
==========================================
  Files         527      527              
  Lines       70916    70925       +9     
==========================================
+ Hits        55870    55926      +56     
+ Misses      11026    11001      -25     
+ Partials     4020     3998      -22     
Flag Coverage Δ
unittests 78.85% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server/cluster: RaftCluster.Start leaks goroutines on startup failure

2 participants