Skip to content

fix: auto-cleanup orphaned gateway container on re-onboard#1567

Open
cr7258 wants to merge 3 commits intoNVIDIA:mainfrom
cr7258:fix/cleanup-orphaned-gateway-container
Open

fix: auto-cleanup orphaned gateway container on re-onboard#1567
cr7258 wants to merge 3 commits intoNVIDIA:mainfrom
cr7258:fix/cleanup-orphaned-gateway-container

Conversation

@cr7258
Copy link
Copy Markdown
Contributor

@cr7258 cr7258 commented Apr 7, 2026

Summary

When Ctrl+C interrupts nemoclaw onboard during gateway startup, the Docker container (openshell-cluster-nemoclaw) keeps running but OpenShell has no metadata for it. On re-onboard, preflight returns "missing" gateway state and skips cleanup, causing port 8080 conflict.

Add orphaned container detection in preflight: when gateway state is "missing", check if the Docker container exists via docker inspect. If found, stop and remove it along with associated volumes before proceeding to port checks.

Before fix (re-onboard fails):

    [1/8] Preflight checks
    ✓ Docker is running
    ✓ Container runtime: docker
    ✓ openshell CLI: openshell 0.0.23

    !! Port 8080 is not available.
       OpenShell gateway needs this port.

       Blocked by: docker-pr (PID 667159)

       To fix, stop the conflicting process:

         sudo kill 667159

After fix (re-onboard auto-cleans):

    [1/8] Preflight checks
    ✓ Docker is running
    ✓ Container runtime: docker
    ✓ openshell CLI: openshell 0.0.23
    Cleaning up orphaned gateway container...
    ✓ Orphaned gateway container removed
    ✓ Port 8080 available (OpenShell gateway)
    ✓ Port 18789 available (NemoClaw dashboard)
    ✓ NVIDIA GPU detected: 1 GPU(s), 122543 MB VRAM
    ✓ Memory OK: 122543 MB RAM + 16383 MB swap

    [2/8] Starting OpenShell gateway
    ...

Related Issue

Changes

  • bin/lib/onboard.js: detect and remove orphaned openshell-cluster-* Docker container when gatewayReuseState === "missing" in preflight

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Checklist

General

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

  • Follows the style guide. Try running the update-docs agent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docs catch up the docs for the new changes I made in this PR."
  • New pages include SPDX license header and frontmatter, if creating a new page.
  • Cross-references and links verified.

Signed-off-by: Seven Cheng [email protected]

Summary by CodeRabbit

  • Bug Fixes
    • Onboarding now performs a best-effort cleanup of orphaned local containers and matching volumes, clears stale in-memory state, and logs the outcome. This cleanup runs before port availability checks so those checks reflect the current system state and reduce setup interruptions.
    • Emits a warning when automatic cleanup cannot remove lingering resources.

When Ctrl+C interrupts `nemoclaw onboard` during gateway startup, the
Docker container (openshell-cluster-nemoclaw) keeps running but OpenShell
has no metadata for it. On re-onboard, preflight returns "missing" gateway
state and skips cleanup, causing port 8080 conflict.

Add orphaned container detection in preflight: when gateway state is
"missing", check if the Docker container exists via `docker inspect`.
If found, stop and remove it along with associated volumes before
proceeding to port checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9cc46950-d0e6-4b2e-acd1-60b58537125a

📥 Commits

Reviewing files that changed from the base of the PR and between 4bac5ef and aea1112.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

📝 Walkthrough

Walkthrough

Adds a best-effort preflight cleanup when gatewayReuseState === "missing": detects an orphaned openshell-cluster-${GATEWAY_NAME} container, attempts stop/remove, re-checks and prunes matching Docker volumes, clears in-memory registry, logs outcomes, then continues to port availability checks.

Changes

Cohort / File(s) Summary
Onboarding Cleanup Logic
bin/lib/onboard.js
Adds a preflight() branch for gatewayReuseState === "missing" that inspects for openshell-cluster-${GATEWAY_NAME}, attempts docker stop and docker rm, re-checks container presence, prunes Docker volumes matching the cluster name, calls registry.clearAll(), and logs warnings or success before proceeding to port checks.

Sequence Diagram

sequenceDiagram
    participant Onboard as Onboard Script
    participant Docker as Docker CLI
    participant Registry as Registry State
    participant Logger as Logger

    Onboard->>Onboard: detect gatewayReuseState === "missing"
    Onboard->>Docker: docker inspect openshell-cluster-{GATEWAY_NAME}
    Docker-->>Onboard: inspect result
    alt container exists
        Onboard->>Docker: docker stop openshell-cluster-{GATEWAY_NAME}
        Docker-->>Onboard: stopped
        Onboard->>Docker: docker rm openshell-cluster-{GATEWAY_NAME}
        Docker-->>Onboard: removed
        Onboard->>Docker: docker inspect (re-check)
        Docker-->>Onboard: no container
        Onboard->>Docker: docker volume rm (filter openshell-cluster-{GATEWAY_NAME})
        Docker-->>Onboard: volumes removed
        Onboard->>Registry: registry.clearAll()
        Registry-->>Onboard: cleared
        Onboard->>Logger: log cleanup success
    else still exists / cleanup failed
        Onboard->>Logger: log warning about failed cleanup
    end
    Onboard->>Onboard: proceed to port availability checks
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I sniffed the orphaned box and gave a hop,
I coaxed it down and watched the engines stop,
I swept the volumes, cleared the registry shelf,
Now ports may greet the cluster — tidy self! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: automatic cleanup of orphaned gateway containers during re-onboarding when gateway state is 'missing'.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1609-1617: The cleanup currently ignores docker stop/rm failures
and calls registry.clearAll() and prints success unconditionally; modify the
logic around run(..., {ignoreError:true}) for the gateway container to
capture/inspect the docker commands' success (for containerName/GATEWAY_NAME)
and only call registry.clearAll() and log the "Orphaned gateway container
removed" message if docker stop and docker rm both succeeded (or if volumes were
removed as intended); if removal fails, preserve the registry and surface/log
the error so the preflight does not erase persisted sandboxes (references:
run(), containerName, GATEWAY_NAME, registry.clearAll()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: df4547ae-c800-47a8-9bb5-ce52b80ab843

📥 Commits

Reviewing files that changed from the base of the PR and between 6d5cf80 and 0f8752c.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

1623-1628: Surface partial cleanup when the volume prune doesn't stick.

destroyGateway() already treats leftover openshell-cluster-${GATEWAY_NAME} volumes as start-breaking state, but this path still masks every docker volume rm failure with || true and then prints a full-success message. A quick post-prune check would make the log accurate when the container is gone but the volumes are not.

♻️ Possible follow-up
       if (postInspectResult.status !== 0) {
         run(
           `docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}" | grep . && docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}" | xargs docker volume rm 2>/dev/null || true`,
           { ignoreError: true, suppressOutput: true },
         );
+        const remainingVolumes = runCapture(
+          `docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}"`,
+          { ignoreError: true },
+        )
+          .split("\n")
+          .map((name) => name.trim())
+          .filter(Boolean);
         registry.clearAll();
-        console.log("  ✓ Orphaned gateway container removed");
+        if (remainingVolumes.length === 0) {
+          console.log("  ✓ Orphaned gateway container removed");
+        } else {
+          console.warn(
+            "  ! Gateway container was removed, but stale Docker volumes remain and may break the next gateway start.",
+          );
+        }
       } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1623 - 1628, The current cleanup always
prints success even if the volume prune failed; change the logic around the
run(...) call that removes openshell-cluster-${GATEWAY_NAME} volumes so it
verifies removal afterwards: after calling run(...) (or capturing its result)
run a follow-up check using the same docker volume ls -q --filter
"name=openshell-cluster-${GATEWAY_NAME}" command and if any volumes remain call
registry.clearAll() as needed but log a warning/error (instead of the success
message) and surface the failure (do not mask it with || true); update the
messages around registry.clearAll() and console.log("  ✓ Orphaned gateway
container removed") to reflect the actual state when volumes are not deleted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1623-1628: The current cleanup always prints success even if the
volume prune failed; change the logic around the run(...) call that removes
openshell-cluster-${GATEWAY_NAME} volumes so it verifies removal afterwards:
after calling run(...) (or capturing its result) run a follow-up check using the
same docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}"
command and if any volumes remain call registry.clearAll() as needed but log a
warning/error (instead of the success message) and surface the failure (do not
mask it with || true); update the messages around registry.clearAll() and
console.log("  ✓ Orphaned gateway container removed") to reflect the actual
state when volumes are not deleted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 648b0ad9-15e0-415c-9705-6b71787ee372

📥 Commits

Reviewing files that changed from the base of the PR and between 0f8752c and 4bac5ef.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@cr7258 cr7258 force-pushed the fix/cleanup-orphaned-gateway-container branch from 4bac5ef to aea1112 Compare April 7, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant