fix: auto-cleanup orphaned gateway container on re-onboard#1567
fix: auto-cleanup orphaned gateway container on re-onboard#1567cr7258 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
When Ctrl+C interrupts `nemoclaw onboard` during gateway startup, the Docker container (openshell-cluster-nemoclaw) keeps running but OpenShell has no metadata for it. On re-onboard, preflight returns "missing" gateway state and skips cleanup, causing port 8080 conflict. Add orphaned container detection in preflight: when gateway state is "missing", check if the Docker container exists via `docker inspect`. If found, stop and remove it along with associated volumes before proceeding to port checks. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a best-effort preflight cleanup when Changes
Sequence DiagramsequenceDiagram
participant Onboard as Onboard Script
participant Docker as Docker CLI
participant Registry as Registry State
participant Logger as Logger
Onboard->>Onboard: detect gatewayReuseState === "missing"
Onboard->>Docker: docker inspect openshell-cluster-{GATEWAY_NAME}
Docker-->>Onboard: inspect result
alt container exists
Onboard->>Docker: docker stop openshell-cluster-{GATEWAY_NAME}
Docker-->>Onboard: stopped
Onboard->>Docker: docker rm openshell-cluster-{GATEWAY_NAME}
Docker-->>Onboard: removed
Onboard->>Docker: docker inspect (re-check)
Docker-->>Onboard: no container
Onboard->>Docker: docker volume rm (filter openshell-cluster-{GATEWAY_NAME})
Docker-->>Onboard: volumes removed
Onboard->>Registry: registry.clearAll()
Registry-->>Onboard: cleared
Onboard->>Logger: log cleanup success
else still exists / cleanup failed
Onboard->>Logger: log warning about failed cleanup
end
Onboard->>Onboard: proceed to port availability checks
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1609-1617: The cleanup currently ignores docker stop/rm failures
and calls registry.clearAll() and prints success unconditionally; modify the
logic around run(..., {ignoreError:true}) for the gateway container to
capture/inspect the docker commands' success (for containerName/GATEWAY_NAME)
and only call registry.clearAll() and log the "Orphaned gateway container
removed" message if docker stop and docker rm both succeeded (or if volumes were
removed as intended); if removal fails, preserve the registry and surface/log
the error so the preflight does not erase persisted sandboxes (references:
run(), containerName, GATEWAY_NAME, registry.clearAll()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: df4547ae-c800-47a8-9bb5-ce52b80ab843
📒 Files selected for processing (1)
bin/lib/onboard.js
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
1623-1628: Surface partial cleanup when the volume prune doesn't stick.
destroyGateway()already treats leftoveropenshell-cluster-${GATEWAY_NAME}volumes as start-breaking state, but this path still masks everydocker volume rmfailure with|| trueand then prints a full-success message. A quick post-prune check would make the log accurate when the container is gone but the volumes are not.♻️ Possible follow-up
if (postInspectResult.status !== 0) { run( `docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}" | grep . && docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}" | xargs docker volume rm 2>/dev/null || true`, { ignoreError: true, suppressOutput: true }, ); + const remainingVolumes = runCapture( + `docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}"`, + { ignoreError: true }, + ) + .split("\n") + .map((name) => name.trim()) + .filter(Boolean); registry.clearAll(); - console.log(" ✓ Orphaned gateway container removed"); + if (remainingVolumes.length === 0) { + console.log(" ✓ Orphaned gateway container removed"); + } else { + console.warn( + " ! Gateway container was removed, but stale Docker volumes remain and may break the next gateway start.", + ); + } } else {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1623 - 1628, The current cleanup always prints success even if the volume prune failed; change the logic around the run(...) call that removes openshell-cluster-${GATEWAY_NAME} volumes so it verifies removal afterwards: after calling run(...) (or capturing its result) run a follow-up check using the same docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}" command and if any volumes remain call registry.clearAll() as needed but log a warning/error (instead of the success message) and surface the failure (do not mask it with || true); update the messages around registry.clearAll() and console.log(" ✓ Orphaned gateway container removed") to reflect the actual state when volumes are not deleted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1623-1628: The current cleanup always prints success even if the
volume prune failed; change the logic around the run(...) call that removes
openshell-cluster-${GATEWAY_NAME} volumes so it verifies removal afterwards:
after calling run(...) (or capturing its result) run a follow-up check using the
same docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}"
command and if any volumes remain call registry.clearAll() as needed but log a
warning/error (instead of the success message) and surface the failure (do not
mask it with || true); update the messages around registry.clearAll() and
console.log(" ✓ Orphaned gateway container removed") to reflect the actual
state when volumes are not deleted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 648b0ad9-15e0-415c-9705-6b71787ee372
📒 Files selected for processing (1)
bin/lib/onboard.js
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
4bac5ef to
aea1112
Compare
Summary
When Ctrl+C interrupts
nemoclaw onboardduring gateway startup, the Docker container (openshell-cluster-nemoclaw) keeps running but OpenShell has no metadata for it. On re-onboard, preflight returns "missing" gateway state and skips cleanup, causing port 8080 conflict.Add orphaned container detection in preflight: when gateway state is "missing", check if the Docker container exists via
docker inspect. If found, stop and remove it along with associated volumes before proceeding to port checks.Before fix (re-onboard fails):
After fix (re-onboard auto-cleans):
Related Issue
Changes
bin/lib/onboard.js: detect and remove orphanedopenshell-cluster-*Docker container whengatewayReuseState === "missing"in preflightType of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Doc Changes
update-docsagent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docscatch up the docs for the new changes I made in this PR."Signed-off-by: Seven Cheng [email protected]
Summary by CodeRabbit