test(e2e): add sandbox operations recovery and multi-sandbox isolatio… by ShevaNguyen · Pull Request #1865 · NVIDIA/NemoClaw

ShevaNguyen · 2026-04-14T07:27:54Z

Summary

Add test/e2e/test-sandbox-operations.sh covering 10 sandbox operation features that currently lack E2E coverage. The script runs 14 assertions across basic operations, recovery paths, and multi-sandbox isolation.

The integration of those to pipeline will be delivered in another PR
Test cases:

TC-SBX-01: nemoclaw list shows onboarded sandbox
TC-SBX-02: SSH into sandbox + openclaw agent -m returns expected response
TC-SBX-03: nemoclaw status output contains Sandbox, Model, Provider, OpenClaw fields
TC-SBX-04: nemoclaw logs produces output; --follow exits cleanly after kill
TC-SBX-05: nemoclaw destroy removes sandbox from registry and Docker
TC-SBX-06: docker kill openshell-cluster-nemoclaw (hard crash); nemoclaw status handles gracefully; re-onboard succeeds
TC-SBX-07: Delete sandboxes.json; nemoclaw list rebuilds registry from live OpenShell state
TC-SBX-08: Kill OpenClaw gateway process inside sandbox; nemoclaw status detects and recovers
TC-SBX-10: Two sandboxes onboarded; nemoclaw list shows both with correct metadata
TC-SBX-11: Cross-sandbox node HTTP probe returns HTTP 403 (proxy blocks); both directions verified

Related Issue

Link to issue 1863

Changes

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only. Prose changes without code sample modifications.
Doc only. Includes code sample changes.

Testing

npx prek run --all-files passes (or equivalently make check).
npm test passes.
make docs builds without warnings. (for doc-only changes)
Validated locally: 14/14 PASS on macOS Apple Silicon, Colima, OpenShell 0.0.26

Checklist

General

I have read and followed the contributing guide.
I have read and followed the style guide. (for doc-only changes)

Code Changes

Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
Tests added or updated for new or changed behavior.
No secrets, API keys, or credentials committed.
Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

Follows the style guide. Try running the nemoclaw-contributor-update-docs agent skill to draft changes while complying with the style guide. For example, prompt your agent with "/nemoclaw-contributor-update-docs catch up the docs for the new changes I made in this PR."
New pages include SPDX license header and frontmatter, if creating a new page.
Cross-references and links verified.

Signed-off-by: Truong Nguyen [email protected]

Summary by CodeRabbit

Tests
- Added a comprehensive end-to-end sandbox operations test runner covering TC-SBX-01 through TC-SBX-11.
- Validates onboarding, listing, SSH commands, status fields, log streaming (follow/timeout), recovery from process/container/gateway failures, multi-sandbox behavior, network isolation probes, and destroy/cleanup flows.
- Implements CI runtime guard, timestamped logging, pass/fail/skip reporting, conditional test skipping, and best-effort teardown.

coderabbitai · 2026-04-14T07:31:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new Bash end-to-end test script test/e2e/test-sandbox-operations.sh implementing TC-SBX-01–TC-SBX-11 to exercise sandbox onboarding, SSH/CLI checks, log streaming, recovery scenarios, multi-sandbox/network isolation, teardown, logging, timeout guard, and summary reporting.

Changes

Cohort / File(s)	Summary
E2E Test Suite `test/e2e/test-sandbox-operations.sh`	New executable Bash E2E script (~+594 lines). Adds strict mode and timeout re-exec, timestamped logging, PASS/FAIL/SKIP/TOTAL counters, helper functions for sandbox onboarding/SSH/exec, phased tests (preflight, setup, phases 1–5 covering listing, SSH one-shot, status checks, log streaming, registry rebuild, process/gateway recovery, destructive gateway kill, multi-sandbox onboarding, network isolation probes), trap-registered teardown, and final summary/exit behavior.

Sequence Diagram(s)

sequenceDiagram
    participant CI as CI Runner
    participant Script as test-sbx.sh
    participant Nemoclaw as Nemoclaw CLI
    participant Docker as Docker Engine
    participant Sandbox as Sandbox (VM/Container)
    participant Gateway as OpenClaw Gateway

    CI->>Script: start (with timeout)
    Script->>Nemoclaw: preflight checks (cli present, creds)
    Script->>Docker: verify Docker responsive
    Script->>Nemoclaw: destroy stale sandboxes
    alt onboarding needed
        Script->>Nemoclaw: onboard sandbox A (--non-interactive)
        Nemoclaw->>Sandbox: provision sandbox A
    end
    Script->>Sandbox: run SSH one-shot command (check token)
    Script->>Nemoclaw: `nemoclaw <sandbox> status` (validate fields)
    Script->>Sandbox: stream logs (--follow), then stop follower
    Script->>Sandbox: kill gateway process (in-sandbox)
    Sandbox->>Gateway: gateway restarts/recovers
    Script->>Nemoclaw: verify recovery via status and SSH
    Script->>Nemoclaw: onboard sandbox B
    Script->>Sandbox: SSH probe HTTP endpoints (network isolation)
    Script->>Nemoclaw: destroy sandbox B
    Script->>Nemoclaw: final cleanup (destroy A, remove locks)
    Script->>CI: summary (PASS/FAIL/SKIP/TOTAL)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through logs beneath the midnight test,
I nudged a gateway, then watched it rest,
I counted PASS and FAIL with whiskers bright,
I probed two sandboxes through soft moonlight,
Then swept the shells and tucked the network tight.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'test(e2e): add sandbox operations recovery and multi-sandbox isolatio…' is directly related to the main change, which adds a comprehensive E2E test script covering sandbox operations, recovery scenarios, and multi-sandbox isolation.
Docstring Coverage	✅ Passed	Docstring coverage is 82.61% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

test/e2e/test-sandbox-operations.sh (4)

391-399: Document the purpose of port 18789.

The hardcoded port 18789 is used without explanation. Consider adding a comment explaining what service is expected on this port to improve maintainability.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 391 - 399, The probe using
sandbox_exec_for to curl node's http.get targets hardcoded port 18789 (variable
names: probe_a, SANDBOX_B) without context; add a short inline comment above
this block explaining what service listens on port 18789 (e.g., "internal test
HTTP probe for sandbox service X"), and preferably replace the literal with a
clearly named shell variable (e.g., SERVICE_PORT or SANDBOX_HTTP_PORT) defined
near the other SANDBOX_* variables so future readers know the intent and can
update the port in one place.

60-72: Consider handling ssh-config failures explicitly.

If openshell sandbox ssh-config fails (non-zero exit), the script may exit due to set -e. Adding || true or checking the exit status would make the function more robust.

Proposed fix

 sandbox_exec_for() {
     local name="$1" cmd="$2"
     local ssh_cfg
     ssh_cfg="$(mktemp)"
-    openshell sandbox ssh-config "$name" >"$ssh_cfg" 2>/dev/null
+    if ! openshell sandbox ssh-config "$name" >"$ssh_cfg" 2>/dev/null; then
+        rm -f "$ssh_cfg"
+        echo "ERROR: Could not get SSH config for $name"
+        return
+    fi
     local result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 60 - 72, The
sandbox_exec_for function currently assumes openshell sandbox ssh-config
succeeds; capture and check its exit status after running openshell sandbox
ssh-config "$name" into ssh_cfg (or run it with || true and then test if the
file is non-empty) and if it failed, remove ssh_cfg and return a non-zero result
or informative error string instead of proceeding to ssh; update logic around
ssh_cfg creation in sandbox_exec_for to explicitly handle failure (clean up tmp
file, log or echo an error, and avoid calling ssh with an invalid config).

241-243: Log message precedes actual deletion.

The log states "Backed up and deleted" but deletion occurs on the next line. Consider reordering or splitting the message.

Proposed fix

     cp "$registry" "${registry}.bak"
-    log "  Backed up and deleted sandboxes.json"
+    log "  Backed up sandboxes.json"
     rm -f "$registry"
+    log "  Deleted sandboxes.json"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 241 - 243, The log message
currently says "Backed up and deleted sandboxes.json" before the deletion
actually happens; update the sequence so the operation and message match: either
perform rm -f "$registry" immediately after cp "$registry" "${registry}.bak" and
then call log "  Backed up and deleted sandboxes.json", or change the log call
to "  Backed up sandboxes.json" before removal and add a second log after rm -f
"$registry" (e.g., log "  Deleted sandboxes.json"); modify the lines using the
"$registry" variable and the log function to keep messages consistent with cp
and rm commands.

494-503: Teardown runs twice: explicitly in main() and via trap EXIT.

The teardown function is called at line 497 and again when trap teardown EXIT fires on script exit (line 503). While teardown is idempotent (sandboxes are already destroyed), this results in redundant nemoclaw list calls. Consider removing the explicit call or unsetting the trap before summary.

Option A: Remove explicit teardown call

     # Phase 5: Cleanup verification (destroys sandbox B)
     test_sbx_05_destroy_cleanup "$SANDBOX_B"

-    # Teardown everything
-    teardown
-
     # Report
     summary
 }

Option B: Clear trap before summary

     # Teardown everything
     teardown
+    trap - EXIT

     # Report
     summary

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 494 - 503, The script calls
teardown twice: once explicitly in main() after test_sbx_05_destroy_cleanup and
again via trap teardown EXIT; remove the redundancy by either deleting the
explicit teardown call in main() (leave trap teardown EXIT to run cleanup on
exit) or by unsetting the trap (e.g., using trap - EXIT) before calling summary
so teardown only runs once; update references around the teardown invocation in
main(), the trap teardown EXIT declaration, and the summary call to implement
the chosen option.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 46-50: The increment helpers (pass, fail, skip) use post-increment
arithmetic ((PASS++), ((FAIL++), ((SKIP++), ((TOTAL++))) which can return 0 and
cause the script to exit under set -e; update those increments in the functions
pass, fail, skip (and any other helper using ((VAR++))) to use
addition-assignment ((PASS+=1), ((FAIL+=1), ((SKIP+=1), ((TOTAL+=1))) so the
expression always evaluates truthy and the script won’t abort on the first
increment.
- Around line 1-17: Add an SPDX license identifier immediately after the
existing shebang line (#!/usr/bin/env bash) so the script includes a one-line
SPDX header (e.g., # SPDX-License-Identifier: <Your-License-Identifier>),
ensuring the identifier matches the project's license policy and appears before
any other comments or code in test-sandbox-operations.sh.
- Around line 175-197: Add the missing SPDX header to the top of
test/e2e/test-sandbox-operations.sh (add the two lines shown in the review) and
update the TC-SBX-03 test (function test_sbx_03_status_fields) to match the
actual nemoclaw status output: change the expected field "Sandbox" to
"Sandboxes:" (or the exact label emitted by showStatusCommand), remove or
replace "Provider" and "OpenClaw" with the actual service/status labels that
nemoclaw prints (or assert against model info and service names produced by
showStatusCommand), and keep the rest of the loop/validation logic unchanged so
the test reflects real output from the nemoclaw "<name> status" command.

---

Nitpick comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 391-399: The probe using sandbox_exec_for to curl node's http.get
targets hardcoded port 18789 (variable names: probe_a, SANDBOX_B) without
context; add a short inline comment above this block explaining what service
listens on port 18789 (e.g., "internal test HTTP probe for sandbox service X"),
and preferably replace the literal with a clearly named shell variable (e.g.,
SERVICE_PORT or SANDBOX_HTTP_PORT) defined near the other SANDBOX_* variables so
future readers know the intent and can update the port in one place.
- Around line 60-72: The sandbox_exec_for function currently assumes openshell
sandbox ssh-config succeeds; capture and check its exit status after running
openshell sandbox ssh-config "$name" into ssh_cfg (or run it with || true and
then test if the file is non-empty) and if it failed, remove ssh_cfg and return
a non-zero result or informative error string instead of proceeding to ssh;
update logic around ssh_cfg creation in sandbox_exec_for to explicitly handle
failure (clean up tmp file, log or echo an error, and avoid calling ssh with an
invalid config).
- Around line 241-243: The log message currently says "Backed up and deleted
sandboxes.json" before the deletion actually happens; update the sequence so the
operation and message match: either perform rm -f "$registry" immediately after
cp "$registry" "${registry}.bak" and then call log "  Backed up and deleted
sandboxes.json", or change the log call to "  Backed up sandboxes.json" before
removal and add a second log after rm -f "$registry" (e.g., log "  Deleted
sandboxes.json"); modify the lines using the "$registry" variable and the log
function to keep messages consistent with cp and rm commands.
- Around line 494-503: The script calls teardown twice: once explicitly in
main() after test_sbx_05_destroy_cleanup and again via trap teardown EXIT;
remove the redundancy by either deleting the explicit teardown call in main()
(leave trap teardown EXIT to run cleanup on exit) or by unsetting the trap
(e.g., using trap - EXIT) before calling summary so teardown only runs once;
update references around the teardown invocation in main(), the trap teardown
EXIT declaration, and the summary call to implement the chosen option.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8053da33-7bce-4dfa-8547-38c89f76b172

📥 Commits

Reviewing files that changed from the base of the PR and between cd36c58 and 2619f28.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

test/e2e/test-sandbox-operations.sh

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

test/e2e/test-sandbox-operations.sh (1)

62-64: ⚠️ Potential issue | 🔴 Critical

set -e + ((VAR++)) will exit on the first PASS/FAIL/SKIP update.

Lines 62-64 use post-increment arithmetic in command context. With set -e, the first increment evaluates to 0 and returns exit code 1, which can terminate the script immediately.

#!/usr/bin/env bash
set -euo pipefail
PASS=0
((PASS++))
echo "UNREACHABLE: this should not print"

Proposed fix

 pass() { ((PASS++)); ((TOTAL++)); echo -e "${GREEN}  PASS${NC} $1" | tee -a "$LOG_FILE"; }
 fail() { ((FAIL++)); ((TOTAL++)); echo -e "${RED}  FAIL${NC} $1 — $2" | tee -a "$LOG_FILE"; }
 skip() { ((SKIP++)); ((TOTAL++)); echo -e "${YELLOW}  SKIP${NC} $1 — $2" | tee -a "$LOG_FILE"; }
+pass() { ((PASS+=1)); ((TOTAL+=1)); echo -e "${GREEN}  PASS${NC} $1" | tee -a "$LOG_FILE"; }
+fail() { ((FAIL+=1)); ((TOTAL+=1)); echo -e "${RED}  FAIL${NC} $1 — $2" | tee -a "$LOG_FILE"; }
+skip() { ((SKIP+=1)); ((TOTAL+=1)); echo -e "${YELLOW}  SKIP${NC} $1 — $2" | tee -a "$LOG_FILE"; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 62 - 64, The post-increment
arithmetic inside the pass(), fail(), and skip() functions (symbols: pass, fail,
skip, and variables PASS, FAIL, SKIP, TOTAL) can trigger a non-zero exit under
set -e; replace the in-command ((VAR++)) usage with safe arithmetic assignments
such as PASS=$((PASS+1)) and TOTAL=$((TOTAL+1)) (and respectively for FAIL and
SKIP) so increments never produce a failing exit code, keeping the existing
echo/tee behavior unchanged.

🧹 Nitpick comments (1)

test/e2e/test-sandbox-operations.sh (1)
559-565: Teardown runs twice (explicit call + EXIT trap).

Line 559 calls teardown, and Line 565 also registers trap teardown EXIT. This is safe but redundant; keeping only one path will reduce duplicate cleanup logs and extra CLI calls.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 559 - 565, The script calls
teardown explicitly and also registers trap teardown EXIT, causing duplicate
cleanup; remove the explicit teardown invocation and rely on the EXIT trap
(leave the trap teardown EXIT and the teardown function intact) so cleanup runs
only once. Reference: the teardown function and the trap teardown EXIT
registration.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 345-347: The destroy pipeline can cause the script to exit under
set -euo pipefail and skip test result recording; wrap the nemoclaw "$target"
destroy --yes | tee -a "$LOG_FILE" invocation in a temporary errexit-safe block:
disable exiting on failure (or run in a subshell), capture the command's exit
status (use PIPESTATUS[0] for the left side of the pipe) into a variable, then
re-enable strict mode and call fail() when that exit status is non-zero or
pass() when zero; reference the nemoclaw command, the LOG_FILE, and the
fail()/pass() result handlers when making the change.

---

Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 62-64: The post-increment arithmetic inside the pass(), fail(),
and skip() functions (symbols: pass, fail, skip, and variables PASS, FAIL, SKIP,
TOTAL) can trigger a non-zero exit under set -e; replace the in-command
((VAR++)) usage with safe arithmetic assignments such as PASS=$((PASS+1)) and
TOTAL=$((TOTAL+1)) (and respectively for FAIL and SKIP) so increments never
produce a failing exit code, keeping the existing echo/tee behavior unchanged.

---

Nitpick comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 559-565: The script calls teardown explicitly and also registers
trap teardown EXIT, causing duplicate cleanup; remove the explicit teardown
invocation and rely on the EXIT trap (leave the trap teardown EXIT and the
teardown function intact) so cleanup runs only once. Reference: the teardown
function and the trap teardown EXIT registration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9501b01d-b1c4-4fbe-b468-47df15ba7054

📥 Commits

Reviewing files that changed from the base of the PR and between 2619f28 and 3c52774.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

test/e2e/test-sandbox-operations.sh

coderabbitai

♻️ Duplicate comments (1)

test/e2e/test-sandbox-operations.sh (1)
252-259: ⚠️ Potential issue | 🟡 Minor

Test expects "OpenClaw" field but status command doesn't output it.

The nemoclaw status command outputs Sandbox:, Model:, Provider:, Inference:, GPU:, and Policies: fields. The string "OpenClaw" only appears in error/recovery messages (e.g., "Recovered NemoClaw gateway runtime") and is not present in normal status output, causing this test to fail.

Replace "OpenClaw" with "GPU", which is always present:
Proposed fix
-  for field in "Sandbox" "Model" "Provider" "OpenClaw"; do
+  for field in "Sandbox" "Model" "Provider" "GPU"; do
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 252 - 259, The test loop in
test-sandbox-operations.sh checks for fields "Sandbox", "Model", "Provider", and
"OpenClaw" but the status output never includes "OpenClaw"; change the for-list
in the for field in "Sandbox" "Model" "Provider" "OpenClaw"; do ... to use "GPU"
instead of "OpenClaw" so it checks against a field that is always present (i.e.,
update the string literal "OpenClaw" to "GPU" in that for-loop).

🧹 Nitpick comments (1)

test/e2e/test-sandbox-operations.sh (1)
586-593: Teardown runs twice: once explicitly and once via EXIT trap.

Line 586 calls teardown explicitly, then summary calls exit, which triggers the EXIT trap at line 592, running teardown again. While the second run is idempotent (sandboxes already destroyed), it produces duplicate log output and unnecessary nemoclaw list calls.

Consider adding a guard variable or removing the explicit teardown call since the trap handles it.
Option 1: Remove explicit call (preferred)
   # Phase 5: Cleanup verification (destroys sandbox B)
   test_sbx_05_destroy_cleanup "$SANDBOX_B"

-  # Teardown everything
-  teardown
-
   # Report
   summary
 }
Option 2: Add guard variable
+TEARDOWN_DONE=0
+
 teardown() {
+  if [[ $TEARDOWN_DONE -eq 1 ]]; then
+    return
+  fi
+  TEARDOWN_DONE=1
   # Disable errexit during teardown — cleanup must be best-effort
   set +e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 586 - 593, The teardown
function is being invoked twice (explicit call before summary and again via the
trap on EXIT); remove the explicit call to teardown that appears just before
calling summary in main so the EXIT trap (trap teardown EXIT) is the single
place that runs cleanup, or alternatively implement a guard (e.g., a
TEARDOWN_DONE flag checked/set inside teardown and respected by the EXIT trap)
to make teardown idempotent; locate and update the teardown invocation and the
trap/summary/main flow (symbols: teardown, summary, trap EXIT, main)
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 252-259: The test loop in test-sandbox-operations.sh checks for
fields "Sandbox", "Model", "Provider", and "OpenClaw" but the status output
never includes "OpenClaw"; change the for-list in the for field in "Sandbox"
"Model" "Provider" "OpenClaw"; do ... to use "GPU" instead of "OpenClaw" so it
checks against a field that is always present (i.e., update the string literal
"OpenClaw" to "GPU" in that for-loop).

---

Nitpick comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 586-593: The teardown function is being invoked twice (explicit
call before summary and again via the trap on EXIT); remove the explicit call to
teardown that appears just before calling summary in main so the EXIT trap (trap
teardown EXIT) is the single place that runs cleanup, or alternatively implement
a guard (e.g., a TEARDOWN_DONE flag checked/set inside teardown and respected by
the EXIT trap) to make teardown idempotent; locate and update the teardown
invocation and the trap/summary/main flow (symbols: teardown, summary, trap
EXIT, main) accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ce77f38f-b934-473e-bcc3-bfa3f68d9694

📥 Commits

Reviewing files that changed from the base of the PR and between 3c52774 and 05d263d.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 339-345: The test currently swallows the exit code of the
`nemoclaw "$SANDBOX_A" status` call by appending `|| true` and then treats any
non-empty output as success (affecting TC-SBX-06 and TC-SBX-08); change the
`status_output=$($TIMEOUT_CMD 120 nemoclaw "$SANDBOX_A" status 2>&1) || true`
pattern so the command's exit status is preserved (capture both the output and
the exit code, e.g., assign `status_output` and then store `$?` into a
variable), then only treat the test as passing when the output matches an
explicit recovery/health token (the grep for "recover|running|healthy|OpenClaw")
and the exit code indicates success; apply the same fix to the similar block
around the other occurrence (lines 410-423) so fatal CLI errors no longer count
as passes.
- Around line 134-149: The onboard_sandbox flow currently ignores a non-zero
exit from the nemoclaw onboard command and proceeds to succeed if the sandbox
shows up later; update the function to fail fast when onboard_exit is non-zero
by returning that exit code (or a non-zero value) instead of continuing: after
running the nemoclaw onboard block that sets onboard_exit, if [[ $onboard_exit
-ne 0 ]]; then log the error (use the existing log message) and return
$onboard_exit (or 1) immediately; keep the subsequent check that verifies
nemoclaw list but only run it when onboard_exit is zero so functions like
onboard_sandbox and the test suite cannot proceed on a partial/failed onboard.
- Around line 484-490: The test currently treats STATUS_404 as equivalent to
blocked, which is too permissive; update the conditional that inspects the probe
output (the grep on "$probe_a" used in the TC-SBX-11 assertion) to only consider
STATUS_403, ERROR, or TIMEOUT as a pass (remove STATUS_404 from the regex), and
make the analogous change for the symmetric check that examines "$probe_b" (the
identical block around lines 504-510) so both isolation checks only accept
STATUS_403/ERROR/TIMEOUT as evidence of blocking.
- Around line 367-379: The test TC-SBX-05 currently only verifies "nemoclaw
list" after running "nemoclaw $target destroy"; update the test to also assert
that the underlying runtime artifacts are gone by checking the container runtime
(e.g., run "docker ps -a" / "podman ps -a" or use "docker container ls --filter
name=$target") for containers/images/networks matching $target and fail if any
are found; add this runtime check after the existing nemoclaw list check (use
the same $target variable and destroy_exit handling) and report a clear failure
message like "Still running container for $target" so the test covers both
registry and Docker cleanup.
- Around line 273-291: The test treats any non-empty stderr as success and
doesn't verify the exit status or that the follower was running; update the
first logs invocation to capture and assert the exit code (e.g., run
output=$($TIMEOUT_CMD 10 nemoclaw "$SANDBOX_A" logs 2>&1); rc=$?; require rc==0
and non-empty output) and for the --follow branch check that the background pid
($pid) is alive before killing it (e.g., test ps -p "$pid" &>/dev/null and fail
if not), then kill and wait as currently done; reference the variables/commands
output, $TIMEOUT_CMD, nemoclaw "$SANDBOX_A" logs, --follow, pid to locate and
patch the checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6d2cfa6a-6387-46e7-91b6-8256e9df4821

📥 Commits

Reviewing files that changed from the base of the PR and between 05d263d and 22738ac.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

test/e2e/test-sandbox-operations.sh

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (3)

test/e2e/test-sandbox-operations.sh (3)
274-292: ⚠️ Potential issue | 🟠 Major

TC-SBX-04 can still pass when logs never works.

The first branch ignores the exit code from nemoclaw ... logs, so any error text counts as success. The --follow branch only proves the PID is gone after kill; if the follower has already exited before the sleep, this still goes green. Require a zero exit code for the one-shot logs call and fail if the background follower is already dead before the kill step.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 274 - 292, The test
currently treats any non-empty stderr/stdout from the one-shot logs call as
success and doesn't check its exit code; modify the block using
output=$($TIMEOUT_CMD 10 nemoclaw "$SANDBOX_A" logs 2>&1) || true to instead
capture the exit code and require it to be zero (use the actual command exit
status, e.g., run without forcing success and assert $? == 0) so TC-SBX-04 fails
on errors, and for the follow branch ensure the background follower is alive
before the kill by checking the PID (ps -p "$pid") immediately after sleep and
failing if that check shows the process already exited, only then proceed to
kill/wait and assert the process is gone.
332-345: ⚠️ Potential issue | 🟠 Major

TC-SBX-08 does not prove the recovery path was exercised.

The fault-injection step is entirely best-effort, and the subsequent status probe also drops its exit code. If no gateway PID was actually terminated, this case is just observing a healthy sandbox and can still pass. Capture whether a process was killed, then only accept recovery output from a zero-exit nemoclaw status.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 332 - 345, The test
currently runs a best-effort kill via sandbox_exec (the "pkill -9 -f 'openclaw
gateway' || kill -9 $(pgrep -f 'openclaw gateway') || ..." pipeline) but
discards whether any process was actually terminated and likewise ignores the
exit code of the nemoclaw status probe; change the logic so you first capture
the exit status of the sandbox_exec kill pipeline (e.g., store its exit code or
a boolean like "killed_any") and only proceed to validate recovery if a kill
actually occurred, then run the status probe using $TIMEOUT_CMD 120 nemoclaw
"$SANDBOX_A" status and require its exit code to be zero before accepting the
recovery messages in status_output (still inspect status_output for
"recover|running|healthy|OpenClaw"); update the TC-SBX-08 pass/fail decision to
fail when no process was killed or when status exits non-zero even if output
matches.
403-425: ⚠️ Potential issue | 🟠 Major

TC-SBX-06 can pass without an actual gateway outage.

docker kill is best-effort, and a still-running gateway only emits a warning before the test matches recovery text from status. If the container survives or restarts immediately, this block is exercising normal behavior, not recovery. Gate the assertion on an observed down/removed state and require nemoclaw status to exit 0.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` around lines 403 - 425, The test
currently treats any matching text in status_output as a recovery even if the
gateway never actually went down; modify the logic so you first gate on the
observed container_state (from container_state=$(docker inspect -f
'{{.State.Running}}' "$container" ...)) and only perform the recovery assertion
when container_state is "false" or "removed"; if the container_state is still
"true" then log a warning and skip the recovery assertion. Also require that the
nemoclaw status invocation (status_output=$($TIMEOUT_CMD 300 nemoclaw
"$SANDBOX_A" status 2>&1) || true) actually exits 0 before checking for recovery
strings — use the command exit code rather than accepting any output — and only
pass "TC-SBX-06" when both the container is down/removed and the nemoclaw status
command exited 0 and contains the expected recovery/health text.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 451-462: TC-SBX-10 currently only checks that SANDBOX_A and
SANDBOX_B names appear in the output of `nemoclaw list` (using variables
found_a/found_b) but does not validate sandbox metadata; update the test so
after confirming presence it calls the command that returns sandbox details
(e.g., `nemoclaw show` or equivalent inspect command) for each sandbox and
parses/validates the expected metadata fields (model, provider, version or other
required keys) for SANDBOX_A and SANDBOX_B, and only call pass "TC-SBX-10: Both
sandboxes visible in nemoclaw list" if both presence and metadata checks succeed
(otherwise call fail with clear details which sandbox and which metadata field
mismatched).
- Around line 476-492: The check treating empty probe output as a successful
isolation is wrong because sandbox_exec_for is wrapped with "|| true" and can
return an empty string on SSH/infrastructure failure; remove the "|| true"
wrapper or capture sandbox_exec_for's exit status and the probe output (probe_a)
and explicitly fail when exit != 0 or probe_a is empty to surface infra errors,
then only treat outputs that contain STATUS_403, ERROR, or TIMEOUT as PASS for
isolation (leave the existing grep checks for STATUS_403/ERROR/TIMEOUT and the
STATUS_[0-9]+ failure branch unchanged, but add an initial guard that fails with
an infrastructure/SSH error if probe_a is empty or the command exit code
indicates failure).
- Around line 368-386: The test currently logs non-zero exits from the nemoclaw
destroy command but does not mark the test as failed; update the block that
checks destroy_exit to call fail(...) when destroy_exit != 0 (use the same test
id/context "TC-SBX-05: Destroy ($target)" and include the exit code in the
failure message) instead of only logging, and keep the subsequent nemoclaw list
and openshell sandbox list checks for diagnostic validation; reference the
destroy_exit variable, the nemoclaw destroy invocation, and the fail()/pass()
functions to locate where to add the explicit fail call.

---

Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Around line 274-292: The test currently treats any non-empty stderr/stdout
from the one-shot logs call as success and doesn't check its exit code; modify
the block using output=$($TIMEOUT_CMD 10 nemoclaw "$SANDBOX_A" logs 2>&1) ||
true to instead capture the exit code and require it to be zero (use the actual
command exit status, e.g., run without forcing success and assert $? == 0) so
TC-SBX-04 fails on errors, and for the follow branch ensure the background
follower is alive before the kill by checking the PID (ps -p "$pid") immediately
after sleep and failing if that check shows the process already exited, only
then proceed to kill/wait and assert the process is gone.
- Around line 332-345: The test currently runs a best-effort kill via
sandbox_exec (the "pkill -9 -f 'openclaw gateway' || kill -9 $(pgrep -f
'openclaw gateway') || ..." pipeline) but discards whether any process was
actually terminated and likewise ignores the exit code of the nemoclaw status
probe; change the logic so you first capture the exit status of the sandbox_exec
kill pipeline (e.g., store its exit code or a boolean like "killed_any") and
only proceed to validate recovery if a kill actually occurred, then run the
status probe using $TIMEOUT_CMD 120 nemoclaw "$SANDBOX_A" status and require its
exit code to be zero before accepting the recovery messages in status_output
(still inspect status_output for "recover|running|healthy|OpenClaw"); update the
TC-SBX-08 pass/fail decision to fail when no process was killed or when status
exits non-zero even if output matches.
- Around line 403-425: The test currently treats any matching text in
status_output as a recovery even if the gateway never actually went down; modify
the logic so you first gate on the observed container_state (from
container_state=$(docker inspect -f '{{.State.Running}}' "$container" ...)) and
only perform the recovery assertion when container_state is "false" or
"removed"; if the container_state is still "true" then log a warning and skip
the recovery assertion. Also require that the nemoclaw status invocation
(status_output=$($TIMEOUT_CMD 300 nemoclaw "$SANDBOX_A" status 2>&1) || true)
actually exits 0 before checking for recovery strings — use the command exit
code rather than accepting any output — and only pass "TC-SBX-06" when both the
container is down/removed and the nemoclaw status command exited 0 and contains
the expected recovery/health text.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9dd4d120-4362-4325-8b2f-e6cfd9064d10

📥 Commits

Reviewing files that changed from the base of the PR and between 22738ac and 83135c7.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

test/e2e/test-sandbox-operations.sh

thaihuynhxyz · 2026-04-14T10:40:55Z

test/e2e/test-sandbox-operations.sh

Could we add at least a basic CI job in this PR that runs the script?

Thanks, I have updated with the integration

…n suite Add test/e2e/test-sandbox-operations.sh covering 10 test cases (14 assertions) for sandbox operation features that currently lack E2E coverage: - TC-SBX-01: nemoclaw list shows onboarded sandbox - TC-SBX-02: SSH + openclaw agent chat returns expected response - TC-SBX-03: nemoclaw status output contains expected fields - TC-SBX-04: nemoclaw logs produces output; --follow exits cleanly - TC-SBX-05: nemoclaw destroy removes sandbox from registry and Docker - TC-SBX-06: docker kill gateway container; nemoclaw status handles gracefully; re-onboard succeeds - TC-SBX-07: delete sandboxes.json; nemoclaw list rebuilds from live OpenShell state - TC-SBX-08: kill OpenClaw gateway process inside sandbox; nemoclaw status detects and recovers - TC-SBX-10: two sandboxes onboarded; nemoclaw list shows both - TC-SBX-11: cross-sandbox HTTP probe returns 403 (proxy blocks) Follows conventions from test-sandbox-survival.sh: SSH via openshell sandbox ssh-config, node for in-sandbox probes, gtimeout/timeout portability. Validated locally: 14/14 PASS on macOS Apple Silicon, Colima, OpenShell 0.0.26. Signed-off-by: Truong Nguyen <[email protected]> Made-with: Cursor

ericksoa

Approving — this is a well-structured E2E test suite with good phase ordering, robust helpers, and sensible assertions. The earlier CodeRabbit issues have all been addressed (safe arithmetic, exit code checks, metadata validation, tightened isolation probes).

A few minor items for a follow-up PR:

TC-SBX-06: gate on status_exit — Unlike TC-SBX-08 which checks status_exit before asserting on output, TC-SBX-06 skips that check. If the status command fails with an error message containing "gateway", it passes. At minimum, log a warning when status_exit != 0.
TC-SBX-08: comment the kill fallback chain — The three-strategy kill command is correct but hard to read. A brief comment explaining why the fallbacks exist (different sandbox environments) would help.
TC-SBX-11: document port 18789 — No comment explaining what service is expected on this port.
TC-SBX-10: note the metadata parsing assumption — grep -A1 "$sb" | tail -1 assumes metadata is on the next line. A comment would help if nemoclaw list format ever changes.

None of these block merge. Nice work on the coverage. 👍

ericksoa · 2026-04-14T13:52:37Z

Note for future PRs: please make sure your commits are signed with verified signatures (GPG or SSH signing). Not blocking this one, but we'll need that going forward.

Will merge once CI passes. 👍

ShevaNguyen · 2026-04-14T13:54:29Z

Note for future PRs: please make sure your commits are signed with verified signatures (GPG or SSH signing). Not blocking this one, but we'll need that going forward.

Will merge once CI passes. 👍

Thank you, @ericksoa
I will pay attention to this next time.