E2E test cases - coverage for existing features with no runtime verification

### Problem Statement

NemoClaw has several implemented features — gateway auto-recovery, onboard resume after hard kill, stale state cleanup, sandbox process recovery, inference error classification, and sandbox data preservation on reuse — that have no E2E test proving they work in a real environment.

Users rely on these paths during the most stressful moments: their bot stopped responding, their laptop died mid-setup, or they are trying to recover a broken state. If a regression lands in any of these code paths, there is no automated signal — it only surfaces when a real user hits it.

TC-01 — connect/status auto-restarts dead gateway
- Real-life situation: A user's Telegram/Slack bot stops responding. They open a terminal and run `nemoclaw status` or `nemoclaw connect` to diagnose. The CLI is expected to detect the gateway is down, automatically restart it, and return a healthy status — all without manual intervention.
- Why we need this: The `status` path has partial coverage; the `connect` path has zero. Without this test, a regression on the auto-restart logic would silently break the most common recovery workflow, leaving users with a dead agent and no path forward.

TC-02 — SIGKILL mid-onboard → stale lock → resume
- Real-life situation: A user is setting up NemoClaw for the first time. Their laptop battery dies or they accidentally close the terminal halfway through onboarding. The next time they run `nemoclaw onboard`, it is expected to detect the stale lock file from the dead process, clean it up, and resume from the last completed step — no data loss, no manual cleanup needed.
- Why we need this: The resume feature is only tested with a clean controlled failure. A real hard kill leaves a different kind of stale state. Without this test, users who hit this scenario may get stuck with a corrupted lock file and no clear path to recover.

TC-04 — Stale gateway cleanup in preflight
- Real-life situation: Docker updated overnight and wiped the gateway container. The user opens their laptop and sends a Telegram message — no reply. They run `nemoclaw status`, which triggers auto-recovery. If auto-recovery succeeds, they are back to normal with no further action needed. If it fails (container and volumes too corrupted for `openshell gateway start` to recover), the CLI prints guidance directing them to run `nemoclaw onboard` as a last resort. Only at that point does the user re-run onboard — not as a first instinct, but as a deliberate recovery step after auto-recovery could not fix things.
- Why we need this: TC-04 is the severity-2 companion to TC-01. TC-01 covers a stopped gateway (container exists but not running — auto-recovery usually works). TC-04 covers a deleted gateway container (no container, stale metadata — auto-recovery fails, onboard is the fallback). The preflight cleanup logic that detects the ghost state and resets it cleanly has no test. Without coverage, a regression here would leave users stranded at the exact moment they are following the CLI's own guidance to recover.


TC-05 — Sandbox process recovery on connect/status
- Real-life situation: The user's machine had a memory spike or high load. The Kubernetes pod inside the sandbox restarted and the OpenClaw gateway process inside it died. When the user runs `nemoclaw connect` to resume work, the CLI is expected to detect the dead process, restart it automatically, and drop the user into a working shell as expected.
- Why we need this: This is newly written code with zero E2E coverage. New recovery paths with no tests are the most likely place for regressions to appear undetected, especially on a flow users depend on to resume work after an unexpected interruption.

TC-06 — Inference validation error → classified error message
- Real-life situation: A user pastes a wrong API key during onboard, or they are on a corporate network where the NVIDIA endpoint is blocked. The wizard is expected to detect the failure, classify it (wrong key vs. unreachable endpoint vs. quota exceeded), and show a clear actionable error message — not a raw stack trace.
- Why we need this: Without classified error handling, users cannot tell whether the problem is their key, their network, or a NemoClaw bug. This test also catches the silent `ANTHROPIC_API_KEY` env var override that currently reroutes traffic to Anthropic without any warning.

TC-07 — Double onboard "reuse" preserves sandbox data

- Real-life situation: A user re-runs `nemoclaw onboard` to change their inference provider or rotate their API key. The wizard asks "Reuse existing sandbox?" They confirm yes. Their agent's memory, workspace files, and scheduled tasks are expected to remain completely untouched — only the inference config changes.
- Why we need this: If the reuse path silently wipes sandbox data, the user loses all accumulated agent context — conversations, saved files, tasks — without any warning. No test currently verifies that data actually survives this path, making it an invisible regression risk.

### Proposed Design

Add six E2E test scripts under test/e2e/, following the existing baseline → disrupt → verify
skeleton from test-sandbox-survival.sh. No new test infrastructure required.

─────────────────────────────────────────────────────────────────

TC-01 — connect/status auto-restarts dead gateway
File: test/e2e/test-gateway-auto-restart.sh

Steps:
  1. Onboard normally. Verify inference works.
  2. Kill the gateway process directly from the host:
       kill -9 $(pgrep openshell-gateway)
     (Container still exists, process is dead — simulates a crashed gateway.)
  3. Run `nemoclaw <name> status`.
  4. Run `nemoclaw <name> connect`.

Pass criteria:
  - Both commands detect the dead gateway and restart it automatically.
  - No manual intervention required.
  - `nemoclaw <name> status` returns healthy after recovery.
  - `nemoclaw <name> connect` drops into a working shell.
  - Inference returns a response after recovery.

─────────────────────────────────────────────────────────────────

TC-02 — SIGKILL mid-onboard → stale lock → resume
File: test/e2e/test-onboard-sigkill-resume.sh

Steps:
  1. Start `nemoclaw onboard --non-interactive` in the background:
       nemoclaw onboard --non-interactive &
  2. Watch logs and wait until the gateway setup step completes
     (look for "gateway select nemoclaw" in output).
  3. Hard-kill the onboard process:
       kill -9 <onboard-pid>
  4. Run `nemoclaw onboard --non-interactive` again.
     Let the auto-detection handle it.

Pass criteria:
  - Output contains "Found an interrupted onboarding session — resuming it."
  - Output shows the gateway step was skipped (not re-executed).
  - Exit code is 0.
  - No orphaned Docker containers
    (docker ps -a shows no dangling openclaw-* containers).
  - ~/.nemoclaw/onboard-session.json is valid JSON with status: "complete".
  - Inference returns a response after resume completes.

─────────────────────────────────────────────────────────────────

TC-04 — Stale gateway cleanup in preflight
File: test/e2e/test-stale-gateway-cleanup.sh

This test simulates the severity-2 disruption: the gateway container was deleted entirely
(not just stopped), leaving stale metadata behind. It follows the real user sequence —
status first, onboard only as a last resort after auto-recovery fails.

Steps:
  1. Onboard normally. Verify inference works.
  2. Force-delete the gateway container from the host:
       docker rm -f openshell-cluster-nemoclaw
     (Leave ~/.nemoclaw metadata intact — this creates the ghost state.)
  3. Run `nemoclaw <name> status`.
     - If status succeeds (auto-recovery handled it) → log as a TC-01 variant, skip to done.
     - If status fails with guidance to re-run onboard → proceed to step 4.
  4. Run `nemoclaw onboard` following the CLI's own recovery guidance.
  5. Verify preflight cleaned up the stale state and onboard completes.

Pass criteria:
  - No manual cleanup required by the user at any step.
  - If auto-recovery in step 3 fails, onboard in step 4 completes cleanly
    with no conflicting state errors or duplicate container entries.
  - Inference returns a response after recovery.

Relationship to TC-01:
  TC-01 covers a stopped gateway (container exists, not running — auto-recovery path).
  TC-04 covers a deleted gateway container (no container, stale metadata — onboard fallback).
  Kept as separate tests to preserve distinct CI failure signals for each code path.

─────────────────────────────────────────────────────────────────

TC-05 — Sandbox process recovery on connect/status
File: test/e2e/test-sandbox-process-recovery.sh

Steps:
  1. Onboard normally. Verify inference works.
  2. Connect into the sandbox and kill the OpenClaw gateway process inside the pod:
       nemoclaw <name> connect
       kill -9 $(pgrep -f "openclaw gateway")
       exit
  3. From the host, run `nemoclaw <name> status`.
  4. From the host, run `nemoclaw <name> connect`.

Pass criteria:
  - CLI detects the dead process inside the pod.
  - Process is restarted automatically — no manual intervention.
  - `nemoclaw <name> status` returns healthy.
  - `nemoclaw <name> connect` drops into a working shell.
  - Inference returns a response after recovery.

─────────────────────────────────────────────────────────────────

TC-06 — Inference validation error → classified error message
File: test/e2e/test-inference-error-classification.sh
Runner: PR-safe (no real API key needed)

Steps:
  1. Run nemoclaw onboard in non-interactive mode with an intentionally invalid API key:
       NVIDIA_API_KEY=invalid-key-for-testing nemoclaw onboard --non-interactive
  2. Capture stdout and stderr.
  3. Repeat with the NVIDIA endpoint replaced by an unreachable URL
     to simulate a blocked corporate network.

Pass criteria:
  - Exit code is non-zero in both cases.
  - Output contains a human-readable classified error message
    (e.g. "Invalid API key" or "Endpoint unreachable") — not a raw stack trace.
  - If ANTHROPIC_API_KEY is set in the shell environment, output contains a warning
    that it was detected — traffic is not silently rerouted to Anthropic.

─────────────────────────────────────────────────────────────────

TC-07 — Double onboard "reuse" preserves sandbox data
File: test/e2e/test-double-onboard-reuse.sh

Steps:
  1. Onboard normally.
  2. Write a marker file inside the sandbox:
       nemoclaw <name> connect
       echo "marker" > /sandbox/marker.txt
       exit
  3. Re-run `nemoclaw onboard`. When prompted "Reuse existing sandbox?", select yes.
     Change only the inference provider (e.g. switch from NVIDIA Cloud to Ollama).
  4. Connect and verify:
       nemoclaw <name> connect
       cat /sandbox/marker.txt

Pass criteria:
  - marker.txt is present with its original content.
  - Agent memory directory (~/.openclaw/memory/) is untouched.
  - Only the inference config has changed (openshell inference show reflects new provider).
  - Inference returns a response from the new provider.

─────────────────────────────────────────────────────────────────

Suggested CI workflow: .github/workflows/e2e-existing-features.yml

  - TC-06 runs on every PR (no secrets needed).
  - TC-01, TC-02, TC-04, TC-05, TC-07 run nightly with real API keys.
  - Failures auto-create a GitHub issue labeled bug + CI/CD,
    consistent with the existing nightly failure convention.

### Alternatives Considered

_No response_

### Category

enhancement: feature

### Checklist

- [x] I searched existing issues and this is not a duplicate
- [x] This is a design proposal, not a "please build this" request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E test cases - coverage for existing features with no runtime verification #1611

Problem Statement

Proposed Design

Alternatives Considered

Category

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

E2E test cases - coverage for existing features with no runtime verification #1611

Description

Problem Statement

Proposed Design

Alternatives Considered

Category

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions