Skip to content

Fix/jetson gpu detection#1106

Closed
Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Junior00619:fix/jetson-gpu-detection
Closed

Fix/jetson gpu detection#1106
Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Junior00619:fix/jetson-gpu-detection

Conversation

@Junior00619
Copy link
Copy Markdown
Contributor

@Junior00619 Junior00619 commented Mar 30, 2026

Problem

detectGpu() fails to recognize NVIDIA Jetson AGX Thor and Jetson AGX Orin as valid GPU devices, returning null and blocking sandbox creation on supported hardware.

Root Cause

On unified-memory architectures, nvidia-smi --query-gpu=memory.total returns [N/A] rather than a numeric value. The VRAM query path parses this as NaN, filters it out, and falls through to a name-based fallback. That fallback only matched "GB10" (DGX Spark), so Thor and Orin were silently dropped.

Fix

Broaden the unified-memory fallback to match ["GB10", "Thor", "Orin"] via a declarative tag list. No changes to detection semantics — the same fallback path executes, reading system RAM as the VRAM estimate.

The runCapture call site was aligned with the runner.runCapture() indirection already established in sandbox-add-gpu-agent.js, enabling direct unit test coverage of the fallback without introducing a new mocking pattern.

Validation

  • 5 new tests covering all three device tags, a desktop GPU negative case (RTX 4090), and the standard VRAM-queryable early return
  • Full CLI test suite passes with no regressions (pre-existing failures unchanged)
  • All pre-commit hooks pass (ESLint, gitleaks, commitlint, SPDX)

Fixes #300

Summary by CodeRabbit

  • New Features

    • Expanded GPU detection to recognize multiple unified‑memory NVIDIA device families and use system memory as a fallback for better hardware compatibility.
  • Bug Fixes

    • Improved reliability of GPU, memory and runtime health checks so device and container status are detected more consistently.
  • Tests

    • Added comprehensive tests covering unified‑memory GPU detection, fallback behavior, and related edge cases.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

The PR updates bin/lib/nim.js to import the runner module namespace and route runCapture calls through runner.runCapture(...). It expands unified-memory GPU detection to match ["GB10","Thor","Orin"] and adds tests in test/nim.test.js (CommonJS-style require, mocks, and teardown) to cover the new fallback behavior and negative cases.

Changes

Cohort / File(s) Summary
GPU detection implementation
bin/lib/nim.js
Switches to namespace require("./runner") and calls runner.runCapture(...) at probe sites. Generalizes unified-memory tag check from "GB10" to ["GB10","Thor","Orin"] and preserves VRAM fallback sourcing from system RAM.
Tests and mocks
test/nim.test.js
Reworked to use CommonJS createRequire, adds afterEach teardown restoring runner.runCapture and vi.restoreAllMocks(). Adds tests exercising unified-memory fallback, negative cases, and numeric-VRAM bypass scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I heard a Thor and Orin hum,
Their VRAM hidden, silence dumb.
I tag their names, I call the run,
Now unified memory meets the sun.
Hooray — GPUs found, the rabbit's done! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Fix/jetson gpu detection' directly refers to the main change: fixing GPU detection for Jetson devices by extending the unified-memory fallback logic.
Linked Issues check ✅ Passed The changes fully implement the requirement from issue #300: extending GPU detection to include 'Thor' and 'Orin' tags alongside 'GB10' for unified-memory architectures.
Out of Scope Changes check ✅ Passed All changes are directly related to issue #300: updating nim.js call sites to use runner.runCapture indirection and extending the unified-memory GPU name tag matching. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (11)
nemoclaw-blueprint/policies/presets/docker-proxy.yaml-36-36 (1)

36-36: ⚠️ Potential issue | 🟡 Minor

Duplicate rule: POST /v*/containers/** appears twice.

Line 36 and line 44 both specify the same allow rule. This is redundant and should be removed.

🔧 Proposed fix
           - allow: { method: GET,    path: "/v*/networks/**" }
           - allow: { method: GET,    path: "/v*/volumes/**" }
           - allow: { method: GET,    path: "/v*/events" }
-          - allow: { method: POST,   path: "/v*/containers/**" }
     binaries:

Also applies to: 44-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw-blueprint/policies/presets/docker-proxy.yaml` at line 36, Remove the
duplicated policy rule entry '- allow: { method: POST,   path:
"/v*/containers/**" }' so the POST /v*/containers/** allow rule appears only
once in the docker-proxy preset; locate both identical allow entries and delete
the redundant occurrence, leaving the single canonical rule intact.
bin/nemoclaw.js-602-609 (1)

602-609: ⚠️ Potential issue | 🟡 Minor

Missing input validation for --port argument.

The --port value is passed directly to the shell without validating it's a valid port number. A malicious or accidental non-numeric value could cause unexpected behavior.

🛡️ Proposed fix to add port validation
   if (args.includes("--port")) {
     const idx = args.indexOf("--port");
-    if (args[idx + 1]) envParts.push(`DASHBOARD_PORT=${shellQuote(args[idx + 1])}`);
+    const portVal = args[idx + 1];
+    if (portVal && /^\d+$/.test(portVal) && +portVal > 0 && +portVal <= 65535) {
+      envParts.push(`DASHBOARD_PORT=${shellQuote(portVal)}`);
+    } else if (portVal) {
+      console.error(`  Invalid port: ${portVal} (must be 1-65535)`);
+      process.exit(1);
+    }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 602 - 609, Validate the value passed for --port
before adding it to envParts: in the block that reads args/indexOf("--port") and
pushes DASHBOARD_PORT using shellQuote, parse and verify the next token is a
numeric integer within the valid TCP port range (1–65535); if it’s invalid,
either throw/exit with a clear error or skip adding the env var and log an
error; keep the same shellQuote usage for the safe value when valid. Use the
same symbols: args, idx, envParts, shellQuote and the DASHBOARD_PORT handling to
locate where to add validation.
Dockerfile.base-65-68 (1)

65-68: ⚠️ Potential issue | 🟡 Minor

Pin xvfb, scrot, and x11-utils to exact Debian bookworm versions.

The file header (lines 17–18) states "apt packages — pinned to exact Debian bookworm versions," but xvfb, scrot, and x11-utils at lines 66–68 are installed without version pins. This breaks the stated policy and risks reproducibility issues. Add version pins matching the pattern of other packages in this RUN command:

xvfb=2:21.1.7-3+deb12u11 \
scrot=1.8.1-1 \
x11-utils=7.7+5 \

Also pin docker-ce-cli at line 83 (currently unpinned from the Docker official repository).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile.base` around lines 65 - 68, Update the RUN apt-get install line to
pin the virtual framebuffer and X utilities by appending exact Debian bookworm
versions for xvfb, scrot, and x11-utils (use xvfb=2:21.1.7-3+deb12u11,
scrot=1.8.1-1, x11-utils=7.7+5) so they match the existing package pinning
pattern, and likewise pin docker-ce-cli to its exact version in the Docker
official repo; modify the package tokens xvfb, scrot, x11-utils, and
docker-ce-cli to include the specified =version strings where they are installed
so the build is reproducible.
.agents/skills/nemoclaw-gpu-sandbox/SKILL.md-1-4 (1)

1-4: ⚠️ Potential issue | 🟡 Minor

Missing SPDX license header.

As per coding guidelines, Markdown files require an HTML comment SPDX header at the top.

📝 Proposed fix
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
 ---
 name: nemoclaw-gpu-sandbox
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/nemoclaw-gpu-sandbox/SKILL.md around lines 1 - 4, The
SKILL.md file is missing the required SPDX license HTML comment header; add an
SPDX identifier comment as the first line of
.agents/skills/nemoclaw-gpu-sandbox/SKILL.md (above the YAML frontmatter
starting with ---) using the project's standard header (for example <!--
SPDX-License-Identifier: Apache-2.0 -->) so the file begins with the SPDX HTML
comment followed by the existing frontmatter/name/description block.
Dockerfile.sandbox-ai-1-17 (1)

1-17: ⚠️ Potential issue | 🟡 Minor

Missing SPDX license header.

As per coding guidelines, all source files require an SPDX header. For Dockerfiles, use # comments.

📝 Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
 # NemoClaw AI sandbox image — CUDA-enabled for Python AI/ML workloads.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile.sandbox-ai` around lines 1 - 17, Add a Dockerfile SPDX header as a
comment at the very top of Dockerfile.sandbox-ai (before any other comments)
using the format "# SPDX-License-Identifier: Apache-2.0" (or replace Apache-2.0
with the repo's chosen identifier), ensuring the header is a single-line comment
so the Dockerfile passes the project's licensing check.
.agents/skills/nemoclaw-docker-proxy/SKILL.md-1-4 (1)

1-4: ⚠️ Potential issue | 🟡 Minor

Missing SPDX license header.

As per coding guidelines, Markdown files require an HTML comment SPDX header at the top.

📝 Proposed fix
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
 ---
 name: nemoclaw-docker-proxy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/nemoclaw-docker-proxy/SKILL.md around lines 1 - 4, Add an
HTML comment SPDX license header to the very top of the SKILL.md document (the
one whose frontmatter has name: nemoclaw-docker-proxy) so the file begins with a
valid SPDX identifier in an HTML comment (for example include
"SPDX-License-Identifier: MIT" inside the comment); ensure the comment is the
first line before the YAML frontmatter and matches the project's chosen SPDX
identifier.
scripts/docker-proxy.js-14-18 (1)

14-18: ⚠️ Potential issue | 🟡 Minor

Comment contradicts the allowlist.

Line 16 states "No exec, no attach" but line 106 allows POST /containers/{id}/attach. The attach endpoint is necessary for docker run -it functionality, so either the comment should be updated or the allowlist reconsidered.

📝 Suggested comment fix
 // SECURITY MODEL:
 //   - Only a hard-coded allowlist of Docker API operations is forwarded.
 //   - Container-create requests are inspected; Privileged mode, host network,
 //     CAP_ADD, and mounts of sensitive host paths are rejected.
-//   - No exec, no attach, no build — agents can run pre-built images only.
+//   - No exec, no build — agents can run pre-built images only.
+//   - Attach is allowed for interactive container sessions (docker run -it).
 //   - No token auth here; OpenShell network policy is gate 1 (sandbox traffic
 //     only), this allowlist is gate 2.

Also applies to: 106-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/docker-proxy.js` around lines 14 - 18, The top-of-file comment in
docker-proxy.js contradicts the allowlist: it says "No exec, no attach, no
build" but the proxy allows POST /containers/{id}/attach; update the comment to
reflect the actual allowlist by removing or revising the "no attach" statement
(or explicitly note that /containers/{id}/attach is permitted for interactive
runs) so the comment matches the implemented allowlist (reference the "No exec,
no attach, no build" comment block and the POST /containers/{id}/attach
endpoint).
scripts/resume.sh-54-57 (1)

54-57: ⚠️ Potential issue | 🟡 Minor

Duplicate example line in usage text.

Line 55 and 56 show identical examples.

📝 Suggested fix
   echo "Examples:"
   echo "  $0 my-assistant"
-  echo "  $0 my-assistant"
+  echo "  DASHBOARD_PORT=8080 $0 my-assistant"
   echo ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/resume.sh` around lines 54 - 57, The usage text prints two identical
example lines; remove the duplicate echo or replace the second line with a
different example so the output shows only unique examples (update the two echo
"  $0 my-assistant" lines in the usage section to a single example or a distinct
second example that uses "$0" appropriately).
scripts/setup-autostart.sh-79-80 (1)

79-80: ⚠️ Potential issue | 🟡 Minor

Non-portable grep -oP may fail on macOS.

The -P (Perl regex) flag is not available in BSD grep (macOS default). Consider using a more portable alternative.

🐛 Portable alternative using sed
-GATEWAY_NAME=$(openshell gateway info 2>&1 | grep -oP 'Gateway:\s+\K\S+' || echo "nemoclaw")
+GATEWAY_NAME=$(openshell gateway info 2>&1 | sed -n 's/.*Gateway:[[:space:]]*\([^[:space:]]*\).*/\1/p' | head -1)
+GATEWAY_NAME="${GATEWAY_NAME:-nemoclaw}"

Or using awk:

GATEWAY_NAME=$(openshell gateway info 2>&1 | awk '/Gateway:/ {print $2; exit}')
GATEWAY_NAME="${GATEWAY_NAME:-nemoclaw}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-autostart.sh` around lines 79 - 80, The current extraction of
GATEWAY_NAME uses non-portable grep -oP which can fail on macOS; replace the
grep -oP usage in the GATEWAY_NAME assignment with a portable command (e.g., awk
or sed) that parses the "Gateway:" line and defaults to "nemoclaw" if empty,
leaving GATEWAY_CTR="openshell-cluster-${GATEWAY_NAME}" unchanged; update the
GATEWAY_NAME assignment (the line that currently calls openshell gateway info |
grep -oP) to use the portable parser and preserve the existing fallback
behavior.
bin/lib/sandbox-init.js-276-292 (1)

276-292: ⚠️ Potential issue | 🟡 Minor

Policy preset failures are silently marked as successful.

Based on context from bin/lib/policies.js, applyPreset returns false when a preset is not found (instead of throwing). The current code only catches exceptions but doesn't check the return value, so a missing preset will show a green checkmark.

🐛 Suggested fix to check return value
-  const policiestoApply = skipGithub
+  const policiesToApply = skipGithub
     ? [...allPolicies]
     : ["github", ...allPolicies];
 
-  if (policiestoApply.length === 0) {
+  if (policiesToApply.length === 0) {
     console.log("    • No policies to apply (--no-github and no --policy flags)");
   } else {
-    for (const preset of [...new Set(policiestoApply)]) {
+    for (const preset of [...new Set(policiesToApply)]) {
       process.stdout.write(`    • ${preset} … `);
       try {
-        policies.applyPreset(sandboxName, preset);
-        console.log("\x1b[32m✓\x1b[0m");
+        const success = policies.applyPreset(sandboxName, preset);
+        if (success) {
+          console.log("\x1b[32m✓\x1b[0m");
+        } else {
+          console.log("\x1b[33m⚠\x1b[0m (skipped)");
+          console.warn(`      Preset '${preset}' not found or invalid`);
+        }
       } catch (err) {
         console.log("\x1b[33m⚠\x1b[0m (skipped)");
         console.warn(`      ${err.message}`);
       }
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/sandbox-init.js` around lines 276 - 292, The loop treating
policies.applyPreset(sandboxName, preset) as successful on non-throwing failures
must be fixed: check the return value of policies.applyPreset (called with
sandboxName and preset) and if it returns false treat it as skipped (print the
yellow "⚠ (skipped)" and console.warn with the error/diagnostic), otherwise
print the green check; keep the existing try/catch to handle thrown errors but
ensure the success branch only runs when applyPreset returns a truthy success
value.
scripts/docker-proxy.js-163-181 (1)

163-181: ⚠️ Potential issue | 🟡 Minor

Mount validation may miss top-level Mounts array.

The Docker API allows mounts to be specified at spec.Mounts (top-level) in addition to spec.HostConfig.Mounts. The current validation only checks HostConfig.Mounts.

🐛 Add check for top-level Mounts
   // Also check Mounts (new-style)
   const mounts = spec.HostConfig?.Mounts || [];
   for (const mount of mounts) {
     if (mount.Type === "bind") {
       const source = mount.Source || "";
       if (source.startsWith("/") && isBlockedMount(source)) {
         return `Mount of host path '${source}' is not allowed`;
       }
     }
   }
 
+  // Also check top-level Mounts (used by some API versions)
+  const topLevelMounts = spec.Mounts || [];
+  for (const mount of topLevelMounts) {
+    if (mount.Type === "bind") {
+      const source = mount.Source || "";
+      if (source.startsWith("/") && isBlockedMount(source)) {
+        return `Mount of host path '${source}' is not allowed`;
+      }
+    }
+  }
+
   return null; // safe
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/docker-proxy.js` around lines 163 - 181, Validation currently only
inspects spec.HostConfig.Mounts and can miss a top-level spec.Mounts array;
update the mount validation in scripts/docker-proxy.js to also iterate
spec.Mounts (in addition to HostConfig.Mounts) and apply the same checks (Type
=== "bind", inspect mount.Source, call isBlockedMount) so bind mounts declared
at the top level are blocked just like HostConfig.Mounts; reuse the existing
variables/naming (binds, mounts, mount.Source, isBlockedMount) and return the
same error strings when a blocked path is detected.
🧹 Nitpick comments (30)
bin/nemoclaw.js (2)

516-520: Consider extracting the inline Python command.

The kubectl exec with embedded Python for reading the auth token is functional but hard to read. Consider extracting this to a helper function or using a simpler approach like jq if available in the sandbox.

🔧 Suggested refactor using jq (if available)
-  const token = runCapture(
-    `openshell doctor exec -- kubectl exec ${sandboxName} -n openshell -- ` +
-      `bash -c "HOME=/sandbox python3 -c \\"import json; cfg=json.load(open('/sandbox/.openclaw/openclaw.json')); print(cfg.get('gateway',{}).get('auth',{}).get('token',''))\\""`,
-    { ignoreError: true }
-  ) || null;
+  const token = runCapture(
+    `openshell doctor exec -- kubectl exec ${sandboxName} -n openshell -- ` +
+      `bash -c "HOME=/sandbox jq -r '.gateway.auth.token // empty' /sandbox/.openclaw/openclaw.json 2>/dev/null"`,
+    { ignoreError: true }
+  ) || null;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 516 - 520, Extract the inline Python kubectl
command used to populate token into a small helper function (e.g.,
getSandboxAuthToken) that takes sandboxName and returns the token, or switch to
a simpler jq-based command if jq is available in the sandbox; replace the long
inline command passed to runCapture with a call to that helper, keep the
runCapture invocation and ignoreError option, and ensure the helper handles
missing file/field cases and returns null on failure so the token assignment
(const token = ...) remains unchanged.

652-664: Help text formatting issue on line 653.

Line 653 combines multiple commands on one line, making it harder to read. The alignment appears inconsistent with other help entries.

📝 Suggested formatting fix
   ${G}Sandbox Management:${R}
-    ${B}nemoclaw list${R}                    List all sandboxes           ${B}nemoclaw add-gpu-agent <name>${R}   Create a GPU-enabled agent + register in parent dashboard    nemoclaw <name> connect          Shell into a running sandbox
+    ${B}nemoclaw list${R}                    List all sandboxes
+    ${B}nemoclaw add-gpu-agent <name>${R}    Create a GPU-enabled agent + register in parent
+    nemoclaw <name> connect          Shell into a running sandbox
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 652 - 664, The help text line containing
multiple commands ("nemoclaw list", "nemoclaw add-gpu-agent <name>", "nemoclaw
<name> connect") is compressed onto one line and misaligned with other entries;
split these into separate lines so each command has its own entry, match the
spacing/column alignment used elsewhere, and preserve the color tokens (${G},
${B}, ${D}, ${R}) and surrounding padding; update the block that renders
"Sandbox Management" so entries like "nemoclaw list", "nemoclaw add-gpu-agent
<name>", and "nemoclaw <name> connect" are each on their own aligned line
consistent with "Data & Filesystem" formatting.
docs/deployment/run-multiple-gpu-agents.md (2)

122-123: Unusual relative path in troubleshooting section.

The path ./../../../.agents/skills/nemoclaw-gpu-sandbox/SKILL.md uses an unusual upward traversal pattern. Per docs guidelines, documentation source files should be edited under docs/ directory. Consider linking to a docs page instead of an internal skill file, or use a simpler relative path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/deployment/run-multiple-gpu-agents.md` around lines 122 - 123, In
docs/deployment/run-multiple-gpu-agents.md replace the unusual upward-traversal
link target "./../../../.agents/skills/nemoclaw-gpu-sandbox/SKILL.md" with a
docs-friendly target (e.g., a new or existing docs page under docs/ such as
"docs/agents/gpu-sandbox.md" or a simpler relative path from this file), update
the link text if needed, and ensure the referenced page exists and is added to
the docs index/TOC so the troubleshooting link resolves correctly.

23-24: Multiple sentences on single line.

Per the docs style guide, each sentence should be on its own line for readable diffs.

📝 Suggested fix
-NemoClaw supports running multiple GPU-accelerated agent sandboxes on a single physical GPU using NVIDIA device plugin **time-slicing**.
-Each sandbox gets its own `nvidia.com/gpu` allocation and sees the full GPU (with shared access at the hardware scheduler level).
+NemoClaw supports running multiple GPU-accelerated agent sandboxes on a single physical GPU using NVIDIA device plugin **time-slicing**.
+Each sandbox gets its own `nvidia.com/gpu` allocation and sees the full GPU (with shared access at the hardware scheduler level).

(Split into separate lines in the source)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/deployment/run-multiple-gpu-agents.md` around lines 23 - 24, The two
sentences in the paragraph ("NemoClaw supports running multiple GPU-accelerated
agent sandboxes on a single physical GPU using NVIDIA device plugin
**time-slicing**." and "Each sandbox gets its own `nvidia.com/gpu` allocation
and sees the full GPU (with shared access at the hardware scheduler level).")
must be split onto separate lines in the markdown source so each sentence
occupies its own line for cleaner diffs; update the text block containing those
sentences so the first sentence is on one line and the second sentence starts on
the next line.
Dockerfile.base (1)

83-83: docker-ce-cli is not version-pinned.

For reproducibility, consider pinning docker-ce-cli to a specific version (e.g., docker-ce-cli=5:27.5.1-1~debian.12~bookworm). The comment mentions 29.x compatibility but the package could auto-upgrade to a newer major version on rebuild.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile.base` at line 83, The Dockerfile installs docker-ce-cli without a
pinned version (see the apt-get install token docker-ce-cli); update the apt-get
install command to pin docker-ce-cli to a specific package version (e.g.,
docker-ce-cli=5:27.5.1-1~debian.12~bookworm) and update the accompanying comment
to document the chosen major-version compatibility (e.g., "pin to 27.x for
reproducible builds"); ensure the pinned version exists in the configured APT
repos (or add the appropriate repository) and include the version pin in any
subsequent rebuild instructions so the image remains reproducible.
extensions/docker-proxy/index.ts (2)

136-140: Empty catch block silently swallows errors.

The container removal is best-effort, but logging the error would aid debugging when cleanup fails unexpectedly.

♻️ Proposed improvement
         // Auto-remove after collecting output
         try {
           await docker.containerRemove(containerId);
-        } catch {
-          // best-effort cleanup
+        } catch (err) {
+          // best-effort cleanup — log but don't fail
+          api.logger.warn(`[docker_run] Failed to remove container ${shortId}: ${err}`);
         }

Note: This requires capturing api in scope or passing logger to the execute function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/docker-proxy/index.ts` around lines 136 - 140, The empty catch
around docker.containerRemove(containerId) swallows errors; change it to log the
failure instead: catch the error and call the available logger (or capture/pass
api to get api.logger) with a clear message including containerId and the error;
update the execute function signature or closure to accept a logger/api if
necessary so execute can call logger.error when docker.containerRemove fails for
best-effort cleanup.

64-68: Consider using a more specific type for PluginApi.

The PluginApi type uses unknown and Record<string, unknown> extensively. If OpenClaw provides type definitions, importing them would improve type safety and IDE support.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/docker-proxy/index.ts` around lines 64 - 68, The PluginApi type is
too generic; replace the widespread use of unknown/Record<string, unknown> by
importing or defining concrete types from OpenClaw (or your app) and applying
them to PluginApi's fields: use the library's PluginConfig type for
pluginConfig, a Logger interface for logger with proper method signatures and
optional metadata, and a typed RegisterTool signature for registerTool (e.g.,
ToolDefinition / RegisterOptions) instead of Record<string, unknown>; update the
type alias name PluginApi to reference these imported/defined interfaces so IDEs
and TS checks can validate usages of pluginConfig, logger, and registerTool.
scripts/mount-sandbox.sh (2)

160-170: find -printf is GNU-specific.

The -printf option is not available in BSD find (macOS). If cross-platform support is needed, use find ... -exec basename {} \; or ls -1 instead.

♻️ Proposed portable alternative
   if mountpoint -q "$mountpoint" 2>/dev/null; then
     info "Mounted at ${mountpoint}"
     echo "  Contents:"
-    find "$mountpoint" -maxdepth 1 -mindepth 1 -printf '    %f\n' 2>/dev/null
+    ls -1 "$mountpoint" 2>/dev/null | sed 's/^/    /'
   else
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/mount-sandbox.sh` around lines 160 - 170, The do_status function uses
GNU-only find -printf which breaks on BSD/macOS; update the listing command
inside do_status (the find "$mountpoint" ... -printf call) to a portable
alternative such as using find "$mountpoint" -maxdepth 1 -mindepth 1 -exec
basename {} \; or piping find to xargs -n1 basename or simply using ls -1
"$mountpoint" to produce the same indented "    name" output so the function
works cross-platform.

51-70: Potential race condition when appending to SSH config.

Multiple concurrent invocations could produce duplicate or corrupted config entries. The grep check and append are not atomic. Consider using a lock file for the SSH config modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/mount-sandbox.sh` around lines 51 - 70, The ensure_ssh_config
function has a race between the grep check and appending to ${HOME}/.ssh/config
which can produce duplicates or corruption under concurrent runs; to fix it,
serialize access with a filesystem lock (e.g., use flock or a dedicated
lockfile) around the entire check-and-append section in ensure_ssh_config so the
host existence test and the append are performed atomically, perform the write
to a temporary file and move it into place (preserving chmod 0600) while still
holding the lock, and ensure the lock is released on all exit paths; reference
ensure_ssh_config, the host variable (${SSH_HOST_PREFIX}${sandbox}), and config
(${HOME}/.ssh/config) when making the change.
Dockerfile.sandbox-ai (2)

73-75: Missing --no-install-recommends for Node.js installation.

The NodeSource setup script may pull in recommended packages that increase image size unnecessarily.

♻️ Proposed fix
 RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
-    && apt-get install -y nodejs \
+    && apt-get install -y --no-install-recommends nodejs \
     && rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile.sandbox-ai` around lines 73 - 75, The RUN instruction installs
Node.js with apt-get without suppressing recommended packages; update the RUN
command that pipes NodeSource setup and calls apt-get (the line containing
"apt-get install -y nodejs") to use apt-get install -y --no-install-recommends
nodejs so recommended packages are not pulled in, keeping the rest of the chain
(curl setup and rm -rf /var/lib/apt/lists/*) unchanged.

164-179: NEMOCLAW_PROVIDER_KEY in ARG/ENV may expose sensitive data in image layers.

Trivy flagged this as a potential secret exposure. While the value appears to be "nvidia" (a provider identifier, not an API key), the variable name suggests it could be confused with actual credentials. Consider renaming to NEMOCLAW_PROVIDER_ID to clarify it's not a secret.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile.sandbox-ai` around lines 164 - 179, The ARG/ENV name
NEMOCLAW_PROVIDER_KEY likely triggers secret scanners; rename it to
NEMOCLAW_PROVIDER_ID to clarify it's an identifier, not a credential: update the
ARG declaration (replace NEMOCLAW_PROVIDER_KEY), the corresponding ENV entry,
and any references elsewhere (e.g., build-time uses or scripts reading
process.env.NEMOCLAW_PROVIDER_KEY) to use NEMOCLAW_PROVIDER_ID so the value
"nvidia" is treated as a non-secret provider id.
scripts/resume-all-sandboxes.sh (1)

68-76: Python JSON parser silently fails on errors.

The || true at line 76 suppresses all Python errors, including malformed JSON or missing keys. If sandboxes.json is corrupted, the script proceeds with zero sandboxes and reports success. Consider logging a warning when Python parsing fails.

♻️ Proposed fix to log parsing failures
   done < <(python3 -c "
 import json, sys
-d = json.load(open('${SANDBOXES_FILE}'))
-sbs = d.get('sandboxes', {})
-if isinstance(sbs, dict):
-    for sb in sbs.values(): print(sb.get('name',''))
-else:
-    for sb in sbs: print(sb.get('name',''))
-" 2>/dev/null || true)
+try:
+    d = json.load(open('${SANDBOXES_FILE}'))
+    sbs = d.get('sandboxes', {})
+    if isinstance(sbs, dict):
+        for sb in sbs.values(): print(sb.get('name',''))
+    else:
+        for sb in sbs: print(sb.get('name',''))
+except Exception as e:
+    import sys
+    print(f'[warning] Failed to parse sandboxes.json: {e}', file=sys.stderr)
+" 2>&1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/resume-all-sandboxes.sh` around lines 68 - 76, The Python JSON
parsing invocation using python3 -c (which reads ${SANDBOXES_FILE}) is currently
silenced by "2>/dev/null || true", so parsing errors are ignored; remove the "||
true", capture the python command's exit status and any stderr, and if it fails
log a warning that includes the error and file name (reference SANDBOXES_FILE
and the python3 -c invocation) and exit or skip with a non-zero status as
appropriate; ensure the script prints the parsed sandbox names only on
successful parse and propagate failures instead of silently continuing.
scripts/start-services.sh (1)

193-209: Consider a readiness check instead of fixed sleep.

The sleep 1 after starting kubectl port-forward is a timing assumption. If the port-forward takes longer to establish, the relay may fail to connect. Consider polling for port availability.

♻️ Proposed improvement
     docker exec -d "$_gw_container" \
       kubectl port-forward pod/"${SANDBOX_NAME}" "${DASHBOARD_PORT}:${DASHBOARD_PORT}" \
       -n openshell --address 0.0.0.0 >/dev/null 2>&1 || true
-    sleep 1
+    # Wait for port-forward to become ready (up to 5 seconds)
+    for _ in $(seq 1 10); do
+      if docker exec "$_gw_container" nc -z 127.0.0.1 "${DASHBOARD_PORT}" 2>/dev/null; then
+        break
+      fi
+      sleep 0.5
+    done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/start-services.sh` around lines 193 - 209, Replace the fixed sleep 1
with a readiness poll that waits for the forwarded dashboard port to be
accepting connections before launching start_service gateway-relay;
specifically, after the docker exec ... kubectl port-forward command, implement
a loop that tests the port (e.g., using nc -z, timeout + nc, curl --max-time, or
bash /dev/tcp/"$_gw_container_ip"/"$DASHBOARD_PORT") with a short sleep and an
overall timeout, and only proceed to call start_service gateway-relay when the
port is reachable (otherwise log an error and abort); update the block around
the kubectl port-forward and the start_service gateway-relay invocation to
perform this readiness check instead of relying on sleep 1.
scripts/nemoclaw-start.sh (1)

314-325: Minor: PID file ownership may be overly restrictive.

The PID file is set to root:root with mode 600, which prevents the sandbox user from reading it. If any cleanup logic needs to read the Xvfb PID as the sandbox user, this would fail. Consider whether 644 would be more appropriate for a read-only PID file, or if the current restriction is intentional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/nemoclaw-start.sh` around lines 314 - 325, The PID file /tmp/xvfb.pid
is currently owned root:root with mode 600 which prevents the sandbox user from
reading it; change the post-start permissions so the sandbox user can read the
PID (either make it world-readable or make sandbox the owner). Concretely,
replace the chown/chmod pair for /tmp/xvfb.pid (the lines using chown root:root
/tmp/xvfb.pid and chmod 600 /tmp/xvfb.pid) with either chown root:root
/tmp/xvfb.pid && chmod 644 /tmp/xvfb.pid or chown sandbox:sandbox /tmp/xvfb.pid
&& chmod 600 /tmp/xvfb.pid depending on whether you want root to remain owner;
update any cleanup logic that assumes the file is unreadable accordingly.
test/sandbox-add-gpu-agent.test.js (3)

142-154: Dead code: first runCapture assignment is immediately overwritten.

Line 144 assigns runner.runCapture = makeDefaultCaptureMock() but line 146 immediately overwrites it with a new mock, making the first assignment unused.

♻️ Remove the dead assignment
   describe("GPU time-slicing", () => {
     it("skips time-slicing when GPU count is already > 1", async () => {
-      // Default mock returns "4" for kubectl get nodes
-      runner.runCapture = makeDefaultCaptureMock();
-      // Sandbox list: cortana ready, jarvis NOT in list yet (so we can create it)
+      // Sandbox list: cortana ready, jarvis NOT in list yet (so we can create it)
+      // Mock returns "4" for kubectl get nodes (triggers skip path)
       runner.runCapture = vi.fn().mockImplementation((cmd) => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/sandbox-add-gpu-agent.test.js` around lines 142 - 154, The test contains
a dead assignment where runner.runCapture is first set to
makeDefaultCaptureMock() and then immediately overwritten; remove the initial
assignment to avoid confusion and ensure the test only defines runner.runCapture
once (keep the vi.fn() mock block). Update the test in
test/sandbox-add-gpu-agent.test.js by deleting the line that calls
makeDefaultCaptureMock() so only the intended mock implementation for
runner.runCapture remains.

10-12: Setting process.env.HOME at module scope may affect other test files.

This global mutation happens before any test setup and persists across the module's lifetime. Consider documenting this or using Vitest's beforeAll/afterAll to restore the original HOME after tests complete.

♻️ Suggested improvement
+const originalHome = process.env.HOME;
 const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-add-gpu-agent-test-"));
 process.env.HOME = tmpDir;
+
+// Restore HOME after all tests (optional cleanup)
+afterAll(() => {
+  process.env.HOME = originalHome;
+  fs.rmSync(tmpDir, { recursive: true, force: true });
+});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/sandbox-add-gpu-agent.test.js` around lines 10 - 12, The test sets
process.env.HOME at module scope using fs.mkdtempSync into tmpDir which mutates
global state for other tests; change this to set and restore HOME inside test
lifecycle hooks by creating tmpDir in a beforeAll (or beforeEach) and assigning
process.env.HOME there, and restore the original HOME in an afterAll (or
afterEach); reference the tmpDir variable, the fs.mkdtempSync call, and
process.env.HOME so you move creation into beforeAll and restore originalHome in
afterAll to avoid leaking the environment change.

235-247: Dead code: first runCapture assignment is immediately overwritten.

Same issue as above — line 236 assigns a mock that is immediately overwritten at line 239.

♻️ Remove the dead assignment
     it("skips import when image is already in k3s", async () => {
-      runner.runCapture = makeDefaultCaptureMock(); // returns "1" for images ls
-
-      // Sandbox list: only cortana (so jarvis does not yet exist)
+      // Sandbox list: only cortana (so jarvis does not yet exist)
+      // Returns "1" for images ls (image already present)
       runner.runCapture = vi.fn().mockImplementation((cmd) => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/sandbox-add-gpu-agent.test.js` around lines 235 - 247, Remove the dead
assignment to runner.runCapture by deleting the initial call to
makeDefaultCaptureMock() so the test only sets runner.runCapture once via
vi.fn().mockImplementation; update the test "skips import when image is already
in k3s" to rely solely on the mockImplementation block (referencing
runner.runCapture and makeDefaultCaptureMock and vi.fn().mockImplementation) to
avoid the overwritten, unnecessary assignment.
scripts/resume.sh (1)

100-107: Consider using grep -F for fixed-string matching.

Although the RFC 1123 validation restricts sandbox names to [a-z0-9-], using -F (fixed string) with grep is more robust and explicit about intent.

♻️ Example fix at line 104
-    if echo "$sandbox_list" | grep -q "$SANDBOX_NAME" && echo "$sandbox_list" | grep "$SANDBOX_NAME" | grep -q "Ready"; then
+    if echo "$sandbox_list" | grep -qF "$SANDBOX_NAME" && echo "$sandbox_list" | grep -F "$SANDBOX_NAME" | grep -q "Ready"; then

Also applies to: 169-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/resume.sh` around lines 100 - 107, The grep checks that search the
sandbox name in resume.sh should use fixed-string matching to avoid accidental
regex interpretation: replace the two occurrences where sandbox_list is piped
into grep "$SANDBOX_NAME" (the pair used to detect presence and to check for
"Ready") with grep -F "$SANDBOX_NAME" (and the second pipeline where you also
check for "Ready" keep the grep -F for the sandbox name and leave the "Ready"
grep unchanged); update both the block around the current_status check
(variables current_status, sandbox_list, sandbox_reachable) and the similar
checks at lines ~169-177 to use grep -F for the sandbox name.
scripts/docker-proxy.js (1)

380-384: Binding to 0.0.0.0 exposes proxy to all network interfaces.

While this is needed for sandbox access via host.openshell.internal, it also exposes the proxy to any other network interface on the host (e.g., public IP). Consider documenting this risk or binding more restrictively.

Consider adding a comment explaining why 0.0.0.0 is necessary, or provide an environment variable to restrict the bind address:

const BIND_ADDR = process.env.NEMOCLAW_DOCKER_PROXY_BIND || "0.0.0.0";
server.listen(PROXY_PORT, BIND_ADDR, () => { ... });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/docker-proxy.js` around lines 380 - 384, The current server.listen
call binds to 0.0.0.0 (server.listen(PROXY_PORT, "0.0.0.0", ...)), which exposes
the proxy on all host interfaces; update the binding to use a configurable
address and document why 0.0.0.0 is used for sandbox access: introduce an
environment-driven BIND_ADDR (e.g., NEMOCLAW_DOCKER_PROXY_BIND) defaulting to
"0.0.0.0", replace the literal in server.listen(PROXY_PORT, BIND_ADDR, ...), and
add an inline comment explaining that 0.0.0.0 is required for
host.openshell.internal sandbox access but can be restricted in production via
the env var to limit exposure.
extensions/claw3d/index.ts (1)

36-44: Inconsistent error handling compared to PUT/POST helpers.

claw3dGet throws an error without including the response body, while claw3dPut and claw3dPost both include the response text in errors. Additionally, res.json() will throw if the response is empty or not valid JSON.

♻️ Suggested improvement for consistency and robustness
 async function claw3dGet(baseUrl: string, path: string, params?: Record<string, string>): Promise<unknown> {
   const url = new URL(`${baseUrl}${path}`);
   if (params) {
     for (const [k, v] of Object.entries(params)) url.searchParams.set(k, v);
   }
   const res = await fetch(url.toString(), { headers: { Accept: "application/json" } });
-  if (!res.ok) throw new Error(`Claw3D GET ${path} failed: ${res.status} ${res.statusText}`);
-  return res.json();
+  if (!res.ok) {
+    const text = await res.text().catch(() => "");
+    throw new Error(`Claw3D GET ${path} failed: ${res.status} ${res.statusText}${text ? " — " + text : ""}`);
+  }
+  const text = await res.text();
+  return text ? JSON.parse(text) : {};
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/claw3d/index.ts` around lines 36 - 44, claw3dGet currently throws
errors without the response body and directly calls res.json() which can throw
on empty/invalid JSON; update claw3dGet to mirror claw3dPut/claw3dPost by
reading the response text when !res.ok and including that text in the thrown
Error, and parse JSON defensively (try JSON.parse on the response text and fall
back to returning the raw text or undefined if empty) so callers don't crash on
empty or non-JSON responses; locate the logic in function claw3dGet to implement
these changes.
scripts/setup-autostart.sh (1)

131-140: Path handling may break if NEMOCLAW_ROOT contains spaces.

When NEMOCLAW_BIN is set to node ${NEMOCLAW_ROOT}/bin/nemoclaw.js (line 134), and later used unquoted in the generated script (line 160: ${NEMOCLAW_CMD} "$@"), paths with spaces will cause issues. The # shellcheck disable=SC2139 acknowledges this but doesn't fix it.

♻️ Suggested fix using an array or proper quoting
   # Final fallback: run directly from the repo via node
   if [[ -f "${NEMOCLAW_ROOT}/bin/nemoclaw.js" ]]; then
-    NEMOCLAW_BIN="node ${NEMOCLAW_ROOT}/bin/nemoclaw.js"
+    NEMOCLAW_BIN="node '${NEMOCLAW_ROOT}/bin/nemoclaw.js'"
     info "nemoclaw not linked globally — using repo binary directly"
   else

Then in the generated script, use eval:

nemoclaw_run() {
  eval ${NEMOCLAW_CMD} "\$@"
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-autostart.sh` around lines 131 - 140, The generated fallback
sets NEMOCLAW_BIN to an unquoted string "node ${NEMOCLAW_ROOT}/bin/nemoclaw.js"
which will break if NEMOCLAW_ROOT contains spaces; update the assignment and
invocation so the command is safely preserved—either store the command as a
properly quoted string or an array and change the runtime invocation (where
NEMOCLAW_CMD is used) to evaluate it safely (e.g., use eval with quoted
positional parameters or invoke the array form), ensuring references to
NEMOCLAW_BIN / NEMOCLAW_CMD and the later call `${NEMOCLAW_CMD} "$@"` are
modified accordingly to handle spaces in paths.
bin/lib/sandbox-add-gpu-agent.js (2)

22-29: Duplicate DASHBOARD_PORT constant creates drift risk.

DASHBOARD_PORT is defined here and also in sandbox-resume.js (line 16). Per context snippet 2, bin/nemoclaw.js imports it only from sandbox-resume.js. Consider importing from a shared location to avoid potential drift.

♻️ Suggested approach

Either import from sandbox-resume.js:

+const { DASHBOARD_PORT } = require("./sandbox-resume");
+
 const IMAGE_NAME = "nemoclaw-sandbox-ai";
 const IMAGE_TAG  = "v3";
 const IMAGE_REF  = `${IMAGE_NAME}:${IMAGE_TAG}`;
 
-const DASHBOARD_PORT    = 18789;
 const MAX_READY_WAIT_S  = 60;

Or create a shared constants file that both modules import from.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/sandbox-add-gpu-agent.js` around lines 22 - 29, DASHBOARD_PORT is
duplicated in sandbox-add-gpu-agent.js and sandbox-resume.js; remove the local
constant from sandbox-add-gpu-agent.js and centralize the value by exporting
DASHBOARD_PORT from a single source (either export it from sandbox-resume.js and
import it where needed, or create a new shared constants module and import from
that) so both sandbox-add-gpu-agent.js and any other consumers (e.g.,
bin/nemoclaw.js) read the same exported DASHBOARD_PORT; update the imports and
remove the redundant declaration in the file that previously declared it.

31-34: Platform-specific sleep implementation.

Using spawnSync("sleep", [...]) will not work on Windows. Consider using a cross-platform approach.

♻️ Cross-platform alternative
 /** `@internal` — exposed for testing */
 function _sleep(ms) {
-  spawnSync("sleep", [String(ms / 1000)]);
+  const end = Date.now() + ms;
+  while (Date.now() < end) {
+    // busy-wait (acceptable for short CLI pauses)
+  }
 }

Or use Atomics.wait if available:

function _sleep(ms) {
  Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/sandbox-add-gpu-agent.js` around lines 31 - 34, The _sleep function
uses spawnSync("sleep", ...) which fails on Windows; replace it with a
cross-platform implementation: detect if Atomics.wait is available and use it
for a blocking sleep, otherwise branch on process.platform—use the POSIX sleep
command on non-win32 and use the Windows timeout command (e.g. "timeout /T ...
/NOBREAK") on win32—so update the _sleep function to pick the correct mechanism
at runtime.
test/sandbox-init.test.js (1)

10-22: Consider cleaning up the temp directory after tests complete.

The temp directory created at line 12 is never removed. While OS temp directories are eventually cleaned, explicitly removing it in an afterAll hook would be cleaner and prevent accumulation during repeated test runs.

Proposed addition
+import { afterAll } from "vitest";
 ...
+afterAll(() => {
+  fs.rmSync(tmpDir, { recursive: true, force: true });
+});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/sandbox-init.test.js` around lines 10 - 22, Add cleanup for the
temporary HOME created by tmpDir (from fs.mkdtempSync) by adding an afterAll
test hook that removes the directory and restores environment state;
specifically, in the test file add an afterAll that calls fs.rmSync(tmpDir, {
recursive: true, force: true }) (or fs.rmdirSync for older Node) and optionally
resets process.env.HOME to its original value saved before assignment,
referencing the tmpDir variable and the code block that sets process.env.HOME.
extensions/docker-proxy/docker-client.ts (2)

173-187: Pull errors may be silently ignored.

Docker image pull errors appear as {"error": "..."} in the stream. The current implementation only returns last?.status, potentially masking pull failures. Consider checking for an error field.

Suggested improvement
     const lines = raw.trim().split("\n");
     const statuses = lines.map((l) => {
       try { return JSON.parse(l); } catch { return null; }
     }).filter(Boolean);
     const last = statuses[statuses.length - 1];
+    if (last?.error) {
+      throw new Error(`Image pull failed: ${last.error}`);
+    }
     return last?.status ?? "Pull complete";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/docker-proxy/docker-client.ts` around lines 173 - 187, The
imagePull function currently parses the stream into statuses but only returns
last?.status, which can hide errors because Docker streams error objects like
{"error":"..."}; update imagePull (and its parsing of raw/lines/statuses/last)
to inspect parsed status objects for an error or Error field (check any status
with error before taking last), and if found either throw a descriptive Error or
return the error message instead of a success status; ensure JSON.parse failures
are still handled and include the error text from the status object in the
thrown/returned value so pull failures are not silently ignored.

98-103: Inconsistent status code handling.

The noContent method accepts 3xx status codes (>= 400), while json and text reject anything >= 300. This inconsistency could mask unexpected redirect responses.

Suggested fix for consistency
   private async noContent(opts: RequestOptions): Promise<void> {
     const { status, body } = await this.request(opts);
-    if (status < 200 || status >= 400) {
+    if (status < 200 || status >= 300) {
       throw new Error(`Docker API ${opts.method} ${opts.path} returned ${status}: ${body.slice(0, 200)}`);
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/docker-proxy/docker-client.ts` around lines 98 - 103, The
noContent method currently allows 3xx redirects while json and text reject any
status >=300; update noContent (method noContent and its use of RequestOptions)
to treat any status <200 or >=300 as error to match json/text behavior by
changing the condition from "status < 200 || status >= 400" to "status < 200 ||
status >= 300" and keep the same error construction (`Docker API ${opts.method}
${opts.path} returned ${status}: ${body.slice(0, 200)`) so redirect responses
are not silently accepted.
bin/lib/sandbox-resume.js (1)

39-48: Substring match may cause false positives.

sbListOut.includes(sandboxName) could incorrectly match if one sandbox name is a substring of another (e.g., "foo" matches "foobar"). Consider using a regex with word boundaries or parsing the structured output.

Suggested improvement
-  if (!sbListOut.includes(sandboxName)) {
+  // Use word boundary to avoid substring false positives
+  const namePattern = new RegExp(`\\b${sandboxName.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\b`);
+  if (!namePattern.test(sbListOut)) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/sandbox-resume.js` around lines 39 - 48, The current check uses
sbListOut.includes(sandboxName) which can false-positive on substrings; update
the validation after runner.runCapture("openshell sandbox list...") to detect
exact sandbox names by splitting sbListOut into lines (e.g.,
sbListOut.split(/\r?\n/).map(l => l.trim())) and checking for an exact equality
match against sandboxName, or use a regex with anchors/word-boundaries (e.g.,
new RegExp(`^${escapeRegExp(sandboxName)}$`, 'm')) to ensure whole-name matches;
adjust the error branch to use the new boolean (e.g., foundSandbox) and keep the
existing messaging and return shape.
scripts/add-gpu-agent.sh (3)

330-365: Shell variable injection into Python code is fragile.

The pattern '"${AGENT_NAME}"' embeds shell variables directly into Python source. If AGENT_NAME contains quotes or special characters, the Python syntax breaks. Earlier validation (line 87) helps, but consider passing values via environment variables for safety.

Safer approach using environment variables
-docker exec -i "${GATEWAY_CONTAINER}" kubectl exec -n openshell "${PARENT_NAME}" -- python3 -c '
+docker exec -i "${GATEWAY_CONTAINER}" kubectl exec -n openshell "${PARENT_NAME}" -- \
+  env AGENT_NAME="${AGENT_NAME}" PARENT_NAME="${PARENT_NAME}" python3 -c '
 import json, os, stat
 
 config_path = "/sandbox/.openclaw/openclaw.json"
 config = json.load(open(config_path))
+agent_name = os.environ["AGENT_NAME"]
+parent_name = os.environ["PARENT_NAME"]
 
 if "agents" not in config:
     config["agents"] = {"defaults": {}, "list": []}
 ...
-existing = [a for a in agent_list if a.get("id") == "'"${AGENT_NAME}"'"]
+existing = [a for a in agent_list if a.get("id") == agent_name]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/add-gpu-agent.sh` around lines 330 - 365, The Python snippet in
scripts/add-gpu-agent.sh injects shell variables directly into source (e.g.,
'"${AGENT_NAME}"' and '"${PARENT_NAME}"'), which breaks on special chars;
instead export/pass AGENT_NAME and PARENT_NAME as environment variables into the
docker/kubectl exec call and read them inside the Python block via os.environ
(use os.environ.get("AGENT_NAME") / os.environ.get("PARENT_NAME")), then use
those values when building/ searching agent_list and parent; keep using
config_path, parent, and agent_list identifiers but remove all inline '"${...}"'
substitutions so the Python source is static and robust.

107-108: GNU grep -P flag limits portability.

grep -oP uses Perl-compatible regex which isn't available on BSD/macOS grep. Since this script targets GPU setups (Linux), this is likely acceptable, but consider using sed or awk for broader compatibility.

Portable alternative using sed
-GATEWAY_NAME=$(openshell gateway info 2>&1 | grep -oP 'Gateway:\s+\K\S+' || echo "nemoclaw")
+GATEWAY_NAME=$(openshell gateway info 2>&1 | sed -n 's/.*Gateway:[[:space:]]*\([^[:space:]]*\).*/\1/p' | head -1)
+[[ -n "$GATEWAY_NAME" ]] || GATEWAY_NAME="nemoclaw"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/add-gpu-agent.sh` around lines 107 - 108, The use of grep -oP to
extract the gateway name is not portable; replace the grep -oP invocation used
to set GATEWAY_NAME (the command invoking "openshell gateway info") with a
POSIX-friendly text extraction such as awk or sed that matches the "Gateway:"
label and returns the following non-space token (so GATEWAY_NAME remains the
gateway identifier). Update the line that defines GATEWAY_NAME to pipe the
openshell output into the chosen awk/sed expression and keep the fallback echo
"nemoclaw" intact.

372-392: datetime.utcnow() is deprecated in Python 3.12+.

Line 382 uses datetime.datetime.utcnow() which is deprecated. For forward compatibility, use datetime.datetime.now(datetime.UTC).

Suggested fix
-    'createdAt': datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.000Z'),
+    'createdAt': datetime.datetime.now(datetime.UTC).strftime('%Y-%m-%dT%H:%M:%S.000Z'),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/add-gpu-agent.sh` around lines 372 - 392, The Python snippet in
scripts/add-gpu-agent.sh uses datetime.datetime.utcnow(), which is deprecated;
update the imports and call to use an explicit UTC-aware timestamp (e.g., use
datetime.datetime.now(datetime.timezone.utc) or from datetime import datetime,
timezone and call datetime.now(timezone.utc)) and keep the existing
.strftime('%Y-%m-%dT%H:%M:%S.000Z') formatting; specifically modify the import
line and replace datetime.datetime.utcnow() in the block that constructs
'createdAt' (the dict assigned to d['sandboxes']['${AGENT_NAME}']) so the
timestamp is timezone-aware and compatible with Python 3.12+.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f9deee4-3e2f-44cf-bd59-53684922ab8f

📥 Commits

Reviewing files that changed from the base of the PR and between dae494f and f72f57f.

📒 Files selected for processing (36)
  • .agents/skills/docs/nemoclaw-mount-filesystem/SKILL.md
  • .agents/skills/nemoclaw-docker-proxy/SKILL.md
  • .agents/skills/nemoclaw-gpu-sandbox/SKILL.md
  • Dockerfile
  • Dockerfile.base
  • Dockerfile.sandbox-ai
  • bin/lib/nim.js
  • bin/lib/onboard.js
  • bin/lib/sandbox-add-gpu-agent.js
  • bin/lib/sandbox-init.js
  • bin/lib/sandbox-resume.js
  • bin/nemoclaw.js
  • docs/deployment/run-multiple-gpu-agents.md
  • docs/index.md
  • extensions/claw3d/index.ts
  • extensions/claw3d/openclaw.plugin.json
  • extensions/docker-proxy/docker-client.ts
  • extensions/docker-proxy/index.ts
  • extensions/docker-proxy/openclaw.plugin.json
  • nemoclaw-blueprint/policies/presets/claw3d.yaml
  • nemoclaw-blueprint/policies/presets/docker-proxy.yaml
  • scripts/add-gpu-agent.sh
  • scripts/backup-workspace.sh
  • scripts/docker-proxy.js
  • scripts/mount-sandbox.sh
  • scripts/nemoclaw-start.sh
  • scripts/post-onboard-gpu.sh
  • scripts/resume-all-sandboxes.sh
  • scripts/resume.sh
  • scripts/setup-autostart.sh
  • scripts/start-services.sh
  • test/cli.test.js
  • test/nim.test.js
  • test/sandbox-add-gpu-agent.test.js
  • test/sandbox-init.test.js
  • test/sandbox-resume.test.js

@Junior00619 Junior00619 force-pushed the fix/jetson-gpu-detection branch from f72f57f to 07a8307 Compare March 30, 2026 17:59
@wscurran wscurran added Platform: AGX Thor/Orin Support for Jetson AGX Thor and Orin fix labels Mar 30, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR with a detailed summary, it addresses a bug with GPU detection on specific hardware platforms and proposes a fix to improve the compatibility of NemoClaw with these devices.

detectGpu() queries nvidia-smi for VRAM, which returns [N/A] on
unified-memory architectures. The existing fallback matched GPU names
containing "GB10" (DGX Spark) but missed Jetson AGX Thor and Orin,
leaving those devices undetected.

Broaden the name check to ["GB10", "Thor", "Orin"]. Align nim.js with
the runner.runCapture() indirection already used in
sandbox-add-gpu-agent.js to enable mocked test coverage of the fallback
path.

Five new tests exercise each device tag, a desktop GPU negative case,
and the standard VRAM-queryable early return.

Fixes NVIDIA#300
@Junior00619 Junior00619 force-pushed the fix/jetson-gpu-detection branch from d3f80af to f49bd0c Compare March 30, 2026 19:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/nim.test.js`:
- Around line 138-150: The test currently uses an if (gpu) guard which can hide
failures; instead assert the GPU was detected and that unified-memory fallback
was not taken: first assert gpu is defined (or truthy) after calling
nim.detectGpu(), then assert runner.runCapture was never called with a command
containing "free -m" (to ensure no fallback probe occurred), and finally assert
gpu.spark is undefined to confirm no unified-memory tagging; reference
nim.detectGpu and runner.runCapture in your changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e709151a-2e91-451a-981e-c650e5617499

📥 Commits

Reviewing files that changed from the base of the PR and between 07a8307 and f49bd0c.

📒 Files selected for processing (2)
  • bin/lib/nim.js
  • test/nim.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/nim.js

Comment on lines +138 to +150
it("does not trigger fallback for desktop GPU names", () => {
// Desktop GPUs return valid VRAM, but even if VRAM were [N/A],
// the name must not match any unified-memory tag.
runner.runCapture = vi.fn().mockImplementation((cmd) => {
if (cmd.includes("--query-gpu=memory.total")) return "[N/A]";
if (cmd.includes("--query-gpu=name")) return "NVIDIA GeForce RTX 4090";
return "";
});
const gpu = nim.detectGpu();
// Should NOT match as a unified-memory device
if (gpu) {
expect(gpu.spark).toBeUndefined();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tighten the negative-case assertion to avoid false positives.

The if (gpu) guard allows this test to pass even when it validates nothing. Assert that unified-memory fallback was not taken by checking command behavior (no free -m call), and keep the spark assertion.

Suggested test hardening
       const gpu = nim.detectGpu();
-      // Should NOT match as a unified-memory device
-      if (gpu) {
-        expect(gpu.spark).toBeUndefined();
-      }
+      const commands = runner.runCapture.mock.calls.map(([cmd]) => cmd);
+      expect(commands.some((cmd) => cmd.includes("--query-gpu=name"))).toBe(true);
+      expect(commands.some((cmd) => cmd.includes("free -m"))).toBe(false);
+      expect(gpu?.spark).toBeUndefined();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
it("does not trigger fallback for desktop GPU names", () => {
// Desktop GPUs return valid VRAM, but even if VRAM were [N/A],
// the name must not match any unified-memory tag.
runner.runCapture = vi.fn().mockImplementation((cmd) => {
if (cmd.includes("--query-gpu=memory.total")) return "[N/A]";
if (cmd.includes("--query-gpu=name")) return "NVIDIA GeForce RTX 4090";
return "";
});
const gpu = nim.detectGpu();
// Should NOT match as a unified-memory device
if (gpu) {
expect(gpu.spark).toBeUndefined();
}
it("does not trigger fallback for desktop GPU names", () => {
// Desktop GPUs return valid VRAM, but even if VRAM were [N/A],
// the name must not match any unified-memory tag.
runner.runCapture = vi.fn().mockImplementation((cmd) => {
if (cmd.includes("--query-gpu=memory.total")) return "[N/A]";
if (cmd.includes("--query-gpu=name")) return "NVIDIA GeForce RTX 4090";
return "";
});
const gpu = nim.detectGpu();
const commands = runner.runCapture.mock.calls.map(([cmd]) => cmd);
expect(commands.some((cmd) => cmd.includes("--query-gpu=name"))).toBe(true);
expect(commands.some((cmd) => cmd.includes("free -m"))).toBe(false);
expect(gpu?.spark).toBeUndefined();
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/nim.test.js` around lines 138 - 150, The test currently uses an if (gpu)
guard which can hide failures; instead assert the GPU was detected and that
unified-memory fallback was not taken: first assert gpu is defined (or truthy)
after calling nim.detectGpu(), then assert runner.runCapture was never called
with a command containing "free -m" (to ensure no fallback probe occurred), and
finally assert gpu.spark is undefined to confirm no unified-memory tagging;
reference nim.detectGpu and runner.runCapture in your changes.

kjw3 added a commit that referenced this pull request Mar 31, 2026
## Summary
- stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and
only gate the Telegram bridge when that bridge actually needs the key
- clean up the dashboard forward, `nemoclaw` gateway, and
`openshell-cluster-nemoclaw` Docker volumes when the last sandbox is
destroyed
- broaden unified-memory NVIDIA GPU detection beyond `GB10` while
keeping `spark: true` specific to GB10
- harden policy merge/retry behavior so truncated or error-like
current-policy reads rebuild from a clean `version: 1` scaffold instead
of producing malformed YAML

## Issue Mapping
Fixes #1191
Fixes #1160
Fixes #1182
Fixes #1162
Related #991

## Notes
- `#1188` was investigated but is not included in this PR.
- The current evidence still points to a deeper runtime / proxy
reachability problem on macOS + Colima rather than a bounded
NemoClaw-only fix.
- Keeping it out of this branch avoids speculative networking changes
without strong reproduction and cross-platform coverage.

## Validation
```bash
npx vitest run
npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js
npx tsc -p jsconfig.json --noEmit
```

## References Reviewed
- #1106
- #308
- #95
- #770

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  - Core services can start without an NVIDIA API key.
- Enhanced unified‑memory GPU detection with more accurate capability
reporting.

* **Bug Fixes**
- Gateway and forwarded‑port cleanup only runs when the last sandbox is
removed and no live sandboxes remain.
- Telegram bridge now starts only when both required tokens are present;
clearer startup warnings.
- Policy parsing/merge more robust for metadata‑only or malformed
inputs; consistent version header formatting.

* **Tests**
- Added tests covering GPU detection, policy parsing/merge, CLI
sandbox/gateway flows, and service startup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
## Summary
- stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and
only gate the Telegram bridge when that bridge actually needs the key
- clean up the dashboard forward, `nemoclaw` gateway, and
`openshell-cluster-nemoclaw` Docker volumes when the last sandbox is
destroyed
- broaden unified-memory NVIDIA GPU detection beyond `GB10` while
keeping `spark: true` specific to GB10
- harden policy merge/retry behavior so truncated or error-like
current-policy reads rebuild from a clean `version: 1` scaffold instead
of producing malformed YAML

## Issue Mapping
Fixes #1191
Fixes #1160
Fixes #1182
Fixes #1162
Related #991

## Notes
- `#1188` was investigated but is not included in this PR.
- The current evidence still points to a deeper runtime / proxy
reachability problem on macOS + Colima rather than a bounded
NemoClaw-only fix.
- Keeping it out of this branch avoids speculative networking changes
without strong reproduction and cross-platform coverage.

## Validation
```bash
npx vitest run
npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js
npx tsc -p jsconfig.json --noEmit
```

## References Reviewed
- #1106
- #308
- #95
- #770

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  - Core services can start without an NVIDIA API key.
- Enhanced unified‑memory GPU detection with more accurate capability
reporting.

* **Bug Fixes**
- Gateway and forwarded‑port cleanup only runs when the last sandbox is
removed and no live sandboxes remain.
- Telegram bridge now starts only when both required tokens are present;
clearer startup warnings.
- Policy parsing/merge more robust for metadata‑only or malformed
inputs; consistent version header formatting.

* **Tests**
- Added tests covering GPU detection, policy parsing/merge, CLI
sandbox/gateway flows, and service startup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
## Summary
- stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and
only gate the Telegram bridge when that bridge actually needs the key
- clean up the dashboard forward, `nemoclaw` gateway, and
`openshell-cluster-nemoclaw` Docker volumes when the last sandbox is
destroyed
- broaden unified-memory NVIDIA GPU detection beyond `GB10` while
keeping `spark: true` specific to GB10
- harden policy merge/retry behavior so truncated or error-like
current-policy reads rebuild from a clean `version: 1` scaffold instead
of producing malformed YAML

## Issue Mapping
Fixes NVIDIA#1191
Fixes NVIDIA#1160
Fixes NVIDIA#1182
Fixes NVIDIA#1162
Related NVIDIA#991

## Notes
- `NVIDIA#1188` was investigated but is not included in this PR.
- The current evidence still points to a deeper runtime / proxy
reachability problem on macOS + Colima rather than a bounded
NemoClaw-only fix.
- Keeping it out of this branch avoids speculative networking changes
without strong reproduction and cross-platform coverage.

## Validation
```bash
npx vitest run
npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js
npx tsc -p jsconfig.json --noEmit
```

## References Reviewed
- NVIDIA#1106
- NVIDIA#308
- NVIDIA#95
- NVIDIA#770

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  - Core services can start without an NVIDIA API key.
- Enhanced unified‑memory GPU detection with more accurate capability
reporting.

* **Bug Fixes**
- Gateway and forwarded‑port cleanup only runs when the last sandbox is
removed and no live sandboxes remain.
- Telegram bridge now starts only when both required tokens are present;
clearer startup warnings.
- Policy parsing/merge more robust for metadata‑only or malformed
inputs; consistent version header formatting.

* **Tests**
- Added tests covering GPU detection, policy parsing/merge, CLI
sandbox/gateway flows, and service startup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@ericksoa
Copy link
Copy Markdown
Contributor

ericksoa commented Apr 8, 2026

Thanks for working on this! This was superseded by #1271, which migrated nim.js to TypeScript and included the Jetson Thor/Orin/Xavier unified-memory fallback (shipped in v0.0.3). Closing in favor of that PR.

@ericksoa ericksoa closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Platform: AGX Thor/Orin Support for Jetson AGX Thor and Orin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nemoclaw fails to detect GPU on NVIDIA Jetson Thor and Orin

3 participants