Skip to content

feat(sandbox): run boot hook on sandbox startup#775

Draft
drew wants to merge 2 commits intomainfrom
feat/sandbox-boot-hook
Draft

feat(sandbox): run boot hook on sandbox startup#775
drew wants to merge 2 commits intomainfrom
feat/sandbox-boot-hook

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 7, 2026

Summary

Run /etc/openshell/boot.sh as a supervisor-managed startup hook on every sandbox startup before the long-lived child process.

Add the shared boot script path constant, cover the hook with startup/failure/regression tests, and document the new sandbox image contract.

Related Issue

None.

Changes

  • add the canonical /etc/openshell/boot.sh path in openshell-policy
  • run the boot hook from the sandbox supervisor startup path before the normal child process
  • fail sandbox startup on non-zero boot hook exit and skip cleanly when the hook is missing
  • add boot hook tests and update architecture and user docs

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner April 7, 2026 00:01
@drew drew self-assigned this Apr 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/OpenShell/pr-preview/pr-775/

Built to branch gh-pages at 2026-04-07 06:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@drew drew marked this pull request as draft April 7, 2026 00:08
The boot hook tests were failing on CI because they used
ProcessHandle::spawn which applies Linux sandbox enforcement
(seccomp, landlock, privilege dropping) in a pre_exec hook. On CI
containers running as root, drop_privileges tried to switch to the
non-existent sandbox user, causing EINVAL.

Replace run_test_boot_hook and spawn_test_process with a test-specific
implementation using plain tokio::process::Command that exercises boot
hook logic without sandbox enforcement.
@mjamiv
Copy link
Copy Markdown

mjamiv commented Apr 11, 2026

Production datapoint from an OpenShell v0.0.25 deployment that just hit the failure mode this PR addresses.

After an unexpected host power cycle, the sandbox pods respawned cleanly and reached Ready phase, but the long-lived child process inside each sandbox (an OpenClaw gateway) did NOT auto-start — the pod was up but the process table inside was empty of the gateway process. The fleet had to be recovered manually by re-running each sandbox's startup script via openshell sandbox exec.

That is exactly the gap this PR fixes — having the supervisor run /etc/openshell/boot.sh before the long-lived child gives the sandbox image a documented contract for "what needs to run on every startup" and removes the manual recovery step.

On v0.0.25 the workaround is a host-side watchdog: a systemd user timer that curls each forwarded /health endpoint every 60s and re-launches the in-sandbox process + re-creates the forward on failure. It works but is clearly a v0.0.25-era substitute — once this PR ships in a release, the right pattern is an in-sandbox boot hook that the operator doesn't have to maintain out-of-band.

Happy to test this PR on a proxy-only sandbox deployment if the branch is ready for outside verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants