Skip to content

feat(container): richen ManagedContainer — isImageCurrent + logs + sibling-exec#968

Merged
Dani Akash (DaniAkash) merged 2 commits into
devfrom
feat/managed-container-richen
May 8, 2026
Merged

feat(container): richen ManagedContainer — isImageCurrent + logs + sibling-exec#968
Dani Akash (DaniAkash) merged 2 commits into
devfrom
feat/managed-container-richen

Conversation

@DaniAkash
Copy link
Copy Markdown
Contributor

Summary

Adds four primitives to the ManagedContainer abstract base ahead of the OpenClaw runtime migration. Without these, the upcoming OpenClawContainerRuntime would need to re-implement them locally; with them, the migration consumes shared base behaviour and Hermes silently inherits the new surface (unused but free).

  • isImageCurrent(): Promise<boolean> — pure predicate comparing the existing container's image ref to descriptor.defaultImage. Treats SHA-pinned variants (<ref>@sha256:…) as matches. start() is unchanged — today's "always recreate from a fresh buildContainerSpec()" semantics stays in place; subclasses and service layers compose the predicate where they want a short-circuit.
  • getLogs(tail = 50): Promise<string[]> and tailLogs(onLine: LogFn): () => void — generic log primitives, thin pass-throughs to the underlying ContainerCli.
  • runOneShot(argv, opts?): Promise<ExecResult> — sibling-container helper. Spawns a <container-name>-setup container with the same image / mounts / add-hosts / base env (no ports / health / restart), runs argv, force-removes after. Includes the retry-on-name-collision behaviour previously bespoke to OpenClaw's gateway-setup path.

Net diff: +160 LOC source, +165 LOC tests.

Why

OpenClaw's pre-ManagedContainer ContainerRuntime class carries five auxiliary methods that aren't OpenClaw-specific — image-ref comparison, logs, sibling-container exec — they're container-generic. Lifting them onto the base means the upcoming Phase 4a runtime migration is a smaller diff and Hermes (plus future container adapters) get a richer surface for free. The discussion that motivated this lives in the team plan files; the short version: scope the migration onto a cleaner foundation rather than dragging the auxiliary surface onto a single subclass.

Test plan

  • bun run typecheck clean across the server package
  • biome check clean on touched sources + tests
  • 10 new test cases in managed-container.test.ts:
    • isImageCurrent: ref match, SHA-pinned match, ref differs, container missing
    • getLogs: collects lines from cli.runCommand(['logs', '-n', N, name])
    • tailLogs: returns the unsubscribe handle from cli.tailLogs
    • runOneShot: creates the -setup sibling with no ports/health, runs argv, force-removes; cleans up even on inner throw; retries on ContainerNameInUseError
  • Full server test sweep — 1037 pass, 0 fail (one pre-existing ContainerCli flake also reproduces on plain origin/dev)
  • No Hermes-side changes needed — the new methods are unused on the Hermes runtime; existing tests stay green

… to ManagedContainer

Four base-class additions ahead of the OpenClaw runtime migration so
the upcoming subclass doesn't have to re-implement them:

- isImageCurrent() — pure predicate comparing the existing container's
  image ref to descriptor.defaultImage. Treats SHA-pinned variants as
  matches. start() is unchanged; subclasses + service layers compose
  the predicate where they want short-circuit behaviour.
- getLogs(tail) and tailLogs(onLine) — generic log primitives, thin
  pass-throughs to ContainerCli.
- runOneShot(argv, opts) — sibling-container helper that spawns a
  <name>-setup container with the same image+mounts+env (no ports/
  health/restart), runs argv, force-removes after. Includes the
  retry-on-name-collision behaviour previously bespoke to OpenClaw.

Hermes inherits unused surface only — no behavioural change. The
in-flight base-class tests cover all four primitives.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

✅ Tests passed — 1219/1223

Suite Passed Failed Skipped
agent 76/76 0 0
build 9/9 0 0
eval 93/93 0 0
server-agent 261/261 0 0
server-api 203/203 0 0
server-browser 4/4 0 0
server-integration 9/10 0 1
server-lib 242/242 0 0
server-root 60/63 0 3
server-skills 31/31 0 0
server-tools 231/231 0 0

View workflow run

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR enriches ManagedContainer with four new primitives — isImageCurrent, getLogs, tailLogs, and runOneShot — that are container-generic and will be consumed by the upcoming OpenClawContainerRuntime migration rather than being re-implemented per subclass.

  • isImageCurrent compares the running container's image ref to descriptor.defaultImage, treating SHA-pinned variants as matches; it is a pure predicate and never called by the base machinery.
  • getLogs / tailLogs are thin pass-throughs to ContainerCli.runCommand and ContainerCli.tailLogs respectively, with getLogs collecting interleaved stdout+stderr lines.
  • runOneShot spawns a -setup sibling container with no ports/health/restart, runs the supplied argv, and force-removes the sibling in a finally block; includes a retry-on-ContainerNameInUseError loop in createOneShotContainer and an optional process timeout via Promise.race.

Confidence Score: 4/5

Safe to merge with awareness of two minor behavioural gaps in the new surface area.

The implementation is well-structured and the retry/cleanup logic in runOneShot is robust. getLogs does not check the CLI exit code so nerdctl error messages are silently collected into the returned log lines rather than triggering a distinct error path; and runWithOptionalTimeout abandons the underlying runCommand promise when the timeout fires, leaving a brief window where opts.onLog can still be called after the caller has received the timeout error. Neither affects the live container or existing Hermes code, but callers building on these new primitives should be aware before the OpenClaw migration consumes them.

The getLogs and runWithOptionalTimeout helpers in managed-container.ts are the spots most worth a second look before this surface is consumed by OpenClawContainerRuntime.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/server/src/lib/container/managed/managed-container.ts Adds isImageCurrent, getLogs, tailLogs, and runOneShot to ManagedContainer; getLogs doesn't check the CLI exit code so error text can leak into the returned lines array, and runWithOptionalTimeout leaves the underlying runCommand promise floating after a timeout fires.
packages/browseros-agent/apps/server/tests/lib/container/managed/managed-container.test.ts Adds 10 new test cases covering isImageCurrent, getLogs, tailLogs, and runOneShot (including retry and cleanup-on-throw paths); all tests mock cli methods directly, so the getLogs exit-code gap and the timeout callback-leak scenario are not exercised.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ManagedContainer
    participant ContainerCli

    Note over Caller,ContainerCli: isImageCurrent
    Caller->>ManagedContainer: isImageCurrent()
    ManagedContainer->>ContainerCli: containerImageRef(containerName)
    ContainerCli-->>ManagedContainer: "actual ref | null"
    ManagedContainer-->>Caller: boolean (exact match or SHA-pinned variant)

    Note over Caller,ContainerCli: getLogs
    Caller->>ManagedContainer: getLogs(tail?)
    ManagedContainer->>ContainerCli: runCommand(['logs','-n',N,name], onLog)
    ContainerCli-->>ManagedContainer: lines via onLog callbacks
    ManagedContainer-->>Caller: string[]

    Note over Caller,ContainerCli: runOneShot
    Caller->>ManagedContainer: runOneShot(argv, opts?)
    ManagedContainer->>ManagedContainer: withLifecycleLock('run-one-shot')
    ManagedContainer->>ManagedContainer: buildContainerSpec()
    ManagedContainer->>ContainerCli: removeContainer(setupName, force)
    ManagedContainer->>ContainerCli: waitForContainerNameRelease(setupName)
    ManagedContainer->>ContainerCli: createContainer(setupSpec)
    alt ContainerNameInUseError (up to 3 retries)
        ManagedContainer->>ContainerCli: removeContainer + waitForRelease + createContainer
    end
    ManagedContainer->>ContainerCli: runCommand(['start','-a',setupName])
    alt processTimeoutMs set
        Note over ManagedContainer: Promise.race vs timeout
    end
    ContainerCli-->>ManagedContainer: ExecResult
    ManagedContainer->>ContainerCli: removeContainer(setupName, force) [finally]
    ManagedContainer-->>Caller: ExecResult
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/browseros-agent/apps/server/src/lib/container/managed/managed-container.ts:293-300
**`getLogs` silently surfaces error text as log lines**

`runCommand` passes both stdout and stderr through the `onLog` callback, and `getLogs` collects everything indiscriminately without checking the returned `exitCode`. When the container doesn't exist, `nerdctl logs` exits with code 1 and writes a message like `Error: no such container: …` to stderr — which ends up in the returned `lines[]`. Callers get back an array that looks like valid log output but is actually an nerdctl error string, with no way to distinguish the two cases. Consider checking `result.exitCode` and returning `[]` (or throwing) on failure.

### Issue 2 of 2
packages/browseros-agent/apps/server/src/lib/container/managed/managed-container.ts:395-419
**Abandoned `runCommand` promise may deliver stale `onLog` callbacks after timeout**

When the timeout fires, `Promise.race` rejects and `runWithOptionalTimeout` returns, but the `cli.runCommand(['start', '-a', setupName], opts.onLog)` promise is abandoned — it continues executing in the background. The `opts.onLog` callback will still be called by that floating promise until `removeContainer(force: true)` in the `finally` block kills the container (which causes `nerdctl start -a` to exit). During that brief window, the caller's `onLog` handler can fire after `runOneShot` has already thrown the timeout error. If the handler closes over any state that the caller tears down on the timeout error (e.g. a streaming buffer or a WebSocket), those stale calls could be problematic. Capturing the `runCommand` promise and awaiting it in the `finally` path would close the gap.

Reviews (1): Last reviewed commit: "feat(container): add isImageCurrent + ge..." | Re-trigger Greptile

…-onLog leak; trim docstrings

- getLogs now distinguishes a missing container (returns []) from
  other CLI failures (throws). Previously nerdctl's stderr ("Error:
  no such container: …") leaked into the lines array as if it were
  log output. isNoSuchContainer is exported from container-cli to
  share the predicate.
- runWithOptionalTimeout wraps the caller's onLog so post-timeout
  lines from the abandoned runCommand promise become no-ops; before
  this, callers could see onLog fire after runOneShot had already
  rejected, hitting state the caller may have torn down on the
  timeout error.
- Tightens the new docstrings to one short line per the project
  convention; drops a restating comment in the test file.
@DaniAkash Dani Akash (DaniAkash) merged commit 4e40568 into dev May 8, 2026
20 checks passed
@DaniAkash Dani Akash (DaniAkash) deleted the feat/managed-container-richen branch May 8, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant