feat(container): richen ManagedContainer — isImageCurrent + logs + sibling-exec#968
Conversation
… to ManagedContainer Four base-class additions ahead of the OpenClaw runtime migration so the upcoming subclass doesn't have to re-implement them: - isImageCurrent() — pure predicate comparing the existing container's image ref to descriptor.defaultImage. Treats SHA-pinned variants as matches. start() is unchanged; subclasses + service layers compose the predicate where they want short-circuit behaviour. - getLogs(tail) and tailLogs(onLine) — generic log primitives, thin pass-throughs to ContainerCli. - runOneShot(argv, opts) — sibling-container helper that spawns a <name>-setup container with the same image+mounts+env (no ports/ health/restart), runs argv, force-removes after. Includes the retry-on-name-collision behaviour previously bespoke to OpenClaw. Hermes inherits unused surface only — no behavioural change. The in-flight base-class tests cover all four primitives.
✅ Tests passed — 1219/1223
|
Greptile SummaryThis PR enriches
Confidence Score: 4/5Safe to merge with awareness of two minor behavioural gaps in the new surface area. The implementation is well-structured and the retry/cleanup logic in runOneShot is robust. getLogs does not check the CLI exit code so nerdctl error messages are silently collected into the returned log lines rather than triggering a distinct error path; and runWithOptionalTimeout abandons the underlying runCommand promise when the timeout fires, leaving a brief window where opts.onLog can still be called after the caller has received the timeout error. Neither affects the live container or existing Hermes code, but callers building on these new primitives should be aware before the OpenClaw migration consumes them. The getLogs and runWithOptionalTimeout helpers in managed-container.ts are the spots most worth a second look before this surface is consumed by OpenClawContainerRuntime. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant ManagedContainer
participant ContainerCli
Note over Caller,ContainerCli: isImageCurrent
Caller->>ManagedContainer: isImageCurrent()
ManagedContainer->>ContainerCli: containerImageRef(containerName)
ContainerCli-->>ManagedContainer: "actual ref | null"
ManagedContainer-->>Caller: boolean (exact match or SHA-pinned variant)
Note over Caller,ContainerCli: getLogs
Caller->>ManagedContainer: getLogs(tail?)
ManagedContainer->>ContainerCli: runCommand(['logs','-n',N,name], onLog)
ContainerCli-->>ManagedContainer: lines via onLog callbacks
ManagedContainer-->>Caller: string[]
Note over Caller,ContainerCli: runOneShot
Caller->>ManagedContainer: runOneShot(argv, opts?)
ManagedContainer->>ManagedContainer: withLifecycleLock('run-one-shot')
ManagedContainer->>ManagedContainer: buildContainerSpec()
ManagedContainer->>ContainerCli: removeContainer(setupName, force)
ManagedContainer->>ContainerCli: waitForContainerNameRelease(setupName)
ManagedContainer->>ContainerCli: createContainer(setupSpec)
alt ContainerNameInUseError (up to 3 retries)
ManagedContainer->>ContainerCli: removeContainer + waitForRelease + createContainer
end
ManagedContainer->>ContainerCli: runCommand(['start','-a',setupName])
alt processTimeoutMs set
Note over ManagedContainer: Promise.race vs timeout
end
ContainerCli-->>ManagedContainer: ExecResult
ManagedContainer->>ContainerCli: removeContainer(setupName, force) [finally]
ManagedContainer-->>Caller: ExecResult
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/browseros-agent/apps/server/src/lib/container/managed/managed-container.ts:293-300
**`getLogs` silently surfaces error text as log lines**
`runCommand` passes both stdout and stderr through the `onLog` callback, and `getLogs` collects everything indiscriminately without checking the returned `exitCode`. When the container doesn't exist, `nerdctl logs` exits with code 1 and writes a message like `Error: no such container: …` to stderr — which ends up in the returned `lines[]`. Callers get back an array that looks like valid log output but is actually an nerdctl error string, with no way to distinguish the two cases. Consider checking `result.exitCode` and returning `[]` (or throwing) on failure.
### Issue 2 of 2
packages/browseros-agent/apps/server/src/lib/container/managed/managed-container.ts:395-419
**Abandoned `runCommand` promise may deliver stale `onLog` callbacks after timeout**
When the timeout fires, `Promise.race` rejects and `runWithOptionalTimeout` returns, but the `cli.runCommand(['start', '-a', setupName], opts.onLog)` promise is abandoned — it continues executing in the background. The `opts.onLog` callback will still be called by that floating promise until `removeContainer(force: true)` in the `finally` block kills the container (which causes `nerdctl start -a` to exit). During that brief window, the caller's `onLog` handler can fire after `runOneShot` has already thrown the timeout error. If the handler closes over any state that the caller tears down on the timeout error (e.g. a streaming buffer or a WebSocket), those stale calls could be problematic. Capturing the `runCommand` promise and awaiting it in the `finally` path would close the gap.
Reviews (1): Last reviewed commit: "feat(container): add isImageCurrent + ge..." | Re-trigger Greptile |
…-onLog leak; trim docstrings
- getLogs now distinguishes a missing container (returns []) from
other CLI failures (throws). Previously nerdctl's stderr ("Error:
no such container: …") leaked into the lines array as if it were
log output. isNoSuchContainer is exported from container-cli to
share the predicate.
- runWithOptionalTimeout wraps the caller's onLog so post-timeout
lines from the abandoned runCommand promise become no-ops; before
this, callers could see onLog fire after runOneShot had already
rejected, hitting state the caller may have torn down on the
timeout error.
- Tightens the new docstrings to one short line per the project
convention; drops a restating comment in the test file.
Summary
Adds four primitives to the
ManagedContainerabstract base ahead of the OpenClaw runtime migration. Without these, the upcomingOpenClawContainerRuntimewould need to re-implement them locally; with them, the migration consumes shared base behaviour and Hermes silently inherits the new surface (unused but free).isImageCurrent(): Promise<boolean>— pure predicate comparing the existing container's image ref todescriptor.defaultImage. Treats SHA-pinned variants (<ref>@sha256:…) as matches.start()is unchanged — today's "always recreate from a freshbuildContainerSpec()" semantics stays in place; subclasses and service layers compose the predicate where they want a short-circuit.getLogs(tail = 50): Promise<string[]>andtailLogs(onLine: LogFn): () => void— generic log primitives, thin pass-throughs to the underlyingContainerCli.runOneShot(argv, opts?): Promise<ExecResult>— sibling-container helper. Spawns a<container-name>-setupcontainer with the same image / mounts / add-hosts / base env (no ports / health / restart), runsargv, force-removes after. Includes the retry-on-name-collision behaviour previously bespoke to OpenClaw's gateway-setup path.Net diff: +160 LOC source, +165 LOC tests.
Why
OpenClaw's pre-
ManagedContainerContainerRuntimeclass carries five auxiliary methods that aren't OpenClaw-specific — image-ref comparison, logs, sibling-container exec — they're container-generic. Lifting them onto the base means the upcoming Phase 4a runtime migration is a smaller diff and Hermes (plus future container adapters) get a richer surface for free. The discussion that motivated this lives in the team plan files; the short version: scope the migration onto a cleaner foundation rather than dragging the auxiliary surface onto a single subclass.Test plan
bun run typecheckclean across the server packagebiome checkclean on touched sources + testsmanaged-container.test.ts:isImageCurrent: ref match, SHA-pinned match, ref differs, container missinggetLogs: collects lines fromcli.runCommand(['logs', '-n', N, name])tailLogs: returns the unsubscribe handle fromcli.tailLogsrunOneShot: creates the-setupsibling with no ports/health, runs argv, force-removes; cleans up even on inner throw; retries onContainerNameInUseErrorContainerCliflake also reproduces on plainorigin/dev)