feat(openclaw): migrate chat to WS chat.send + add chat abort#845
feat(openclaw): migrate chat to WS chat.send + add chat abort#845shivammittal274 wants to merge 1 commit intodevfrom
Conversation
The OpenClaw observer was failing handshakes in production with
CONTROL_UI_DEVICE_IDENTITY_REQUIRED — the 'openclaw-tui' client.id
forces control-ui clients to present a paired device identity, which
we don't have. Switch to 'gateway-client' + mode:'backend' (the
documented "trusted backend self-connect" identity) and drop the
explicit operator.read scope so the gateway grants default operator
scopes for shared-secret auth.
Move chat send from POST /v1/chat/completions to WS chat.send so that
runs are registered in OpenClaw's chatAbortControllers registry —
HTTP-initiated runs are NOT in that registry, which is why closing
the SSE fetch never aborted the agent (verified empirically: agent
kept generating for 70+ seconds after fetch.abort()).
chat.send is invoked by execing OpenClaw's CLI inside the gateway
container so the connection appears as direct_local and gets full
operator scope (operator.write is required for chat.send/chat.abort
and gets cleared for non-direct_local connections).
Adds POST /claw/agents/:id/chat/stop {sessionKey, runId?} which calls
chat.abort the same way. Live-verified: returns aborted:true and the
in-flight stream closes with lifecycle:aborted.
The new stream emits richer events than the old HTTP path: text-delta,
reasoning-delta, tool-start (with arguments + label), tool-end (with
duration + isError), lifecycle (start/end/aborted/error), usage
(tokensIn/tokensOut/costUsd), done, and error (with typed errorKind).
Existing handlers gracefully ignore unknown types.
Removes the legacy streamChat() in OpenClawHttpClient and ~290 lines
of associated SSE/chunk parsing — chat goes through WS now, no
fallback. isAuthenticated() and the session history reads stay on
HTTP.
Idempotency: chat.send requires idempotencyKey; we generate a UUID
per send, so retries are safe.
Multimodal: messageParts (OpenAI shape) translate to OpenClaw
chat.send attachments shape inline.
❌ Tests failed — 6/574 failed
Failed tests
|
Greptile SummaryThis PR ships three tightly coupled changes: fixes the observer handshake identity (
Confidence Score: 3/5Functionally correct for the happy path but has a stream-leak risk on WS disconnect that can cause hanging SSE connections in production. One P1 (stream hang on observer disconnect) pulls the score to 3. The _history silent drop and token-in-CLI-args are P2s. The happy-path logic is well-structured and the handshake fix is clearly correct. openclaw-service.ts (stream lifecycle on disconnect) and container-runtime.ts (token arg exposure)
|
| Filename | Overview |
|---|---|
| packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts | Core migration: chatStream rewritten to WS chat.send via CLI; new abortChat method added; _history silently dropped; stream-hang risk on observer disconnect |
| packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts | New callGatewayRpc method execs OpenClaw CLI inside container; bearer token passed as a CLI arg (visible in /proc); fallback JSON parser is well-handled |
| packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-observer.ts | Handshake fix (gateway-client/backend identity); RPC request/response layer and event listener system added cleanly; pending requests correctly drained on disconnect |
| packages/browseros-agent/apps/server/src/api/routes/openclaw.ts | New /agents/:id/chat/stop endpoint added; AbortSignal threaded into chatStream; error handling consistent with existing routes |
| packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-http-client.ts | ~290 lines of SSE/chat streaming code removed; parseChunk retained and still used by streamSessionHistory; remaining code clean |
| packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-types.ts | Added reasoning-delta and usage to OpenClawStreamEvent union; additive change, no breakage |
| packages/browseros-agent/apps/server/tests/api/services/openclaw/openclaw-service.test.ts | Test updated to verify chat.send is called with correct sessionKey and idempotencyKey; observer and runtime mocked appropriately |
| packages/browseros-agent/apps/server/tests/api/services/openclaw/openclaw-http-client.test.ts | Legacy streamChat tests removed; remaining isAuthenticated and getSessionHistory tests unaffected |
Sequence Diagram
sequenceDiagram
participant C as Client (Extension)
participant R as Route /agents/:id/chat
participant S as OpenClawService
participant RT as ContainerRuntime
participant OC as OpenClaw Gateway (container)
participant OBS as OpenClawObserver (WS)
C->>R: POST /chat (message + sessionKey)
R->>S: chatStream(agentId, sessionKey, message, signal)
S->>S: runControlPlaneCall (ensure observer connected)
S->>RT: callGatewayRpc(chat.send, params, --token)
RT->>OC: nerdctl exec node dist/index.js gateway call chat.send
OC-->>RT: {runId}
RT-->>S: {runId}
S->>OBS: on('chat', onChat) + on('agent', onAgent)
S-->>R: ReadableStream
R-->>C: SSE stream (text/event-stream)
loop WS broadcast events
OC->>OBS: chat event {runId, state, message.content}
OBS->>S: onChat(payload)
S-->>C: text-delta / reasoning-delta / done / lifecycle / usage
OC->>OBS: agent event {runId, stream:'tool', ...}
OBS->>S: onAgent(payload)
S-->>C: tool-start / tool-end / lifecycle
end
alt Stop requested
C->>R: POST /agents/:id/chat/stop {sessionKey, runId}
R->>S: abortChat(agentId, sessionKey, runId)
S->>RT: callGatewayRpc(chat.abort, params, --token)
RT->>OC: nerdctl exec ... gateway call chat.abort
OC-->>RT: {aborted:true, runIds:[...]}
OC->>OBS: chat event {state:'aborted'}
OBS->>S: onChat → close(lifecycle:aborted)
S-->>C: lifecycle:aborted, stream closes
end
Comments Outside Diff (4)
-
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts, line 975-993 (link)Stream can hang indefinitely on observer WS disconnect
When the WS observer disconnects mid-stream,
failAllPendingRequestsrejects any pending RPCs but the event listeners (onChat,onAgent) are deliberately kept alive across reconnects. If the OpenClaw run completes or errors during the disconnect window, thefinal/abortedevent is never delivered, and the stream never closes — the SSE response to the browser stays open forever.There is no stream-level timeout or disconnect-triggered close path. The only way out is if the caller's HTTP request is cancelled (which fires
abortChat, but that itself uses the observer RPC path that's also broken while disconnected), or if the WS reconnects and a newfinalevent fires.Consider adding a timeout fallback on the stream side, or hooking into the observer's reconnect/disconnect lifecycle to close open streams when the disconnect outlasts a grace period.
Prompt To Fix With AI
This is a comment left during a code review. Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts Line: 975-993 Comment: **Stream can hang indefinitely on observer WS disconnect** When the WS observer disconnects mid-stream, `failAllPendingRequests` rejects any pending RPCs but the event listeners (`onChat`, `onAgent`) are deliberately kept alive across reconnects. If the OpenClaw run completes or errors during the disconnect window, the `final`/`aborted` event is never delivered, and the stream never closes — the SSE response to the browser stays open forever. There is no stream-level timeout or disconnect-triggered close path. The only way out is if the caller's HTTP request is cancelled (which fires `abortChat`, but that itself uses the observer RPC path that's also broken while disconnected), or if the WS reconnects and a new `final` event fires. Consider adding a timeout fallback on the stream side, or hooking into the observer's reconnect/disconnect lifecycle to close open streams when the disconnect outlasts a grace period. How can I resolve this? If you propose a fix, please make it concise.
-
packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts, line 64-98 (link)Bearer token exposed in process argument list
The gateway bearer token is passed via
--token input.tokenas a CLI argument. Inside the container this is readable from/proc/<pid>/cmdlineand visible inps auxoutput. If any other process in the container has read access to/proc, it can scrape the token.Consider passing the token via stdin, a temporary file with restricted permissions, or an environment variable instead of a positional CLI argument to avoid this exposure.
Prompt To Fix With AI
This is a comment left during a code review. Path: packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts Line: 64-98 Comment: **Bearer token exposed in process argument list** The gateway bearer token is passed via `--token input.token` as a CLI argument. Inside the container this is readable from `/proc/<pid>/cmdline` and visible in `ps aux` output. If any other process in the container has read access to `/proc`, it can scrape the token. Consider passing the token via stdin, a temporary file with restricted permissions, or an environment variable instead of a positional CLI argument to avoid this exposure. How can I resolve this? If you propose a fix, please make it concise.
-
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts, line 748 (link)_historyparameter silently droppedThe
_historyparameter was previously forwarded to the HTTPchat/completionsendpoint. It's now prefixed with_and ignored —sendChatViaClihas nohistoryparam. Callers (e.g., the route handler atopenclaw.ts:581) still pass populatedhistoryarrays and will get no error or warning that the history is being discarded.If OpenClaw's session-based history via
sessionKeyis always sufficient, consider removing the parameter from the public signature entirely to avoid the misleading contract. If explicit history injection is still needed for any flow (bootstrap, context override), that case is now silently broken.Prompt To Fix With AI
This is a comment left during a code review. Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts Line: 748 Comment: **`_history` parameter silently dropped** The `_history` parameter was previously forwarded to the HTTP `chat/completions` endpoint. It's now prefixed with `_` and ignored — `sendChatViaCli` has no `history` param. Callers (e.g., the route handler at `openclaw.ts:581`) still pass populated `history` arrays and will get no error or warning that the history is being discarded. If OpenClaw's session-based history via `sessionKey` is always sufficient, consider removing the parameter from the public signature entirely to avoid the misleading contract. If explicit history injection is still needed for any flow (bootstrap, context override), that case is now silently broken. How can I resolve this? If you propose a fix, please make it concise.
-
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts, line 986-993 (link)signalListenerdoes not close the stream directly on abortWhen the request
AbortSignalfires,abortChat(runId)is called asynchronously and its error is silently swallowed. The stream only closes when OpenClaw replies withstate: 'aborted'over the WS. IfabortChatfails (e.g., the observer is disconnected, the CLI exec fails, or the gateway is restarting), no close event will ever arrive and the stream will leak indefinitely.A defensive fallback — closing the stream directly after a short grace period if no
abortedlifecycle event arrives — would prevent this accumulation of dangling streams.Prompt To Fix With AI
This is a comment left during a code review. Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts Line: 986-993 Comment: **`signalListener` does not close the stream directly on abort** When the request `AbortSignal` fires, `abortChat(runId)` is called asynchronously and its error is silently swallowed. The stream only closes when OpenClaw replies with `state: 'aborted'` over the WS. If `abortChat` fails (e.g., the observer is disconnected, the CLI exec fails, or the gateway is restarting), no close event will ever arrive and the stream will leak indefinitely. A defensive fallback — closing the stream directly after a short grace period if no `aborted` lifecycle event arrives — would prevent this accumulation of dangling streams. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 975-993
Comment:
**Stream can hang indefinitely on observer WS disconnect**
When the WS observer disconnects mid-stream, `failAllPendingRequests` rejects any pending RPCs but the event listeners (`onChat`, `onAgent`) are deliberately kept alive across reconnects. If the OpenClaw run completes or errors during the disconnect window, the `final`/`aborted` event is never delivered, and the stream never closes — the SSE response to the browser stays open forever.
There is no stream-level timeout or disconnect-triggered close path. The only way out is if the caller's HTTP request is cancelled (which fires `abortChat`, but that itself uses the observer RPC path that's also broken while disconnected), or if the WS reconnects and a new `final` event fires.
Consider adding a timeout fallback on the stream side, or hooking into the observer's reconnect/disconnect lifecycle to close open streams when the disconnect outlasts a grace period.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts
Line: 64-98
Comment:
**Bearer token exposed in process argument list**
The gateway bearer token is passed via `--token input.token` as a CLI argument. Inside the container this is readable from `/proc/<pid>/cmdline` and visible in `ps aux` output. If any other process in the container has read access to `/proc`, it can scrape the token.
Consider passing the token via stdin, a temporary file with restricted permissions, or an environment variable instead of a positional CLI argument to avoid this exposure.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 748
Comment:
**`_history` parameter silently dropped**
The `_history` parameter was previously forwarded to the HTTP `chat/completions` endpoint. It's now prefixed with `_` and ignored — `sendChatViaCli` has no `history` param. Callers (e.g., the route handler at `openclaw.ts:581`) still pass populated `history` arrays and will get no error or warning that the history is being discarded.
If OpenClaw's session-based history via `sessionKey` is always sufficient, consider removing the parameter from the public signature entirely to avoid the misleading contract. If explicit history injection is still needed for any flow (bootstrap, context override), that case is now silently broken.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 986-993
Comment:
**`signalListener` does not close the stream directly on abort**
When the request `AbortSignal` fires, `abortChat(runId)` is called asynchronously and its error is silently swallowed. The stream only closes when OpenClaw replies with `state: 'aborted'` over the WS. If `abortChat` fails (e.g., the observer is disconnected, the CLI exec fails, or the gateway is restarting), no close event will ever arrive and the stream will leak indefinitely.
A defensive fallback — closing the stream directly after a short grace period if no `aborted` lifecycle event arrives — would prevent this accumulation of dangling streams.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(openclaw): migrate chat to WS chat...." | Re-trigger Greptile
|
Pivoting: Nikhil's browseros-openclaw fork adds gateway.auth.mode=none + private-ingress no-auth (browseros-ai/openclaw#1). With that we can run chat.send/abort directly over the observer's WS — no CLI exec, no token. Reworking on a fresh branch off dev. Closing in favor of the simpler approach. Branch preserved for reference / cherry-picking the observer fix + dedupe + stop button if useful. |
Summary
Three things shipped together because they touch the same file:
CONTROL_UI_DEVICE_IDENTITY_REQUIREDon every reconnect. Switchedclient.idfrom'openclaw-tui'to'gateway-client'andmodefrom'ui'to'backend'— OpenClaw's documented "trusted backend self-connect" identity. Dropped the explicitoperator.readscope so the gateway grants default operator scopes for shared-secret auth. Dashboard SSE / queue dispatch now work again.POST /v1/chat/completions(SSE). Now WSchat.send. The reason isn't speed — it's that HTTP-initiated runs are NOT in OpenClaw'schatAbortControllersregistry, so closing the SSE fetch never aborted the agent. PoC verified: agent kept generating for 70+ seconds afterfetch.abort(). WSchat.senddoes register, which is what makes the Stop button actually stop things.POST /claw/agents/:id/chat/stop. New abort endpoint. Callschat.abortover WS via the OpenClaw CLI from inside the gateway container so the connection appears asdirect_local(required foroperator.writescope). Live-verified end-to-end: returns{aborted: true, runIds: […]}and the in-flight stream closes withlifecycle:aborted.Notable design choices
direct_localconnections. Callingchat.send/chat.abortfrom outside the container losesoperator.write. Calling vianerdctl exec node dist/index.js gateway call <method>keeps the connection direct_local → full operator scope.OpenClawHttpClient.streamChatand ~290 lines of SSE/chunk parsing are gone.isAuthenticated()and session history reads stay on HTTP. Single chat path, no feature flag.chat.sendrequiresidempotencyKey; we generate a per-send UUID, so retries are safe.messagePartstranslate to OpenClaw'sattachmentsshape inline.Richer event stream
The new stream emits more than the old
text-delta/done/error:reasoning-delta,tool-start(with arguments + label + subject),tool-end(withdurationMs+isError),lifecycle(start/end/aborted/error),usage(tokensIn / tokensOut / costUsd). Existing extension UI gracefully ignores unknown types — no extension changes required to ship this server-side; surfacing the new types is a follow-up UI PR.Test plan
bun run typecheckpassesbun --env-file=apps/server/.env.development test apps/server/tests/api/services/openclawpasses (one pre-existing failure onrestart moves off a persisted ready port when auth rejects the current token— verified failing ondevbefore this PR)gateway-client+backendchat.sendreturnsrunIdand broadcasts events to the observerchat.abortactually aborts (sawaborted: trueand stream close)ClawSessionwhich is fed by the observer)