Skip to content

feat(openclaw): migrate chat to WS chat.send + add chat abort#845

Closed
shivammittal274 wants to merge 1 commit intodevfrom
feat/openclaw-chat-via-ws
Closed

feat(openclaw): migrate chat to WS chat.send + add chat abort#845
shivammittal274 wants to merge 1 commit intodevfrom
feat/openclaw-chat-via-ws

Conversation

@shivammittal274
Copy link
Copy Markdown
Contributor

Summary

Three things shipped together because they touch the same file:

  • Fix observer handshake (production breakage). Was failing with CONTROL_UI_DEVICE_IDENTITY_REQUIRED on every reconnect. Switched client.id from 'openclaw-tui' to 'gateway-client' and mode from 'ui' to 'backend' — OpenClaw's documented "trusted backend self-connect" identity. Dropped the explicit operator.read scope so the gateway grants default operator scopes for shared-secret auth. Dashboard SSE / queue dispatch now work again.
  • Migrate chat send from HTTP to WS. Was POST /v1/chat/completions (SSE). Now WS chat.send. The reason isn't speed — it's that HTTP-initiated runs are NOT in OpenClaw's chatAbortControllers registry, so closing the SSE fetch never aborted the agent. PoC verified: agent kept generating for 70+ seconds after fetch.abort(). WS chat.send does register, which is what makes the Stop button actually stop things.
  • Add POST /claw/agents/:id/chat/stop. New abort endpoint. Calls chat.abort over WS via the OpenClaw CLI from inside the gateway container so the connection appears as direct_local (required for operator.write scope). Live-verified end-to-end: returns {aborted: true, runIds: […]} and the in-flight stream closes with lifecycle:aborted.

Notable design choices

  • Why exec the CLI inside the container instead of opening our own WS? OpenClaw clears requested operator scopes for non-direct_local connections. Calling chat.send / chat.abort from outside the container loses operator.write. Calling via nerdctl exec node dist/index.js gateway call <method> keeps the connection direct_local → full operator scope.
  • No device pairing. Per the static-UUID + bearer-auth constraint, we don't pair a device.
  • HTTP path deleted, no fallback. OpenClawHttpClient.streamChat and ~290 lines of SSE/chunk parsing are gone. isAuthenticated() and session history reads stay on HTTP. Single chat path, no feature flag.
  • Idempotency. chat.send requires idempotencyKey; we generate a per-send UUID, so retries are safe.
  • Multimodal. OpenAI-shaped messageParts translate to OpenClaw's attachments shape inline.

Richer event stream

The new stream emits more than the old text-delta / done / error: reasoning-delta, tool-start (with arguments + label + subject), tool-end (with durationMs + isError), lifecycle (start/end/aborted/error), usage (tokensIn / tokensOut / costUsd). Existing extension UI gracefully ignores unknown types — no extension changes required to ship this server-side; surfacing the new types is a follow-up UI PR.

Test plan

  • bun run typecheck passes
  • bun --env-file=apps/server/.env.development test apps/server/tests/api/services/openclaw passes (one pre-existing failure on restart moves off a persisted ready port when auth rejects the current token — verified failing on dev before this PR)
  • Live PoC against running gateway: handshake succeeds with gateway-client + backend
  • Live PoC: chat.send returns runId and broadcasts events to the observer
  • Live PoC: chat.abort actually aborts (saw aborted: true and stream close)
  • Manual: smoke-test chat from BrowserOS extension (text + image attachment)
  • Manual: click Stop mid-execution, verify agent halts within ~1s
  • Manual: queue dispatch on idle still works (it watches ClawSession which is fed by the observer)

The OpenClaw observer was failing handshakes in production with
CONTROL_UI_DEVICE_IDENTITY_REQUIRED — the 'openclaw-tui' client.id
forces control-ui clients to present a paired device identity, which
we don't have. Switch to 'gateway-client' + mode:'backend' (the
documented "trusted backend self-connect" identity) and drop the
explicit operator.read scope so the gateway grants default operator
scopes for shared-secret auth.

Move chat send from POST /v1/chat/completions to WS chat.send so that
runs are registered in OpenClaw's chatAbortControllers registry —
HTTP-initiated runs are NOT in that registry, which is why closing
the SSE fetch never aborted the agent (verified empirically: agent
kept generating for 70+ seconds after fetch.abort()).

chat.send is invoked by execing OpenClaw's CLI inside the gateway
container so the connection appears as direct_local and gets full
operator scope (operator.write is required for chat.send/chat.abort
and gets cleared for non-direct_local connections).

Adds POST /claw/agents/:id/chat/stop {sessionKey, runId?} which calls
chat.abort the same way. Live-verified: returns aborted:true and the
in-flight stream closes with lifecycle:aborted.

The new stream emits richer events than the old HTTP path: text-delta,
reasoning-delta, tool-start (with arguments + label), tool-end (with
duration + isError), lifecycle (start/end/aborted/error), usage
(tokensIn/tokensOut/costUsd), done, and error (with typed errorKind).
Existing handlers gracefully ignore unknown types.

Removes the legacy streamChat() in OpenClawHttpClient and ~290 lines
of associated SSE/chunk parsing — chat goes through WS now, no
fallback. isAuthenticated() and the session history reads stay on
HTTP.

Idempotency: chat.send requires idempotencyKey; we generate a UUID
per send, so retries are safe.

Multimodal: messageParts (OpenAI shape) translate to OpenClaw
chat.send attachments shape inline.
@github-actions
Copy link
Copy Markdown
Contributor

❌ Tests failed — 6/574 failed

Suite Passed Failed Skipped
agent-sdk 44/44 0 0
agent 23/23 0 0
build 7/7 0 0
eval 8/8 0 0
server-agent 261/261 0 0
server-api 131/133 2 0
server-browser 3/3 0 0
server-integration 0/1 1 0
server-root 57/61 1 3
server-sdk 0/1 1 0
server-skills 31/31 0 0
server-tools 0/1 1 0
Failed tests
  • server-apicreateOpenClawRoutes > preserves BrowserOS SSE framing and normalizes recursive session keys for chat
  • server-apicreateOpenClawRoutes > passes prior chat history through to the OpenClaw chat stream
  • server-integrationHTTP Server Integration Tests > (unnamed)
  • server-rootMonitoringService lazy judge integration > logs safe judge results so judge activity is visible in stdout
  • server-sdk(unnamed)
  • server-toolsworkflow > server-tools setup

View workflow run

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR ships three tightly coupled changes: fixes the observer handshake identity (gateway-client/backend) that was blocking production reconnects, migrates chat sends from HTTP SSE to WS chat.send via the OpenClaw CLI (enabling abort registration), and adds a POST /claw/agents/:id/chat/stop endpoint backed by chat.abort.

  • P1 — Stream hang on WS disconnect: if the observer WS drops while a chat is in-flight, onChat/onAgent listeners are kept alive across reconnects but no close/timeout path fires on the stream side. If the OpenClaw run completes during the disconnect window, the SSE response to the browser will hang indefinitely.
  • P2 — Bearer token in CLI args: callGatewayRpc passes --token <value> on the command line; the token is readable from /proc/<pid>/cmdline inside the container.

Confidence Score: 3/5

Functionally correct for the happy path but has a stream-leak risk on WS disconnect that can cause hanging SSE connections in production.

One P1 (stream hang on observer disconnect) pulls the score to 3. The _history silent drop and token-in-CLI-args are P2s. The happy-path logic is well-structured and the handshake fix is clearly correct.

openclaw-service.ts (stream lifecycle on disconnect) and container-runtime.ts (token arg exposure)

Security Review

  • Token exposure via CLI argument (container-runtime.ts callGatewayRpc): the gateway bearer token is passed as --token <value> in the command line. Inside the container it is visible via /proc/<pid>/cmdline and ps aux to any co-process. Prefer stdin, a named pipe, or an env-var scoped to the spawned process.
  • No new injection vectors: params to callGatewayRpc are serialised with JSON.stringify before being passed as a single shell argument, avoiding shell-injection risks.
  • Abort endpoint access control: POST /agents/:id/chat/stop follows the same auth middleware pattern as other agent routes; no standalone escalation risk identified.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts Core migration: chatStream rewritten to WS chat.send via CLI; new abortChat method added; _history silently dropped; stream-hang risk on observer disconnect
packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts New callGatewayRpc method execs OpenClaw CLI inside container; bearer token passed as a CLI arg (visible in /proc); fallback JSON parser is well-handled
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-observer.ts Handshake fix (gateway-client/backend identity); RPC request/response layer and event listener system added cleanly; pending requests correctly drained on disconnect
packages/browseros-agent/apps/server/src/api/routes/openclaw.ts New /agents/:id/chat/stop endpoint added; AbortSignal threaded into chatStream; error handling consistent with existing routes
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-http-client.ts ~290 lines of SSE/chat streaming code removed; parseChunk retained and still used by streamSessionHistory; remaining code clean
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-types.ts Added reasoning-delta and usage to OpenClawStreamEvent union; additive change, no breakage
packages/browseros-agent/apps/server/tests/api/services/openclaw/openclaw-service.test.ts Test updated to verify chat.send is called with correct sessionKey and idempotencyKey; observer and runtime mocked appropriately
packages/browseros-agent/apps/server/tests/api/services/openclaw/openclaw-http-client.test.ts Legacy streamChat tests removed; remaining isAuthenticated and getSessionHistory tests unaffected

Sequence Diagram

sequenceDiagram
    participant C as Client (Extension)
    participant R as Route /agents/:id/chat
    participant S as OpenClawService
    participant RT as ContainerRuntime
    participant OC as OpenClaw Gateway (container)
    participant OBS as OpenClawObserver (WS)

    C->>R: POST /chat (message + sessionKey)
    R->>S: chatStream(agentId, sessionKey, message, signal)
    S->>S: runControlPlaneCall (ensure observer connected)
    S->>RT: callGatewayRpc(chat.send, params, --token)
    RT->>OC: nerdctl exec node dist/index.js gateway call chat.send
    OC-->>RT: {runId}
    RT-->>S: {runId}
    S->>OBS: on('chat', onChat) + on('agent', onAgent)
    S-->>R: ReadableStream
    R-->>C: SSE stream (text/event-stream)

    loop WS broadcast events
        OC->>OBS: chat event {runId, state, message.content}
        OBS->>S: onChat(payload)
        S-->>C: text-delta / reasoning-delta / done / lifecycle / usage
        OC->>OBS: agent event {runId, stream:'tool', ...}
        OBS->>S: onAgent(payload)
        S-->>C: tool-start / tool-end / lifecycle
    end

    alt Stop requested
        C->>R: POST /agents/:id/chat/stop {sessionKey, runId}
        R->>S: abortChat(agentId, sessionKey, runId)
        S->>RT: callGatewayRpc(chat.abort, params, --token)
        RT->>OC: nerdctl exec ... gateway call chat.abort
        OC-->>RT: {aborted:true, runIds:[...]}
        OC->>OBS: chat event {state:'aborted'}
        OBS->>S: onChat → close(lifecycle:aborted)
        S-->>C: lifecycle:aborted, stream closes
    end
Loading

Comments Outside Diff (4)

  1. packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts, line 975-993 (link)

    P1 Stream can hang indefinitely on observer WS disconnect

    When the WS observer disconnects mid-stream, failAllPendingRequests rejects any pending RPCs but the event listeners (onChat, onAgent) are deliberately kept alive across reconnects. If the OpenClaw run completes or errors during the disconnect window, the final/aborted event is never delivered, and the stream never closes — the SSE response to the browser stays open forever.

    There is no stream-level timeout or disconnect-triggered close path. The only way out is if the caller's HTTP request is cancelled (which fires abortChat, but that itself uses the observer RPC path that's also broken while disconnected), or if the WS reconnects and a new final event fires.

    Consider adding a timeout fallback on the stream side, or hooking into the observer's reconnect/disconnect lifecycle to close open streams when the disconnect outlasts a grace period.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
    Line: 975-993
    
    Comment:
    **Stream can hang indefinitely on observer WS disconnect**
    
    When the WS observer disconnects mid-stream, `failAllPendingRequests` rejects any pending RPCs but the event listeners (`onChat`, `onAgent`) are deliberately kept alive across reconnects. If the OpenClaw run completes or errors during the disconnect window, the `final`/`aborted` event is never delivered, and the stream never closes — the SSE response to the browser stays open forever.
    
    There is no stream-level timeout or disconnect-triggered close path. The only way out is if the caller's HTTP request is cancelled (which fires `abortChat`, but that itself uses the observer RPC path that's also broken while disconnected), or if the WS reconnects and a new `final` event fires.
    
    Consider adding a timeout fallback on the stream side, or hooking into the observer's reconnect/disconnect lifecycle to close open streams when the disconnect outlasts a grace period.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts, line 64-98 (link)

    P2 security Bearer token exposed in process argument list

    The gateway bearer token is passed via --token input.token as a CLI argument. Inside the container this is readable from /proc/<pid>/cmdline and visible in ps aux output. If any other process in the container has read access to /proc, it can scrape the token.

    Consider passing the token via stdin, a temporary file with restricted permissions, or an environment variable instead of a positional CLI argument to avoid this exposure.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts
    Line: 64-98
    
    Comment:
    **Bearer token exposed in process argument list**
    
    The gateway bearer token is passed via `--token input.token` as a CLI argument. Inside the container this is readable from `/proc/<pid>/cmdline` and visible in `ps aux` output. If any other process in the container has read access to `/proc`, it can scrape the token.
    
    Consider passing the token via stdin, a temporary file with restricted permissions, or an environment variable instead of a positional CLI argument to avoid this exposure.
    
    How can I resolve this? If you propose a fix, please make it concise.
  3. packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts, line 748 (link)

    P2 _history parameter silently dropped

    The _history parameter was previously forwarded to the HTTP chat/completions endpoint. It's now prefixed with _ and ignored — sendChatViaCli has no history param. Callers (e.g., the route handler at openclaw.ts:581) still pass populated history arrays and will get no error or warning that the history is being discarded.

    If OpenClaw's session-based history via sessionKey is always sufficient, consider removing the parameter from the public signature entirely to avoid the misleading contract. If explicit history injection is still needed for any flow (bootstrap, context override), that case is now silently broken.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
    Line: 748
    
    Comment:
    **`_history` parameter silently dropped**
    
    The `_history` parameter was previously forwarded to the HTTP `chat/completions` endpoint. It's now prefixed with `_` and ignored — `sendChatViaCli` has no `history` param. Callers (e.g., the route handler at `openclaw.ts:581`) still pass populated `history` arrays and will get no error or warning that the history is being discarded.
    
    If OpenClaw's session-based history via `sessionKey` is always sufficient, consider removing the parameter from the public signature entirely to avoid the misleading contract. If explicit history injection is still needed for any flow (bootstrap, context override), that case is now silently broken.
    
    How can I resolve this? If you propose a fix, please make it concise.
  4. packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts, line 986-993 (link)

    P2 signalListener does not close the stream directly on abort

    When the request AbortSignal fires, abortChat(runId) is called asynchronously and its error is silently swallowed. The stream only closes when OpenClaw replies with state: 'aborted' over the WS. If abortChat fails (e.g., the observer is disconnected, the CLI exec fails, or the gateway is restarting), no close event will ever arrive and the stream will leak indefinitely.

    A defensive fallback — closing the stream directly after a short grace period if no aborted lifecycle event arrives — would prevent this accumulation of dangling streams.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
    Line: 986-993
    
    Comment:
    **`signalListener` does not close the stream directly on abort**
    
    When the request `AbortSignal` fires, `abortChat(runId)` is called asynchronously and its error is silently swallowed. The stream only closes when OpenClaw replies with `state: 'aborted'` over the WS. If `abortChat` fails (e.g., the observer is disconnected, the CLI exec fails, or the gateway is restarting), no close event will ever arrive and the stream will leak indefinitely.
    
    A defensive fallback — closing the stream directly after a short grace period if no `aborted` lifecycle event arrives — would prevent this accumulation of dangling streams.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 975-993

Comment:
**Stream can hang indefinitely on observer WS disconnect**

When the WS observer disconnects mid-stream, `failAllPendingRequests` rejects any pending RPCs but the event listeners (`onChat`, `onAgent`) are deliberately kept alive across reconnects. If the OpenClaw run completes or errors during the disconnect window, the `final`/`aborted` event is never delivered, and the stream never closes — the SSE response to the browser stays open forever.

There is no stream-level timeout or disconnect-triggered close path. The only way out is if the caller's HTTP request is cancelled (which fires `abortChat`, but that itself uses the observer RPC path that's also broken while disconnected), or if the WS reconnects and a new `final` event fires.

Consider adding a timeout fallback on the stream side, or hooking into the observer's reconnect/disconnect lifecycle to close open streams when the disconnect outlasts a grace period.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/container-runtime.ts
Line: 64-98

Comment:
**Bearer token exposed in process argument list**

The gateway bearer token is passed via `--token input.token` as a CLI argument. Inside the container this is readable from `/proc/<pid>/cmdline` and visible in `ps aux` output. If any other process in the container has read access to `/proc`, it can scrape the token.

Consider passing the token via stdin, a temporary file with restricted permissions, or an environment variable instead of a positional CLI argument to avoid this exposure.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 748

Comment:
**`_history` parameter silently dropped**

The `_history` parameter was previously forwarded to the HTTP `chat/completions` endpoint. It's now prefixed with `_` and ignored — `sendChatViaCli` has no `history` param. Callers (e.g., the route handler at `openclaw.ts:581`) still pass populated `history` arrays and will get no error or warning that the history is being discarded.

If OpenClaw's session-based history via `sessionKey` is always sufficient, consider removing the parameter from the public signature entirely to avoid the misleading contract. If explicit history injection is still needed for any flow (bootstrap, context override), that case is now silently broken.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 986-993

Comment:
**`signalListener` does not close the stream directly on abort**

When the request `AbortSignal` fires, `abortChat(runId)` is called asynchronously and its error is silently swallowed. The stream only closes when OpenClaw replies with `state: 'aborted'` over the WS. If `abortChat` fails (e.g., the observer is disconnected, the CLI exec fails, or the gateway is restarting), no close event will ever arrive and the stream will leak indefinitely.

A defensive fallback — closing the stream directly after a short grace period if no `aborted` lifecycle event arrives — would prevent this accumulation of dangling streams.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(openclaw): migrate chat to WS chat...." | Re-trigger Greptile

@shivammittal274
Copy link
Copy Markdown
Contributor Author

Pivoting: Nikhil's browseros-openclaw fork adds gateway.auth.mode=none + private-ingress no-auth (browseros-ai/openclaw#1). With that we can run chat.send/abort directly over the observer's WS — no CLI exec, no token. Reworking on a fresh branch off dev. Closing in favor of the simpler approach. Branch preserved for reference / cherry-picking the observer fix + dedupe + stop button if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant