Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
2fee1e5
Support full-access profile for websocket transport
Yehonal Apr 7, 2026
d597092
feat: add multi-endpoint CAS support with per-conversation endpoint s…
huntharo Apr 12, 2026
e526fb8
feat: route CAS worker tools from exec node context
Yehonal Apr 22, 2026
5819e31
feat: add manual and automatic CAS endpoint policy
Yehonal Apr 23, 2026
2870af7
feat: derive node websocket endpoint fallback for CAS workers
Yehonal Apr 23, 2026
73db523
feat: always include resolved endpoint in cas_resume replies
Yehonal Apr 23, 2026
18180ec
Plugin: transcribe inbound audio before Codex turns
Yehonal Apr 17, 2026
0bf4d37
Plugin: recover missing local Discord CAS bindings
Yehonal Apr 20, 2026
750d4d4
docs: document cas_reset recovery command
Yehonal Apr 20, 2026
7712fbe
fix(discord): isolate CAS thread bindings by thread scope
Yehonal Apr 21, 2026
4620e38
fix: apply node-derived endpoint fallback in cas_resume
Yehonal Apr 30, 2026
d0cc346
fix: preserve recovered endpoint selection and refresh controller tests
Yehonal Apr 30, 2026
722cb00
fix: use node-aware endpoint resolution across CAS controls
Yehonal May 1, 2026
22fd11a
fix: probe paired node ip for derived CAS endpoints
Yehonal May 1, 2026
f324dce
fix: honor auto exec host for derived CAS endpoints
Yehonal May 1, 2026
0e9ed17
fix: resolve CAS endpoint from agent exec context
Yehonal May 7, 2026
aa9c089
feat: allow default reasoning effort
Yehonal May 7, 2026
0ae865a
fix: support endpoint workspace defaults
Yehonal May 9, 2026
ad5670e
Add per-agent default endpoint resolution
Yehonal May 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
# Changelog

## Unreleased

### Highlights

- Added an optional inbound audio transcription preprocessor so bound conversations can convert staged voice/audio attachments into normal text turn input before forwarding the turn into Codex. The plugin stays transport-agnostic by delegating transcription to a configurable local command that prints transcript text to stdout.

### Docs

- Documented the new `inboundAudioTranscription` plugin config and clarified the media bridge notes around staged inbound audio handling.

## v0.6.0 - 2026-04-03

### Highlights
Expand Down
69 changes: 69 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,38 @@ Pre-release packages are published on matching npm dist-tags instead of `latest`
5. Use `/cas_status` to inspect or adjust the binding in place, including model, reasoning, fast mode, permissions, compact, and stop controls.
6. If you leave plan mode through the normal `Implement this plan` button, you do not need `/cas_plan off`; use `/cas_plan off` only when you want to exit planning manually instead.

## Autonomous Worker Tools (experimental)

This plugin can also expose **agent-callable tools** so OpenClaw can talk to Codex workers **without manual `/cas_*` control**.

Use this mode when you want OpenClaw to act as an orchestrator over multiple Codex app-server endpoints, for example:

- `context-worker` as a browser or context worker
- `implementation-worker` as a development worker

Current tool surface:

- `codex_workers_describe_endpoints`
- `codex_workers_list_threads`
- `codex_workers_run_task`
- `codex_workers_read_thread_context`

Notes:

- These tools talk **directly to Codex app-server endpoints**. They do **not** use an MCP proxy layer.
- `codex_workers_run_task` can create a named thread, continue an existing `threadId`, or reuse a named thread when `reuseThreadByName=true`.
- If Codex requests interactive approval/input during an autonomous run, the tool records the pending input and interrupts the run instead of hanging forever.
- For fully autonomous write actions, you will usually want a worker endpoint that exposes the `full-access` profile.

Suggested pattern:

1. `codex_workers_describe_endpoints`
2. `codex_workers_run_task(endpointId="context-worker", ...)`
3. `codex_workers_run_task(endpointId="implementation-worker", threadName="job/...", ...)`
4. `codex_workers_read_thread_context(...)` when you need replay/state

The manual `/cas_*` commands still remain useful as the human-facing fallback and debugging surface.

## Command Reference

| Command | What it does | Notes / examples |
Expand All @@ -135,6 +167,7 @@ Pre-release packages are published on matching npm dist-tags instead of `latest`
| `/cas_status --fast`, `/cas_status --no-fast` | Change fast mode and refresh the status card. | Fast mode is only available on supported models such as GPT-5.4+. |
| `/cas_status --yolo`, `/cas_status --no-yolo` | Change permissions mode and refresh the status card. | `--yolo` selects Full Access. |
| `/cas_detach` | Unbind this conversation from Codex. | Stops routing plain text from this conversation into the bound thread. |
| `/cas_reset` | Force-clear Codex state for this conversation. | Recovery command for stale binds; clears the binding plus pending bind/request/callback state, then tells you to run `/cas_resume`. |
| `/cas_stop` | Interrupt the active Codex run. | Only applies when a turn is currently in progress. |
| `/cas_steer <message>` | Send follow-up steer text to an active run. | Example: `/cas_steer focus on the failing tests first` |
| `/cas_plan <goal>` | Ask Codex to plan instead of execute. | The plugin relays plan questions and the final plan back into chat. |
Expand Down Expand Up @@ -208,10 +241,46 @@ The plugin schema in [`openclaw.plugin.json`](./openclaw.plugin.json) supports:

- `transport`: `stdio` or `websocket`
- `command` and `args`: the Codex executable and CLI args for `stdio`
- `execNodes`: optional list of `tools.exec.node` aliases that should auto-select a specific endpoint when agent tools run with `tools.exec.host=node`
- `url`, `authToken`, `headers`: connection settings for `websocket`
- `defaultWorkspaceDir`: fallback workspace for unbound actions
- `agentEndpoints`: optional map of OpenClaw agent id to default endpoint id, used after manual `/cas_endpoint` overrides and exec node-derived endpoint selection but before `defaultEndpoint`
- `endpoints[].defaultWorkspaceDir`: endpoint-specific fallback workspace; useful when a remote app-server cannot access the controller host path
- `defaultModel`: model used when a new thread starts without an explicit selection
- `defaultServiceTier`: default service tier for new turns
- `inboundAudioTranscription`: optional preprocessor for inbound audio/voice attachments before they are forwarded into Codex

### Optional inbound audio transcription

If your chat surface provides inbound audio files as local paths or media metadata, this plugin can transcribe them before forwarding the turn to Codex. This keeps the plugin transport-agnostic: Codex still receives normal text input, while transcription is delegated to any local command you choose.

Example config using an existing local script:

```json
{
"inboundAudioTranscription": {
"enabled": true,
"command": "/root/.openclaw/workspace/scripts/local-stt-transcribe.sh",
"args": ["{path}"],
"timeoutMs": 20000
}
}
```

Behavior:

- audio-only inbound messages become transcript text
- caption + audio keeps the caption and adds a labeled transcript block
- the command should print the transcript to stdout
- if stdout is JSON, `.text` or `.transcript` is used automatically

Argument placeholders supported in `args`:

- `{path}`
- `{mimeType}`
- `{fileName}`

If `{path}` is omitted from `args`, the plugin appends the media path automatically.

## Developer Workflow With A Local OpenClaw Checkout

Expand Down
144 changes: 144 additions & 0 deletions docs/autonomous-worker-tools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Autonomous Worker Tools

This document describes the **agent-callable** tool layer added on top of `openclaw-codex-app-server`.

## Goal

Allow OpenClaw to orchestrate one or more Codex workers **directly via Codex app-server**, without requiring a human to drive `/cas_resume`, `/cas_status`, or `/cas_endpoint` manually.

This is intended for flows like:

- `context-worker` -> browser or authenticated context worker
- `implementation-worker` -> repo implementation worker
- OpenClaw -> planner / router / memory / reporting layer

## Why direct app-server instead of MCP here?

Because the worker relationship is conversational/stateful:

- persistent threads
- turn execution
- resume / continue
- interrupt
- thread state and replay
- native Codex approvals / pending input semantics

MCP is still useful **inside** Codex for tools, but for **OpenClaw -> Codex worker control**, app-server is the primary transport.

## Exposed tools

### `codex_workers_describe_endpoints`

Returns:

- default endpoint
- per-agent default endpoint map
- default workspace/model
- configured endpoints
- whether each endpoint supports `full-access`

Worker tools resolve endpoints in this order: explicit `endpointId`, exec-context/node-derived endpoint, `agentEndpoints[agentId]`, then `defaultEndpoint`.

### `codex_workers_list_threads`

Lists threads on an endpoint.

Useful before reusing a thread or when trying to resolve a stable worker thread by name.

Key params:

- `endpointId`
- `workspaceDir`
- `includeAllWorkspaces`
- `filter`
- `permissionsMode`

### `codex_workers_run_task`

Runs a prompt on a Codex worker.

Supports:

- starting a fresh turn
- continuing an existing `threadId`
- creating a named thread with `threadName`
- reusing a named thread with `reuseThreadByName=true`
- optional model / reasoning / service tier overrides
- optional collaboration payload
- optional multimodal `input`

Key params:

- `endpointId`
- `prompt`
- `workspaceDir`
- `threadId`
- `threadName`
- `reuseThreadByName`
- `permissionsMode`
- `model`
- `reasoningEffort`
- `serviceTier`
- `collaborationMode`
- `input`

Return shape includes:

- resolved endpoint/workspace/profile
- resulting `threadId`
- whether a thread was created or reused
- any captured `pendingInput`
- the Codex turn result

### `codex_workers_read_thread_context`

Reads:

- thread state
- thread replay/context summary

Useful when OpenClaw wants to inspect a worker thread before resuming it.

## Pending input behavior

Autonomous tool calls cannot complete an interactive approval loop by themselves.

So the current behavior is:

1. detect pending approval/input
2. capture a compact `pendingInput` summary
3. interrupt the run
4. return control to OpenClaw

This avoids deadlocks.

## Recommended orchestration pattern

### Phase 1 — direct autonomous orchestration

Use these tools directly from OpenClaw:

1. gather context on `context-worker`
2. pass the structured result to `implementation-worker`
3. continue the same named thread when useful
4. inspect thread context if a run needs to be resumed later

### Phase 2 — add ClawFlow above it

ClawFlow is the natural next layer when you want:

- persistent multi-step jobs
- waiting/resume states
- small persisted outputs
- one owner session around multiple worker turns

So the intended stack is:

- **Codex app-server plugin tools first**
- **ClawFlow second**

## Safety / ops notes

- Prefer loopback or authenticated websocket endpoints.
- For autonomous write actions, use a dedicated endpoint/profile intentionally configured for that purpose.
- Keep `CAS` as the human fallback/debug surface even after autonomous tools are enabled.
40 changes: 38 additions & 2 deletions docs/specs/MEDIA.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ This document captures the current state of media handling relevant to this plug
- how Codex app-server accepts image input
- what this plugin currently sends
- what OpenClaw currently exposes to plugins
- the gap for inbound media
- the remaining gap for richer inbound media
- the staged-audio transcription bridge this plugin now supports
- a recommended bridge design for future implementation

This is a spec/notes document only. It does not imply that inbound media support has already been implemented here.
Expand All @@ -15,9 +16,11 @@ This is a spec/notes document only. It does not imply that inbound media support
- Codex app-server already supports multimodal turn input via `UserInput`.
- The supported image-shaped input items are remote/data URL images and local filesystem images.
- This plugin now supports mixed text + image turn input and forwards inbound image media into Codex when OpenClaw provides a staged media path or URL.
- This plugin can also transcribe staged inbound audio/voice attachments into plain text turn input when a local transcription command is configured.
- OpenClaw’s plugin SDK already supports outbound attachments from a plugin via `mediaUrl` and `mediaUrls`.
- OpenClaw’s plugin SDK still does not model inbound attachments as a first-class typed field on command or `inbound_claim` events.
- In practice, current `inbound_claim` hook metadata already carries `mediaPath` / `mediaType`, which is enough for this plugin to forward a staged inbound image.
- The same staged inbound path is also enough to transcribe audio before Codex sees the turn, as long as the plugin can execute an external transcription command against the staged file.
- The cleanest future bridge is: OpenClaw stages inbound files locally, then this plugin maps image paths to Codex `localImage` items.

## Codex App-Server Input Model
Expand Down Expand Up @@ -177,8 +180,41 @@ That means:
- text-only turns still work as before
- mixed text + image turns can be forwarded into Codex
- image-only inbound turns can be forwarded into Codex
- audio-only inbound turns can be converted into transcript text before the turn starts when `inboundAudioTranscription` is configured
- mixed caption + audio inbound turns can keep the original text and append a labeled transcript block
- staged text attachments such as `.txt`, `.md`, `.json`, `.yaml`, and `.yml` can be read and forwarded as additional `text` items
- unsupported binary non-image inbound media is still ignored for now
- unsupported binary non-image inbound media is still ignored for now unless a future bridge teaches the plugin how to reinterpret it

## Inbound Audio Transcription Bridge

The plugin does not send raw audio into Codex. Instead, it can optionally reinterpret staged audio files as text by invoking a configurable local command.

Configuration shape:

```json
{
"inboundAudioTranscription": {
"enabled": true,
"command": "/path/to/transcribe",
"args": ["{path}"],
"timeoutMs": 20000
}
}
```

Behavior:

- The command receives the staged media path either through an explicit `{path}` placeholder or as an appended trailing argument.
- Optional placeholders `{mimeType}` and `{fileName}` are available for wrappers that need them.
- The command should print the transcript to stdout.
- If stdout is JSON, the plugin uses `.text` first and then `.transcript`.
- On transcription failure or timeout, the plugin logs the failure and falls back to the previous behavior instead of crashing the inbound turn.

This keeps the bridge generic:

- no hard dependency on a specific speech-to-text engine
- no plugin-side audio decoding logic
- no transport-specific behavior baked into the Codex turn layer

## OpenClaw Plugin SDK: Outbound Media

Expand Down
3 changes: 3 additions & 0 deletions index.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,15 @@ describe("plugin registration", () => {
it("loads without the binding resolved hook on older OpenClaw cores", () => {
const api = {
registerService: vi.fn(),
registerTool: vi.fn(),
registerInteractiveHandler: vi.fn(),
registerCommand: vi.fn(),
on: vi.fn(),
};

expect(() => plugin.register(api as never)).not.toThrow();
expect(api.registerService).toHaveBeenCalledTimes(1);
expect(api.registerTool).toHaveBeenCalledTimes(4);
expect(api.on).toHaveBeenCalledWith("inbound_claim", expect.any(Function));
expect(api.registerInteractiveHandler).toHaveBeenCalledTimes(2);
expect(api.registerCommand).toHaveBeenCalled();
Expand All @@ -45,6 +47,7 @@ describe("plugin registration", () => {
it("registers the binding resolved hook when available", () => {
const api = {
registerService: vi.fn(),
registerTool: vi.fn(),
registerInteractiveHandler: vi.fn(),
registerCommand: vi.fn(),
on: vi.fn(),
Expand Down
12 changes: 12 additions & 0 deletions index.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
import { createAgentTools } from "./src/agent-tools.js";
import { CodexPluginController } from "./src/controller.js";
import { COMMANDS } from "./src/commands.js";
import { INTERACTIVE_NAMESPACE } from "./src/types.js";
Expand All @@ -11,6 +12,17 @@ const plugin = {

api.registerService(controller.createService());

const toolRegistrar = (
api as OpenClawPluginApi & {
registerTool?: (tool: unknown) => void;
}
).registerTool;
if (typeof toolRegistrar === "function") {
for (const tool of createAgentTools(controller)) {
toolRegistrar(tool);
}
}

const bindingResolvedHook = (
api as OpenClawPluginApi & {
onConversationBindingResolved?: OpenClawPluginApi["onConversationBindingResolved"];
Expand Down
Loading