Skip to content

Commit babd601

Browse files
committed
Plugin: transcribe inbound audio before Codex turns
1 parent 4a87dce commit babd601

8 files changed

Lines changed: 479 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
# Changelog
22

3+
## Unreleased
4+
5+
### Highlights
6+
7+
- Added an optional inbound audio transcription preprocessor so bound conversations can convert staged voice/audio attachments into normal text turn input before forwarding the turn into Codex. The plugin stays transport-agnostic by delegating transcription to a configurable local command that prints transcript text to stdout.
8+
9+
### Docs
10+
11+
- Documented the new `inboundAudioTranscription` plugin config and clarified the media bridge notes around staged inbound audio handling.
12+
313
## v0.6.0 - 2026-04-03
414

515
### Highlights

README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,39 @@ The plugin schema in [`openclaw.plugin.json`](./openclaw.plugin.json) supports:
212212
- `defaultWorkspaceDir`: fallback workspace for unbound actions
213213
- `defaultModel`: model used when a new thread starts without an explicit selection
214214
- `defaultServiceTier`: default service tier for new turns
215+
- `inboundAudioTranscription`: optional preprocessor for inbound audio/voice attachments before they are forwarded into Codex
216+
217+
### Optional inbound audio transcription
218+
219+
If your chat surface provides inbound audio files as local paths or media metadata, this plugin can transcribe them before forwarding the turn to Codex. This keeps the plugin transport-agnostic: Codex still receives normal text input, while transcription is delegated to any local command you choose.
220+
221+
Example config using an existing local script:
222+
223+
```json
224+
{
225+
"inboundAudioTranscription": {
226+
"enabled": true,
227+
"command": "/root/.openclaw/workspace/scripts/local-stt-transcribe.sh",
228+
"args": ["{path}"],
229+
"timeoutMs": 20000
230+
}
231+
}
232+
```
233+
234+
Behavior:
235+
236+
- audio-only inbound messages become transcript text
237+
- caption + audio keeps the caption and adds a labeled transcript block
238+
- the command should print the transcript to stdout
239+
- if stdout is JSON, `.text` or `.transcript` is used automatically
240+
241+
Argument placeholders supported in `args`:
242+
243+
- `{path}`
244+
- `{mimeType}`
245+
- `{fileName}`
246+
247+
If `{path}` is omitted from `args`, the plugin appends the media path automatically.
215248

216249
## Developer Workflow With A Local OpenClaw Checkout
217250

docs/specs/MEDIA.md

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ This document captures the current state of media handling relevant to this plug
55
- how Codex app-server accepts image input
66
- what this plugin currently sends
77
- what OpenClaw currently exposes to plugins
8-
- the gap for inbound media
8+
- the remaining gap for richer inbound media
9+
- the staged-audio transcription bridge this plugin now supports
910
- a recommended bridge design for future implementation
1011

1112
This is a spec/notes document only. It does not imply that inbound media support has already been implemented here.
@@ -15,9 +16,11 @@ This is a spec/notes document only. It does not imply that inbound media support
1516
- Codex app-server already supports multimodal turn input via `UserInput`.
1617
- The supported image-shaped input items are remote/data URL images and local filesystem images.
1718
- This plugin now supports mixed text + image turn input and forwards inbound image media into Codex when OpenClaw provides a staged media path or URL.
19+
- This plugin can also transcribe staged inbound audio/voice attachments into plain text turn input when a local transcription command is configured.
1820
- OpenClaw’s plugin SDK already supports outbound attachments from a plugin via `mediaUrl` and `mediaUrls`.
1921
- OpenClaw’s plugin SDK still does not model inbound attachments as a first-class typed field on command or `inbound_claim` events.
2022
- In practice, current `inbound_claim` hook metadata already carries `mediaPath` / `mediaType`, which is enough for this plugin to forward a staged inbound image.
23+
- The same staged inbound path is also enough to transcribe audio before Codex sees the turn, as long as the plugin can execute an external transcription command against the staged file.
2124
- The cleanest future bridge is: OpenClaw stages inbound files locally, then this plugin maps image paths to Codex `localImage` items.
2225

2326
## Codex App-Server Input Model
@@ -177,8 +180,41 @@ That means:
177180
- text-only turns still work as before
178181
- mixed text + image turns can be forwarded into Codex
179182
- image-only inbound turns can be forwarded into Codex
183+
- audio-only inbound turns can be converted into transcript text before the turn starts when `inboundAudioTranscription` is configured
184+
- mixed caption + audio inbound turns can keep the original text and append a labeled transcript block
180185
- staged text attachments such as `.txt`, `.md`, `.json`, `.yaml`, and `.yml` can be read and forwarded as additional `text` items
181-
- unsupported binary non-image inbound media is still ignored for now
186+
- unsupported binary non-image inbound media is still ignored for now unless a future bridge teaches the plugin how to reinterpret it
187+
188+
## Inbound Audio Transcription Bridge
189+
190+
The plugin does not send raw audio into Codex. Instead, it can optionally reinterpret staged audio files as text by invoking a configurable local command.
191+
192+
Configuration shape:
193+
194+
```json
195+
{
196+
"inboundAudioTranscription": {
197+
"enabled": true,
198+
"command": "/path/to/transcribe",
199+
"args": ["{path}"],
200+
"timeoutMs": 20000
201+
}
202+
}
203+
```
204+
205+
Behavior:
206+
207+
- The command receives the staged media path either through an explicit `{path}` placeholder or as an appended trailing argument.
208+
- Optional placeholders `{mimeType}` and `{fileName}` are available for wrappers that need them.
209+
- The command should print the transcript to stdout.
210+
- If stdout is JSON, the plugin uses `.text` first and then `.transcript`.
211+
- On transcription failure or timeout, the plugin logs the failure and falls back to the previous behavior instead of crashing the inbound turn.
212+
213+
This keeps the bridge generic:
214+
215+
- no hard dependency on a specific speech-to-text engine
216+
- no plugin-side audio decoding logic
217+
- no transport-specific behavior baked into the Codex turn layer
182218

183219
## OpenClaw Plugin SDK: Outbound Media
184220

openclaw.plugin.json

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,28 @@
5353
},
5454
"defaultServiceTier": {
5555
"type": "string"
56+
},
57+
"inboundAudioTranscription": {
58+
"type": "object",
59+
"additionalProperties": false,
60+
"properties": {
61+
"enabled": {
62+
"type": "boolean"
63+
},
64+
"command": {
65+
"type": "string"
66+
},
67+
"args": {
68+
"type": "array",
69+
"items": {
70+
"type": "string"
71+
}
72+
},
73+
"timeoutMs": {
74+
"type": "number",
75+
"minimum": 100
76+
}
77+
}
5678
}
5779
}
5880
},
@@ -100,6 +122,11 @@
100122
"defaultServiceTier": {
101123
"label": "Default Service Tier",
102124
"advanced": true
125+
},
126+
"inboundAudioTranscription": {
127+
"label": "Inbound Audio Transcription",
128+
"advanced": true,
129+
"help": "Optional preprocessor for inbound audio/voice attachments. The command should print the transcript to stdout. Use {path}, {mimeType}, and {fileName} placeholders in args when needed."
103130
}
104131
}
105132
}

src/config.ts

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
import type { PluginSettings } from "./types.js";
1+
import type {
2+
EndpointSettings,
3+
InboundAudioTranscriptionSettings,
4+
PluginSettings,
5+
} from "./types.js";
26
import {
37
DEFAULT_REQUEST_TIMEOUT_MS,
48
} from "./types.js";
@@ -56,6 +60,23 @@ function readNumber(
5660
return fallback;
5761
}
5862

63+
function resolveInboundAudioTranscription(
64+
record: Record<string, unknown>,
65+
): InboundAudioTranscriptionSettings | undefined {
66+
const nested = asRecord(record.inboundAudioTranscription);
67+
const legacy = asRecord(record.audioTranscription);
68+
const source = Object.keys(nested).length > 0 ? nested : legacy;
69+
if (Object.keys(source).length === 0) {
70+
return undefined;
71+
}
72+
return {
73+
enabled: source.enabled !== false,
74+
command: readString(source, "command"),
75+
args: readStringArray(source, "args"),
76+
timeoutMs: readNumber(source, "timeoutMs", 20_000, 100),
77+
};
78+
}
79+
5980
export function resolvePluginSettings(rawConfig: unknown): PluginSettings {
6081
const record = asRecord(rawConfig);
6182
const transport = record.transport === "websocket" ? "websocket" : "stdio";
@@ -82,6 +103,7 @@ export function resolvePluginSettings(rawConfig: unknown): PluginSettings {
82103
defaultWorkspaceDir: readString(record, "defaultWorkspaceDir"),
83104
defaultModel: readString(record, "defaultModel"),
84105
defaultServiceTier: readString(record, "defaultServiceTier"),
106+
inboundAudioTranscription: resolveInboundAudioTranscription(record),
85107
};
86108
}
87109

0 commit comments

Comments
 (0)