This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
@verygoodplugins/autobench — a plugin-based benchmark harness for voice/chat pipelines. Sweeps configurations across four slots (VAD → STT → LLM → TTS) and reports TTFT, tokens/sec, first-audio latency, and end-to-end turn time.
Extracted from ../uncensored-voice-server on 2026-04-17. The two repos are fully decoupled — autobench does not import from uncensored-voice-server and vice versa. Plugin implementations were copied, not shared.
Part of the Auto- family: AutoMem, AutoJack, AutoHub, AutoBench.
This is an ES module, strict TypeScript project (Node 20+). Use import/export, not require.
| mode | slots used | input |
|---|---|---|
text-to-text |
LLM | prompt text |
voice-to-text |
STT + LLM (+ VAD) | audio file |
voice-to-voice |
STT + LLM + TTS (+ VAD) | audio file |
Mode is declared in the matrix YAML. The runner gates which slots are required.
# Build TS -> dist/
npm run build
npm run typecheck # no emit
# List all registered plugins (sanity check the registry loaded)
node bin/autobench.js list
# Run a matrix, emit runs/<name>.jsonl + runs/<name>.summary.md
node bin/autobench.js run configs/smoke.yaml --out runs/smoke.jsonl
node bin/autobench.js run configs/m5-max.yaml
node bin/autobench.js run configs/voice-to-voice.yaml
# Serve HTTP + SSE on :8782 for the dashboard
node bin/autobench.js serve
# or: npm run serve
# Dashboard (Vite + React) on :5173, proxies API to :8782
npm run dashboard:dev
npm run dashboard:buildsrc/
├── core/
│ ├── types.ts # PluginBase, VadPlugin, SttPlugin, LlmPlugin, TtsPlugin, RunRecord
│ ├── registry.ts # Global `registry` singleton + loadBuiltins()
│ ├── runner.ts # loadMatrix, runMatrix (plugin cache lives here)
│ ├── metrics.ts # percentile, summaryStats, wer, rtf, Stopwatch
│ ├── jsonl.ts # JsonlWriter, readRuns
│ ├── hardware.ts # sampleHardware() — darwin memory_pressure + RSS
│ └── report.ts # toMarkdownSummary(records) — groups by pipeline
├── plugins/
│ ├── vad/{sox-silence,silero}.ts
│ ├── stt/{whisper-server,parakeet}.ts
│ ├── llm/{ollama,claude}.ts
│ └── tts/{kokoro,macos-say,piper}.ts
├── cli/{run,serve,list}.ts
├── server.ts # Express + SSE
├── playground.ts # /playground/chat/stream + /playground/voice/turn
├── core/plugin-cache.ts # shared PluginCache (runner + playground)
└── index.ts # public exports
bin/autobench.js # dispatcher to dist/cli/*
configs/*.yaml # matrix definitions
dashboard/ # Vite + React + Recharts, proxies to server
runs/ # JSONL output (gitignored except .gitkeep)
fixtures/ # reference audio + transcripts (gitignored placeholder)
bin/autobench.js run <matrix.yaml>→src/cli/run.ts.loadMatrix()parses YAML, validates mode + prompts + pipelines.runMatrix()creates aPluginCache, iteratespipeline × prompt × runs, callsrunOnce()for each.runOnce()chains STT → LLM → TTS as required by mode, assembles aRunRecord.JsonlWriterappends one JSON line per run toruns/<name>.jsonl.- After the stream ends,
toMarkdownSummary()groups by pipeline and emits P50/P95/P99 as<name>.summary.md. - The dashboard reads
/runs(list),/runs/:file(records),/plugins(registry) from the server.
- Pick a slot (
vad | stt | llm | tts) and implement the corresponding interface fromsrc/core/types.ts. registry.register(kind, name, async (config) => new YourPlugin())at module load.- Add
await import("../plugins/<slot>/<name>.js")toloadBuiltins()insrc/core/registry.ts. - Users wire it into a matrix via
{ name: "<your-name>", config: {...} }.
Plugins live for the process lifetime. Expensive init (ONNX model load, HTTP connections) should happen on first synthesize/transcribe/generate call and be memoized on this. Do not implement teardown() for model-holding plugins — the cache handles reuse.
runMatrix() in src/core/runner.ts caches plugin instances keyed by ${slot}:${name}:${stableStringify(config)}. Two pipelines with identical config reuse the same instance. teardownAll() runs once in the outer finally, not per-run.
This is why Kokoro.teardown() was removed. Before the cache, Kokoro reloaded its 150MB ONNX model every run and firstAudioMs was dominated by load time.
See src/core/types.ts. Key fields:
timings: raw per-stage numbers, prefixed (stt.*,llm.*,tts.*).metrics: derived, comparable numbers (TTFT, TPS, firstAudioMs, totalMs).pipeline: frozen snapshot of the slot refs used, including config — the dashboard keys on this.hardware.memoryResidentGb: currently hardcoded to sampleollamaprocess. See follow-ups.
sampleHardware("ollama")is hardcoded insrc/core/runner.ts:222. LLM plugins should eventually declare their own process name. Fine for now since Ollama is the only LLM backend.GGML_METAL_TENSOR_DISABLE=1+GGML_METAL_BF16_DISABLE=1must be set onollama servebefore running long M5 Max benchmarks, or Metal tensor kernel stalls will tank TPS numbers after 500s. See.env.example.BodyInitandDOMlib: tsconfig includes"DOM"sofetch+FormDatatype withBodyInitwithout shims. Do not remove.- Dashboard
tsconfig.tsbuildinfois gitignored becausetsc -bwrites it. Usenpx tsc --noEmitto typecheck without emit. - First run is slow. Ollama loads the model on first request (often 3–30s depending on size). Report P50/P95 over
runs: 3+so the first run doesn't dominate.
mode: text-to-text | voice-to-text | voice-to-voice
runs: 3 # repetitions per (pipeline × prompt)
prompts:
- id: short
text: "prompt text here" # required for *-to-text modes
audioPath: fixtures/x.wav # required for voice-* modes
reference: "ground truth" # optional, used for WER
pipelines:
- name: fast # optional display name
vad: { name: sox-silence, config: {...} }
stt: { name: whisper-server, config: { serverUrl: "..." } }
llm: { name: ollama, config: { model: "...", numCtx: 8192, numPredict: 512, think: false } }
tts: { name: kokoro, config: { voice: af_heart } }Unused slots can be omitted. The runner validates that required slots for the mode are present.
See .env.example. Key variables:
AUTOBENCH_PORT— server port (default 8782)OLLAMA_BASE_URL— defaulthttp://localhost:11434WHISPER_SERVER_URL— defaulthttp://localhost:8178ANTHROPIC_API_KEY— required byllm/claude; falls through toconfig.apiKeyif unset- Parakeet STT defaults to
http://localhost:8179— autohub'sparakeet-serverdefaults to:8178, so start it withPARAKEET_PORT=8179(or override via the plugin'sserverUrlconfig) to avoid collision with whisper-server. For the playground endpoint, setPARAKEET_SERVER_URLon the serve process (clients can't override it — see allowlist) PIPER_MODEL,PIPER_BINARY,PIPER_MODEL_CONFIG— required on the serve process to enable piper in the playground (filesystem paths are not client-configurable)GGML_METAL_TENSOR_DISABLE=1,GGML_METAL_BF16_DISABLE=1— set on ollama serve for M5 Max stability
Additive to the existing runs review. Two Express endpoints and a React tab:
POST /playground/chat/stream— JSON{ llm: { name, config }, messages, maxTokens?, temperature? }. SSE events:ready,token,done,error. Token events carry a running{ ttftMs, totalMs }. Thedoneevent's metadata includespromptTokens,completionTokens, andevalDurationMswhen the LLM reports them.POST /playground/voice/turn— JSON{ stt, llm, tts?, audio (base64), audioFormat? }. One-shot voice turn (PTT or hands-free). SSE events:stt-start→transcript→token(repeated, streaming) →llm-done→audio(repeated, one per TTS segment) →done.audioevents carry{ base64, format, ms, index, text }; the server buffers LLM output into sentence-sized segments (hard punctuation + min chars, max-chars fallback) and synthesizes them via a concurrent TTS worker that runs alongside further token streaming, so the client starts playback well before the full response arrives.tts-errorevents surface per-segment synthesis failures without killing the turn.
Config allowlist. Clients can only override safe keys per slot/plugin (model, temperature, maxTokens, voice, etc.). Secrets and paths (apiKey, baseUrl, serverUrl, binary, model for piper) are stripped and re-injected server-side from environment variables. An unknown slot/plugin is refused with 400. See src/playground.ts::CONFIG_ALLOWLIST.
Plugin cache. src/playground.ts holds its own PluginCache (shared class with the runner via core/plugin-cache.ts) for the server process lifetime. Repeated chat turns against the same model hit a warm plugin instance and skip cold-start.
Dashboard. dashboard/src/components/Playground.tsx with chat and voice sub-tabs. Chat panel streams tokens with a blinking-cursor tail and live TTFT/tok-s/token-count readout. Voice panel has a push-to-talk | hands-free (vad) toggle. PTT uses an inline AudioWorklet @ 16 kHz, encodes PCM16 WAV client-side, POSTs, and queues returned audio segments for sequential playback. Hands-free mode runs @ricky0123/vad-web (Silero, self-hosted under dashboard/public/vad/ via a postinstall script) which auto-detects speech start/end, drives the turn on silence, and supports barge-in (speech detected during TTS playback after a 300 ms arm-delay pauses playback and aborts the in-flight fetch). Both modes play audio via a programmatic queue of Audio() elements so the first segment from streaming TTS starts playing before later segments arrive. Reset clears history. Stop aborts in-flight fetches via AbortController.
SSE client quirk. EventSource can't POST, so dashboard/src/lib/sse.ts implements POST + ReadableStream + manual event:/data: parsing.
-
Hands-free VAD + streaming TTS segments (2026-04-18). Voice playground now has a
push-to-talk | hands-free (vad)toggle. Hands-free uses@ricky0123/vad-web(Silero, self-hosted atdashboard/public/vad/) to auto-detect speech start/end; speech-end auto-submits the turn; speech during TTS playback (after a 300 ms arm-delay) fires barge-in — pauses audio, aborts the fetch, captures the new utterance. Server-side, LLM output is buffered into sentence-sized segments and synthesized by a concurrent TTS worker, so each segment is emitted as its ownaudioSSE event. Client queues segments for sequential playback. First-audio latency is now (first-segment + first-TTS), not (full LLM + full TTS). Kokoro still hits onnxruntime-node's ONNX error (follow-up #1); macos-say and piper work end-to-end. -
Interactive playground UI —
feat/playground-ui(2026-04-18). Chat streaming verified live in Chrome against ollama (qwen2.5-coder:32b, TTFT 342ms, 26.2 tok/s). Voice turn verified server-side via synthesized input (parakeet + ollama + macos-say, STT 75ms, LLM TTFT 109ms warm). Also fixed a real STT plugin bug:form-datanpm package produced multipart parakeet-mlx's FastAPI parser rejected; switched both parakeet + whisper-server to nativeFormData+Blob, which also drops theform-datadep. -
Second plugin per slot — merged as PR #1,
feat/second-plugin-per-slot→main(2026-04-17). Commitsf1631cd(claude LLM),9114144(silero VAD),a165133(parakeet STT),28d1a20(piper TTS + demo matrix + doc update),855cd71(piper flag note),b529ca8(merge polish: opts.stream===false in claude, onnxruntime-node pinned to 1.21.0, YAML comment fix).claudeverified end-to-end (~800ms TTFT on haiku-4-5); parakeet verified via playground turn (2026-04-18); silero/piper still dry-run only, pending fixtures/ audio.
- Kokoro ONNX runtime error —
kokoro-jsthrowsPreferred output locations must have the same size as output namesonsynthesize()with onnxruntime-node 1.21.0 in this repo. Surfaced when wiring the voice playground; unrelated to playground code. Needs triage against kokoro-js versions or onnxruntime-node config. Workaround: usemacos-sayorpiperin Playground and benchmark matrices for now. - Streaming STT partials — the hands-free loop still does one-shot STT at end-of-utterance. A streaming STT plugin interface (whisper-server supports partial transcripts; parakeet-mlx does not) would let the client show a live transcript as the user speaks. Not a latency win for the turn itself (LLM still blocks on final transcript), but a UX win.
- Decouple hardware sampling from "ollama" — make the process name a field on LLM plugin metadata so non-Ollama LLMs still report RSS.
src/core/runner.ts:213. - Wire SSE live-view in the dashboard Runs tab — server already emits
/run/streamevents; the UI currently polls/runsonly. Would share thelib/sse.tshelper the playground already uses. - Add fixtures/ audio — short WAV clips + reference transcripts so
voice-to-voice.yamlruns without manual setup. Include afixtures/README.mdwith provenance. Unblocks end-to-end verification of silero + piper. - WER computation —
core/metrics.ts::werexists butrunOncedoesn't invoke it. Compute whenprompt.referenceis set, write tometrics.wer. npm run benchsmoke in CI — a headless matrix + text-only LLM mock plugin for GitHub Actions.- Publish:
npm publish --access publiconce CI is green. - README badge row (npm, CI, license) once 8–9 land.
# Requires Ollama running with at least one model pulled
node bin/autobench.js run configs/smoke.yaml --out runs/smoke.jsonl
cat runs/smoke.summary.mdFirst run will include cold model-load latency (~3s for 32B Q4). Re-running with runs: 2+ confirms the plugin + Ollama cache keep subsequent runs fast (~100–200ms TTFT).
MIT © Very Good Plugins