Pi SDK local-repo search benchmark for coding-agent models. It prepares local repositories/packages once, runs one fresh Pi SDK session per (model, test, run), and scores source-backed answers with a Pi SDK judge council.
Install dependencies:
bun installEnsure:
OPENCODE_API_KEYis set (required for OpenCode bench and council models).OPENROUTER_API_KEYis set (required for OpenRouter bench models).gitis available for resource checkout.
bun run benchFlags:
--runs <n>or-r <n>: set per-test run count (default1).--model <name>or-m <name>: run only one model.
Example:
bun run bench --runs 1 --model gpt-5.4-miniIf --model is not in the default list, it runs with provider opencode.
The default bench model set is:
gpt-5.3-codex-sparkviaopencode-sparkcustom provider, thinkingmediumclaude-haiku-4-5viaopencodegpt-5.4-miniviaopencode, thinkingmediumgpt-5.5viaopencode, thinkinglowx-ai/grok-4.3viaopenrouter
Council judges:
claude-opus-4-7viaopencode, thinkinglowgpt-5.5viaopencode, thinkinglow
The suite runs the V2 fixed test list in src/bench.ts:
- 11 test prompts total
- Resources used:
svelte,tailwindcss,justBash,daytona - Git repos are cached under
.pi-bench/cache/repos - Timestamped run workspaces are created under
.pi-bench/workspaces/<timestamp>
For each selected model, src/bench.ts:
- Prepares all local resources before model runs.
- Creates a bench-local
models.jsonwith thegpt-5.3-codex-sparkcustom provider workaround. - Runs tests sequentially within each model and models in parallel.
- Creates a fresh in-memory Pi SDK coding-agent session per
(model, test, run). - Exposes only
readandbashtools to benchmark agents. - Runs council judging through Pi SDK sessions with no tools.
- Extracts wall-clock time, time to first model output event, tool calls, input/output/cache/total tokens, output tokens/sec, and cost.
Raw records are written to:
results/bench-results-<timestamp>.jsonl
Each line is a full JSON record with:
- question metadata
- local resource path and git commit
- raw answer
- token/cost/timing metrics, including prompt-cache read/write tokens when providers report them
- tool usage and error fields
- judge score/clarity/votes and disagreement stats
The terminal summary prints averages for:
- duration
- time to first model delta
- tool calls
- judge score
- clarity
- input/output/cache tokens
- output wall-clock tokens/sec
- cost
- failed run count
With --model, the summary is grouped by testId instead of model.
src/pi_smoke_test.ts is kept as a Pi SDK regression smoke test:
bun run src/pi_smoke_test.ts