Skip to content

Commit c37ddb8

Browse files
dennys246claude
andcommitted
docs: add simulation benchmark plan (multi-model comparative testing)
Designs a maxim --sim benchmark subcommand that automates multi-model comparison using existing campaign YAML, experiment recording, and tool alias tracking. Defines repo-specific metrics (memory recall, behavioral recall, hallucination rate, think-before-act) alongside industry-standard metrics (instruction following, JSON compliance, context retention, token efficiency). Covers CLI interface, BenchmarkRunner architecture, output formats, and 7 implementation phases (~680 LOC total). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4875b7a commit c37ddb8

1 file changed

Lines changed: 330 additions & 0 deletions

File tree

docs/plans/benchmark_plan.md

Lines changed: 330 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,330 @@
1+
# Simulation Benchmark Plan — Multi-Model Comparative Testing
2+
3+
## Context
4+
5+
The tool refactoring work (2026-04-06) revealed that model choice dramatically affects AUT behavior. Mistral-7B discovered `think` and attempted `remember`; Qwen 14B got stuck in a `respond` loop. Tool hallucination rates, recall fidelity, and narrative engagement all vary per model. Currently, comparing models requires manually re-running sims and reading through logs.
6+
7+
This plan adds a `maxim --sim benchmark` subcommand that automates multi-model comparison, computes standardized metrics, and outputs a comparative report — reusing the existing research protocol, campaign YAML, and experiment recording infrastructure.
8+
9+
## What Already Exists
10+
11+
| Component | Reuse for benchmarks |
12+
|-----------|---------------------|
13+
| `--aut-model` flag | Per-run model selection (creates separate AUT router) |
14+
| Campaign YAML + expectations | Standardized test scenarios with pass/fail criteria |
15+
| `SimulationReport` | Tool usage, success rates, cost, timing, AUT cognitive state |
16+
| `ExperimentLog` | UMR-tracked experiment recording with metrics dict |
17+
| `validation.py` expectations | action_count_range, tool_success_rate, response_latency_ms |
18+
| `TOOL_ALIASES` + `alias_redirects` | Hallucination rate tracking per model |
19+
| `sweep` persona | Systematic boundary exploration (could drive benchmark probes) |
20+
| `researcher` persona | Evidence-based experiment flow |
21+
| Research protocol (Writer + Reviewer) | Auto-generate comparative paper from results |
22+
| NAc causal links | Learning efficiency metric |
23+
| Hippocampus memory count | Memory formation metric |
24+
25+
## CLI Interface
26+
27+
```bash
28+
# Run a benchmark suite against multiple models
29+
maxim --sim benchmark \
30+
--models mistral-7b,qwen2.5-14b,llama-3-8b \
31+
--campaign scenarios/benchmarks/cognitive_suite.yaml \
32+
--runs 3 # repeat each model N times for variance
33+
--output data/benchmarks/ # output directory
34+
35+
# Run against a single new model (quick smoke test)
36+
maxim --sim benchmark \
37+
--models phi-3-mini \
38+
--campaign scenarios/benchmarks/quick_check.yaml
39+
40+
# Compare against a previous benchmark baseline
41+
maxim --sim benchmark \
42+
--models qwen2.5-14b \
43+
--campaign scenarios/benchmarks/cognitive_suite.yaml \
44+
--baseline data/benchmarks/baseline_20260406.json
45+
```
46+
47+
## Benchmark Metrics
48+
49+
### Repo-Specific (Cognitive Architecture)
50+
51+
These test whether the AUT's biological subsystems work correctly with a given model.
52+
53+
| Metric | What it measures | How to compute |
54+
|--------|-----------------|----------------|
55+
| **Memory recall success** | Did the AUT recall the seed detail when prompted? | Check hippocampus for expected content at recall turn |
56+
| **Behavioral recall** | Did the AUT *act* on the recalled memory (not just report it)? | Check for `say("Verath")` (not `respond("Verath")`) at the door |
57+
| **Tool hallucination rate** | % of tool calls that were unregistered names | `alias_redirects + failed_unregistered / total_calls` |
58+
| **Alias redirect rate** | % of calls that needed alias resolution | `len(alias_redirects) / total_calls` |
59+
| **Correct tool usage rate** | % of calls to tools the model chose from the available list | `1 - hallucination_rate` |
60+
| **NAc learning efficiency** | How many causal links formed per action | `causal_links / total_actions` |
61+
| **Think-before-act rate** | % of turns where `think` preceded another action | Count `think → X` sequences |
62+
| **Memory formation rate** | Episodic memories per turn | `aut_memories_formed / turns` |
63+
| **Narrative engagement** | Did the AUT respond to scene content (not just repeat instructions)? | Semantic diversity of responses across turns |
64+
| **Interference resistance** | Did the seed memory survive interference turns? | Check hippocampus for seed content after interference phase |
65+
66+
### Industry-Standard
67+
68+
These are model capability metrics that apply to any agent system.
69+
70+
| Metric | What it measures | How to compute |
71+
|--------|-----------------|----------------|
72+
| **Instruction following** | Did the model use tools from the available list? | `correct_tool_usage_rate` (inverse of hallucination) |
73+
| **JSON compliance** | % of LLM responses that parsed as valid JSON on first try | Track parse success in router |
74+
| **Context retention** | Does the model retain information across turns? | Verath recall at turn 6 (after 3 interference turns) |
75+
| **Action latency** | Time from percept to action (p50, p95) | Timestamps on bridge send → action record |
76+
| **Token efficiency** | Actions per 1K tokens consumed | `total_actions / (total_tokens / 1000)` |
77+
| **Cost per turn** | USD per simulation turn | `cost_usd / turns` |
78+
| **Reasoning depth** | Does the model chain actions (think → recall → act)? | Detect multi-step action chains within a turn |
79+
80+
## Benchmark Campaign Format
81+
82+
Extends the existing campaign YAML with benchmark-specific metadata:
83+
84+
```yaml
85+
name: cognitive_suite_v1
86+
type: benchmark
87+
description: |
88+
Comprehensive cognitive architecture benchmark.
89+
Tests memory recall, tool usage, reasoning, and narrative engagement.
90+
91+
# Models to test (can be overridden by --models CLI flag)
92+
default_models:
93+
- mistral-7b
94+
- qwen2.5-14b
95+
96+
# Scenarios to run (in order)
97+
scenarios:
98+
- path: scenarios/experiments/hippocampal_recall_short.yaml
99+
weight: 2.0 # counts double in overall score
100+
category: memory
101+
metrics:
102+
- memory_recall_success # did Verath survive?
103+
- behavioral_recall # did AUT say("Verath") at the door?
104+
- interference_resistance
105+
106+
- path: scenarios/benchmarks/tool_discovery.yaml
107+
weight: 1.0
108+
category: tool_usage
109+
metrics:
110+
- tool_hallucination_rate
111+
- alias_redirect_rate
112+
- correct_tool_usage_rate
113+
114+
- path: scenarios/benchmarks/reasoning_chain.yaml
115+
weight: 1.5
116+
category: reasoning
117+
metrics:
118+
- think_before_act_rate
119+
- reasoning_depth
120+
121+
- path: scenarios/benchmarks/narrative_engagement.yaml
122+
weight: 1.0
123+
category: engagement
124+
metrics:
125+
- narrative_engagement
126+
- memory_formation_rate
127+
128+
# Scoring thresholds for pass/fail
129+
scoring:
130+
memory_recall_success: { pass: 1.0 } # binary: recalled or not
131+
behavioral_recall: { pass: 1.0 } # binary: said it or not
132+
tool_hallucination_rate: { pass_below: 0.3 } # <30% hallucinated
133+
correct_tool_usage_rate: { pass_above: 0.7 } # >70% correct
134+
think_before_act_rate: { pass_above: 0.2 } # >20% of turns
135+
interference_resistance: { pass: 1.0 } # binary
136+
```
137+
138+
## Architecture
139+
140+
### BenchmarkRunner class
141+
142+
```
143+
src/maxim/simulation/benchmark.py (new)
144+
145+
BenchmarkRunner
146+
├── __init__(models, campaign_path, runs, output_dir, baseline)
147+
├── run() → BenchmarkReport
148+
│ ├── for each model:
149+
│ │ ├── for each run (1..N):
150+
│ │ │ ├── start_simulation_mode(aut_model=model, campaign=scenario)
151+
│ │ │ ├── collect SimulationReport + executor.alias_redirects
152+
│ │ │ └── compute per-run metrics
153+
│ │ └── aggregate across runs (mean, stddev)
154+
│ ├── compute comparative metrics
155+
│ ├── score against thresholds
156+
│ └── build BenchmarkReport
157+
├── _compute_metrics(report, executor, hippocampus) → ModelMetrics
158+
├── _score(metrics, thresholds) → ModelScore
159+
└── _compare(scores, baseline) → ComparisonTable
160+
```
161+
162+
### BenchmarkReport
163+
164+
```python
165+
@dataclass
166+
class BenchmarkReport:
167+
timestamp: str
168+
campaign: str
169+
models: list[str]
170+
runs_per_model: int
171+
172+
# Per-model results
173+
results: dict[str, ModelResult] # model_name → ModelResult
174+
175+
# Comparative
176+
rankings: dict[str, list[str]] # metric_name → [models ranked]
177+
overall_ranking: list[str] # weighted composite score
178+
179+
@dataclass
180+
class ModelResult:
181+
model: str
182+
runs: list[RunResult] # individual run data
183+
metrics: dict[str, float] # aggregated (mean)
184+
metrics_stddev: dict[str, float] # variance across runs
185+
score: float # weighted composite
186+
passed: bool # met all pass thresholds
187+
expectations_met: int
188+
expectations_total: int
189+
```
190+
191+
### Integration with Research Protocol
192+
193+
The benchmark runner can optionally feed results into the research protocol's Writer + Reviewer:
194+
195+
```bash
196+
# Benchmark only (fast, metrics + table)
197+
maxim --sim benchmark --models mistral-7b,qwen2.5-14b --campaign ...
198+
199+
# Benchmark + paper (slower, includes analysis)
200+
maxim --sim benchmark --models mistral-7b,qwen2.5-14b --campaign ... --write-paper
201+
```
202+
203+
With `--write-paper`, the benchmark feeds `BenchmarkReport` into the Writer agent as experiment data, producing a comparative research paper with the Reviewer validating claims against the metrics.
204+
205+
## Output Format
206+
207+
### Terminal Output
208+
209+
```
210+
============================================================
211+
BENCHMARK REPORT — cognitive_suite_v1
212+
Models: mistral-7b, qwen2.5-14b, llama-3-8b
213+
Scenarios: 4 | Runs per model: 3
214+
============================================================
215+
216+
MEMORY
217+
memory_recall_success mistral-7b: 1.00 qwen-14b: 1.00 llama-8b: 0.67
218+
behavioral_recall mistral-7b: 0.33 qwen-14b: 0.00 llama-8b: 0.33
219+
interference_resistance mistral-7b: 1.00 qwen-14b: 1.00 llama-8b: 1.00
220+
221+
TOOL USAGE
222+
hallucination_rate mistral-7b: 0.38 qwen-14b: 0.43 llama-8b: 0.21
223+
correct_tool_usage mistral-7b: 0.62 qwen-14b: 0.57 llama-8b: 0.79
224+
alias_redirect_rate mistral-7b: 0.25 qwen-14b: 0.29 llama-8b: 0.14
225+
226+
REASONING
227+
think_before_act_rate mistral-7b: 0.14 qwen-14b: 0.00 llama-8b: 0.29
228+
229+
EFFICIENCY
230+
cost_per_turn mistral-7b: $0.00 qwen-14b: $0.00 llama-8b: $0.00
231+
actions_per_turn mistral-7b: 1.86 qwen-14b: 2.00 llama-8b: 1.57
232+
latency_p50_ms mistral-7b: 2800 qwen-14b: 2500 llama-8b: 3100
233+
234+
OVERALL RANKING
235+
1. llama-3-8b score: 0.78
236+
2. mistral-7b score: 0.65
237+
3. qwen2.5-14b score: 0.52
238+
============================================================
239+
```
240+
241+
### Persisted Files
242+
243+
```
244+
data/benchmarks/{timestamp}/
245+
benchmark_report.json # Full BenchmarkReport
246+
summary.md # Human-readable markdown table
247+
per_model/
248+
mistral-7b/
249+
run_1/ # Standard sim_reports structure
250+
run_2/
251+
run_3/
252+
aggregated.json # Mean metrics across runs
253+
qwen2.5-14b/
254+
...
255+
comparison.json # Cross-model comparison data
256+
paper.md # (if --write-paper) Comparative analysis
257+
```
258+
259+
## Benchmark Scenarios to Create
260+
261+
### 1. `cognitive_suite.yaml` — Full cognitive architecture test
262+
263+
Combines all existing experiment scenarios into a single benchmark:
264+
- Hippocampal recall (short) — memory formation + recall under interference
265+
- Tool discovery — are narrative/introspection tools used correctly?
266+
- Reasoning chain — does think→recall→act chaining work?
267+
- Narrative engagement — does the AUT respond to scene content?
268+
269+
### 2. `quick_check.yaml` — Fast smoke test for new models
270+
271+
Minimal 3-turn scenario: seed → interference → recall. Takes ~30s per model. Good for quick validation when a new model drops.
272+
273+
### 3. `instruction_following.yaml` — Industry-standard tool compliance
274+
275+
Tests whether the model reads and follows the available tool list. Deliberately ambiguous percepts that could go to any tool — measures whether the model hallucinates or picks from the list.
276+
277+
### 4. `stress_test.yaml` — Context window pressure
278+
279+
Long campaign (20+ turns) that tests context retention, memory formation under load, and cost efficiency. Useful for comparing 7B vs 14B vs 70B models on the same narrative.
280+
281+
## Implementation Phases
282+
283+
| Phase | What | LOC | Depends on |
284+
|-------|------|-----|-----------|
285+
| 1 | `BenchmarkRunner` class + `ModelMetrics` computation | ~200 | Existing sim infrastructure |
286+
| 2 | `--sim benchmark` CLI integration + model sweep loop | ~80 | Phase 1 |
287+
| 3 | Benchmark YAML format + scenario loader | ~60 | Phase 1 |
288+
| 4 | Terminal output + JSON/markdown persistence | ~100 | Phase 1 |
289+
| 5 | Create benchmark scenarios (cognitive_suite, quick_check) | ~150 (YAML) | Phase 3 |
290+
| 6 | Baseline comparison (`--baseline`) | ~50 | Phase 4 |
291+
| 7 | Research protocol integration (`--write-paper`) | ~40 | Phase 4 |
292+
| **Total** | | **~680** | |
293+
294+
**Recommended first session:** Phases 1-2 (~280 LOC) — the runner + CLI. This gives you `maxim --sim benchmark --models X,Y --campaign Z` working end-to-end. Phases 3-7 add polish, scenarios, and paper generation.
295+
296+
## Open Questions
297+
298+
1. **Should benchmark runs use the `researcher` or `sweep` persona for the orchestrator?**
299+
- `researcher` follows a structured hypothesis→experiment→conclusion flow
300+
- `sweep` does systematic boundary exploration
301+
- For benchmarks, the orchestrator isn't probing — it's just delivering campaigns. The `campaign` persona (which stays hands-off) may be best.
302+
- Recommendation: `campaign` persona (no orchestrator probing, just deliver and measure)
303+
304+
2. **How to handle model loading for self-hosted models?**
305+
- `--aut-model mistral-7b` requires the leader to have that model available
306+
- `maxim peer llm mistral-7b` hot-swaps the leader's model, but only one at a time
307+
- Benchmark would need to swap models between runs: `peer llm model_A → run → peer llm model_B → run`
308+
- For cloud models (Claude, GPT-4), no swap needed — just different API profiles
309+
- Recommendation: benchmark runner calls `peer llm` between self-hosted model runs, uses profiles for cloud models
310+
311+
3. **Should multiple runs be sequential or parallel?**
312+
- Sequential: simpler, one GPU at a time, deterministic ordering
313+
- Parallel: faster but needs multiple GPU slots or cloud dispatch
314+
- Recommendation: sequential by default, parallel as future optimization
315+
316+
4. **How to handle cloud model costs?**
317+
- Claude-sonnet at ~$3/Mtok could get expensive with 3 runs × 4 scenarios
318+
- Recommendation: `--cloud-budget` cap applies per model. Default budget of $0.50 per benchmark model. Skip remaining scenarios if budget exceeded.
319+
320+
5. **Should the benchmark report feed back into the tool alias map?**
321+
- New hallucinated tool names discovered during benchmarks could auto-update TOOL_ALIASES
322+
- Risk: auto-updating could add bad mappings
323+
- Recommendation: report new hallucinations in the benchmark output with suggested aliases, but require manual review before adding to the map
324+
325+
## Related Plans
326+
327+
- [Tool refactoring plan](tool_refactoring_plan.md) — tool aliases and hallucination tracking that feed into benchmark metrics
328+
- [Realtime refinement plan](realtime_refinement_plan.md) — refinement persona and metric expectations that benchmarks reuse
329+
- [Research protocol plan](research_protocol_plan.md) — Writer + Reviewer pipeline for benchmark papers
330+
- [Generative campaign plan](generative_campaign_plan.md) — LLM-generated campaigns could auto-create benchmark scenarios

0 commit comments

Comments
 (0)