|
| 1 | +# Simulation Benchmark Plan — Multi-Model Comparative Testing |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +The tool refactoring work (2026-04-06) revealed that model choice dramatically affects AUT behavior. Mistral-7B discovered `think` and attempted `remember`; Qwen 14B got stuck in a `respond` loop. Tool hallucination rates, recall fidelity, and narrative engagement all vary per model. Currently, comparing models requires manually re-running sims and reading through logs. |
| 6 | + |
| 7 | +This plan adds a `maxim --sim benchmark` subcommand that automates multi-model comparison, computes standardized metrics, and outputs a comparative report — reusing the existing research protocol, campaign YAML, and experiment recording infrastructure. |
| 8 | + |
| 9 | +## What Already Exists |
| 10 | + |
| 11 | +| Component | Reuse for benchmarks | |
| 12 | +|-----------|---------------------| |
| 13 | +| `--aut-model` flag | Per-run model selection (creates separate AUT router) | |
| 14 | +| Campaign YAML + expectations | Standardized test scenarios with pass/fail criteria | |
| 15 | +| `SimulationReport` | Tool usage, success rates, cost, timing, AUT cognitive state | |
| 16 | +| `ExperimentLog` | UMR-tracked experiment recording with metrics dict | |
| 17 | +| `validation.py` expectations | action_count_range, tool_success_rate, response_latency_ms | |
| 18 | +| `TOOL_ALIASES` + `alias_redirects` | Hallucination rate tracking per model | |
| 19 | +| `sweep` persona | Systematic boundary exploration (could drive benchmark probes) | |
| 20 | +| `researcher` persona | Evidence-based experiment flow | |
| 21 | +| Research protocol (Writer + Reviewer) | Auto-generate comparative paper from results | |
| 22 | +| NAc causal links | Learning efficiency metric | |
| 23 | +| Hippocampus memory count | Memory formation metric | |
| 24 | + |
| 25 | +## CLI Interface |
| 26 | + |
| 27 | +```bash |
| 28 | +# Run a benchmark suite against multiple models |
| 29 | +maxim --sim benchmark \ |
| 30 | + --models mistral-7b,qwen2.5-14b,llama-3-8b \ |
| 31 | + --campaign scenarios/benchmarks/cognitive_suite.yaml \ |
| 32 | + --runs 3 # repeat each model N times for variance |
| 33 | + --output data/benchmarks/ # output directory |
| 34 | + |
| 35 | +# Run against a single new model (quick smoke test) |
| 36 | +maxim --sim benchmark \ |
| 37 | + --models phi-3-mini \ |
| 38 | + --campaign scenarios/benchmarks/quick_check.yaml |
| 39 | + |
| 40 | +# Compare against a previous benchmark baseline |
| 41 | +maxim --sim benchmark \ |
| 42 | + --models qwen2.5-14b \ |
| 43 | + --campaign scenarios/benchmarks/cognitive_suite.yaml \ |
| 44 | + --baseline data/benchmarks/baseline_20260406.json |
| 45 | +``` |
| 46 | + |
| 47 | +## Benchmark Metrics |
| 48 | + |
| 49 | +### Repo-Specific (Cognitive Architecture) |
| 50 | + |
| 51 | +These test whether the AUT's biological subsystems work correctly with a given model. |
| 52 | + |
| 53 | +| Metric | What it measures | How to compute | |
| 54 | +|--------|-----------------|----------------| |
| 55 | +| **Memory recall success** | Did the AUT recall the seed detail when prompted? | Check hippocampus for expected content at recall turn | |
| 56 | +| **Behavioral recall** | Did the AUT *act* on the recalled memory (not just report it)? | Check for `say("Verath")` (not `respond("Verath")`) at the door | |
| 57 | +| **Tool hallucination rate** | % of tool calls that were unregistered names | `alias_redirects + failed_unregistered / total_calls` | |
| 58 | +| **Alias redirect rate** | % of calls that needed alias resolution | `len(alias_redirects) / total_calls` | |
| 59 | +| **Correct tool usage rate** | % of calls to tools the model chose from the available list | `1 - hallucination_rate` | |
| 60 | +| **NAc learning efficiency** | How many causal links formed per action | `causal_links / total_actions` | |
| 61 | +| **Think-before-act rate** | % of turns where `think` preceded another action | Count `think → X` sequences | |
| 62 | +| **Memory formation rate** | Episodic memories per turn | `aut_memories_formed / turns` | |
| 63 | +| **Narrative engagement** | Did the AUT respond to scene content (not just repeat instructions)? | Semantic diversity of responses across turns | |
| 64 | +| **Interference resistance** | Did the seed memory survive interference turns? | Check hippocampus for seed content after interference phase | |
| 65 | + |
| 66 | +### Industry-Standard |
| 67 | + |
| 68 | +These are model capability metrics that apply to any agent system. |
| 69 | + |
| 70 | +| Metric | What it measures | How to compute | |
| 71 | +|--------|-----------------|----------------| |
| 72 | +| **Instruction following** | Did the model use tools from the available list? | `correct_tool_usage_rate` (inverse of hallucination) | |
| 73 | +| **JSON compliance** | % of LLM responses that parsed as valid JSON on first try | Track parse success in router | |
| 74 | +| **Context retention** | Does the model retain information across turns? | Verath recall at turn 6 (after 3 interference turns) | |
| 75 | +| **Action latency** | Time from percept to action (p50, p95) | Timestamps on bridge send → action record | |
| 76 | +| **Token efficiency** | Actions per 1K tokens consumed | `total_actions / (total_tokens / 1000)` | |
| 77 | +| **Cost per turn** | USD per simulation turn | `cost_usd / turns` | |
| 78 | +| **Reasoning depth** | Does the model chain actions (think → recall → act)? | Detect multi-step action chains within a turn | |
| 79 | + |
| 80 | +## Benchmark Campaign Format |
| 81 | + |
| 82 | +Extends the existing campaign YAML with benchmark-specific metadata: |
| 83 | + |
| 84 | +```yaml |
| 85 | +name: cognitive_suite_v1 |
| 86 | +type: benchmark |
| 87 | +description: | |
| 88 | + Comprehensive cognitive architecture benchmark. |
| 89 | + Tests memory recall, tool usage, reasoning, and narrative engagement. |
| 90 | +
|
| 91 | +# Models to test (can be overridden by --models CLI flag) |
| 92 | +default_models: |
| 93 | + - mistral-7b |
| 94 | + - qwen2.5-14b |
| 95 | + |
| 96 | +# Scenarios to run (in order) |
| 97 | +scenarios: |
| 98 | + - path: scenarios/experiments/hippocampal_recall_short.yaml |
| 99 | + weight: 2.0 # counts double in overall score |
| 100 | + category: memory |
| 101 | + metrics: |
| 102 | + - memory_recall_success # did Verath survive? |
| 103 | + - behavioral_recall # did AUT say("Verath") at the door? |
| 104 | + - interference_resistance |
| 105 | + |
| 106 | + - path: scenarios/benchmarks/tool_discovery.yaml |
| 107 | + weight: 1.0 |
| 108 | + category: tool_usage |
| 109 | + metrics: |
| 110 | + - tool_hallucination_rate |
| 111 | + - alias_redirect_rate |
| 112 | + - correct_tool_usage_rate |
| 113 | + |
| 114 | + - path: scenarios/benchmarks/reasoning_chain.yaml |
| 115 | + weight: 1.5 |
| 116 | + category: reasoning |
| 117 | + metrics: |
| 118 | + - think_before_act_rate |
| 119 | + - reasoning_depth |
| 120 | + |
| 121 | + - path: scenarios/benchmarks/narrative_engagement.yaml |
| 122 | + weight: 1.0 |
| 123 | + category: engagement |
| 124 | + metrics: |
| 125 | + - narrative_engagement |
| 126 | + - memory_formation_rate |
| 127 | + |
| 128 | +# Scoring thresholds for pass/fail |
| 129 | +scoring: |
| 130 | + memory_recall_success: { pass: 1.0 } # binary: recalled or not |
| 131 | + behavioral_recall: { pass: 1.0 } # binary: said it or not |
| 132 | + tool_hallucination_rate: { pass_below: 0.3 } # <30% hallucinated |
| 133 | + correct_tool_usage_rate: { pass_above: 0.7 } # >70% correct |
| 134 | + think_before_act_rate: { pass_above: 0.2 } # >20% of turns |
| 135 | + interference_resistance: { pass: 1.0 } # binary |
| 136 | +``` |
| 137 | +
|
| 138 | +## Architecture |
| 139 | +
|
| 140 | +### BenchmarkRunner class |
| 141 | +
|
| 142 | +``` |
| 143 | +src/maxim/simulation/benchmark.py (new) |
| 144 | + |
| 145 | +BenchmarkRunner |
| 146 | + ├── __init__(models, campaign_path, runs, output_dir, baseline) |
| 147 | + ├── run() → BenchmarkReport |
| 148 | + │ ├── for each model: |
| 149 | + │ │ ├── for each run (1..N): |
| 150 | + │ │ │ ├── start_simulation_mode(aut_model=model, campaign=scenario) |
| 151 | + │ │ │ ├── collect SimulationReport + executor.alias_redirects |
| 152 | + │ │ │ └── compute per-run metrics |
| 153 | + │ │ └── aggregate across runs (mean, stddev) |
| 154 | + │ ├── compute comparative metrics |
| 155 | + │ ├── score against thresholds |
| 156 | + │ └── build BenchmarkReport |
| 157 | + ├── _compute_metrics(report, executor, hippocampus) → ModelMetrics |
| 158 | + ├── _score(metrics, thresholds) → ModelScore |
| 159 | + └── _compare(scores, baseline) → ComparisonTable |
| 160 | +``` |
| 161 | + |
| 162 | +### BenchmarkReport |
| 163 | + |
| 164 | +```python |
| 165 | +@dataclass |
| 166 | +class BenchmarkReport: |
| 167 | + timestamp: str |
| 168 | + campaign: str |
| 169 | + models: list[str] |
| 170 | + runs_per_model: int |
| 171 | + |
| 172 | + # Per-model results |
| 173 | + results: dict[str, ModelResult] # model_name → ModelResult |
| 174 | + |
| 175 | + # Comparative |
| 176 | + rankings: dict[str, list[str]] # metric_name → [models ranked] |
| 177 | + overall_ranking: list[str] # weighted composite score |
| 178 | + |
| 179 | +@dataclass |
| 180 | +class ModelResult: |
| 181 | + model: str |
| 182 | + runs: list[RunResult] # individual run data |
| 183 | + metrics: dict[str, float] # aggregated (mean) |
| 184 | + metrics_stddev: dict[str, float] # variance across runs |
| 185 | + score: float # weighted composite |
| 186 | + passed: bool # met all pass thresholds |
| 187 | + expectations_met: int |
| 188 | + expectations_total: int |
| 189 | +``` |
| 190 | + |
| 191 | +### Integration with Research Protocol |
| 192 | + |
| 193 | +The benchmark runner can optionally feed results into the research protocol's Writer + Reviewer: |
| 194 | + |
| 195 | +```bash |
| 196 | +# Benchmark only (fast, metrics + table) |
| 197 | +maxim --sim benchmark --models mistral-7b,qwen2.5-14b --campaign ... |
| 198 | + |
| 199 | +# Benchmark + paper (slower, includes analysis) |
| 200 | +maxim --sim benchmark --models mistral-7b,qwen2.5-14b --campaign ... --write-paper |
| 201 | +``` |
| 202 | + |
| 203 | +With `--write-paper`, the benchmark feeds `BenchmarkReport` into the Writer agent as experiment data, producing a comparative research paper with the Reviewer validating claims against the metrics. |
| 204 | + |
| 205 | +## Output Format |
| 206 | + |
| 207 | +### Terminal Output |
| 208 | + |
| 209 | +``` |
| 210 | +============================================================ |
| 211 | + BENCHMARK REPORT — cognitive_suite_v1 |
| 212 | + Models: mistral-7b, qwen2.5-14b, llama-3-8b |
| 213 | + Scenarios: 4 | Runs per model: 3 |
| 214 | +============================================================ |
| 215 | +
|
| 216 | + MEMORY |
| 217 | + memory_recall_success mistral-7b: 1.00 qwen-14b: 1.00 llama-8b: 0.67 |
| 218 | + behavioral_recall mistral-7b: 0.33 qwen-14b: 0.00 llama-8b: 0.33 |
| 219 | + interference_resistance mistral-7b: 1.00 qwen-14b: 1.00 llama-8b: 1.00 |
| 220 | +
|
| 221 | + TOOL USAGE |
| 222 | + hallucination_rate mistral-7b: 0.38 qwen-14b: 0.43 llama-8b: 0.21 |
| 223 | + correct_tool_usage mistral-7b: 0.62 qwen-14b: 0.57 llama-8b: 0.79 |
| 224 | + alias_redirect_rate mistral-7b: 0.25 qwen-14b: 0.29 llama-8b: 0.14 |
| 225 | +
|
| 226 | + REASONING |
| 227 | + think_before_act_rate mistral-7b: 0.14 qwen-14b: 0.00 llama-8b: 0.29 |
| 228 | +
|
| 229 | + EFFICIENCY |
| 230 | + cost_per_turn mistral-7b: $0.00 qwen-14b: $0.00 llama-8b: $0.00 |
| 231 | + actions_per_turn mistral-7b: 1.86 qwen-14b: 2.00 llama-8b: 1.57 |
| 232 | + latency_p50_ms mistral-7b: 2800 qwen-14b: 2500 llama-8b: 3100 |
| 233 | +
|
| 234 | + OVERALL RANKING |
| 235 | + 1. llama-3-8b score: 0.78 |
| 236 | + 2. mistral-7b score: 0.65 |
| 237 | + 3. qwen2.5-14b score: 0.52 |
| 238 | +============================================================ |
| 239 | +``` |
| 240 | + |
| 241 | +### Persisted Files |
| 242 | + |
| 243 | +``` |
| 244 | +data/benchmarks/{timestamp}/ |
| 245 | + benchmark_report.json # Full BenchmarkReport |
| 246 | + summary.md # Human-readable markdown table |
| 247 | + per_model/ |
| 248 | + mistral-7b/ |
| 249 | + run_1/ # Standard sim_reports structure |
| 250 | + run_2/ |
| 251 | + run_3/ |
| 252 | + aggregated.json # Mean metrics across runs |
| 253 | + qwen2.5-14b/ |
| 254 | + ... |
| 255 | + comparison.json # Cross-model comparison data |
| 256 | + paper.md # (if --write-paper) Comparative analysis |
| 257 | +``` |
| 258 | + |
| 259 | +## Benchmark Scenarios to Create |
| 260 | + |
| 261 | +### 1. `cognitive_suite.yaml` — Full cognitive architecture test |
| 262 | + |
| 263 | +Combines all existing experiment scenarios into a single benchmark: |
| 264 | +- Hippocampal recall (short) — memory formation + recall under interference |
| 265 | +- Tool discovery — are narrative/introspection tools used correctly? |
| 266 | +- Reasoning chain — does think→recall→act chaining work? |
| 267 | +- Narrative engagement — does the AUT respond to scene content? |
| 268 | + |
| 269 | +### 2. `quick_check.yaml` — Fast smoke test for new models |
| 270 | + |
| 271 | +Minimal 3-turn scenario: seed → interference → recall. Takes ~30s per model. Good for quick validation when a new model drops. |
| 272 | + |
| 273 | +### 3. `instruction_following.yaml` — Industry-standard tool compliance |
| 274 | + |
| 275 | +Tests whether the model reads and follows the available tool list. Deliberately ambiguous percepts that could go to any tool — measures whether the model hallucinates or picks from the list. |
| 276 | + |
| 277 | +### 4. `stress_test.yaml` — Context window pressure |
| 278 | + |
| 279 | +Long campaign (20+ turns) that tests context retention, memory formation under load, and cost efficiency. Useful for comparing 7B vs 14B vs 70B models on the same narrative. |
| 280 | + |
| 281 | +## Implementation Phases |
| 282 | + |
| 283 | +| Phase | What | LOC | Depends on | |
| 284 | +|-------|------|-----|-----------| |
| 285 | +| 1 | `BenchmarkRunner` class + `ModelMetrics` computation | ~200 | Existing sim infrastructure | |
| 286 | +| 2 | `--sim benchmark` CLI integration + model sweep loop | ~80 | Phase 1 | |
| 287 | +| 3 | Benchmark YAML format + scenario loader | ~60 | Phase 1 | |
| 288 | +| 4 | Terminal output + JSON/markdown persistence | ~100 | Phase 1 | |
| 289 | +| 5 | Create benchmark scenarios (cognitive_suite, quick_check) | ~150 (YAML) | Phase 3 | |
| 290 | +| 6 | Baseline comparison (`--baseline`) | ~50 | Phase 4 | |
| 291 | +| 7 | Research protocol integration (`--write-paper`) | ~40 | Phase 4 | |
| 292 | +| **Total** | | **~680** | | |
| 293 | + |
| 294 | +**Recommended first session:** Phases 1-2 (~280 LOC) — the runner + CLI. This gives you `maxim --sim benchmark --models X,Y --campaign Z` working end-to-end. Phases 3-7 add polish, scenarios, and paper generation. |
| 295 | + |
| 296 | +## Open Questions |
| 297 | + |
| 298 | +1. **Should benchmark runs use the `researcher` or `sweep` persona for the orchestrator?** |
| 299 | + - `researcher` follows a structured hypothesis→experiment→conclusion flow |
| 300 | + - `sweep` does systematic boundary exploration |
| 301 | + - For benchmarks, the orchestrator isn't probing — it's just delivering campaigns. The `campaign` persona (which stays hands-off) may be best. |
| 302 | + - Recommendation: `campaign` persona (no orchestrator probing, just deliver and measure) |
| 303 | + |
| 304 | +2. **How to handle model loading for self-hosted models?** |
| 305 | + - `--aut-model mistral-7b` requires the leader to have that model available |
| 306 | + - `maxim peer llm mistral-7b` hot-swaps the leader's model, but only one at a time |
| 307 | + - Benchmark would need to swap models between runs: `peer llm model_A → run → peer llm model_B → run` |
| 308 | + - For cloud models (Claude, GPT-4), no swap needed — just different API profiles |
| 309 | + - Recommendation: benchmark runner calls `peer llm` between self-hosted model runs, uses profiles for cloud models |
| 310 | + |
| 311 | +3. **Should multiple runs be sequential or parallel?** |
| 312 | + - Sequential: simpler, one GPU at a time, deterministic ordering |
| 313 | + - Parallel: faster but needs multiple GPU slots or cloud dispatch |
| 314 | + - Recommendation: sequential by default, parallel as future optimization |
| 315 | + |
| 316 | +4. **How to handle cloud model costs?** |
| 317 | + - Claude-sonnet at ~$3/Mtok could get expensive with 3 runs × 4 scenarios |
| 318 | + - Recommendation: `--cloud-budget` cap applies per model. Default budget of $0.50 per benchmark model. Skip remaining scenarios if budget exceeded. |
| 319 | + |
| 320 | +5. **Should the benchmark report feed back into the tool alias map?** |
| 321 | + - New hallucinated tool names discovered during benchmarks could auto-update TOOL_ALIASES |
| 322 | + - Risk: auto-updating could add bad mappings |
| 323 | + - Recommendation: report new hallucinations in the benchmark output with suggested aliases, but require manual review before adding to the map |
| 324 | + |
| 325 | +## Related Plans |
| 326 | + |
| 327 | +- [Tool refactoring plan](tool_refactoring_plan.md) — tool aliases and hallucination tracking that feed into benchmark metrics |
| 328 | +- [Realtime refinement plan](realtime_refinement_plan.md) — refinement persona and metric expectations that benchmarks reuse |
| 329 | +- [Research protocol plan](research_protocol_plan.md) — Writer + Reviewer pipeline for benchmark papers |
| 330 | +- [Generative campaign plan](generative_campaign_plan.md) — LLM-generated campaigns could auto-create benchmark scenarios |
0 commit comments