A planned ablation under §4.5a of The inquiry loop on SWE-bench Pro. Sibling run to the headline
prereg-pro-v1 Sonnet+codex artifact, not a restart of it. Both runs ship under the same Zenodo
DOI bundle, two model-pair receipts.
This document inherits every discipline from PREREGISTRATION.md at the prereg-pro-v1 SHA:
predicate (§1), two modes (§2), restart scope (§3), failure-mode state machine (§4), overnight
recovery (§4a), stopping rule (§5), eligible denominator (§6, 728), provenance (§10), Q1–Q22
checklist (§11), confound discipline (§12), freeze mechanism (§13). The deltas below override only
what is explicitly named; everything else is in force unchanged.
Replaces parent §7 / §12 "Sonnet 4.5 generator + GPT-5.5 craft challenger" with:
- Recon (abduction): Gemini Flash 3.5 + Cursor Composer 2.5, blind-blind pushout per §3.3 of the paper. Cross-family preserved (Google × Cursor).
- Craft (deduction): Cursor Composer 2.5 writes the patch (impl strength); Gemini Flash 3.5 adversarially critiques the diff against the spec.
- Audit: deterministic, unchanged.
Stage assignment is role-specialized by design: Composer is code-specialized (impl strength where it pays); Flash is fast and general-purpose (cheap divergent hypothesis brainstorming where it pays). This is methodologically distinct from the parent's symmetric strong-model pair. If it holds, the publishable heuristic is match model to stage, not symmetric pair.
Auth: Composer via CURSOR_API_KEY; Flash via GEMINI_API_KEY (env vars in operator shell, not
committed to repo).
Parent ran both Verified (swebench-verified companion) and Pro. This ablation runs Pro only
(728 eligible). Verified is contaminated for every submitter and ports cleanly under any reasonable
model pair; the contamination-resistant tier is where the cheap-model question is informative. No
Verified rerun.
Per-instance cost projection is ~$0.40 combined (Composer ~$0.23/problem per SWE-rebench + Flash near-zero increment), against the parent's ~$2.50/instance Sonnet-API rate. The budget headroom is the lever: where the parent ran on ~4 boxes serialized by API spend, this run can sustain up to 32 boxes in parallel (~8× concurrency), bottleneck-shifted from API cost to box-hours and Cursor / Google API rate limits.
Concrete: shard the 728 eligible into 32 stripes of ~23 instances each from tasks/run_order.txt
(parent §3, frozen order unchanged). Wall-clock target: 728 / 32 ≈ 23 instances/box × ~15 min/instance
(cheap models are faster as well as cheaper) ≈ ~6 hours, against the parent's multi-day run.
Higher parallelism does not alter the measurement contract: the run remains a single scored artifact under one frozen tag, with the same fault-class state machine (§4), the same checkpoint / resume mechanism (§4a), and the same one-shot held-out discipline (§9).
$500 hard cap (projection ~$290; cap absorbs cost overruns from retries within INCOMPLETE
re-runs per §4 and EC2 spend ~$30-50 per memory). At cap, the run is PAUSE(QUOTA_EXHAUSTED) per
parent §4; results published per parent §10 as a non-headline partial if completion is impossible.
This artifact is frozen as prereg-pro-v1-cheap (annotated tag). SHA recorded in this doc's
worklog before the scored run begins. Subsequent amendments follow parent §13 (new commit + new
tag + timestamped rationale; old tags never move).
- Eligible denominator: 728 (parent §6 audit; no re-audit, defects list inherited).
- Run order:
tasks/run_order.txtfrom parent (lexicographic, frozen). Shard stripe assignment is a deterministic function of this order per parent §3. - Grader: official SWE-bench Pro harness; no bespoke graders (parent §1.3, §4).
- State machine: WIN / LOSS / INCOMPLETE / PAUSE exactly as parent §4, including INCOMPLETE corroboration discipline. A run-time fault on Cursor or Google APIs is a provider-class incident cross-checked against the relevant statuspage (Cursor + Google Cloud); no overlap → LOSS, not re-roll.
- Provenance contract: per-instance trajectories, captured source-only diff, official grader output, agent logs, fault codes, cost ledger: all pulled off-box continuously per parent §14 amendment.
- Headline discipline: a partial run is non-headline (parent §5). The cheap-ablation headline is the full-eligible-set resolve rate under this frozen artifact, reported alongside the Sonnet+codex headline under the shared Zenodo DOI.
Pre-committed interpretations of the resolve rate, against the Sonnet+codex headline (97.1% / N=278 in-flight at preregistration time):
| Cheap-pair rate | Reading |
|---|---|
| Comparable | Loop is the lever; model selection is not. Strongest possible read. |
| Modestly lower | Loop is necessary but not sufficient; frontier capability matters on hard-tail repos. |
| Collapses | Frontier capability does most of the work; loop helps marginally. |
Whichever outcome lands, the receipts publish per §10. The answer is the answer.
Parent §3 ("Artifact changed → whole-set restart under a new frozen tag") governs restarts within one scientific track, i.e., when a methodological change makes prior verdicts stale and the prior headline should be retired. This run does not retire the Sonnet+codex headline; both headlines stand, side by side, under the shared Zenodo DOI. The motivation is §4.5a of the paper (isolate loop contribution from model contribution), not a result-driven restart.
The §3 accountability discipline still binds: this doc's worklog opens with the failure-class motivation ("model selection is unablated hyperparameter"; see §4.5a of the paper), written before the scored run begins. Both headlines remain auditable independently.
-
CURSOR_API_KEYandGEMINI_API_KEYavailable in operator shell; tested with a single-instance smoke run. - Shard plan committed (
tasks/shards-v1-cheap.txt: 32 stripes of ~23 instances). - Fleet provisioning script updated for 32 boxes; preflight checks (parent §13.3) confirmed green on each.
- Watchdog wall-clock cap sized ≥1.5× expected per-instance wall-time at the new model rate (parent §4a).
- This doc committed and
prereg-pro-v1-cheaptag cut; SHA recorded in worklog. - First-day cost ledger smoke-test on a 4-box subset (~92 instances) to verify per-instance cost lands near projection before scaling to 32 boxes.
Preregistered: 2026-05-28. Run start: [TBD, post-checklist]. Expected wall-clock: ~6h
on 32 boxes. Budget cap: $500. Tag: prereg-pro-v1-cheap.