The goal of this project was to vibe a more performant and memory efficient parser in OCaml than Jane Street with cloud coding agents and it was a success.
After being given the original code and existing test suite, I asked the agent to create a parser that's more performant than the existing one and to include benchmarks for measurable scores. It made benchmarks for performance and memory with a few of the categories being actually more performant than the original parser. All being done in exactly 69 steps.
However, it wasn't instructed to thoroughly beat all the benchmarks which resulted in an upsetting outcome.
Using the minimum context (tests and parser equivalence checks but no underlying logic), a new session was created with the explicit instruction of beating all the benchmarks in terms of performance and memory.
This was a success but contained a few correctness issues (ie specific character sequences or atoms that aren't covered by the test suite).
The prior phases reveal two key things:
- A coding agent can be fired off and work till a test suite is fulfilled
- A single agent can write more performant Ocaml code than Jane Street
Which resulted in the third phase of this project where: a swarm is spun up with a subordinate to make a parser which beats benchmarks while passing tests, a subordinate to make more tests for strings that aren't currently covered, and a subordinate to make additional 'benchmarks' to compare against the original parser.
The results speak for themselves: all 7 original benchmarks won in both speed and memory (up to 3.07× faster, up to 5.75× less memory), plus the test agent generated 370 new test cases — see the Phase 3 test results for details.
The Phase 3 parser is 239 lines vs Phase 2's 378 (37% smaller), using direct mutual recursion instead of array-stack machinery. It trades some performance on list-heavy shapes for a simpler, better-tested parser: Phase 3 wins 4/7 benchmarks on speed (up to 47% faster on quoted strings) while Phase 2 holds an edge on large flat lists and deep nesting. The real gain is in coverage — 370 tests vs ~30, and 19 benchmark scenarios vs 7.
Ergo, distributing work among purpose-specific agents in a swarm can result with more performant outcomes than an individual one!
- OCaml 5.2+ (tested with 5.3.0)
- opam with the following packages installed:
opam install dune sexplib0 parsexp core_bench
Each phase is a self-contained dune project. From the repo root:
eval $(opam env)
# Phase 1 — fast_parsexp (4/7 benchmarks won)
cd fast_parsexp && dune runtest && cd ..
# Phase 2 — benchmarked_parsexp (7/7 benchmarks won, all tests pass)
cd benchmarked_parsexp && dune runtest && cd ..
# Phase 3 — self_improving_parsexp (7/7 benchmarks won, 5 known test failures)
cd self_improving_parsexp && dune runtest && cd ..Phase 3 will exit with code 1 due to 5 known failures in the comprehensive test suite. The 5 failures are from two bugs: ## atom parsing and comments before close-parens in lists. See the Phase 3 README for a full breakdown.
Benchmarks use Jane Street's core_bench. Each run takes ~2–3 minutes:
eval $(opam env)
# Phase 1
cd fast_parsexp && dune exec bench/bench.exe && cd ..
# Phase 2
cd benchmarked_parsexp && dune exec bench/bench.exe && cd ..
# Phase 3 — core 7 benchmarks
cd self_improving_parsexp && dune exec bench/bench.exe && cd ..Each benchmark prints a core_bench table comparing the custom parser against parsexp on the same inputs, followed by a memory usage analysis. Lower Time/Run and mWd/Run numbers are better.
Phase 3 also has 12 extended benchmarks (dune exec bench/bench_extra.exe), but these use large inputs (up to 100KB) and may take significantly longer to run.
- Speed: In the
core_benchtable, compareTime/Runbetween(Benchmarked_parsexp)and(Parsexp)rows for each benchmark. - Memory: The
mWd/Runcolumn shows minor-heap words allocated per run. The memory analysis section at the bottom prints explicit ratios. - Correctness:
dune runtestruns equivalence tests comparing output againstparsexpas the reference implementation. Phases 1 and 2 should pass cleanly. Phase 3 has the 5 known failures documented above.
