parsexp

The goal of this project was to vibe a more performant and memory efficient parser in OCaml than Jane Street with cloud coding agents and it was a success.

Phases

Phase 1

After being given the original code and existing test suite, I asked the agent to create a parser that's more performant than the existing one and to include benchmarks for measurable scores. It made benchmarks for performance and memory with a few of the categories being actually more performant than the original parser. All being done in exactly 69 steps.

However, it wasn't instructed to thoroughly beat all the benchmarks which resulted in an upsetting outcome.

Phase 2

Using the minimum context (tests and parser equivalence checks but no underlying logic), a new session was created with the explicit instruction of beating all the benchmarks in terms of performance and memory.

This was a success but contained a few correctness issues (ie specific character sequences or atoms that aren't covered by the test suite).

Phase 3

The prior phases reveal two key things:

A coding agent can be fired off and work till a test suite is fulfilled
A single agent can write more performant Ocaml code than Jane Street

Which resulted in the third phase of this project where: a swarm is spun up with a subordinate to make a parser which beats benchmarks while passing tests, a subordinate to make more tests for strings that aren't currently covered, and a subordinate to make additional 'benchmarks' to compare against the original parser.

The results speak for themselves: all 7 original benchmarks won in both speed and memory (up to 3.07× faster, up to 5.75× less memory), plus the test agent generated 370 new test cases — see the Phase 3 test results for details.

Phase 2 vs Phase 3

The Phase 3 parser is 239 lines vs Phase 2's 378 (37% smaller), using direct mutual recursion instead of array-stack machinery. It trades some performance on list-heavy shapes for a simpler, better-tested parser: Phase 3 wins 4/7 benchmarks on speed (up to 47% faster on quoted strings) while Phase 2 holds an edge on large flat lists and deep nesting. The real gain is in coverage — 370 tests vs ~30, and 19 benchmark scenarios vs 7.

Ergo, distributing work among purpose-specific agents in a swarm can result with more performant outcomes than an individual one!

Verifying Benchmark Results Locally

Prerequisites

OCaml 5.2+ (tested with 5.3.0)

opam with the following packages installed:

opam install dune sexplib0 parsexp core_bench

Running Tests

Each phase is a self-contained dune project. From the repo root:

eval $(opam env)

# Phase 1 — fast_parsexp (4/7 benchmarks won)
cd fast_parsexp && dune runtest && cd ..

# Phase 2 — benchmarked_parsexp (7/7 benchmarks won, all tests pass)
cd benchmarked_parsexp && dune runtest && cd ..

# Phase 3 — self_improving_parsexp (7/7 benchmarks won, 5 known test failures)
cd self_improving_parsexp && dune runtest && cd ..

Phase 3 will exit with code 1 due to 5 known failures in the comprehensive test suite. The 5 failures are from two bugs: ## atom parsing and comments before close-parens in lists. See the Phase 3 README for a full breakdown.

Running Benchmarks

Benchmarks use Jane Street's core_bench. Each run takes ~2–3 minutes:

eval $(opam env)

# Phase 1
cd fast_parsexp && dune exec bench/bench.exe && cd ..

# Phase 2
cd benchmarked_parsexp && dune exec bench/bench.exe && cd ..

# Phase 3 — core 7 benchmarks
cd self_improving_parsexp && dune exec bench/bench.exe && cd ..

Each benchmark prints a core_bench table comparing the custom parser against parsexp on the same inputs, followed by a memory usage analysis. Lower Time/Run and mWd/Run numbers are better.

Phase 3 also has 12 extended benchmarks (dune exec bench/bench_extra.exe), but these use large inputs (up to 100KB) and may take significantly longer to run.

What to Look For

Speed: In the core_bench table, compare Time/Run between (Benchmarked_parsexp) and (Parsexp) rows for each benchmark.
Memory: The mWd/Run column shows minor-heap words allocated per run. The memory analysis section at the bottom prints explicit ratios.
Correctness: dune runtest runs equivalence tests comparing output against parsexp as the reference implementation. Phases 1 and 2 should pass cleanly. Phase 3 has the 5 known failures documented above.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
benchmarked_parsexp		benchmarked_parsexp
dashboard		dashboard
fast_parsexp		fast_parsexp
js_parsexp		js_parsexp
self_improving_parsexp		self_improving_parsexp
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parsexp

Phases

Phase 1

Phase 2

Phase 3

Phase 2 vs Phase 3

Verifying Benchmark Results Locally

Prerequisites

Running Tests

Running Benchmarks

What to Look For

About

Uh oh!

Releases

Packages

Languages

hdresearch/parsexp

Folders and files

Latest commit

History

Repository files navigation

parsexp

Phases

Phase 1

Phase 2

Phase 3

Phase 2 vs Phase 3

Verifying Benchmark Results Locally

Prerequisites

Running Tests

Running Benchmarks

What to Look For

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages