pm-bench compare — diff two leaderboard JSON snapshots by protosphinx · Pull Request #16 · erphq/pm-bench

protosphinx · 2026-05-01T05:28:17Z

Stacked on top of #15 (floor-baselines). Merge order: ... → #14 → #15 → this.

Summary

New CLI verb pm-bench compare A.json B.json. Per-model score deltas as JSON; models unique to one side surfaced separately. Tasks/datasets must match.
The "did my change move the numbers?" command. Useful both interactively (snapshot, edit, re-snapshot) and in CI (post the diff as a comment on PRs that touch scoring code).

What's new

pm_bench/leaderboard.compare_boards(a, b) — pure function returning {"task", "dataset", "compared", "only_in_a", "only_in_b"}. ~25 lines.
CLI: pm-bench compare A.json B.json emits JSON; exits non-zero with a clear message if the boards are on different (task, dataset) pairs.
6 new tests including a click-runner smoke and the cross-task rejection canary. 123 total.

Smoke

$ pm-bench compare \
    leaderboard/next-event/synthetic-toy.json \
    /tmp/snapshot-after-my-change.json | head -20
{
  "task": "next-event",
  "dataset": "synthetic-toy",
  "compared": [
    {
      "model": "markov-ref",
      "scores": {
        "n":    {"a": 158, "b": 158, "delta": 0},
        "top1": {"a": 0.9304, "b": 0.9504, "delta": 0.02},
        "top3": {"a": 1.0, "b": 1.0, "delta": 0.0}
      }
    },
    ...
  ],
  "only_in_a": [],
  "only_in_b": []
}

Test plan

pytest -q — 123 passed (was 117 on PR Floor baselines: zero-time + empty-dfg — multi-entry on 3 boards #15)
ruff check pm_bench tests — clean
Identical-board diff produces all-zero deltas
Tampered top1 surfaces as expected delta
only_in_b surfaces newly-added entries
Cross-task comparison rejected loudly

- pm_bench/leaderboard.py:compare_boards(a, b) → dict — per-model score deltas. Tasks/datasets must match (loud ValueError otherwise) - CLI: pm-bench compare A.json B.json emits the diff as JSON - Use case: snapshot today, change something, re-snapshot, diff to see what moved. Models unique to one side get surfaced separately - 6 new tests including click-runner smoke + cross-task rejection - 123 total, ruff clean

protosphinx · 2026-05-01T17:54:32Z

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

This was referenced May 1, 2026

test: exercise auto-download fetch path with tmp HTTP server #17

Closed

bench/seeds.py — cross-seed variance harness for the baselines #18

Closed

protosphinx deleted the branch floor-baselines May 1, 2026 17:54

protosphinx closed this May 1, 2026

protosphinx deleted the compare-command branch May 1, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pm-bench compare — diff two leaderboard JSON snapshots#16

pm-bench compare — diff two leaderboard JSON snapshots#16
protosphinx wants to merge 1 commit into
floor-baselinesfrom
compare-command

protosphinx commented May 1, 2026

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

protosphinx commented May 1, 2026

Summary

What's new

Smoke

Test plan

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant