Skip to content

pm-bench compare — diff two leaderboard JSON snapshots#16

Closed
protosphinx wants to merge 1 commit into
floor-baselinesfrom
compare-command
Closed

pm-bench compare — diff two leaderboard JSON snapshots#16
protosphinx wants to merge 1 commit into
floor-baselinesfrom
compare-command

Conversation

@protosphinx
Copy link
Copy Markdown
Member

Stacked on top of #15 (floor-baselines). Merge order: ... → #14#15 → this.

Summary

  • New CLI verb pm-bench compare A.json B.json. Per-model score deltas as JSON; models unique to one side surfaced separately. Tasks/datasets must match.
  • The "did my change move the numbers?" command. Useful both interactively (snapshot, edit, re-snapshot) and in CI (post the diff as a comment on PRs that touch scoring code).

What's new

  • pm_bench/leaderboard.compare_boards(a, b) — pure function returning {"task", "dataset", "compared", "only_in_a", "only_in_b"}. ~25 lines.
  • CLI: pm-bench compare A.json B.json emits JSON; exits non-zero with a clear message if the boards are on different (task, dataset) pairs.
  • 6 new tests including a click-runner smoke and the cross-task rejection canary. 123 total.

Smoke

$ pm-bench compare \
    leaderboard/next-event/synthetic-toy.json \
    /tmp/snapshot-after-my-change.json | head -20
{
  "task": "next-event",
  "dataset": "synthetic-toy",
  "compared": [
    {
      "model": "markov-ref",
      "scores": {
        "n":    {"a": 158, "b": 158, "delta": 0},
        "top1": {"a": 0.9304, "b": 0.9504, "delta": 0.02},
        "top3": {"a": 1.0, "b": 1.0, "delta": 0.0}
      }
    },
    ...
  ],
  "only_in_a": [],
  "only_in_b": []
}

Test plan

- pm_bench/leaderboard.py:compare_boards(a, b) → dict — per-model
  score deltas. Tasks/datasets must match (loud ValueError otherwise)
- CLI: pm-bench compare A.json B.json emits the diff as JSON
- Use case: snapshot today, change something, re-snapshot, diff to
  see what moved. Models unique to one side get surfaced separately
- 6 new tests including click-runner smoke + cross-task rejection
- 123 total, ruff clean
@protosphinx
Copy link
Copy Markdown
Member Author

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

@protosphinx protosphinx deleted the compare-command branch May 1, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant