Leaderboard scaffold: standings JSON + Markov reference + verify CLI#4
Closed
protosphinx wants to merge 1 commit into
Closed
Leaderboard scaffold: standings JSON + Markov reference + verify CLI#4protosphinx wants to merge 1 commit into
protosphinx wants to merge 1 commit into
Conversation
- leaderboard/next-event/synthetic-toy.json — first standings file, with the Markov-ref entry (top1 0.9756, top3 1.0, n 41) - leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz — reference predictions, checked in so the loop is reproducible without hitting the network - pm_bench/leaderboard.py — load_board, rescore, verify, standings. Reads gzipped or plain CSV; pure CPython (no torch / pandas) - CLI: `pm-bench leaderboard <task> <dataset> [--verify]` — pretty-prints standings, optionally re-runs scoring against the checked-in predictions and fails if recorded != actual - tests/test_leaderboard.py — 8 tests including a drift-detection canary that tampers with the recorded score and confirms verify() flags it - 45 tests total (was 37); ruff clean - README v0.4 milestone marked partial; STATUS + GOALS updated
This was referenced May 1, 2026
Member
Author
|
Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #3 (v0.1 fetch). Merge #2 → #3 → this in order.
Summary
leaderboard/next-event/synthetic-toy.json, with the Markov reference baseline as the inaugural entry (top-1 0.9756, top-3 1.0, n=41).pm-bench leaderboard <task> <dataset> [--verify]pretty-prints the table and, with--verify, re-runs scoring on the checked-in predictions to catch drift.leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz) so the loop is reproducible offline.What's new
pm_bench/leaderboard.py—load_board,rescore,verify,standings. Pure CPython, reads gzipped or plain CSV. Truth dispatch is keyed on dataset name; today onlysynthetic-toyis wired (the dispatch grows a branch per pinned dataset).pm-bench leaderboard <task> <dataset>prints standings;--verifyfails non-zero if recorded scores don't match a fresh rescore.leaderboard/README.md— submission convention; how to verify locally.tests/test_leaderboard.py— 8 tests, including a drift canary that tampers withtop1in a tmp copy of the JSON and assertsverifyflags it.Why this matters
pm-bench leaderboard --verifyon the changed files.Smoke
Test plan
pytest -q— 45 passed (was 37 on PR v0.1: dataset fetch + sha256 verification + cache resolution #3)ruff check pm_bench tests— cleantop1produces atop1 driftmessage viaverify()Roadmap impact