diff --git a/GOALS.md b/GOALS.md index 8b88dae..0610c65 100644 --- a/GOALS.md +++ b/GOALS.md @@ -15,6 +15,14 @@ Be the default benchmark for new process-mining methods. Within 18 months, - End-to-end loop runs on `synthetic-toy` ✅ — split → prefixes → predict → score, covered by `tests/test_e2e.py` +## Leaderboard +- Standings JSON format and reference Markov entry on `synthetic-toy` + shipped (`leaderboard/next-event/synthetic-toy.json`) +- `pm-bench leaderboard --verify` re-scores entries to catch drift; a + test guards the Markov-ref score +- Remaining: CI workflow that re-scores submission PRs (URL- or + in-repo-predictions flow) + ## v1 success criteria - ≥3 external groups submit to the leaderboard - Cited in ≥5 papers diff --git a/README.md b/README.md index e369897..547eb73 100644 --- a/README.md +++ b/README.md @@ -222,7 +222,10 @@ honesty. The point of the benchmark is to make the comparison real. pending the one-time TOS-gated downloads from 4TU and Mendeley. - [ ] v0.2 — splits: next-event, remaining-time - [ ] v0.3 — scoring scripts for all 5 tasks -- [ ] v0.4 — leaderboard CI + landing page +- [🟡] v0.4 — leaderboard CI + landing page. Standings format, + reference Markov entry, and `pm-bench leaderboard --verify` + shipped (`leaderboard/next-event/synthetic-toy.json`); CI + workflow that re-scores submission PRs is the remaining piece. - [ ] v0.5 — baselines: `gnn`, transformer, LSTM, Markov ✅ (Markov shipped) - [ ] v1.0 — first external submissions; cited in ≥1 paper diff --git a/STATUS.md b/STATUS.md index 92518d0..46d8a02 100644 --- a/STATUS.md +++ b/STATUS.md @@ -4,7 +4,7 @@ _Last updated: 2026-04-30._ ## Where we are -Pre-v0. Two pieces shipped on top of v0.0: +Pre-v0. Three pieces shipped on top of v0.0: 1. The end-to-end loop runs on the bundled `synthetic-toy` dataset (split → prefixes → predict → score; Markov reference baseline @@ -14,6 +14,13 @@ Pre-v0. Two pieces shipped on top of v0.0: sha256, and prints precise instructions for the TOS-gated download step on 4TU / Mendeley. `--pin` emits the `registry.yml` patch a contributor pastes into a PR after the manual download. +3. The leaderboard scaffold is live: standings JSON under + `leaderboard//.json`, reference predictions checked + in under `leaderboard/predictions/...`, and `pm-bench leaderboard + [--verify]` re-scores entries to catch drift. The + Markov-ref entry on `synthetic-toy` is the first row, and a + determinism test in CI fails if the recorded score doesn't match a + fresh rescore. What's still left in v0.1 is purely a per-dataset operational task: do the one-time download, run `--pin`, open seven small PRs to pin the @@ -53,6 +60,16 @@ pm-bench fetch bpi2020 --pin ## Recently shipped +- **Leaderboard scaffold** (`leaderboard-scaffold` branch). + - `leaderboard/next-event/synthetic-toy.json` with the Markov + reference entry; predictions checked in under + `leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz`. + - `pm_bench/leaderboard.py` — load/rescore/verify/standings, all + pure CPython; reads gzipped or plain CSV. + - CLI: `pm-bench leaderboard [--verify]` — + pretty-prints standings, optionally re-runs scoring. + - 8 new tests, including a drift-detection canary that tampers with + the recorded score and confirms `verify` flags it. - **v0.1 fetch + hash machinery** (`dataset-fetch` branch). - `pm_bench/cache.py` — cache root resolution (`$PM_BENCH_CACHE` → `~/.cache/pm-bench/`), per-dataset path with diff --git a/leaderboard/README.md b/leaderboard/README.md new file mode 100644 index 0000000..0df6604 --- /dev/null +++ b/leaderboard/README.md @@ -0,0 +1,42 @@ +# Leaderboard + +Standings JSON files live under `leaderboard//.json`. + +Each file describes: +- the task and metric +- the split convention (case-level chronological is the only blessed + one in v0) +- a list of submission entries, each with: model name, version, code + link, predictions file, score, and timestamp + +Reference baselines that ship with `pm-bench` keep their predictions +checked in under `leaderboard/predictions///.csv.gz`, +so the loop is reproducible without hitting the network. + +## Verifying a leaderboard file + +```bash +pm-bench leaderboard next-event synthetic-toy --verify +``` + +This re-scores every entry by reading its `predictions_path` (relative +to the repo root) and the dataset's prefixes file, then asserts the +recorded `score` matches what `pm_bench.score` produces today. +Any drift fails loudly — pinned numbers must match the code that +produced them. + +## Submitting + +Today (pre-v0): open a PR adding a new entry to the relevant JSON +file, with your `predictions.csv.gz` checked in under +`leaderboard/predictions/...`. Once the leaderboard CI is wired +(v0.4) submissions will move to a URL-based flow where CI fetches the +predictions and fills in the score. + +## Score convention + +For `next-event`, the score block carries `top1`, `top3`, and `n` (the +number of (case, prefix_idx) targets scored). All values are floats in +`[0, 1]`. Higher is better. `n` makes split sizes auditable across +entries — if your `n` differs from the reference, you used a different +split. diff --git a/leaderboard/next-event/synthetic-toy.json b/leaderboard/next-event/synthetic-toy.json new file mode 100644 index 0000000..110c50a --- /dev/null +++ b/leaderboard/next-event/synthetic-toy.json @@ -0,0 +1,27 @@ +{ + "task": "next-event", + "dataset": "synthetic-toy", + "metric": "top1 / top3 accuracy", + "scored_with": "pm_bench.score.score_next_event", + "split": { + "kind": "case-chrono", + "train_frac": 0.7, + "val_frac": 0.1 + }, + "entries": [ + { + "model": "markov-ref", + "version": "0.1.0", + "predictions_path": "leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz", + "code": "https://github.com/erphq/pm-bench/blob/main/pm_bench/baselines/markov.py", + "paper": null, + "score": { + "top1": 0.975609756097561, + "top3": 1.0, + "n": 41 + }, + "scored_at": "2026-04-30T00:00:00Z", + "notes": "First-order Markov reference baseline shipped with pm-bench. Trained on the train partition only; falls back to unigram for unseen last-activities. The floor any 'real' sequence model has to clear." + } + ] +} diff --git a/leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz b/leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz new file mode 100644 index 0000000..45d77cf Binary files /dev/null and b/leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz differ diff --git a/pm_bench/cli.py b/pm_bench/cli.py index bf45ff7..941bea3 100644 --- a/pm_bench/cli.py +++ b/pm_bench/cli.py @@ -14,6 +14,7 @@ ensure_cached, sha256_file, ) +from pm_bench.leaderboard import load_board, standings, verify from pm_bench.predictions import read_predictions_csv, write_predictions_csv from pm_bench.prefixes import extract_prefixes, read_prefixes_csv, write_prefixes_csv from pm_bench.registry import get_dataset, load_registry @@ -290,5 +291,52 @@ def score(predictions_path: str, prefixes_path: str, task: str) -> None: ) +@main.command() +@click.argument("task") +@click.argument("dataset") +@click.option( + "--verify", + "do_verify", + is_flag=True, + default=False, + help="Re-score every entry and fail if recorded scores have drifted.", +) +@click.option( + "--repo-root", + type=click.Path(exists=True, file_okay=False), + default=".", + show_default=True, + help="Repo root that `predictions_path` entries are relative to.", +) +def leaderboard(task: str, dataset: str, do_verify: bool, repo_root: str) -> None: + """Print standings for a (task, dataset) pair, optionally rescoring.""" + path = f"leaderboard/{task}/{dataset}.json" + full = f"{repo_root.rstrip('/')}/{path}" + try: + board = load_board(full) + except FileNotFoundError: + click.echo(f"no leaderboard at {path}", err=True) + sys.exit(1) + + if do_verify: + drifts = verify(board, repo_root=repo_root) + if drifts: + for d in drifts: + click.echo(d, err=True) + sys.exit(2) + click.echo(f"verified {len(board.entries)} entr(ies) — no drift") + + width = max((len(e.model) for e in board.entries), default=10) + click.echo(f"{board.task} · {board.dataset} · {board.metric}") + click.echo("-" * (width + 30)) + for e in standings(board): + top1 = e.score.get("top1") + top3 = e.score.get("top3") + n = e.score.get("n") + click.echo( + f"{e.model:<{width}} top1={top1:.4f} top3={top3:.4f} n={n}" + ) + + if __name__ == "__main__": main() diff --git a/pm_bench/leaderboard.py b/pm_bench/leaderboard.py new file mode 100644 index 0000000..1446808 --- /dev/null +++ b/pm_bench/leaderboard.py @@ -0,0 +1,159 @@ +"""Leaderboard loading + verification. + +The on-disk format lives under `leaderboard//.json`. +Reference predictions ship next to the JSON under +`leaderboard/predictions///.csv[.gz]`. Reading ++ rescoring is pure Python — no torch, no network, deterministic. + +Score drift is the only failure mode: the recorded `score` must match +what `pm_bench.score.score_next_event` produces today against the +checked-in predictions and the freshly-extracted prefixes for the +named dataset. If the model code changes the numbers, the leaderboard +file changes alongside it — no exceptions. +""" +from __future__ import annotations + +import csv +import gzip +import json +import math +from collections.abc import Iterable +from dataclasses import dataclass +from pathlib import Path + +from pm_bench.predictions import Prediction +from pm_bench.prefixes import PREFIX_SEP, Prefix, extract_prefixes +from pm_bench.score import score_next_event + + +@dataclass(frozen=True) +class Entry: + model: str + version: str + predictions_path: str + score: dict + code: str | None = None + paper: str | None = None + scored_at: str | None = None + notes: str | None = None + + +@dataclass(frozen=True) +class Board: + task: str + dataset: str + metric: str + entries: list[Entry] + raw: dict + + +def load_board(path: str | Path) -> Board: + """Load a leaderboard JSON file.""" + p = Path(path) + raw = json.loads(p.read_text()) + entries = [ + Entry( + model=e["model"], + version=e["version"], + predictions_path=e["predictions_path"], + score=e["score"], + code=e.get("code"), + paper=e.get("paper"), + scored_at=e.get("scored_at"), + notes=e.get("notes"), + ) + for e in raw["entries"] + ] + return Board( + task=raw["task"], + dataset=raw["dataset"], + metric=raw["metric"], + entries=entries, + raw=raw, + ) + + +def _open_predictions(path: Path) -> Iterable[Prediction]: + """Yield Prediction rows from a (gzipped or plain) CSV file.""" + opener = gzip.open if str(path).endswith(".gz") else open + with opener(path, "rt", newline="") as f: + reader = csv.DictReader(f) + for row in reader: + ranked_str = row["predictions"] + ranked = tuple(ranked_str.split(PREFIX_SEP)) if ranked_str else () + yield Prediction( + case_id=row["case_id"], + prefix_idx=int(row["prefix_idx"]), + ranked=ranked, + ) + + +def _truth_for_dataset(name: str) -> list[Prefix]: + """Build the canonical truth set for a known dataset. + + Today only `synthetic-toy` is supported — once a real dataset is + pinned this dispatch grows a branch per dataset, gated on the cached + file's sha256. + """ + if name == "synthetic-toy": + from pm_bench import _synth + from pm_bench.split import case_chrono_split + + events = list(_synth.synthetic_log()) + s = case_chrono_split(events) + return list(extract_prefixes(events, s.test)) + raise ValueError( + f"truth for dataset {name!r} not yet wired; pin a registry hash " + "and add the dispatch branch" + ) + + +def rescore(board: Board, repo_root: str | Path = ".") -> list[tuple[Entry, dict]]: + """Re-run scoring for every entry; return (entry, fresh_score) pairs.""" + if board.task != "next-event": + raise ValueError(f"rescore only supports next-event today (got {board.task})") + truth = _truth_for_dataset(board.dataset) + truth_keys = [(t.case_id, t.prefix_idx) for t in truth] + truth_next = [t.true_next for t in truth] + + out: list[tuple[Entry, dict]] = [] + for entry in board.entries: + pred_path = Path(repo_root) / entry.predictions_path + pred_lookup = { + (p.case_id, p.prefix_idx): list(p.ranked) + for p in _open_predictions(pred_path) + } + missing = [k for k in truth_keys if k not in pred_lookup] + if missing: + raise ValueError( + f"{entry.model}: predictions missing {len(missing)} target(s); " + f"first missing {missing[0]}" + ) + ranked = [pred_lookup[k] for k in truth_keys] + s = score_next_event(ranked, truth_next) + out.append( + (entry, {"top1": s.top1, "top3": s.top3, "n": s.n}), + ) + return out + + +def verify(board: Board, repo_root: str | Path = ".", *, tol: float = 1e-9) -> list[str]: + """Return a list of human-readable drift messages (empty = clean).""" + drifts: list[str] = [] + for entry, fresh in rescore(board, repo_root=repo_root): + for k in ("top1", "top3", "n"): + recorded = entry.score.get(k) + actual = fresh[k] + ok = recorded == actual if isinstance(actual, int) else ( + recorded is not None and math.isclose(recorded, actual, abs_tol=tol) + ) + if not ok: + drifts.append( + f"{entry.model}: {k} drift — recorded={recorded} actual={actual}" + ) + return drifts + + +def standings(board: Board, *, key: str = "top1") -> list[Entry]: + """Return entries sorted by the given score key, descending.""" + return sorted(board.entries, key=lambda e: e.score.get(key, float("-inf")), reverse=True) diff --git a/tests/test_leaderboard.py b/tests/test_leaderboard.py new file mode 100644 index 0000000..065d880 --- /dev/null +++ b/tests/test_leaderboard.py @@ -0,0 +1,104 @@ +"""Leaderboard determinism + drift detection. + +The Markov reference entry under leaderboard/next-event/synthetic-toy.json +is the canary: if its recorded score doesn't match what we re-compute +today, either the Markov code changed or the entry is stale. Either +way, contributors should know. +""" +from __future__ import annotations + +import json +from pathlib import Path + +from click.testing import CliRunner + +from pm_bench.cli import main +from pm_bench.leaderboard import load_board, rescore, standings, verify + +REPO_ROOT = Path(__file__).resolve().parent.parent +BOARD_PATH = REPO_ROOT / "leaderboard" / "next-event" / "synthetic-toy.json" + + +def test_synthetic_toy_board_loads() -> None: + board = load_board(BOARD_PATH) + assert board.task == "next-event" + assert board.dataset == "synthetic-toy" + assert len(board.entries) >= 1 + + +def test_markov_ref_has_no_score_drift() -> None: + """Recorded score must match a fresh rescore — guards model code.""" + board = load_board(BOARD_PATH) + drifts = verify(board, repo_root=REPO_ROOT) + assert drifts == [], drifts + + +def test_rescore_returns_one_pair_per_entry() -> None: + board = load_board(BOARD_PATH) + pairs = rescore(board, repo_root=REPO_ROOT) + assert len(pairs) == len(board.entries) + for _entry, fresh in pairs: + assert "top1" in fresh and "top3" in fresh and "n" in fresh + assert 0.0 <= fresh["top1"] <= 1.0 + + +def test_standings_orders_by_top1_desc() -> None: + board = load_board(BOARD_PATH) + s = standings(board) + scores = [e.score["top1"] for e in s] + assert scores == sorted(scores, reverse=True) + + +def test_verify_detects_drift(tmp_path) -> None: + """A tampered score must surface as a drift message.""" + src = json.loads(BOARD_PATH.read_text()) + src["entries"][0]["score"]["top1"] = 0.111 + fake = tmp_path / "fake.json" + fake.write_text(json.dumps(src)) + # The tampered file still references the real predictions; we point + # repo_root at the real repo so the predictions resolve. + board = load_board(fake) + drifts = verify(board, repo_root=REPO_ROOT) + assert any("top1 drift" in d for d in drifts) + + +def test_cli_leaderboard_verify_passes() -> None: + """`pm-bench leaderboard ... --verify` must exit 0 and report 'no drift'.""" + runner = CliRunner() + r = runner.invoke( + main, + ["leaderboard", "next-event", "synthetic-toy", "--verify", "--repo-root", str(REPO_ROOT)], + ) + assert r.exit_code == 0, r.output + assert "no drift" in r.output + + +def test_cli_leaderboard_missing_returns_nonzero(tmp_path) -> None: + """Asking for a nonexistent (task, dataset) pair must error cleanly.""" + # Lay out a partial repo with no leaderboard file. + (tmp_path / "leaderboard" / "next-event").mkdir(parents=True) + runner = CliRunner() + r = runner.invoke( + main, + ["leaderboard", "next-event", "no-such-dataset", "--repo-root", str(tmp_path)], + ) + assert r.exit_code == 1 + assert "no leaderboard at" in r.output + + +def test_predictions_file_is_readable_gz() -> None: + """The reference predictions must be a real gzip — not a placeholder.""" + p = ( + REPO_ROOT + / "leaderboard" + / "predictions" + / "next-event" + / "synthetic-toy" + / "markov-ref.csv.gz" + ) + assert p.exists() + # First two bytes of a gzip file are 0x1f 0x8b. + head = p.read_bytes()[:2] + assert head == b"\x1f\x8b" + +