erphq · protosphinx · May 1, 2026
diff --git a/GOALS.md b/GOALS.md
@@ -15,6 +15,14 @@ Be the default benchmark for new process-mining methods. Within 18 months,
 - End-to-end loop runs on `synthetic-toy` ✅ — split → prefixes →
   predict → score, covered by `tests/test_e2e.py`
 
+## Leaderboard
+- Standings JSON format and reference Markov entry on `synthetic-toy`
+  shipped (`leaderboard/next-event/synthetic-toy.json`)
+- `pm-bench leaderboard --verify` re-scores entries to catch drift; a
+  test guards the Markov-ref score
+- Remaining: CI workflow that re-scores submission PRs (URL- or
+  in-repo-predictions flow)
+
 ## v1 success criteria
 - ≥3 external groups submit to the leaderboard
 - Cited in ≥5 papers

diff --git a/README.md b/README.md
@@ -222,7 +222,10 @@ honesty. The point of the benchmark is to make the comparison real.
       pending the one-time TOS-gated downloads from 4TU and Mendeley.
 - [ ] v0.2 — splits: next-event, remaining-time
 - [ ] v0.3 — scoring scripts for all 5 tasks
-- [ ] v0.4 — leaderboard CI + landing page
+- [🟡] v0.4 — leaderboard CI + landing page. Standings format,
+      reference Markov entry, and `pm-bench leaderboard --verify`
+      shipped (`leaderboard/next-event/synthetic-toy.json`); CI
+      workflow that re-scores submission PRs is the remaining piece.
 - [ ] v0.5 — baselines: `gnn`, transformer, LSTM, Markov ✅ (Markov shipped)
 - [ ] v1.0 — first external submissions; cited in ≥1 paper
 

diff --git a/STATUS.md b/STATUS.md
@@ -4,7 +4,7 @@ _Last updated: 2026-04-30._
 
 ## Where we are
 
-Pre-v0. Two pieces shipped on top of v0.0:
+Pre-v0. Three pieces shipped on top of v0.0:
 
 1. The end-to-end loop runs on the bundled `synthetic-toy` dataset
    (split → prefixes → predict → score; Markov reference baseline
@@ -14,6 +14,13 @@ Pre-v0. Two pieces shipped on top of v0.0:
    sha256, and prints precise instructions for the TOS-gated download
    step on 4TU / Mendeley. `--pin` emits the `registry.yml` patch a
    contributor pastes into a PR after the manual download.
+3. The leaderboard scaffold is live: standings JSON under
+   `leaderboard/<task>/<dataset>.json`, reference predictions checked
+   in under `leaderboard/predictions/...`, and `pm-bench leaderboard
+   <task> <dataset> [--verify]` re-scores entries to catch drift. The
+   Markov-ref entry on `synthetic-toy` is the first row, and a
+   determinism test in CI fails if the recorded score doesn't match a
+   fresh rescore.
 
 What's still left in v0.1 is purely a per-dataset operational task: do
 the one-time download, run `--pin`, open seven small PRs to pin the
@@ -53,6 +60,16 @@ pm-bench fetch bpi2020 --pin
 
 ## Recently shipped
 
+- **Leaderboard scaffold** (`leaderboard-scaffold` branch).
+  - `leaderboard/next-event/synthetic-toy.json` with the Markov
+    reference entry; predictions checked in under
+    `leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz`.
+  - `pm_bench/leaderboard.py` — load/rescore/verify/standings, all
+    pure CPython; reads gzipped or plain CSV.
+  - CLI: `pm-bench leaderboard <task> <dataset> [--verify]` —
+    pretty-prints standings, optionally re-runs scoring.
+  - 8 new tests, including a drift-detection canary that tampers with
+    the recorded score and confirms `verify` flags it.
 - **v0.1 fetch + hash machinery** (`dataset-fetch` branch).
   - `pm_bench/cache.py` — cache root resolution
     (`$PM_BENCH_CACHE` → `~/.cache/pm-bench/`), per-dataset path with

diff --git a/leaderboard/README.md b/leaderboard/README.md
@@ -0,0 +1,42 @@
+# Leaderboard
+
+Standings JSON files live under `leaderboard/<task>/<dataset>.json`.
+
+Each file describes:
+- the task and metric
+- the split convention (case-level chronological is the only blessed
+  one in v0)
+- a list of submission entries, each with: model name, version, code
+  link, predictions file, score, and timestamp
+
+Reference baselines that ship with `pm-bench` keep their predictions
+checked in under `leaderboard/predictions/<task>/<dataset>/<model>.csv.gz`,
+so the loop is reproducible without hitting the network.
+
+## Verifying a leaderboard file
+
+```bash
+pm-bench leaderboard next-event synthetic-toy --verify
+```
+
+This re-scores every entry by reading its `predictions_path` (relative
+to the repo root) and the dataset's prefixes file, then asserts the
+recorded `score` matches what `pm_bench.score` produces today.
+Any drift fails loudly — pinned numbers must match the code that
+produced them.
+
+## Submitting
+
+Today (pre-v0): open a PR adding a new entry to the relevant JSON
+file, with your `predictions.csv.gz` checked in under
+`leaderboard/predictions/...`. Once the leaderboard CI is wired
+(v0.4) submissions will move to a URL-based flow where CI fetches the
+predictions and fills in the score.
+
+## Score convention
+
+For `next-event`, the score block carries `top1`, `top3`, and `n` (the
+number of (case, prefix_idx) targets scored). All values are floats in
+`[0, 1]`. Higher is better. `n` makes split sizes auditable across
+entries — if your `n` differs from the reference, you used a different
+split.
diff --git a/leaderboard/next-event/synthetic-toy.json b/leaderboard/next-event/synthetic-toy.json
@@ -0,0 +1,27 @@
+{
+  "task": "next-event",
+  "dataset": "synthetic-toy",
+  "metric": "top1 / top3 accuracy",
+  "scored_with": "pm_bench.score.score_next_event",
+  "split": {
+    "kind": "case-chrono",
+    "train_frac": 0.7,
+    "val_frac": 0.1
+  },
+  "entries": [
+    {
+      "model": "markov-ref",
+      "version": "0.1.0",
+      "predictions_path": "leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz",
+      "code": "https://github.com/erphq/pm-bench/blob/main/pm_bench/baselines/markov.py",
+      "paper": null,
+      "score": {
+        "top1": 0.975609756097561,
+        "top3": 1.0,
+        "n": 41
+      },
+      "scored_at": "2026-04-30T00:00:00Z",
+      "notes": "First-order Markov reference baseline shipped with pm-bench. Trained on the train partition only; falls back to unigram for unseen last-activities. The floor any 'real' sequence model has to clear."
+    }
+  ]
+}
diff --git a/leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz b/leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz
diff --git a/pm_bench/cli.py b/pm_bench/cli.py
@@ -14,6 +14,7 @@
     ensure_cached,
     sha256_file,
 )
+from pm_bench.leaderboard import load_board, standings, verify
 from pm_bench.predictions import read_predictions_csv, write_predictions_csv
 from pm_bench.prefixes import extract_prefixes, read_prefixes_csv, write_prefixes_csv
 from pm_bench.registry import get_dataset, load_registry
@@ -290,5 +291,52 @@ def score(predictions_path: str, prefixes_path: str, task: str) -> None:
     )
 
 
+@main.command()
+@click.argument("task")
+@click.argument("dataset")
+@click.option(
+    "--verify",
+    "do_verify",
+    is_flag=True,
+    default=False,
+    help="Re-score every entry and fail if recorded scores have drifted.",
+)
+@click.option(
+    "--repo-root",
+    type=click.Path(exists=True, file_okay=False),
+    default=".",
+    show_default=True,
+    help="Repo root that `predictions_path` entries are relative to.",
+)
+def leaderboard(task: str, dataset: str, do_verify: bool, repo_root: str) -> None:
+    """Print standings for a (task, dataset) pair, optionally rescoring."""
+    path = f"leaderboard/{task}/{dataset}.json"
+    full = f"{repo_root.rstrip('/')}/{path}"
+    try:
+        board = load_board(full)
+    except FileNotFoundError:
+        click.echo(f"no leaderboard at {path}", err=True)
+        sys.exit(1)
+
+    if do_verify:
+        drifts = verify(board, repo_root=repo_root)
+        if drifts:
+            for d in drifts:
+                click.echo(d, err=True)
+            sys.exit(2)
+        click.echo(f"verified {len(board.entries)} entr(ies) — no drift")
+
+    width = max((len(e.model) for e in board.entries), default=10)
+    click.echo(f"{board.task} · {board.dataset} · {board.metric}")
+    click.echo("-" * (width + 30))
+    for e in standings(board):
+        top1 = e.score.get("top1")
+        top3 = e.score.get("top3")
+        n = e.score.get("n")
+        click.echo(
+            f"{e.model:<{width}}  top1={top1:.4f}  top3={top3:.4f}  n={n}"
+        )
+
+
 if __name__ == "__main__":
     main()
diff --git a/pm_bench/leaderboard.py b/pm_bench/leaderboard.py
@@ -0,0 +1,159 @@
+"""Leaderboard loading + verification.
+
+The on-disk format lives under `leaderboard/<task>/<dataset>.json`.
+Reference predictions ship next to the JSON under
+`leaderboard/predictions/<task>/<dataset>/<model>.csv[.gz]`. Reading
++ rescoring is pure Python — no torch, no network, deterministic.
+
+Score drift is the only failure mode: the recorded `score` must match
+what `pm_bench.score.score_next_event` produces today against the
+checked-in predictions and the freshly-extracted prefixes for the
+named dataset. If the model code changes the numbers, the leaderboard
+file changes alongside it — no exceptions.
+"""
+from __future__ import annotations
+
+import csv
+import gzip
+import json
+import math
+from collections.abc import Iterable
+from dataclasses import dataclass
+from pathlib import Path
+
+from pm_bench.predictions import Prediction
+from pm_bench.prefixes import PREFIX_SEP, Prefix, extract_prefixes
+from pm_bench.score import score_next_event
+
+
+@dataclass(frozen=True)
+class Entry:
+    model: str
+    version: str
+    predictions_path: str
+    score: dict
+    code: str | None = None
+    paper: str | None = None
+    scored_at: str | None = None
+    notes: str | None = None
+
+
+@dataclass(frozen=True)
+class Board:
+    task: str
+    dataset: str
+    metric: str
+    entries: list[Entry]
+    raw: dict
+
+
+def load_board(path: str | Path) -> Board:
+    """Load a leaderboard JSON file."""
+    p = Path(path)
+    raw = json.loads(p.read_text())
+    entries = [
+        Entry(
+            model=e["model"],
+            version=e["version"],
+            predictions_path=e["predictions_path"],
+            score=e["score"],
+            code=e.get("code"),
+            paper=e.get("paper"),
+            scored_at=e.get("scored_at"),
+            notes=e.get("notes"),
+        )
+        for e in raw["entries"]
+    ]
+    return Board(
+        task=raw["task"],
+        dataset=raw["dataset"],
+        metric=raw["metric"],
+        entries=entries,
+        raw=raw,
+    )
+
+
+def _open_predictions(path: Path) -> Iterable[Prediction]:
+    """Yield Prediction rows from a (gzipped or plain) CSV file."""
+    opener = gzip.open if str(path).endswith(".gz") else open
+    with opener(path, "rt", newline="") as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            ranked_str = row["predictions"]
+            ranked = tuple(ranked_str.split(PREFIX_SEP)) if ranked_str else ()
+            yield Prediction(
+                case_id=row["case_id"],
+                prefix_idx=int(row["prefix_idx"]),
+                ranked=ranked,
+            )
+
+
+def _truth_for_dataset(name: str) -> list[Prefix]:
+    """Build the canonical truth set for a known dataset.
+
+    Today only `synthetic-toy` is supported — once a real dataset is
+    pinned this dispatch grows a branch per dataset, gated on the cached
+    file's sha256.
+    """
+    if name == "synthetic-toy":
+        from pm_bench import _synth
+        from pm_bench.split import case_chrono_split
+
+        events = list(_synth.synthetic_log())
+        s = case_chrono_split(events)
+        return list(extract_prefixes(events, s.test))
+    raise ValueError(
+        f"truth for dataset {name!r} not yet wired; pin a registry hash "
+        "and add the dispatch branch"
+    )
+
+
+def rescore(board: Board, repo_root: str | Path = ".") -> list[tuple[Entry, dict]]:
+    """Re-run scoring for every entry; return (entry, fresh_score) pairs."""
+    if board.task != "next-event":
+        raise ValueError(f"rescore only supports next-event today (got {board.task})")
+    truth = _truth_for_dataset(board.dataset)
+    truth_keys = [(t.case_id, t.prefix_idx) for t in truth]
+    truth_next = [t.true_next for t in truth]
+
+    out: list[tuple[Entry, dict]] = []
+    for entry in board.entries:
+        pred_path = Path(repo_root) / entry.predictions_path
+        pred_lookup = {
+            (p.case_id, p.prefix_idx): list(p.ranked)
+            for p in _open_predictions(pred_path)
+        }
+        missing = [k for k in truth_keys if k not in pred_lookup]
+        if missing:
+            raise ValueError(
+                f"{entry.model}: predictions missing {len(missing)} target(s); "
+                f"first missing {missing[0]}"
+            )
+        ranked = [pred_lookup[k] for k in truth_keys]
+        s = score_next_event(ranked, truth_next)
+        out.append(
+            (entry, {"top1": s.top1, "top3": s.top3, "n": s.n}),
+        )
+    return out
+
+
+def verify(board: Board, repo_root: str | Path = ".", *, tol: float = 1e-9) -> list[str]:
+    """Return a list of human-readable drift messages (empty = clean)."""
+    drifts: list[str] = []
+    for entry, fresh in rescore(board, repo_root=repo_root):
+        for k in ("top1", "top3", "n"):
+            recorded = entry.score.get(k)
+            actual = fresh[k]
+            ok = recorded == actual if isinstance(actual, int) else (
+                recorded is not None and math.isclose(recorded, actual, abs_tol=tol)
+            )
+            if not ok:
+                drifts.append(
+                    f"{entry.model}: {k} drift — recorded={recorded} actual={actual}"
+                )
+    return drifts
+
+
+def standings(board: Board, *, key: str = "top1") -> list[Entry]:
+    """Return entries sorted by the given score key, descending."""
+    return sorted(board.entries, key=lambda e: e.score.get(key, float("-inf")), reverse=True)