Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions GOALS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ Be the default benchmark for new process-mining methods. Within 18 months,
- End-to-end loop runs on `synthetic-toy` ✅ — split → prefixes →
predict → score, covered by `tests/test_e2e.py`

## Leaderboard
- Standings JSON format and reference Markov entry on `synthetic-toy`
shipped (`leaderboard/next-event/synthetic-toy.json`)
- `pm-bench leaderboard --verify` re-scores entries to catch drift; a
test guards the Markov-ref score
- Remaining: CI workflow that re-scores submission PRs (URL- or
in-repo-predictions flow)

## v1 success criteria
- ≥3 external groups submit to the leaderboard
- Cited in ≥5 papers
Expand Down
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,10 @@ honesty. The point of the benchmark is to make the comparison real.
pending the one-time TOS-gated downloads from 4TU and Mendeley.
- [ ] v0.2 — splits: next-event, remaining-time
- [ ] v0.3 — scoring scripts for all 5 tasks
- [ ] v0.4 — leaderboard CI + landing page
- [🟡] v0.4 — leaderboard CI + landing page. Standings format,
reference Markov entry, and `pm-bench leaderboard --verify`
shipped (`leaderboard/next-event/synthetic-toy.json`); CI
workflow that re-scores submission PRs is the remaining piece.
- [ ] v0.5 — baselines: `gnn`, transformer, LSTM, Markov ✅ (Markov shipped)
- [ ] v1.0 — first external submissions; cited in ≥1 paper

Expand Down
19 changes: 18 additions & 1 deletion STATUS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ _Last updated: 2026-04-30._

## Where we are

Pre-v0. Two pieces shipped on top of v0.0:
Pre-v0. Three pieces shipped on top of v0.0:

1. The end-to-end loop runs on the bundled `synthetic-toy` dataset
(split → prefixes → predict → score; Markov reference baseline
Expand All @@ -14,6 +14,13 @@ Pre-v0. Two pieces shipped on top of v0.0:
sha256, and prints precise instructions for the TOS-gated download
step on 4TU / Mendeley. `--pin` emits the `registry.yml` patch a
contributor pastes into a PR after the manual download.
3. The leaderboard scaffold is live: standings JSON under
`leaderboard/<task>/<dataset>.json`, reference predictions checked
in under `leaderboard/predictions/...`, and `pm-bench leaderboard
<task> <dataset> [--verify]` re-scores entries to catch drift. The
Markov-ref entry on `synthetic-toy` is the first row, and a
determinism test in CI fails if the recorded score doesn't match a
fresh rescore.

What's still left in v0.1 is purely a per-dataset operational task: do
the one-time download, run `--pin`, open seven small PRs to pin the
Expand Down Expand Up @@ -53,6 +60,16 @@ pm-bench fetch bpi2020 --pin

## Recently shipped

- **Leaderboard scaffold** (`leaderboard-scaffold` branch).
- `leaderboard/next-event/synthetic-toy.json` with the Markov
reference entry; predictions checked in under
`leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz`.
- `pm_bench/leaderboard.py` — load/rescore/verify/standings, all
pure CPython; reads gzipped or plain CSV.
- CLI: `pm-bench leaderboard <task> <dataset> [--verify]` —
pretty-prints standings, optionally re-runs scoring.
- 8 new tests, including a drift-detection canary that tampers with
the recorded score and confirms `verify` flags it.
- **v0.1 fetch + hash machinery** (`dataset-fetch` branch).
- `pm_bench/cache.py` — cache root resolution
(`$PM_BENCH_CACHE` → `~/.cache/pm-bench/`), per-dataset path with
Expand Down
42 changes: 42 additions & 0 deletions leaderboard/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Leaderboard

Standings JSON files live under `leaderboard/<task>/<dataset>.json`.

Each file describes:
- the task and metric
- the split convention (case-level chronological is the only blessed
one in v0)
- a list of submission entries, each with: model name, version, code
link, predictions file, score, and timestamp

Reference baselines that ship with `pm-bench` keep their predictions
checked in under `leaderboard/predictions/<task>/<dataset>/<model>.csv.gz`,
so the loop is reproducible without hitting the network.

## Verifying a leaderboard file

```bash
pm-bench leaderboard next-event synthetic-toy --verify
```

This re-scores every entry by reading its `predictions_path` (relative
to the repo root) and the dataset's prefixes file, then asserts the
recorded `score` matches what `pm_bench.score` produces today.
Any drift fails loudly — pinned numbers must match the code that
produced them.

## Submitting

Today (pre-v0): open a PR adding a new entry to the relevant JSON
file, with your `predictions.csv.gz` checked in under
`leaderboard/predictions/...`. Once the leaderboard CI is wired
(v0.4) submissions will move to a URL-based flow where CI fetches the
predictions and fills in the score.

## Score convention

For `next-event`, the score block carries `top1`, `top3`, and `n` (the
number of (case, prefix_idx) targets scored). All values are floats in
`[0, 1]`. Higher is better. `n` makes split sizes auditable across
entries — if your `n` differs from the reference, you used a different
split.
27 changes: 27 additions & 0 deletions leaderboard/next-event/synthetic-toy.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"task": "next-event",
"dataset": "synthetic-toy",
"metric": "top1 / top3 accuracy",
"scored_with": "pm_bench.score.score_next_event",
"split": {
"kind": "case-chrono",
"train_frac": 0.7,
"val_frac": 0.1
},
"entries": [
{
"model": "markov-ref",
"version": "0.1.0",
"predictions_path": "leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz",
"code": "https://github.com/erphq/pm-bench/blob/main/pm_bench/baselines/markov.py",
"paper": null,
"score": {
"top1": 0.975609756097561,
"top3": 1.0,
"n": 41
},
"scored_at": "2026-04-30T00:00:00Z",
"notes": "First-order Markov reference baseline shipped with pm-bench. Trained on the train partition only; falls back to unigram for unseen last-activities. The floor any 'real' sequence model has to clear."
}
]
}
Binary file not shown.
48 changes: 48 additions & 0 deletions pm_bench/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
ensure_cached,
sha256_file,
)
from pm_bench.leaderboard import load_board, standings, verify
from pm_bench.predictions import read_predictions_csv, write_predictions_csv
from pm_bench.prefixes import extract_prefixes, read_prefixes_csv, write_prefixes_csv
from pm_bench.registry import get_dataset, load_registry
Expand Down Expand Up @@ -290,5 +291,52 @@ def score(predictions_path: str, prefixes_path: str, task: str) -> None:
)


@main.command()
@click.argument("task")
@click.argument("dataset")
@click.option(
"--verify",
"do_verify",
is_flag=True,
default=False,
help="Re-score every entry and fail if recorded scores have drifted.",
)
@click.option(
"--repo-root",
type=click.Path(exists=True, file_okay=False),
default=".",
show_default=True,
help="Repo root that `predictions_path` entries are relative to.",
)
def leaderboard(task: str, dataset: str, do_verify: bool, repo_root: str) -> None:
"""Print standings for a (task, dataset) pair, optionally rescoring."""
path = f"leaderboard/{task}/{dataset}.json"
full = f"{repo_root.rstrip('/')}/{path}"
try:
board = load_board(full)
except FileNotFoundError:
click.echo(f"no leaderboard at {path}", err=True)
sys.exit(1)

if do_verify:
drifts = verify(board, repo_root=repo_root)
if drifts:
for d in drifts:
click.echo(d, err=True)
sys.exit(2)
click.echo(f"verified {len(board.entries)} entr(ies) — no drift")

width = max((len(e.model) for e in board.entries), default=10)
click.echo(f"{board.task} · {board.dataset} · {board.metric}")
click.echo("-" * (width + 30))
for e in standings(board):
top1 = e.score.get("top1")
top3 = e.score.get("top3")
n = e.score.get("n")
click.echo(
f"{e.model:<{width}} top1={top1:.4f} top3={top3:.4f} n={n}"
)


if __name__ == "__main__":
main()
159 changes: 159 additions & 0 deletions pm_bench/leaderboard.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
"""Leaderboard loading + verification.

The on-disk format lives under `leaderboard/<task>/<dataset>.json`.
Reference predictions ship next to the JSON under
`leaderboard/predictions/<task>/<dataset>/<model>.csv[.gz]`. Reading
+ rescoring is pure Python — no torch, no network, deterministic.

Score drift is the only failure mode: the recorded `score` must match
what `pm_bench.score.score_next_event` produces today against the
checked-in predictions and the freshly-extracted prefixes for the
named dataset. If the model code changes the numbers, the leaderboard
file changes alongside it — no exceptions.
"""
from __future__ import annotations

import csv
import gzip
import json
import math
from collections.abc import Iterable
from dataclasses import dataclass
from pathlib import Path

from pm_bench.predictions import Prediction
from pm_bench.prefixes import PREFIX_SEP, Prefix, extract_prefixes
from pm_bench.score import score_next_event


@dataclass(frozen=True)
class Entry:
model: str
version: str
predictions_path: str
score: dict
code: str | None = None
paper: str | None = None
scored_at: str | None = None
notes: str | None = None


@dataclass(frozen=True)
class Board:
task: str
dataset: str
metric: str
entries: list[Entry]
raw: dict


def load_board(path: str | Path) -> Board:
"""Load a leaderboard JSON file."""
p = Path(path)
raw = json.loads(p.read_text())
entries = [
Entry(
model=e["model"],
version=e["version"],
predictions_path=e["predictions_path"],
score=e["score"],
code=e.get("code"),
paper=e.get("paper"),
scored_at=e.get("scored_at"),
notes=e.get("notes"),
)
for e in raw["entries"]
]
return Board(
task=raw["task"],
dataset=raw["dataset"],
metric=raw["metric"],
entries=entries,
raw=raw,
)


def _open_predictions(path: Path) -> Iterable[Prediction]:
"""Yield Prediction rows from a (gzipped or plain) CSV file."""
opener = gzip.open if str(path).endswith(".gz") else open
with opener(path, "rt", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
ranked_str = row["predictions"]
ranked = tuple(ranked_str.split(PREFIX_SEP)) if ranked_str else ()
yield Prediction(
case_id=row["case_id"],
prefix_idx=int(row["prefix_idx"]),
ranked=ranked,
)


def _truth_for_dataset(name: str) -> list[Prefix]:
"""Build the canonical truth set for a known dataset.

Today only `synthetic-toy` is supported — once a real dataset is
pinned this dispatch grows a branch per dataset, gated on the cached
file's sha256.
"""
if name == "synthetic-toy":
from pm_bench import _synth
from pm_bench.split import case_chrono_split

events = list(_synth.synthetic_log())
s = case_chrono_split(events)
return list(extract_prefixes(events, s.test))
raise ValueError(
f"truth for dataset {name!r} not yet wired; pin a registry hash "
"and add the dispatch branch"
)


def rescore(board: Board, repo_root: str | Path = ".") -> list[tuple[Entry, dict]]:
"""Re-run scoring for every entry; return (entry, fresh_score) pairs."""
if board.task != "next-event":
raise ValueError(f"rescore only supports next-event today (got {board.task})")
truth = _truth_for_dataset(board.dataset)
truth_keys = [(t.case_id, t.prefix_idx) for t in truth]
truth_next = [t.true_next for t in truth]

out: list[tuple[Entry, dict]] = []
for entry in board.entries:
pred_path = Path(repo_root) / entry.predictions_path
pred_lookup = {
(p.case_id, p.prefix_idx): list(p.ranked)
for p in _open_predictions(pred_path)
}
missing = [k for k in truth_keys if k not in pred_lookup]
if missing:
raise ValueError(
f"{entry.model}: predictions missing {len(missing)} target(s); "
f"first missing {missing[0]}"
)
ranked = [pred_lookup[k] for k in truth_keys]
s = score_next_event(ranked, truth_next)
out.append(
(entry, {"top1": s.top1, "top3": s.top3, "n": s.n}),
)
return out


def verify(board: Board, repo_root: str | Path = ".", *, tol: float = 1e-9) -> list[str]:
"""Return a list of human-readable drift messages (empty = clean)."""
drifts: list[str] = []
for entry, fresh in rescore(board, repo_root=repo_root):
for k in ("top1", "top3", "n"):
recorded = entry.score.get(k)
actual = fresh[k]
ok = recorded == actual if isinstance(actual, int) else (
recorded is not None and math.isclose(recorded, actual, abs_tol=tol)
)
if not ok:
drifts.append(
f"{entry.model}: {k} drift — recorded={recorded} actual={actual}"
)
return drifts


def standings(board: Board, *, key: str = "top1") -> list[Entry]:
"""Return entries sorted by the given score key, descending."""
return sorted(board.entries, key=lambda e: e.score.get(key, float("-inf")), reverse=True)
Loading