v0.3: bottleneck task — NDCG@10 + mean-wait baseline + leaderboard by protosphinx · Pull Request #8 · erphq/pm-bench

protosphinx · 2026-05-01T04:21:34Z

Stacked on top of #7 (outcome). Merge order: #2 → #3 → #4 → #5 → #6 → #7 → this.

Summary

Adds bottleneck detection as the fourth task — per-transition wait-time ranking, NDCG@10. CLI is fully task-aware; the leaderboard scaffold now hosts 3 entries on synthetic-toy (next-event, remaining-time, bottleneck) plus the outcome pipeline.
Mean-wait baseline scores NDCG@10 0.9786 on synthetic-toy. Strong floor — ranking by training-mean already nails the ordering.

What's new

score_bottleneck — pure-CPython NDCG@k. Predicted scores rank transitions; truth is the held-out per-(a,b) mean wait time. Missing predictions sink to the bottom (refusing to predict doesn't earn credit).
pm_bench/bottleneck.py — BottleneckTarget, BottleneckPrediction, extract_bottleneck_targets, CSV r/w. Per-transition shape (activity_a, activity_b, mean_wait_seconds, n_observations) — different from the per-prefix tasks, intentionally so.
pm_bench/baselines/mean_wait.py — train-mean-per-transition with observation-weighted global fallback. ~30 lines.
CLI dispatch — --task bottleneck, --baseline mean-wait. UsageError on mismatched (task, baseline) pairs, consistent with the other tasks.
leaderboard.py — _rescore_bottleneck and a new dispatch branch in rescore; standings now sorts by ndcg_at_k for bottleneck and auc for outcome (correctly higher-better for both).
CLI pm-bench leaderboard prints task-appropriate columns: mae_days for time, auc for outcome, ndcg@10 for bottleneck, top1/top3 for next-event.
7 new tests (test_bottleneck.py); 86 total, ruff clean.

Smoke

$ pm-bench prefixes synthetic-toy --split split.json --out bt.csv --task bottleneck
wrote 6 prefixes to bt.csv (task=bottleneck partition=test)

$ pm-bench predict synthetic-toy --split split.json --prefixes bt.csv \
    --out bpreds.csv --baseline mean-wait --task bottleneck
wrote 6 predictions to bpreds.csv (task=bottleneck baseline=mean-wait)

$ pm-bench score bpreds.csv --prefixes bt.csv --task bottleneck
{ "task": "bottleneck", "ndcg_at_k": 0.9786..., "k": 10, "n_transitions": 6 }

$ pm-bench leaderboard --all --verify
bottleneck/synthetic-toy: OK — 1 entry(ies)
next-event/synthetic-toy: OK — 1 entry(ies)
remaining-time/synthetic-toy: OK — 1 entry(ies)

Test plan

pytest -q — 86 passed (was 73 on PR v0.3: outcome task — AUC scoring + prior baseline #7)
ruff check pm_bench tests — clean
NDCG math: perfect ranking, inverted ranking, missing-predictions, hand-checked 3-transition example
Drift canary on the new leaderboard entry runs through the existing --all --verify workflow

Roadmap impact

v0.3 (5-task scoring): now 4 of 5 ✅ — conformance is the only remaining task. That one will need pm4py (process discovery + token replay), so it's the natural moment to introduce a [bpi] extra.

- score_bottleneck — pure-CPython NDCG@k. Predictions rank transitions; truth is the held-out per-(a,b) mean wait time. Missing predictions sink to the bottom (refusing to predict doesn't earn credit) - pm_bench/bottleneck.py — BottleneckTarget + extract; per-transition shape (4-tuple: a, b, mean_wait_seconds, n_observations) instead of per-prefix - baselines/mean_wait.py — train-mean-per-transition with global-mean fallback. On synthetic-toy: NDCG@10 0.9786 over 6 transitions - CLI: --task bottleneck, --baseline mean-wait wired through prefixes / predict / score - leaderboard/bottleneck/synthetic-toy.json with mean-wait-ref entry (NDCG@10 0.9786, n_transitions 6); pm-bench leaderboard --all now walks 3 boards (next-event, remaining-time, bottleneck) - 7 new tests; 86 total, ruff clean - v0.3 marked partial → 4 of 5 tasks (conformance remains)

protosphinx · 2026-05-01T17:54:15Z

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

protosphinx deleted the branch outcome-task May 1, 2026 17:54

protosphinx closed this May 1, 2026

protosphinx deleted the bottleneck-task branch May 1, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3: bottleneck task — NDCG@10 + mean-wait baseline + leaderboard#8

v0.3: bottleneck task — NDCG@10 + mean-wait baseline + leaderboard#8
protosphinx wants to merge 1 commit into
outcome-taskfrom
bottleneck-task

protosphinx commented May 1, 2026

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

protosphinx commented May 1, 2026

Summary

What's new

Smoke

Test plan

Roadmap impact

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant