synthetic-toy → 200 cases: outcome leaderboard row lands; all 5 boards real by protosphinx · Pull Request #12 · erphq/pm-bench

protosphinx · 2026-05-01T05:18:51Z

Stacked on top of #11 (conformance). Merge order: #2 → #3 → #4 → #5 → #6 → #7 → #8 → #9 → #10 → #11 → this.

Summary

Bumps synthetic_log() default from 50 → 200 cases. The chronological tail of the bigger log naturally contains positive delivery_confirmed cases, so the outcome task finally has a real leaderboard entry.
All 4 existing reference predictions regenerated; AUC entry added for outcome. Five boards now verify under --all --verify.

What's new

pm_bench/_synth.synthetic_log(n_cases=200, seed=42) — same path distribution, more cases. Test partition: 158 prefixes, 45 positives.
5th leaderboard board: leaderboard/outcome/synthetic-toy.json with prior-ref entry — AUC 0.6319, n_pos 45/158. Real floor for any temporal model on the outcome task.
_rescore_outcome + _outcome_truth_for_dataset wired into pm_bench/leaderboard.py.
All 4 existing entries regenerated:
- markov-ref: top-1 0.9304 (was 0.9756 on 50 cases — Markov is harder when the test set has more variety)
- mean-ref: MAE 1.3481 days
- mean-wait-ref: NDCG@10 0.9911 over 9 transitions
- dfg-ref: F = 1.0 (200 cases → both partitions observe every path-graph edge — the cleaner conformance result)
registry.yml synthetic-toy row updated (cases 200, events 965).
STANDINGS.md regenerated; README.md, STATUS.md updated.
Tests adjusted to the new numbers; the n_pos=0 by accident comments are gone.

Smoke

$ pm-bench leaderboard --all --verify
bottleneck/synthetic-toy: OK - 1 entry(ies)
conformance/synthetic-toy: OK - 1 entry(ies)
next-event/synthetic-toy: OK - 1 entry(ies)
outcome/synthetic-toy: OK - 1 entry(ies)
remaining-time/synthetic-toy: OK - 1 entry(ies)

Test plan

pytest -q — 109 passed (was 108 on PR v0.3 closed: conformance task — DFG fitness × precision → F #11; +1 for the new outcome-board verify test)
ruff check pm_bench tests — clean
pm-bench leaderboard --all --verify exits 0 across all 5 boards
STANDINGS.md staleness canary green

Roadmap impact

Closes the long-standing "outcome row waits on a dataset whose test split has both classes" caveat. Synthetic-toy is now self-sufficient for all 5 v0 tasks.
Numbers will move when a real BPI dataset gets pinned, which is fine — that's the whole point of having drift detection in CI.

- synthetic_log() default n_cases = 200 (was 50). Test partition now has ~45 positive cases for `delivery_confirmed`, so the outcome task gets a real AUC instead of degenerating to 0.5 - All 4 existing reference entries regenerated and re-scored: * markov-ref: top1 0.9304 (was 0.9756 on 50 cases) * mean-ref: MAE 1.3481 days * mean-wait-ref: NDCG@10 0.9911 over 9 transitions * dfg-ref: F = 1.0 (200 cases → both partitions cover the full path graph) - 5th leaderboard board added: leaderboard/outcome/synthetic-toy.json with prior-ref entry — AUC 0.6319, n_pos 45 / 158 - _rescore_outcome + _outcome_truth_for_dataset wired into leaderboard.py; pm-bench leaderboard --all --verify walks all 5 boards - registry.yml synthetic-toy row updated (cases 200, events 965) - STANDINGS.md regenerated; README + STATUS updated; tests adjusted to the new numbers (drop "n_pos=0 by accident" comments) - 109 tests, ruff clean

protosphinx · 2026-05-01T17:54:23Z

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

This was referenced May 1, 2026

pm-bench stats <name> — one-shot summary stats for any log #13

Closed

uniform-ref second baseline on next-event — multi-entry leaderboard demo #14

Closed

protosphinx deleted the branch conformance-task May 1, 2026 17:54

protosphinx closed this May 1, 2026

protosphinx deleted the synthetic-200 branch May 1, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthetic-toy → 200 cases: outcome leaderboard row lands; all 5 boards real#12

synthetic-toy → 200 cases: outcome leaderboard row lands; all 5 boards real#12
protosphinx wants to merge 1 commit into
conformance-taskfrom
synthetic-200

protosphinx commented May 1, 2026

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

protosphinx commented May 1, 2026

Summary

What's new

Smoke

Test plan

Roadmap impact

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant