End-to-end loop on synthetic-toy: prefixes + predict + score by protosphinx · Pull Request #2 · erphq/pm-bench

protosphinx · 2026-05-01T02:55:01Z

Summary

Wires up prefixes, predict --baseline markov, and score so the README's command sequence actually runs end-to-end on synthetic-toy.
Adds the first-order Markov reference baseline — the floor any submission must clear. On synthetic-toy: top-1 0.976, top-3 1.000.
Locks the file formats the leaderboard CI will depend on (prefixes.csv, predictions.csv) via tests/test_e2e.py.

What's new

pm_bench/prefixes.py — extract (case_id, prefix_idx, prefix, true_next) targets from a split; skips length-1 cases. CSV round-trip.
pm_bench/predictions.py — predictions CSV (case_id, prefix_idx, predictions).
pm_bench/baselines/markov.py — fit on train cases only; unigram fallback for unseen last-activity. No torch / sklearn — just CPython.
CLI: pm-bench prefixes <name> --split split.json --out prefixes.csv, pm-bench predict <name> --split split.json --prefixes prefixes.csv --out predictions.csv --baseline markov, pm-bench score predictions.csv --prefixes prefixes.csv --task next-event.
README install/use section now reflects the loop that actually works; STATUS.md added; GOALS.md ticked.

Why this matters

v0.1's dataset fetch is gated on a 4TU TOS step that can't be automated. Locking the file format on synthetic-toy first means external contributors can build models against pm-bench now, on synthetic data, and slot the same code into BPI/Sepsis/Helpdesk the moment v0.1 lands.
Sets up the second reference baseline cleanly: adding gnn is a --baseline gnn choice once the dataset machinery is in place.

Test plan

pytest -q — 24 passed (was 17)
ruff check pm_bench tests — clean
Manual end-to-end smoke: split → prefixes → predict → score returns top1 0.976 / top3 1.000
tests/test_e2e.py runs the same sequence via click runner — guards the file format

Roadmap impact

Bumps v0.0.1 checkbox in README + GOALS; v0.1 (dataset fetch) is the next gate.

… score - prefixes.py — extract (case_id, prefix_idx, prefix, true_next) targets from a split; CSV round-trip helpers - predictions.py — predictions CSV format (case_id, prefix_idx, ranked) - baselines/markov.py — first-order Markov reference (train-only fit, unigram fallback for unseen last-activity) - CLI gains `prefixes`, `predict --baseline markov`, `score`; the full `split → prefixes → predict → score` loop now matches the README - tests/test_e2e.py exercises the loop via click runner, locking the file formats the leaderboard depends on - 24 tests pass (was 17); ruff clean - Markov on synthetic-toy: top1 0.976, top3 1.000 — sets the floor any future model has to clear

protosphinx · 2026-05-01T16:22:09Z

auto-deferred for human review: LOC delta exceeds gate (652 added + 15 removed = 667 lines, threshold is 250)

Generated by Claude Code

protosphinx added the needs-review label May 1, 2026 — with Claude

protosphinx merged commit 562f86c into main May 1, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end loop on synthetic-toy: prefixes + predict + score#2

End-to-end loop on synthetic-toy: prefixes + predict + score#2
protosphinx merged 1 commit into
mainfrom
end-to-end-loop

protosphinx commented May 1, 2026

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

protosphinx commented May 1, 2026

Summary

What's new

Why this matters

Test plan

Roadmap impact

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant