End-to-end loop on synthetic-toy: prefixes + predict + score#2
Merged
Conversation
… score - prefixes.py — extract (case_id, prefix_idx, prefix, true_next) targets from a split; CSV round-trip helpers - predictions.py — predictions CSV format (case_id, prefix_idx, ranked) - baselines/markov.py — first-order Markov reference (train-only fit, unigram fallback for unseen last-activity) - CLI gains `prefixes`, `predict --baseline markov`, `score`; the full `split → prefixes → predict → score` loop now matches the README - tests/test_e2e.py exercises the loop via click runner, locking the file formats the leaderboard depends on - 24 tests pass (was 17); ruff clean - Markov on synthetic-toy: top1 0.976, top3 1.000 — sets the floor any future model has to clear
This was referenced May 1, 2026
Member
Author
|
auto-deferred for human review: LOC delta exceeds gate (652 added + 15 removed = 667 lines, threshold is 250) Generated by Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
prefixes,predict --baseline markov, andscoreso the README's command sequence actually runs end-to-end onsynthetic-toy.prefixes.csv,predictions.csv) viatests/test_e2e.py.What's new
pm_bench/prefixes.py— extract(case_id, prefix_idx, prefix, true_next)targets from a split; skips length-1 cases. CSV round-trip.pm_bench/predictions.py— predictions CSV (case_id, prefix_idx, predictions).pm_bench/baselines/markov.py— fit on train cases only; unigram fallback for unseen last-activity. No torch / sklearn — just CPython.pm-bench prefixes <name> --split split.json --out prefixes.csv,pm-bench predict <name> --split split.json --prefixes prefixes.csv --out predictions.csv --baseline markov,pm-bench score predictions.csv --prefixes prefixes.csv --task next-event.STATUS.mdadded;GOALS.mdticked.Why this matters
synthetic-toyfirst means external contributors can build models against pm-bench now, on synthetic data, and slot the same code into BPI/Sepsis/Helpdesk the moment v0.1 lands.gnnis a--baseline gnnchoice once the dataset machinery is in place.Test plan
pytest -q— 24 passed (was 17)ruff check pm_bench tests— cleansplit → prefixes → predict → scorereturns top1 0.976 / top3 1.000tests/test_e2e.pyruns the same sequence via click runner — guards the file formatRoadmap impact
v0.0.1checkbox in README + GOALS; v0.1 (dataset fetch) is the next gate.