v0.3: outcome task — AUC scoring + prior baseline by protosphinx · Pull Request #7 · erphq/pm-bench

protosphinx · 2026-05-01T03:27:52Z

Stacked on top of #6 (remaining-time). Merge order: #2 → #3 → #4 → #5 → #6 → this.

Summary

Adds outcome as the third task — binary classification, pure-Python AUC. CLI is fully task-aware: 3-of-5 v0 tasks now ship.
Last-activity-conditioned prior baseline: the dumbest model that uses any prefix information. Tying it means the model isn't reading the trace.
No leaderboard entry on synthetic-toy yet — seed=42's test partition has n_pos=0, AUC degenerates. The full pipeline still runs and is tested; a real entry lands when a BPI dataset gets pinned.

What's new

score_outcome — rank-sum AUC with average-rank tie breaking. Degenerate single-class case returns 0.5 by convention rather than NaN, so leaderboard rows stay readable.
OutcomeTarget + extract_outcome_targets — repeats the case's final 0/1 outcome at every prefix length. Models see the same target with progressively more context — a clean way to measure "how soon can you tell?"
prior_outcome baseline — last-activity → empirical positive rate, with global-rate fallback. ~30 lines of CPython.
_synth.is_positive_outcome — synthetic-toy rule: case ends with delivery_confirmed.
CLI dispatch — --task outcome, --baseline prior. UsageError on mismatched (task, baseline) pairs. Outcome rule resolved by dataset name.
8 new tests (test_outcome.py) — extraction, baseline determinism, per-last-activity rates, CSV round-trips, e2e click-runner pipeline. 73 total.

Smoke (synthetic-toy)

$ pm-bench prefixes synthetic-toy --split split.json --out o.csv --task outcome
$ pm-bench predict synthetic-toy --split split.json --prefixes o.csv \
    --out opreds.csv --baseline prior --task outcome
$ pm-bench score opreds.csv --prefixes o.csv --task outcome
{ "task": "outcome", "auc": 0.5, "n": 41, "n_pos": 0 }

The n_pos: 0 is honest about what the data supports; the rest of the pipeline is verified by tests/test_outcome.py against a hand-built event set with controlled class balance.

Test plan

pytest -q — 73 passed (was 59 on PR v0.2: remaining-time task + mean reference baseline #6)
ruff check pm_bench tests — clean
AUC math: perfect separation, perfect inversion, all-tied, single-class, hand-checked 8/9 case
e2e CLI on synthetic-toy completes cleanly even with degenerate AUC

Roadmap impact

v0.3 (5-task scoring): now 3 of 5 (next-event, remaining-time, outcome) ✅; conformance + bottleneck remain
The outcome leaderboard slot is real — it just waits on a dataset whose test split has both classes

- score_outcome — pure-CPython rank-sum AUC with average-rank tie breaking; single-class degenerate case returns 0.5 (rather than NaN) so leaderboard rows stay readable - prefixes.py: OutcomeTarget + extract_outcome_targets — repeats the case's final 0/1 outcome at every prefix length so models see the same target with progressively more context - baselines/prior_outcome.py: last-activity-conditioned positive rate (with global-rate fallback for unseen activities). The dumbest baseline that uses *any* prefix signal — tying it means the model isn't using the trace at all - _synth.is_positive_outcome — synthetic-toy outcome rule (case ends with delivery_confirmed) - CLI: --task outcome, --baseline prior wired through prefixes / predict / score; outcome rule dispatch by dataset name - 8 new tests (test_outcome.py) — extraction, baseline determinism, per-last-activity rates, CSV round-trips, e2e click pipeline; 73 total - No leaderboard entry on synthetic-toy yet: seed=42's test partition happens to have n_pos=0, so AUC degenerates. The pipeline still runs cleanly; a real leaderboard row waits on a pinned BPI dataset

protosphinx · 2026-05-01T17:54:13Z

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

protosphinx deleted the branch remaining-time May 1, 2026 17:54

protosphinx closed this May 1, 2026

protosphinx deleted the outcome-task branch May 1, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3: outcome task — AUC scoring + prior baseline#7

v0.3: outcome task — AUC scoring + prior baseline#7
protosphinx wants to merge 1 commit into
remaining-timefrom
outcome-task

protosphinx commented May 1, 2026

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

protosphinx commented May 1, 2026

Summary

What's new

Smoke (synthetic-toy)

Test plan

Roadmap impact

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant