Skip to content

synthetic-toy → 200 cases: outcome leaderboard row lands; all 5 boards real#12

Closed
protosphinx wants to merge 1 commit into
conformance-taskfrom
synthetic-200
Closed

synthetic-toy → 200 cases: outcome leaderboard row lands; all 5 boards real#12
protosphinx wants to merge 1 commit into
conformance-taskfrom
synthetic-200

Conversation

@protosphinx
Copy link
Copy Markdown
Member

Stacked on top of #11 (conformance). Merge order: #2#3#4#5#6#7#8#9#10#11 → this.

Summary

  • Bumps synthetic_log() default from 50 → 200 cases. The chronological tail of the bigger log naturally contains positive delivery_confirmed cases, so the outcome task finally has a real leaderboard entry.
  • All 4 existing reference predictions regenerated; AUC entry added for outcome. Five boards now verify under --all --verify.

What's new

  • pm_bench/_synth.synthetic_log(n_cases=200, seed=42) — same path distribution, more cases. Test partition: 158 prefixes, 45 positives.
  • 5th leaderboard board: leaderboard/outcome/synthetic-toy.json with prior-ref entry — AUC 0.6319, n_pos 45/158. Real floor for any temporal model on the outcome task.
  • _rescore_outcome + _outcome_truth_for_dataset wired into pm_bench/leaderboard.py.
  • All 4 existing entries regenerated:
    • markov-ref: top-1 0.9304 (was 0.9756 on 50 cases — Markov is harder when the test set has more variety)
    • mean-ref: MAE 1.3481 days
    • mean-wait-ref: NDCG@10 0.9911 over 9 transitions
    • dfg-ref: F = 1.0 (200 cases → both partitions observe every path-graph edge — the cleaner conformance result)
  • registry.yml synthetic-toy row updated (cases 200, events 965).
  • STANDINGS.md regenerated; README.md, STATUS.md updated.
  • Tests adjusted to the new numbers; the n_pos=0 by accident comments are gone.

Smoke

$ pm-bench leaderboard --all --verify
bottleneck/synthetic-toy: OK - 1 entry(ies)
conformance/synthetic-toy: OK - 1 entry(ies)
next-event/synthetic-toy: OK - 1 entry(ies)
outcome/synthetic-toy: OK - 1 entry(ies)
remaining-time/synthetic-toy: OK - 1 entry(ies)

Test plan

Roadmap impact

  • Closes the long-standing "outcome row waits on a dataset whose test split has both classes" caveat. Synthetic-toy is now self-sufficient for all 5 v0 tasks.
  • Numbers will move when a real BPI dataset gets pinned, which is fine — that's the whole point of having drift detection in CI.

- synthetic_log() default n_cases = 200 (was 50). Test partition now
  has ~45 positive cases for `delivery_confirmed`, so the outcome
  task gets a real AUC instead of degenerating to 0.5
- All 4 existing reference entries regenerated and re-scored:
  * markov-ref:    top1 0.9304  (was 0.9756 on 50 cases)
  * mean-ref:      MAE 1.3481 days
  * mean-wait-ref: NDCG@10 0.9911 over 9 transitions
  * dfg-ref:       F = 1.0  (200 cases → both partitions cover the
                             full path graph)
- 5th leaderboard board added: leaderboard/outcome/synthetic-toy.json
  with prior-ref entry — AUC 0.6319, n_pos 45 / 158
- _rescore_outcome + _outcome_truth_for_dataset wired into
  leaderboard.py; pm-bench leaderboard --all --verify walks all 5
  boards
- registry.yml synthetic-toy row updated (cases 200, events 965)
- STANDINGS.md regenerated; README + STATUS updated; tests adjusted
  to the new numbers (drop "n_pos=0 by accident" comments)
- 109 tests, ruff clean
@protosphinx
Copy link
Copy Markdown
Member Author

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

@protosphinx protosphinx deleted the branch conformance-task May 1, 2026 17:54
@protosphinx protosphinx closed this May 1, 2026
@protosphinx protosphinx deleted the synthetic-200 branch May 1, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant