synthetic-toy → 200 cases: outcome leaderboard row lands; all 5 boards real#12
Closed
protosphinx wants to merge 1 commit into
Closed
synthetic-toy → 200 cases: outcome leaderboard row lands; all 5 boards real#12protosphinx wants to merge 1 commit into
protosphinx wants to merge 1 commit into
Conversation
- synthetic_log() default n_cases = 200 (was 50). Test partition now
has ~45 positive cases for `delivery_confirmed`, so the outcome
task gets a real AUC instead of degenerating to 0.5
- All 4 existing reference entries regenerated and re-scored:
* markov-ref: top1 0.9304 (was 0.9756 on 50 cases)
* mean-ref: MAE 1.3481 days
* mean-wait-ref: NDCG@10 0.9911 over 9 transitions
* dfg-ref: F = 1.0 (200 cases → both partitions cover the
full path graph)
- 5th leaderboard board added: leaderboard/outcome/synthetic-toy.json
with prior-ref entry — AUC 0.6319, n_pos 45 / 158
- _rescore_outcome + _outcome_truth_for_dataset wired into
leaderboard.py; pm-bench leaderboard --all --verify walks all 5
boards
- registry.yml synthetic-toy row updated (cases 200, events 965)
- STANDINGS.md regenerated; README + STATUS updated; tests adjusted
to the new numbers (drop "n_pos=0 by accident" comments)
- 109 tests, ruff clean
This was referenced May 1, 2026
Member
Author
|
Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #11 (conformance). Merge order: #2 → #3 → #4 → #5 → #6 → #7 → #8 → #9 → #10 → #11 → this.
Summary
synthetic_log()default from 50 → 200 cases. The chronological tail of the bigger log naturally contains positivedelivery_confirmedcases, so the outcome task finally has a real leaderboard entry.--all --verify.What's new
pm_bench/_synth.synthetic_log(n_cases=200, seed=42)— same path distribution, more cases. Test partition: 158 prefixes, 45 positives.leaderboard/outcome/synthetic-toy.jsonwithprior-refentry — AUC 0.6319, n_pos 45/158. Real floor for any temporal model on the outcome task._rescore_outcome+_outcome_truth_for_datasetwired intopm_bench/leaderboard.py.markov-ref: top-1 0.9304 (was 0.9756 on 50 cases — Markov is harder when the test set has more variety)mean-ref: MAE 1.3481 daysmean-wait-ref: NDCG@10 0.9911 over 9 transitionsdfg-ref: F = 1.0 (200 cases → both partitions observe every path-graph edge — the cleaner conformance result)registry.ymlsynthetic-toy row updated (cases 200, events 965).STANDINGS.mdregenerated;README.md,STATUS.mdupdated.n_pos=0 by accidentcomments are gone.Smoke
Test plan
pytest -q— 109 passed (was 108 on PR v0.3 closed: conformance task — DFG fitness × precision → F #11; +1 for the new outcome-board verify test)ruff check pm_bench tests— cleanpm-bench leaderboard --all --verifyexits 0 across all 5 boardsRoadmap impact