chore: add dependabot config#24
Merged
Merged
Conversation
… score - prefixes.py — extract (case_id, prefix_idx, prefix, true_next) targets from a split; CSV round-trip helpers - predictions.py — predictions CSV format (case_id, prefix_idx, ranked) - baselines/markov.py — first-order Markov reference (train-only fit, unigram fallback for unseen last-activity) - CLI gains `prefixes`, `predict --baseline markov`, `score`; the full `split → prefixes → predict → score` loop now matches the README - tests/test_e2e.py exercises the loop via click runner, locking the file formats the leaderboard depends on - 24 tests pass (was 17); ruff clean - Markov on synthetic-toy: top1 0.976, top3 1.000 — sets the floor any future model has to clear
- cache.py — `$PM_BENCH_CACHE` → `~/.cache/pm-bench/` with per-dataset paths; rejects synthetic and unknown formats - fetch.py — `ensure_cached(dataset)` covers cached+match, cached+mismatch (loud HashMismatchError), cached+unpinned (returns actual hash), not-cached (auto-download if URL set, otherwise ManualFetchRequired with the precise landing URL + on-disk path). Streams in 1 MiB chunks; atomic .part-then-rename writes - CLI: `pm-bench fetch <name> [--pin]` — prints status, emits a pasteable registry.yml sha256 patch when `--pin` is set against an unpinned-but-present cached file (the path the TOS-gated workflow takes) - 13 new tests (test_cache.py, test_fetch.py); 37 total, ruff clean - STATUS / GOALS / README updated: v0.1 marked partial — machinery shipped, per-dataset hash pins pending one-time manual downloads
- leaderboard/next-event/synthetic-toy.json — first standings file, with the Markov-ref entry (top1 0.9756, top3 1.0, n 41) - leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz — reference predictions, checked in so the loop is reproducible without hitting the network - pm_bench/leaderboard.py — load_board, rescore, verify, standings. Reads gzipped or plain CSV; pure CPython (no torch / pandas) - CLI: `pm-bench leaderboard <task> <dataset> [--verify]` — pretty-prints standings, optionally re-runs scoring against the checked-in predictions and fails if recorded != actual - tests/test_leaderboard.py — 8 tests including a drift-detection canary that tampers with the recorded score and confirms verify() flags it - 45 tests total (was 37); ruff clean - README v0.4 milestone marked partial; STATUS + GOALS updated
- `pm-bench leaderboard --all [--verify]` walks every leaderboard/<task>/<dataset>.json so CI and contributors share one command. Without args, prints OK/DRIFT per board; with --verify, exits non-zero on any drift - .github/workflows/leaderboard.yml — dedicated job that runs the --all --verify command on every push / PR touching scoring code or standings files. Surfaces leaderboard health as its own check rather than burying it inside pytest - 2 new tests cover the --all path on both clean and tampered trees; 47 total, ruff clean - README v0.4 milestone narrowed: only the landing page remains
- score_remaining_time → MAE in days; equally-weighted across prefixes
- pm_bench/prefixes.py grows TimeTarget + extract_remaining_time_targets
+ CSV r/w; truth file shape parallels next-event so models share a
loader
- pm_bench/baselines/mean_time.py — fit_mean_time predicts the mean
remaining-time observed on training prefixes; the dumbest model that
still respects the train/test split
- CLI: prefixes / predict / score all dispatch on
--task {next-event,remaining-time}. predict --baseline mean is
remaining-time only; markov is next-event only
- leaderboard.py — rescore + verify handle both tasks; standings sorts
ascending for MAE, descending for accuracy
- leaderboard/remaining-time/synthetic-toy.json with mean-ref entry
(MAE 1.255 days, n 41); predictions checked in
- 8 new tests; 59 total, ruff clean
- README v0.2 marked shipped; v0.3 marked partial (2 of 5 tasks)
- score_outcome — pure-CPython rank-sum AUC with average-rank tie breaking; single-class degenerate case returns 0.5 (rather than NaN) so leaderboard rows stay readable - prefixes.py: OutcomeTarget + extract_outcome_targets — repeats the case's final 0/1 outcome at every prefix length so models see the same target with progressively more context - baselines/prior_outcome.py: last-activity-conditioned positive rate (with global-rate fallback for unseen activities). The dumbest baseline that uses *any* prefix signal — tying it means the model isn't using the trace at all - _synth.is_positive_outcome — synthetic-toy outcome rule (case ends with delivery_confirmed) - CLI: --task outcome, --baseline prior wired through prefixes / predict / score; outcome rule dispatch by dataset name - 8 new tests (test_outcome.py) — extraction, baseline determinism, per-last-activity rates, CSV round-trips, e2e click pipeline; 73 total - No leaderboard entry on synthetic-toy yet: seed=42's test partition happens to have n_pos=0, so AUC degenerates. The pipeline still runs cleanly; a real leaderboard row waits on a pinned BPI dataset
- score_bottleneck — pure-CPython NDCG@k. Predictions rank transitions; truth is the held-out per-(a,b) mean wait time. Missing predictions sink to the bottom (refusing to predict doesn't earn credit) - pm_bench/bottleneck.py — BottleneckTarget + extract; per-transition shape (4-tuple: a, b, mean_wait_seconds, n_observations) instead of per-prefix - baselines/mean_wait.py — train-mean-per-transition with global-mean fallback. On synthetic-toy: NDCG@10 0.9786 over 6 transitions - CLI: --task bottleneck, --baseline mean-wait wired through prefixes / predict / score - leaderboard/bottleneck/synthetic-toy.json with mean-wait-ref entry (NDCG@10 0.9786, n_transitions 6); pm-bench leaderboard --all now walks 3 boards (next-event, remaining-time, bottleneck) - 7 new tests; 86 total, ruff clean - v0.3 marked partial → 4 of 5 tasks (conformance remains)
- `pm-bench leaderboard [--all] --markdown` emits a markdown rendering with task-aware columns (top1/top3 for next-event, mae_days for remaining-time, AUC/n_pos for outcome, NDCG@k for bottleneck) - STANDINGS.md checked in at repo root; tests/test_leaderboard.py fails if it drifts from what `--all --markdown` produces today (regenerate via that one command) - README links STANDINGS so headline numbers are one click away - v0.4 milestone closed: standings format + reference entries + verify CLI + leaderboard.yml CI + auto-generated landing page all shipped - 89 tests pass (was 86), ruff clean
…ry plumbing - pm_bench/io.py: read_csv_log accepts CSV / .csv.gz with either pm-bench-native columns (case_id, activity, timestamp) or PM4Py XES-derived names (case:concept:name, concept:name, time:timestamp). Bad timestamps fail with file:line context - _load_events auto-detects path-like inputs (slashes, .csv, .csv.gz, .tsv) and routes to the loader. Registry names still work - `pm-bench split path/to/log.csv` (and the full pipeline) runs end-to-end on any user CSV — no registry entry, no hash pin - 8 new tests including a click-runner e2e against a tmp CSV; 97 total - Unblocks the obvious "let me try pm-bench on my own data" path that previously required wiring a registry entry first
- score_conformance — pure CPython, no pm4py dep. F = 2fp/(f+p)
where f and p are computed from set overlap of the submitted DFG
and the test-partition DFG
- pm_bench/conformance.py — extract_dfg, write/read_model_json.
Submission format: {"transitions": [["a","b"], ...]}
- New CLI verb `pm-bench discover <name> --baseline dfg --out
model.json` — discovers a DFG from training cases. Score path
takes --dataset and --split (instead of --prefixes) since the
model is global, not per-prefix
- leaderboard/conformance/synthetic-toy.json with dfg-ref entry
(F=0.857, fitness 1.0, precision 0.75); pm-bench leaderboard
--all now walks 4 boards
- leaderboard.py + CLI standings printer + STANDINGS.md learn the
conformance column set
- 11 new tests (test_conformance.py); 108 total, ruff clean
- v0.3 (5-task scoring) closed: every task has a baseline + entry
No semantic change; ASCII-only punctuation across READMEs, GOALS, source comments, doctests, and config. Verified by running the existing test suite (no test asserts on em-dash text).
- synthetic_log() default n_cases = 200 (was 50). Test partition now
has ~45 positive cases for `delivery_confirmed`, so the outcome
task gets a real AUC instead of degenerating to 0.5
- All 4 existing reference entries regenerated and re-scored:
* markov-ref: top1 0.9304 (was 0.9756 on 50 cases)
* mean-ref: MAE 1.3481 days
* mean-wait-ref: NDCG@10 0.9911 over 9 transitions
* dfg-ref: F = 1.0 (200 cases → both partitions cover the
full path graph)
- 5th leaderboard board added: leaderboard/outcome/synthetic-toy.json
with prior-ref entry — AUC 0.6319, n_pos 45 / 158
- _rescore_outcome + _outcome_truth_for_dataset wired into
leaderboard.py; pm-bench leaderboard --all --verify walks all 5
boards
- registry.yml synthetic-toy row updated (cases 200, events 965)
- STANDINGS.md regenerated; README + STATUS updated; tests adjusted
to the new numbers (drop "n_pos=0 by accident" comments)
- 109 tests, ruff clean
- pm_bench/stats.py:summarize(events, top_n) → LogStats with n_events, n_cases, n_activities, time span, earliest/latest, mean/median case length, top-N activities, top-N transitions - CLI: pm-bench stats <name-or-path> [--top-n N] emits JSON - Works on synthetic-toy and any CSV path that the existing _load_events dispatch accepts - 7 new tests (test_stats.py); 116 total, ruff clean - README gets a one-liner pointing at the command
…oard - pm_bench/baselines/uniform.py: ranks every training-set activity in lexicographic order, identical for every prefix. The "didn't read the trace at all" floor that any real model has to clear - predict --baseline uniform wired through CLI for next-event - leaderboard/next-event/synthetic-toy.json now has 2 entries: * markov-ref: top1 0.9304, top3 1.0000 * uniform-ref: top1 0.2025, top3 0.2785 Standings sort puts markov above uniform (asserted by test) - Demonstrates leaderboard scales beyond 1 entry per (task, dataset) - STANDINGS.md regenerated; STATUS updated - 1 new test (multi-entry sort canary); 117 total, ruff clean
- pm_bench/baselines/zero_time.py: predicts 0 days for every prefix (absolute MAE floor) - discover --baseline empty: submits an empty DFG (fitness 0, F 0 — absolute conformance floor) - CLI: predict --baseline zero --task remaining-time wired alongside mean; discover --baseline empty wired alongside dfg - New leaderboard entries: * remaining-time/synthetic-toy: zero-ref MAE 2.7410 vs mean-ref 1.3481 * conformance/synthetic-toy: empty-ref F 0.0 vs dfg-ref F 1.0 - 3 of 5 boards now have 2 entries (next-event already had uniform-ref from the previous PR; outcome and bottleneck still single-entry pending future submissions) - STANDINGS.md regenerated; 117 tests, ruff clean
Sweep across STATUS.md, baselines (uniform, zero_time), stats.py, cli.py, and the leaderboard JSON fixtures. ASCII-only punctuation. Tests pass unchanged (117); ruff clean.
- pm_bench/leaderboard.py:compare_boards(a, b) → dict — per-model score deltas. Tasks/datasets must match (loud ValueError otherwise) - CLI: pm-bench compare A.json B.json emits the diff as JSON - Use case: snapshot today, change something, re-snapshot, diff to see what moved. Models unique to one side get surfaced separately - 6 new tests including click-runner smoke + cross-task rejection - 123 total, ruff clean
- tests/test_fetch.py:test_ensure_cached_auto_downloads_from_url
spins up http.server in a tmp dir, points a Dataset's download_url
at 127.0.0.1:<port>, and verifies ensure_cached:
* downloads on first call (downloaded=True, pinned=True since
sha256 matches)
* hits the cache on the second call (downloaded=False)
- The auto-download path was previously only covered by the manual-
fetch error case; this fills the obvious test gap before any
actually-fetchable URLs land in registry.yml
- 124 tests, ruff clean (was 123)
- python -m bench.seeds --n 30 runs each reference baseline at N seeds of synthetic_log; prints mean / std / min / max per metric as a markdown table (or JSON via --format json) - Quantifies the noise band a real submission has to clear before "better than the baseline" is a statistically interesting claim - All 5 tasks covered: next-event/markov, remaining-time/mean, outcome/prior, bottleneck/mean-wait, conformance/dfg - README gets a "Baseline variance" section pointing at the script - 4 new tests including parametrized smoke per task; 132 total, ruff clean - First measurement at n=5: markov top1 0.9183 ± 0.0111
- Setup section for new clones - Adding a leaderboard entry: 5-step recipe per task - Predictions file format table for all 5 tasks (next-event, remaining-time, outcome, bottleneck, conformance) - Pre-PR checklist: pytest, ruff, leaderboard verify, STANDINGS regen - Noise-quantification recommendation pointing at bench/seeds.py - Bug reporting template - README's "Submitting" section now points at CONTRIBUTING for the long form
- _load_events parses "synthetic-toy@99" → synthetic_log(seed=99).
Bare "synthetic-toy" still uses canonical seed=42 — leaderboard
reproducibility preserved
- Saves users from either scripting Python or threading --seed
through every command verb
- Bad seed strings ("synthetic-toy@nope") fail cleanly with a
message
- README + STATUS updated; 2 new tests; 134 total, ruff clean
- pm_bench/leaderboard_schema.py:validate_board(dict) → list[str]. Stdlib-only structural checker (no jsonschema dep). Clear error paths like "$.entries[2].score: must be an object" - Required top keys: task, dataset, metric, scored_with, split, entries. Required entry keys: model, version, predictions_path, score. Task must be one of the 5 v0 tasks - Parametrized test exercises every leaderboard/<task>/<dataset>.json in the repo; 4 negative tests cover missing top keys, unknown tasks, missing entry.score, non-dict score - 9 new tests; 143 total, ruff clean - Catches structural drift in leaderboard files before they reach the rescore path
- Combined schema check + score rescore on a single leaderboard file. Exit 0 = clean; exit 2 with schema-prefixed or score-prefixed error messages on stderr - --no-rescore for fast schema-only sanity check - CONTRIBUTING.md now recommends `pm-bench validate <file>` as the pre-PR step before regenerating STANDINGS - 4 new tests covering clean board, --no-rescore path, schema error surfacing, score drift; 147 total, ruff clean
- pm_bench/baselines/global_rate.py: constant training-positive-rate prediction. AUC = 0.5 (tied ranks). The "doesn't condition on the trace at all" outcome floor - pm_bench/baselines/random_rank.py: deterministic SHA-256-based pseudo-random score per transition. NDCG@10 0.943 — stable across CI runs (seeded by the SHA, not Python's PRNG) - CLI: predict --baseline global (outcome), --baseline random (bottleneck) wired alongside existing baselines - New leaderboard entries: * outcome/synthetic-toy: global-ref AUC 0.5 vs prior-ref 0.6319 * bottleneck/synthetic-toy: random-ref NDCG 0.9434 vs mean-wait-ref 0.9911 - All 5 boards now have 2 entries; multi-entry sort asserted across every task - STANDINGS regenerated; 147 tests, ruff clean
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds dependabot.yml so pip + github-actions ecosystems get monthly grouped dev updates and individually-PR'd runtime/major bumps. Matches the pattern already in place on skillcheck/mcprec/shruti.