chore: add dependabot config by protosphinx · Pull Request #24 · erphq/pm-bench

protosphinx · 2026-05-01T05:42:59Z

Adds dependabot.yml so pip + github-actions ecosystems get monthly grouped dev updates and individually-PR'd runtime/major bumps. Matches the pattern already in place on skillcheck/mcprec/shruti.

… score - prefixes.py — extract (case_id, prefix_idx, prefix, true_next) targets from a split; CSV round-trip helpers - predictions.py — predictions CSV format (case_id, prefix_idx, ranked) - baselines/markov.py — first-order Markov reference (train-only fit, unigram fallback for unseen last-activity) - CLI gains `prefixes`, `predict --baseline markov`, `score`; the full `split → prefixes → predict → score` loop now matches the README - tests/test_e2e.py exercises the loop via click runner, locking the file formats the leaderboard depends on - 24 tests pass (was 17); ruff clean - Markov on synthetic-toy: top1 0.976, top3 1.000 — sets the floor any future model has to clear

- cache.py — `$PM_BENCH_CACHE` → `~/.cache/pm-bench/` with per-dataset paths; rejects synthetic and unknown formats - fetch.py — `ensure_cached(dataset)` covers cached+match, cached+mismatch (loud HashMismatchError), cached+unpinned (returns actual hash), not-cached (auto-download if URL set, otherwise ManualFetchRequired with the precise landing URL + on-disk path). Streams in 1 MiB chunks; atomic .part-then-rename writes - CLI: `pm-bench fetch <name> [--pin]` — prints status, emits a pasteable registry.yml sha256 patch when `--pin` is set against an unpinned-but-present cached file (the path the TOS-gated workflow takes) - 13 new tests (test_cache.py, test_fetch.py); 37 total, ruff clean - STATUS / GOALS / README updated: v0.1 marked partial — machinery shipped, per-dataset hash pins pending one-time manual downloads

- leaderboard/next-event/synthetic-toy.json — first standings file, with the Markov-ref entry (top1 0.9756, top3 1.0, n 41) - leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz — reference predictions, checked in so the loop is reproducible without hitting the network - pm_bench/leaderboard.py — load_board, rescore, verify, standings. Reads gzipped or plain CSV; pure CPython (no torch / pandas) - CLI: `pm-bench leaderboard <task> <dataset> [--verify]` — pretty-prints standings, optionally re-runs scoring against the checked-in predictions and fails if recorded != actual - tests/test_leaderboard.py — 8 tests including a drift-detection canary that tampers with the recorded score and confirms verify() flags it - 45 tests total (was 37); ruff clean - README v0.4 milestone marked partial; STATUS + GOALS updated

- `pm-bench leaderboard --all [--verify]` walks every leaderboard/<task>/<dataset>.json so CI and contributors share one command. Without args, prints OK/DRIFT per board; with --verify, exits non-zero on any drift - .github/workflows/leaderboard.yml — dedicated job that runs the --all --verify command on every push / PR touching scoring code or standings files. Surfaces leaderboard health as its own check rather than burying it inside pytest - 2 new tests cover the --all path on both clean and tampered trees; 47 total, ruff clean - README v0.4 milestone narrowed: only the landing page remains

- score_remaining_time → MAE in days; equally-weighted across prefixes - pm_bench/prefixes.py grows TimeTarget + extract_remaining_time_targets + CSV r/w; truth file shape parallels next-event so models share a loader - pm_bench/baselines/mean_time.py — fit_mean_time predicts the mean remaining-time observed on training prefixes; the dumbest model that still respects the train/test split - CLI: prefixes / predict / score all dispatch on --task {next-event,remaining-time}. predict --baseline mean is remaining-time only; markov is next-event only - leaderboard.py — rescore + verify handle both tasks; standings sorts ascending for MAE, descending for accuracy - leaderboard/remaining-time/synthetic-toy.json with mean-ref entry (MAE 1.255 days, n 41); predictions checked in - 8 new tests; 59 total, ruff clean - README v0.2 marked shipped; v0.3 marked partial (2 of 5 tasks)

- score_outcome — pure-CPython rank-sum AUC with average-rank tie breaking; single-class degenerate case returns 0.5 (rather than NaN) so leaderboard rows stay readable - prefixes.py: OutcomeTarget + extract_outcome_targets — repeats the case's final 0/1 outcome at every prefix length so models see the same target with progressively more context - baselines/prior_outcome.py: last-activity-conditioned positive rate (with global-rate fallback for unseen activities). The dumbest baseline that uses *any* prefix signal — tying it means the model isn't using the trace at all - _synth.is_positive_outcome — synthetic-toy outcome rule (case ends with delivery_confirmed) - CLI: --task outcome, --baseline prior wired through prefixes / predict / score; outcome rule dispatch by dataset name - 8 new tests (test_outcome.py) — extraction, baseline determinism, per-last-activity rates, CSV round-trips, e2e click pipeline; 73 total - No leaderboard entry on synthetic-toy yet: seed=42's test partition happens to have n_pos=0, so AUC degenerates. The pipeline still runs cleanly; a real leaderboard row waits on a pinned BPI dataset

- score_bottleneck — pure-CPython NDCG@k. Predictions rank transitions; truth is the held-out per-(a,b) mean wait time. Missing predictions sink to the bottom (refusing to predict doesn't earn credit) - pm_bench/bottleneck.py — BottleneckTarget + extract; per-transition shape (4-tuple: a, b, mean_wait_seconds, n_observations) instead of per-prefix - baselines/mean_wait.py — train-mean-per-transition with global-mean fallback. On synthetic-toy: NDCG@10 0.9786 over 6 transitions - CLI: --task bottleneck, --baseline mean-wait wired through prefixes / predict / score - leaderboard/bottleneck/synthetic-toy.json with mean-wait-ref entry (NDCG@10 0.9786, n_transitions 6); pm-bench leaderboard --all now walks 3 boards (next-event, remaining-time, bottleneck) - 7 new tests; 86 total, ruff clean - v0.3 marked partial → 4 of 5 tasks (conformance remains)

- `pm-bench leaderboard [--all] --markdown` emits a markdown rendering with task-aware columns (top1/top3 for next-event, mae_days for remaining-time, AUC/n_pos for outcome, NDCG@k for bottleneck) - STANDINGS.md checked in at repo root; tests/test_leaderboard.py fails if it drifts from what `--all --markdown` produces today (regenerate via that one command) - README links STANDINGS so headline numbers are one click away - v0.4 milestone closed: standings format + reference entries + verify CLI + leaderboard.yml CI + auto-generated landing page all shipped - 89 tests pass (was 86), ruff clean

…ry plumbing - pm_bench/io.py: read_csv_log accepts CSV / .csv.gz with either pm-bench-native columns (case_id, activity, timestamp) or PM4Py XES-derived names (case:concept:name, concept:name, time:timestamp). Bad timestamps fail with file:line context - _load_events auto-detects path-like inputs (slashes, .csv, .csv.gz, .tsv) and routes to the loader. Registry names still work - `pm-bench split path/to/log.csv` (and the full pipeline) runs end-to-end on any user CSV — no registry entry, no hash pin - 8 new tests including a click-runner e2e against a tmp CSV; 97 total - Unblocks the obvious "let me try pm-bench on my own data" path that previously required wiring a registry entry first

- score_conformance — pure CPython, no pm4py dep. F = 2fp/(f+p) where f and p are computed from set overlap of the submitted DFG and the test-partition DFG - pm_bench/conformance.py — extract_dfg, write/read_model_json. Submission format: {"transitions": [["a","b"], ...]} - New CLI verb `pm-bench discover <name> --baseline dfg --out model.json` — discovers a DFG from training cases. Score path takes --dataset and --split (instead of --prefixes) since the model is global, not per-prefix - leaderboard/conformance/synthetic-toy.json with dfg-ref entry (F=0.857, fitness 1.0, precision 0.75); pm-bench leaderboard --all now walks 4 boards - leaderboard.py + CLI standings printer + STANDINGS.md learn the conformance column set - 11 new tests (test_conformance.py); 108 total, ruff clean - v0.3 (5-task scoring) closed: every task has a baseline + entry

No semantic change; ASCII-only punctuation across READMEs, GOALS, source comments, doctests, and config. Verified by running the existing test suite (no test asserts on em-dash text).

- synthetic_log() default n_cases = 200 (was 50). Test partition now has ~45 positive cases for `delivery_confirmed`, so the outcome task gets a real AUC instead of degenerating to 0.5 - All 4 existing reference entries regenerated and re-scored: * markov-ref: top1 0.9304 (was 0.9756 on 50 cases) * mean-ref: MAE 1.3481 days * mean-wait-ref: NDCG@10 0.9911 over 9 transitions * dfg-ref: F = 1.0 (200 cases → both partitions cover the full path graph) - 5th leaderboard board added: leaderboard/outcome/synthetic-toy.json with prior-ref entry — AUC 0.6319, n_pos 45 / 158 - _rescore_outcome + _outcome_truth_for_dataset wired into leaderboard.py; pm-bench leaderboard --all --verify walks all 5 boards - registry.yml synthetic-toy row updated (cases 200, events 965) - STANDINGS.md regenerated; README + STATUS updated; tests adjusted to the new numbers (drop "n_pos=0 by accident" comments) - 109 tests, ruff clean

- pm_bench/stats.py:summarize(events, top_n) → LogStats with n_events, n_cases, n_activities, time span, earliest/latest, mean/median case length, top-N activities, top-N transitions - CLI: pm-bench stats <name-or-path> [--top-n N] emits JSON - Works on synthetic-toy and any CSV path that the existing _load_events dispatch accepts - 7 new tests (test_stats.py); 116 total, ruff clean - README gets a one-liner pointing at the command

…oard - pm_bench/baselines/uniform.py: ranks every training-set activity in lexicographic order, identical for every prefix. The "didn't read the trace at all" floor that any real model has to clear - predict --baseline uniform wired through CLI for next-event - leaderboard/next-event/synthetic-toy.json now has 2 entries: * markov-ref: top1 0.9304, top3 1.0000 * uniform-ref: top1 0.2025, top3 0.2785 Standings sort puts markov above uniform (asserted by test) - Demonstrates leaderboard scales beyond 1 entry per (task, dataset) - STANDINGS.md regenerated; STATUS updated - 1 new test (multi-entry sort canary); 117 total, ruff clean

- pm_bench/baselines/zero_time.py: predicts 0 days for every prefix (absolute MAE floor) - discover --baseline empty: submits an empty DFG (fitness 0, F 0 — absolute conformance floor) - CLI: predict --baseline zero --task remaining-time wired alongside mean; discover --baseline empty wired alongside dfg - New leaderboard entries: * remaining-time/synthetic-toy: zero-ref MAE 2.7410 vs mean-ref 1.3481 * conformance/synthetic-toy: empty-ref F 0.0 vs dfg-ref F 1.0 - 3 of 5 boards now have 2 entries (next-event already had uniform-ref from the previous PR; outcome and bottleneck still single-entry pending future submissions) - STANDINGS.md regenerated; 117 tests, ruff clean

Sweep across STATUS.md, baselines (uniform, zero_time), stats.py, cli.py, and the leaderboard JSON fixtures. ASCII-only punctuation. Tests pass unchanged (117); ruff clean.

- pm_bench/leaderboard.py:compare_boards(a, b) → dict — per-model score deltas. Tasks/datasets must match (loud ValueError otherwise) - CLI: pm-bench compare A.json B.json emits the diff as JSON - Use case: snapshot today, change something, re-snapshot, diff to see what moved. Models unique to one side get surfaced separately - 6 new tests including click-runner smoke + cross-task rejection - 123 total, ruff clean

- tests/test_fetch.py:test_ensure_cached_auto_downloads_from_url spins up http.server in a tmp dir, points a Dataset's download_url at 127.0.0.1:<port>, and verifies ensure_cached: * downloads on first call (downloaded=True, pinned=True since sha256 matches) * hits the cache on the second call (downloaded=False) - The auto-download path was previously only covered by the manual- fetch error case; this fills the obvious test gap before any actually-fetchable URLs land in registry.yml - 124 tests, ruff clean (was 123)

- python -m bench.seeds --n 30 runs each reference baseline at N seeds of synthetic_log; prints mean / std / min / max per metric as a markdown table (or JSON via --format json) - Quantifies the noise band a real submission has to clear before "better than the baseline" is a statistically interesting claim - All 5 tasks covered: next-event/markov, remaining-time/mean, outcome/prior, bottleneck/mean-wait, conformance/dfg - README gets a "Baseline variance" section pointing at the script - 4 new tests including parametrized smoke per task; 132 total, ruff clean - First measurement at n=5: markov top1 0.9183 ± 0.0111

- Setup section for new clones - Adding a leaderboard entry: 5-step recipe per task - Predictions file format table for all 5 tasks (next-event, remaining-time, outcome, bottleneck, conformance) - Pre-PR checklist: pytest, ruff, leaderboard verify, STANDINGS regen - Noise-quantification recommendation pointing at bench/seeds.py - Bug reporting template - README's "Submitting" section now points at CONTRIBUTING for the long form

- _load_events parses "synthetic-toy@99" → synthetic_log(seed=99). Bare "synthetic-toy" still uses canonical seed=42 — leaderboard reproducibility preserved - Saves users from either scripting Python or threading --seed through every command verb - Bad seed strings ("synthetic-toy@nope") fail cleanly with a message - README + STATUS updated; 2 new tests; 134 total, ruff clean

- pm_bench/leaderboard_schema.py:validate_board(dict) → list[str]. Stdlib-only structural checker (no jsonschema dep). Clear error paths like "$.entries[2].score: must be an object" - Required top keys: task, dataset, metric, scored_with, split, entries. Required entry keys: model, version, predictions_path, score. Task must be one of the 5 v0 tasks - Parametrized test exercises every leaderboard/<task>/<dataset>.json in the repo; 4 negative tests cover missing top keys, unknown tasks, missing entry.score, non-dict score - 9 new tests; 143 total, ruff clean - Catches structural drift in leaderboard files before they reach the rescore path

- Combined schema check + score rescore on a single leaderboard file. Exit 0 = clean; exit 2 with schema-prefixed or score-prefixed error messages on stderr - --no-rescore for fast schema-only sanity check - CONTRIBUTING.md now recommends `pm-bench validate <file>` as the pre-PR step before regenerating STANDINGS - 4 new tests covering clean board, --no-rescore path, schema error surfacing, score drift; 147 total, ruff clean

- pm_bench/baselines/global_rate.py: constant training-positive-rate prediction. AUC = 0.5 (tied ranks). The "doesn't condition on the trace at all" outcome floor - pm_bench/baselines/random_rank.py: deterministic SHA-256-based pseudo-random score per transition. NDCG@10 0.943 — stable across CI runs (seeded by the SHA, not Python's PRNG) - CLI: predict --baseline global (outcome), --baseline random (bottleneck) wired alongside existing baselines - New leaderboard entries: * outcome/synthetic-toy: global-ref AUC 0.5 vs prior-ref 0.6319 * bottleneck/synthetic-toy: random-ref NDCG 0.9434 vs mean-wait-ref 0.9911 - All 5 boards now have 2 entries; multi-entry sort asserted across every task - STANDINGS regenerated; 147 tests, ruff clean

protosphinx added 25 commits April 30, 2026 19:54

chore: replace em dashes with hyphens per writing style guide

55d193a

No semantic change; ASCII-only punctuation across READMEs, GOALS, source comments, doctests, and config. Verified by running the existing test suite (no test asserts on em-dash text).

chore: replace em dashes with hyphens per writing style guide

5854658

Sweep across STATUS.md, baselines (uniform, zero_time), stats.py, cli.py, and the leaderboard JSON fixtures. ASCII-only punctuation. Tests pass unchanged (117); ruff clean.

chore: add dependabot config

681c951

protosphinx added the automated Opened by the daily bot label May 1, 2026

protosphinx merged commit 522e972 into main May 1, 2026
6 checks passed

protosphinx deleted the chore/dependabot-config branch May 1, 2026 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add dependabot config#24

chore: add dependabot config#24
protosphinx merged 25 commits into
mainfrom
chore/dependabot-config

protosphinx commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

protosphinx commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant