Skip to content

chore: add dependabot config#24

Merged
protosphinx merged 25 commits into
mainfrom
chore/dependabot-config
May 1, 2026
Merged

chore: add dependabot config#24
protosphinx merged 25 commits into
mainfrom
chore/dependabot-config

Conversation

@protosphinx
Copy link
Copy Markdown
Member

Adds dependabot.yml so pip + github-actions ecosystems get monthly grouped dev updates and individually-PR'd runtime/major bumps. Matches the pattern already in place on skillcheck/mcprec/shruti.

… score

- prefixes.py — extract (case_id, prefix_idx, prefix, true_next) targets
  from a split; CSV round-trip helpers
- predictions.py — predictions CSV format (case_id, prefix_idx, ranked)
- baselines/markov.py — first-order Markov reference (train-only fit,
  unigram fallback for unseen last-activity)
- CLI gains `prefixes`, `predict --baseline markov`, `score`; the full
  `split → prefixes → predict → score` loop now matches the README
- tests/test_e2e.py exercises the loop via click runner, locking the
  file formats the leaderboard depends on
- 24 tests pass (was 17); ruff clean
- Markov on synthetic-toy: top1 0.976, top3 1.000 — sets the floor any
  future model has to clear
- cache.py — `$PM_BENCH_CACHE` → `~/.cache/pm-bench/` with per-dataset
  paths; rejects synthetic and unknown formats
- fetch.py — `ensure_cached(dataset)` covers cached+match,
  cached+mismatch (loud HashMismatchError), cached+unpinned (returns
  actual hash), not-cached (auto-download if URL set, otherwise
  ManualFetchRequired with the precise landing URL + on-disk path).
  Streams in 1 MiB chunks; atomic .part-then-rename writes
- CLI: `pm-bench fetch <name> [--pin]` — prints status, emits a
  pasteable registry.yml sha256 patch when `--pin` is set against an
  unpinned-but-present cached file (the path the TOS-gated workflow
  takes)
- 13 new tests (test_cache.py, test_fetch.py); 37 total, ruff clean
- STATUS / GOALS / README updated: v0.1 marked partial — machinery
  shipped, per-dataset hash pins pending one-time manual downloads
- leaderboard/next-event/synthetic-toy.json — first standings file,
  with the Markov-ref entry (top1 0.9756, top3 1.0, n 41)
- leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz —
  reference predictions, checked in so the loop is reproducible
  without hitting the network
- pm_bench/leaderboard.py — load_board, rescore, verify, standings.
  Reads gzipped or plain CSV; pure CPython (no torch / pandas)
- CLI: `pm-bench leaderboard <task> <dataset> [--verify]` —
  pretty-prints standings, optionally re-runs scoring against the
  checked-in predictions and fails if recorded != actual
- tests/test_leaderboard.py — 8 tests including a drift-detection
  canary that tampers with the recorded score and confirms verify()
  flags it
- 45 tests total (was 37); ruff clean
- README v0.4 milestone marked partial; STATUS + GOALS updated
- `pm-bench leaderboard --all [--verify]` walks every
  leaderboard/<task>/<dataset>.json so CI and contributors share one
  command. Without args, prints OK/DRIFT per board; with --verify,
  exits non-zero on any drift
- .github/workflows/leaderboard.yml — dedicated job that runs the
  --all --verify command on every push / PR touching scoring code or
  standings files. Surfaces leaderboard health as its own check
  rather than burying it inside pytest
- 2 new tests cover the --all path on both clean and tampered trees;
  47 total, ruff clean
- README v0.4 milestone narrowed: only the landing page remains
- score_remaining_time → MAE in days; equally-weighted across prefixes
- pm_bench/prefixes.py grows TimeTarget + extract_remaining_time_targets
  + CSV r/w; truth file shape parallels next-event so models share a
  loader
- pm_bench/baselines/mean_time.py — fit_mean_time predicts the mean
  remaining-time observed on training prefixes; the dumbest model that
  still respects the train/test split
- CLI: prefixes / predict / score all dispatch on
  --task {next-event,remaining-time}. predict --baseline mean is
  remaining-time only; markov is next-event only
- leaderboard.py — rescore + verify handle both tasks; standings sorts
  ascending for MAE, descending for accuracy
- leaderboard/remaining-time/synthetic-toy.json with mean-ref entry
  (MAE 1.255 days, n 41); predictions checked in
- 8 new tests; 59 total, ruff clean
- README v0.2 marked shipped; v0.3 marked partial (2 of 5 tasks)
- score_outcome — pure-CPython rank-sum AUC with average-rank tie
  breaking; single-class degenerate case returns 0.5 (rather than
  NaN) so leaderboard rows stay readable
- prefixes.py: OutcomeTarget + extract_outcome_targets — repeats the
  case's final 0/1 outcome at every prefix length so models see the
  same target with progressively more context
- baselines/prior_outcome.py: last-activity-conditioned positive
  rate (with global-rate fallback for unseen activities). The
  dumbest baseline that uses *any* prefix signal — tying it means
  the model isn't using the trace at all
- _synth.is_positive_outcome — synthetic-toy outcome rule (case ends
  with delivery_confirmed)
- CLI: --task outcome, --baseline prior wired through prefixes /
  predict / score; outcome rule dispatch by dataset name
- 8 new tests (test_outcome.py) — extraction, baseline determinism,
  per-last-activity rates, CSV round-trips, e2e click pipeline; 73
  total
- No leaderboard entry on synthetic-toy yet: seed=42's test partition
  happens to have n_pos=0, so AUC degenerates. The pipeline still
  runs cleanly; a real leaderboard row waits on a pinned BPI dataset
- score_bottleneck — pure-CPython NDCG@k. Predictions rank
  transitions; truth is the held-out per-(a,b) mean wait time.
  Missing predictions sink to the bottom (refusing to predict
  doesn't earn credit)
- pm_bench/bottleneck.py — BottleneckTarget + extract; per-transition
  shape (4-tuple: a, b, mean_wait_seconds, n_observations) instead
  of per-prefix
- baselines/mean_wait.py — train-mean-per-transition with global-mean
  fallback. On synthetic-toy: NDCG@10 0.9786 over 6 transitions
- CLI: --task bottleneck, --baseline mean-wait wired through
  prefixes / predict / score
- leaderboard/bottleneck/synthetic-toy.json with mean-wait-ref entry
  (NDCG@10 0.9786, n_transitions 6); pm-bench leaderboard --all now
  walks 3 boards (next-event, remaining-time, bottleneck)
- 7 new tests; 86 total, ruff clean
- v0.3 marked partial → 4 of 5 tasks (conformance remains)
- `pm-bench leaderboard [--all] --markdown` emits a markdown
  rendering with task-aware columns (top1/top3 for next-event,
  mae_days for remaining-time, AUC/n_pos for outcome, NDCG@k for
  bottleneck)
- STANDINGS.md checked in at repo root; tests/test_leaderboard.py
  fails if it drifts from what `--all --markdown` produces today
  (regenerate via that one command)
- README links STANDINGS so headline numbers are one click away
- v0.4 milestone closed: standings format + reference entries +
  verify CLI + leaderboard.yml CI + auto-generated landing page all
  shipped
- 89 tests pass (was 86), ruff clean
…ry plumbing

- pm_bench/io.py: read_csv_log accepts CSV / .csv.gz with either
  pm-bench-native columns (case_id, activity, timestamp) or PM4Py
  XES-derived names (case:concept:name, concept:name,
  time:timestamp). Bad timestamps fail with file:line context
- _load_events auto-detects path-like inputs (slashes, .csv, .csv.gz,
  .tsv) and routes to the loader. Registry names still work
- `pm-bench split path/to/log.csv` (and the full pipeline) runs
  end-to-end on any user CSV — no registry entry, no hash pin
- 8 new tests including a click-runner e2e against a tmp CSV; 97 total
- Unblocks the obvious "let me try pm-bench on my own data" path
  that previously required wiring a registry entry first
- score_conformance — pure CPython, no pm4py dep. F = 2fp/(f+p)
  where f and p are computed from set overlap of the submitted DFG
  and the test-partition DFG
- pm_bench/conformance.py — extract_dfg, write/read_model_json.
  Submission format: {"transitions": [["a","b"], ...]}
- New CLI verb `pm-bench discover <name> --baseline dfg --out
  model.json` — discovers a DFG from training cases. Score path
  takes --dataset and --split (instead of --prefixes) since the
  model is global, not per-prefix
- leaderboard/conformance/synthetic-toy.json with dfg-ref entry
  (F=0.857, fitness 1.0, precision 0.75); pm-bench leaderboard
  --all now walks 4 boards
- leaderboard.py + CLI standings printer + STANDINGS.md learn the
  conformance column set
- 11 new tests (test_conformance.py); 108 total, ruff clean
- v0.3 (5-task scoring) closed: every task has a baseline + entry
No semantic change; ASCII-only punctuation across READMEs, GOALS,
source comments, doctests, and config. Verified by running the
existing test suite (no test asserts on em-dash text).
- synthetic_log() default n_cases = 200 (was 50). Test partition now
  has ~45 positive cases for `delivery_confirmed`, so the outcome
  task gets a real AUC instead of degenerating to 0.5
- All 4 existing reference entries regenerated and re-scored:
  * markov-ref:    top1 0.9304  (was 0.9756 on 50 cases)
  * mean-ref:      MAE 1.3481 days
  * mean-wait-ref: NDCG@10 0.9911 over 9 transitions
  * dfg-ref:       F = 1.0  (200 cases → both partitions cover the
                             full path graph)
- 5th leaderboard board added: leaderboard/outcome/synthetic-toy.json
  with prior-ref entry — AUC 0.6319, n_pos 45 / 158
- _rescore_outcome + _outcome_truth_for_dataset wired into
  leaderboard.py; pm-bench leaderboard --all --verify walks all 5
  boards
- registry.yml synthetic-toy row updated (cases 200, events 965)
- STANDINGS.md regenerated; README + STATUS updated; tests adjusted
  to the new numbers (drop "n_pos=0 by accident" comments)
- 109 tests, ruff clean
- pm_bench/stats.py:summarize(events, top_n) → LogStats with
  n_events, n_cases, n_activities, time span, earliest/latest,
  mean/median case length, top-N activities, top-N transitions
- CLI: pm-bench stats <name-or-path> [--top-n N] emits JSON
- Works on synthetic-toy and any CSV path that the existing
  _load_events dispatch accepts
- 7 new tests (test_stats.py); 116 total, ruff clean
- README gets a one-liner pointing at the command
…oard

- pm_bench/baselines/uniform.py: ranks every training-set activity
  in lexicographic order, identical for every prefix. The "didn't
  read the trace at all" floor that any real model has to clear
- predict --baseline uniform wired through CLI for next-event
- leaderboard/next-event/synthetic-toy.json now has 2 entries:
  * markov-ref:  top1 0.9304, top3 1.0000
  * uniform-ref: top1 0.2025, top3 0.2785
  Standings sort puts markov above uniform (asserted by test)
- Demonstrates leaderboard scales beyond 1 entry per (task, dataset)
- STANDINGS.md regenerated; STATUS updated
- 1 new test (multi-entry sort canary); 117 total, ruff clean
- pm_bench/baselines/zero_time.py: predicts 0 days for every prefix
  (absolute MAE floor)
- discover --baseline empty: submits an empty DFG (fitness 0, F 0 —
  absolute conformance floor)
- CLI: predict --baseline zero --task remaining-time wired alongside
  mean; discover --baseline empty wired alongside dfg
- New leaderboard entries:
  * remaining-time/synthetic-toy: zero-ref MAE 2.7410 vs mean-ref 1.3481
  * conformance/synthetic-toy:    empty-ref F 0.0    vs dfg-ref F 1.0
- 3 of 5 boards now have 2 entries (next-event already had uniform-ref
  from the previous PR; outcome and bottleneck still single-entry
  pending future submissions)
- STANDINGS.md regenerated; 117 tests, ruff clean
Sweep across STATUS.md, baselines (uniform, zero_time), stats.py,
cli.py, and the leaderboard JSON fixtures. ASCII-only punctuation.
Tests pass unchanged (117); ruff clean.
- pm_bench/leaderboard.py:compare_boards(a, b) → dict — per-model
  score deltas. Tasks/datasets must match (loud ValueError otherwise)
- CLI: pm-bench compare A.json B.json emits the diff as JSON
- Use case: snapshot today, change something, re-snapshot, diff to
  see what moved. Models unique to one side get surfaced separately
- 6 new tests including click-runner smoke + cross-task rejection
- 123 total, ruff clean
- tests/test_fetch.py:test_ensure_cached_auto_downloads_from_url
  spins up http.server in a tmp dir, points a Dataset's download_url
  at 127.0.0.1:<port>, and verifies ensure_cached:
  * downloads on first call (downloaded=True, pinned=True since
    sha256 matches)
  * hits the cache on the second call (downloaded=False)
- The auto-download path was previously only covered by the manual-
  fetch error case; this fills the obvious test gap before any
  actually-fetchable URLs land in registry.yml
- 124 tests, ruff clean (was 123)
- python -m bench.seeds --n 30 runs each reference baseline at N
  seeds of synthetic_log; prints mean / std / min / max per metric
  as a markdown table (or JSON via --format json)
- Quantifies the noise band a real submission has to clear before
  "better than the baseline" is a statistically interesting claim
- All 5 tasks covered: next-event/markov, remaining-time/mean,
  outcome/prior, bottleneck/mean-wait, conformance/dfg
- README gets a "Baseline variance" section pointing at the script
- 4 new tests including parametrized smoke per task; 132 total,
  ruff clean
- First measurement at n=5: markov top1 0.9183 ± 0.0111
- Setup section for new clones
- Adding a leaderboard entry: 5-step recipe per task
- Predictions file format table for all 5 tasks (next-event,
  remaining-time, outcome, bottleneck, conformance)
- Pre-PR checklist: pytest, ruff, leaderboard verify, STANDINGS regen
- Noise-quantification recommendation pointing at bench/seeds.py
- Bug reporting template
- README's "Submitting" section now points at CONTRIBUTING for the
  long form
- _load_events parses "synthetic-toy@99" → synthetic_log(seed=99).
  Bare "synthetic-toy" still uses canonical seed=42 — leaderboard
  reproducibility preserved
- Saves users from either scripting Python or threading --seed
  through every command verb
- Bad seed strings ("synthetic-toy@nope") fail cleanly with a
  message
- README + STATUS updated; 2 new tests; 134 total, ruff clean
- pm_bench/leaderboard_schema.py:validate_board(dict) → list[str].
  Stdlib-only structural checker (no jsonschema dep). Clear error
  paths like "$.entries[2].score: must be an object"
- Required top keys: task, dataset, metric, scored_with, split,
  entries. Required entry keys: model, version, predictions_path,
  score. Task must be one of the 5 v0 tasks
- Parametrized test exercises every leaderboard/<task>/<dataset>.json
  in the repo; 4 negative tests cover missing top keys, unknown
  tasks, missing entry.score, non-dict score
- 9 new tests; 143 total, ruff clean
- Catches structural drift in leaderboard files before they reach
  the rescore path
- Combined schema check + score rescore on a single leaderboard
  file. Exit 0 = clean; exit 2 with schema-prefixed or score-prefixed
  error messages on stderr
- --no-rescore for fast schema-only sanity check
- CONTRIBUTING.md now recommends `pm-bench validate <file>` as the
  pre-PR step before regenerating STANDINGS
- 4 new tests covering clean board, --no-rescore path, schema error
  surfacing, score drift; 147 total, ruff clean
- pm_bench/baselines/global_rate.py: constant training-positive-rate
  prediction. AUC = 0.5 (tied ranks). The "doesn't condition on the
  trace at all" outcome floor
- pm_bench/baselines/random_rank.py: deterministic SHA-256-based
  pseudo-random score per transition. NDCG@10 0.943 — stable across
  CI runs (seeded by the SHA, not Python's PRNG)
- CLI: predict --baseline global (outcome), --baseline random
  (bottleneck) wired alongside existing baselines
- New leaderboard entries:
  * outcome/synthetic-toy:    global-ref AUC 0.5    vs prior-ref 0.6319
  * bottleneck/synthetic-toy: random-ref NDCG 0.9434 vs mean-wait-ref 0.9911
- All 5 boards now have 2 entries; multi-entry sort asserted across
  every task
- STANDINGS regenerated; 147 tests, ruff clean
@protosphinx protosphinx added the automated Opened by the daily bot label May 1, 2026
@protosphinx protosphinx merged commit 522e972 into main May 1, 2026
6 checks passed
@protosphinx protosphinx deleted the chore/dependabot-config branch May 1, 2026 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automated Opened by the daily bot

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant