Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions STANDINGS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ _DFG fitness × precision → F-score (higher is better)_
| Model | F | Fitness | Precision | n_test | n_model |
|---|---:|---:|---:|---:|---:|
| `dfg-ref` | 1.0000 | 1.0000 | 1.0000 | 9 | 9 |
| `empty-ref` | 0.0000 | 0.0000 | 0.0000 | 9 | 0 |

### next-event · synthetic-toy
_top1 / top3 accuracy_
Expand All @@ -37,3 +38,4 @@ _MAE in days (lower is better)_
| Model | mae_days | n |
|---|---:|---:|
| `mean-ref` | 1.3481 | 158 |
| `zero-ref` | 2.7410 | 158 |
18 changes: 14 additions & 4 deletions STATUS.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,17 +60,27 @@ pm-bench fetch bpi2020 --pin

## Recently shipped

- **Floor baselines for time + conformance** (`floor-baselines` branch).
- `zero-time` for remaining-time: predicts 0 days for every prefix.
MAE 2.741 on synthetic-toy - exactly twice mean-ref's 1.348, as
expected for a constant zero predictor.
- `empty` for conformance: submits a model with no transitions.
F = 0 by construction; the absolute conformance floor.
- Both wired through CLI (`predict --baseline zero`, `discover
--baseline empty`) and added as second entries on their boards.
3 of 5 boards now demonstrate >1 entry; outcome and bottleneck
still single-entry pending future submissions.
- **Uniform second baseline + multi-entry leaderboard demo**
(`uniform-baseline` branch).
- `pm_bench/baselines/uniform.py` ranks every training-set
- `pm_bench/baselines/uniform.py` - ranks every training-set
activity in lexicographic order, identical for every prefix. The
"didn't read the trace at all" floor.
- `predict --baseline uniform` for next-event.
- `leaderboard/next-event/synthetic-toy.json` now has **2 entries**:
markov-ref top-1 0.9304 vs uniform-ref top-1 0.2025. Standings
sort puts markov on top.
- Demonstrates the leaderboard scales to multiple submissions on
one (task, dataset) pair a precondition for accepting external
one (task, dataset) pair - a precondition for accepting external
submissions.
- 1 new test asserting the standings order; 117 total.
- **`pm-bench stats <name>`** (`stats-command` branch).
Expand All @@ -80,7 +90,7 @@ pm-bench fetch bpi2020 --pin
- Pure CPython; works on synthetic-toy and any CSV path the
existing `_load_events` accepts.
- 7 new tests; 116 total.
- **Synthetic-toy bumped to 200 cases outcome row finally lands**
- **Synthetic-toy bumped to 200 cases - outcome row finally lands**
(`synthetic-200` branch).
- `synthetic_log()` default `n_cases` = 200 (was 50). Test partition
now has ~45 positive cases (`delivery_confirmed`) so AUC is
Expand All @@ -90,7 +100,7 @@ pm-bench fetch bpi2020 --pin
mean-ref MAE 1.3481, mean-wait-ref NDCG@10 0.9911,
dfg-ref F=1.0 (both partitions now cover the full path graph).
- **5th leaderboard board added**: `outcome/synthetic-toy.json`
with `prior-ref` entry AUC 0.6319, n_pos 45 / 158. Real floor
with `prior-ref` entry - AUC 0.6319, n_pos 45 / 158. Real floor
for any temporal model on the outcome task.
- `_rescore_outcome` + `_outcome_truth_for_dataset` added to
`leaderboard.py`. `pm-bench leaderboard --all --verify` now
Expand Down
16 changes: 16 additions & 0 deletions leaderboard/conformance/synthetic-toy.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,22 @@
},
"scored_at": "2026-04-30T00:00:00Z",
"notes": "Directly-follows graph extracted from training cases. With 200 cases the train and test partitions both observe every path-graph edge, so fitness = precision = 1.0. Any future submission has to keep this floor while generalizing to a different test split."
},
{
"model": "empty-ref",
"version": "0.1.0",
"predictions_path": "leaderboard/predictions/conformance/synthetic-toy/empty-ref.json",
"code": "https://github.com/erphq/pm-bench/blob/main/pm_bench/cli.py",
"paper": null,
"score": {
"fitness": 0.0,
"precision": 0.0,
"fscore": 0.0,
"n_test_transitions": 9,
"n_model_transitions": 0
},
"scored_at": "2026-04-30T00:00:00Z",
"notes": "Empty model - zero transitions. Absolute floor: fitness 0, F 0. A submission has to do strictly better."
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"transitions": []
}
Binary file not shown.
15 changes: 14 additions & 1 deletion leaderboard/remaining-time/synthetic-toy.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,20 @@
"n": 158
},
"scored_at": "2026-04-30T00:00:00Z",
"notes": "Constant prediction = mean remaining-time observed on training prefixes. The dumbest model that still respects the train/test split — the floor any temporal model must clear."
"notes": "Constant prediction = mean remaining-time observed on training prefixes. The dumbest model that still respects the train/test split - the floor any temporal model must clear."
},
{
"model": "zero-ref",
"version": "0.1.0",
"predictions_path": "leaderboard/predictions/remaining-time/synthetic-toy/zero-ref.csv.gz",
"code": "https://github.com/erphq/pm-bench/blob/main/pm_bench/baselines/zero_time.py",
"paper": null,
"score": {
"mae_days": 2.7410337552742616,
"n": 158
},
"scored_at": "2026-04-30T00:00:00Z",
"notes": "Predicts 0 days remaining for every prefix. The absolute MAE floor - beats nothing, exists to anchor the lower bound."
}
]
}
2 changes: 2 additions & 0 deletions pm_bench/baselines/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
write_outcome_predictions_csv,
)
from pm_bench.baselines.uniform import UniformBaseline, fit_uniform, predict_uniform
from pm_bench.baselines.zero_time import predict_zero_time

__all__ = [
"MarkovBaseline",
Expand All @@ -44,6 +45,7 @@
"predict_mean_wait",
"predict_prior_outcome",
"predict_uniform",
"predict_zero_time",
"read_outcome_predictions_csv",
"read_time_predictions_csv",
"write_outcome_predictions_csv",
Expand Down
4 changes: 2 additions & 2 deletions pm_bench/baselines/uniform.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
real model has to clear that floor or it isn't using the trace.

This baseline exists alongside `markov` to demonstrate the leaderboard
scales to multiple entries on the same (task, dataset) pair and to
scales to multiple entries on the same (task, dataset) pair - and to
give an honest "did the trace help at all?" comparison number.
"""
from __future__ import annotations
Expand Down Expand Up @@ -40,7 +40,7 @@ def fit_uniform(events: Iterable[Event], train_case_ids: Iterable[CaseId]) -> Un
def predict_uniform(
model: UniformBaseline, prefixes: Iterable[Prefix]
) -> list[Prediction]:
"""Same ranked list for every target the entire activity vocab."""
"""Same ranked list for every target - the entire activity vocab."""
ranking = model.activities_sorted
return [
Prediction(case_id=p.case_id, prefix_idx=p.prefix_idx, ranked=ranking)
Expand Down
21 changes: 21 additions & 0 deletions pm_bench/baselines/zero_time.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""Zero-remaining-time floor baseline.

Predicts 0 days of remaining time for every prefix. The absolute MAE
floor - any model that ties this isn't using time information at all.
Sits below `mean-time` on the leaderboard and gives the dumbest
possible reference number.
"""
from __future__ import annotations

from collections.abc import Iterable

from pm_bench.baselines.mean_time import TimePrediction
from pm_bench.prefixes import TimeTarget


def predict_zero_time(targets: Iterable[TimeTarget]) -> list[TimePrediction]:
"""Constant 0.0 for every prefix."""
return [
TimePrediction(case_id=t.case_id, prefix_idx=t.prefix_idx, predicted_days=0.0)
for t in targets
]
32 changes: 19 additions & 13 deletions pm_bench/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
write_outcome_predictions_csv,
)
from pm_bench.baselines.uniform import fit_uniform, predict_uniform
from pm_bench.baselines.zero_time import predict_zero_time
from pm_bench.bottleneck import (
extract_bottleneck_targets,
read_bottleneck_predictions_csv,
Expand Down Expand Up @@ -357,11 +358,11 @@ def prefixes(name: str, split_path: str, out_path: str, partition: str, task: st
)
@click.option(
"--baseline",
type=click.Choice(["markov", "uniform", "mean", "prior", "mean-wait"]),
type=click.Choice(["markov", "uniform", "mean", "zero", "prior", "mean-wait"]),
default="markov",
show_default=True,
help=(
"markov / uniform → next-event; mean → remaining-time; "
"markov / uniform → next-event; mean / zero → remaining-time; "
"prior → outcome; mean-wait → bottleneck."
),
)
Expand Down Expand Up @@ -398,11 +399,14 @@ def predict(
preds = predict_uniform(uni_model, targets)
n = write_predictions_csv(preds, out_path)
elif task == "remaining-time":
if baseline != "mean":
if baseline not in ("mean", "zero"):
raise click.UsageError(f"baseline {baseline!r} doesn't apply to remaining-time")
time_model = fit_mean_time(events, split_data["train"])
time_targets = read_time_targets_csv(prefixes_path)
time_preds = predict_mean_time(time_model, time_targets)
if baseline == "mean":
time_model = fit_mean_time(events, split_data["train"])
time_preds = predict_mean_time(time_model, time_targets)
else:
time_preds = predict_zero_time(time_targets)
n = write_time_predictions_csv(time_preds, out_path)
elif task == "outcome":
if baseline != "prior":
Expand Down Expand Up @@ -445,25 +449,27 @@ def predict(
)
@click.option(
"--baseline",
type=click.Choice(["dfg"]),
type=click.Choice(["dfg", "empty"]),
default="dfg",
show_default=True,
help="dfg → directly-follows graph from training cases.",
help="dfg → DFG from training cases; empty → no transitions (absolute floor).",
)
def discover(name: str, split_path: str, out_path: str, baseline: str) -> None:
"""Discover a process model from training cases.

The submission for the conformance task is a model JSON. Today only
the DFG baseline ships in-tree - other discoverers (alpha, inductive)
can be added behind a `[discovery]` extra without changing the
submission format.
The submission for the conformance task is a model JSON. `dfg`
extracts the directly-follows graph; `empty` submits no transitions
(the absolute conformance floor - fitness 0, F-score 0).
"""
events = _load_events(name)
with open(split_path) as f:
split_data = json.load(f)
if baseline != "dfg":
if baseline == "dfg":
dfg = extract_dfg(events, split_data["train"])
elif baseline == "empty":
dfg = set()
else:
raise click.UsageError(f"unknown discoverer: {baseline}")
dfg = extract_dfg(events, split_data["train"])
n = write_model_json(dfg, out_path)
click.echo(f"wrote model with {n} transitions to {out_path} (baseline={baseline})")

Expand Down
2 changes: 1 addition & 1 deletion pm_bench/stats.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Quick summary stats for an event log.

Useful when inspecting a new dataset n_cases, n_events, distinct
Useful when inspecting a new dataset - n_cases, n_events, distinct
activity count, time span, top-N most-frequent activities and
transitions, mean / median case length. Pure CPython; runs in the
same process as the rest of pm-bench so it works on `synthetic-toy`,
Expand Down