Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions GOALS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,12 @@ Be the default benchmark for new process-mining methods. Within 18 months,

## v0 success criteria
- 7 datasets fetchable + hash-verified
- 5 tasks with fixed scoring scripts
- `gnn` runs end-to-end as the reference baseline
- 5 tasks with fixed scoring scripts (next-event ✅; remaining-time, outcome,
conformance, bottleneck pending)
- `gnn` runs end-to-end as the reference baseline (Markov reference ✅;
`gnn` integration pending v0.1 dataset machinery)
- End-to-end loop runs on `synthetic-toy` ✅ — split → prefixes →
predict → score, covered by `tests/test_e2e.py`

## v1 success criteria
- ≥3 external groups submit to the leaderboard
Expand Down
25 changes: 19 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,14 +105,24 @@ and cache locally. Datasets carry their original licenses (linked in
```bash
pip install pm-bench

pm-bench list # available datasets
pm-bench fetch bpi2020 # download + verify hash
pm-bench split bpi2020 --task next-event > split.json
pm-bench list # available datasets
pm-bench split synthetic-toy > split.json # train/val/test case ids
pm-bench prefixes synthetic-toy \
--split split.json --out prefixes.csv # prediction targets
pm-bench predict synthetic-toy \
--split split.json --prefixes prefixes.csv \
--out predictions.csv --baseline markov # reference baseline
pm-bench score predictions.csv \
--task next-event --dataset bpi2020 --split split.json
pm-bench leaderboard --task next-event # current standings
--prefixes prefixes.csv --task next-event # top-1 / top-3
```

The full loop (`split → prefixes → predict → score`) runs end-to-end on
`synthetic-toy` today; it's covered by `tests/test_e2e.py` and locks
the file formats the leaderboard depends on. BPI / Sepsis / Helpdesk
will use the same commands once v0.1's fetch+cache machinery lands —
4TU's interactive TOS makes the download itself a one-time manual
step, but everything downstream is automated.

The full pipeline:

```mermaid
Expand Down Expand Up @@ -191,11 +201,14 @@ honesty. The point of the benchmark is to make the comparison real.
## ✦ Roadmap

- [x] v0.0 — scaffold, dataset registry, split design
- [x] v0.0.1 — end-to-end loop on `synthetic-toy`: split → prefixes →
predict (Markov) → score, with a smoke test that locks the file
formats
- [ ] v0.1 — fetch + cache + hash for all 7 datasets
- [ ] v0.2 — splits: next-event, remaining-time
- [ ] v0.3 — scoring scripts for all 5 tasks
- [ ] v0.4 — leaderboard CI + landing page
- [ ] v0.5 — baselines: `gnn`, transformer, LSTM, Markov
- [ ] v0.5 — baselines: `gnn`, transformer, LSTM, Markov ✅ (Markov shipped)
- [ ] v1.0 — first external submissions; cited in ≥1 paper

## ✦ Topics
Expand Down
64 changes: 64 additions & 0 deletions STATUS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Status

_Last updated: 2026-04-30._

## Where we are

Pre-v0. The end-to-end loop runs on the bundled `synthetic-toy`
dataset; the seven public datasets are still pending v0.1's fetch +
hash machinery.

A submission today looks like:

```bash
pm-bench split synthetic-toy > split.json
pm-bench prefixes synthetic-toy --split split.json --out prefixes.csv
pm-bench predict synthetic-toy --split split.json \
--prefixes prefixes.csv --out predictions.csv --baseline markov
pm-bench score predictions.csv --prefixes prefixes.csv --task next-event
# → top1 0.976, top3 1.000 (Markov on synthetic-toy)
```

That sequence is the contract — it's what `tests/test_e2e.py` runs in
CI, and it's what the leaderboard CI will run once datasets are pinned.

## Recently shipped

- **End-to-end loop on synthetic-toy** (`end-to-end-loop` branch).
- `pm_bench/prefixes.py` — extract prediction targets from a split,
write/read CSV. Skips length-1 cases.
- `pm_bench/predictions.py` — predictions CSV format
(`case_id,prefix_idx,predictions`).
- `pm_bench/baselines/markov.py` — first-order Markov reference
baseline. Trained on the train partition only; falls back to
unigram for unseen last-activities.
- CLI gained `prefixes`, `predict`, `score`. The full
`split → prefixes → predict → score` loop now matches what the
README advertises.
- `tests/test_e2e.py` covers the loop end-to-end via the click
runner; format changes will trip it.
- **v0.0** (initial release): scaffold, registry, case-chrono split,
next-event scoring function, CLI `list` / `info` / `split`.

## Next up

- **v0.1 — dataset fetch + hash** for the seven public logs. The 4TU
portal needs interactive TOS acceptance per dataset, so the fetch
itself is a one-time manual step; the rest (cache → verify hash →
parse XES → run the same loop) is automated. This is the work that
unblocks every downstream milestone.
- **`gnn` as the second reference baseline** once v0.1 lands. `gnn`'s
v0.5 milestone is symmetrical with this — it's been waiting for a
pinned dataset registry, which `pm-bench` is meant to provide.
- Additional tasks beyond next-event (remaining-time, outcome,
conformance, bottleneck). The split + prefixes machinery is shared;
scoring is the per-task piece.

## Known gaps

- No `pm-bench fetch` yet. README still hints at it; the install &
use section now shows the loop that actually works (synthetic-toy
only) so the doc and the CLI line up.
- `predict` currently only knows `markov`. The `--baseline` flag is a
click choice so adding a second is a one-liner, but the second one
worth adding is `gnn`, which depends on v0.1.
9 changes: 9 additions & 0 deletions pm_bench/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@

__version__ = "0.1.0"

from pm_bench.predictions import Prediction, read_predictions_csv, write_predictions_csv
from pm_bench.prefixes import Prefix, extract_prefixes, read_prefixes_csv, write_prefixes_csv
from pm_bench.registry import Dataset, get_dataset, load_registry
from pm_bench.score import NextEventScore, score_next_event
from pm_bench.split import Event, Split, case_chrono_split
Expand All @@ -11,9 +13,16 @@
"Dataset",
"Event",
"NextEventScore",
"Prediction",
"Prefix",
"Split",
"case_chrono_split",
"extract_prefixes",
"get_dataset",
"load_registry",
"read_predictions_csv",
"read_prefixes_csv",
"score_next_event",
"write_predictions_csv",
"write_prefixes_csv",
]
12 changes: 12 additions & 0 deletions pm_bench/baselines/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""Reference baselines that ship with pm-bench.

Baselines exist to anchor the leaderboard: a submission that loses to
the markov reference is an immediate red flag. They're deliberately
simple — no torch, no scikit-learn, no GPUs, just CPython — so anyone
can read the code and trust the number.
"""
from __future__ import annotations

from pm_bench.baselines.markov import MarkovBaseline, predict_markov

__all__ = ["MarkovBaseline", "predict_markov"]
67 changes: 67 additions & 0 deletions pm_bench/baselines/markov.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
"""First-order Markov reference baseline.

Counts (current_activity → next_activity) transitions on training cases
only, then ranks candidates by frequency. Falls back to the global
unigram distribution when a prefix ends in an activity unseen during
training. No smoothing — the leaderboard reports raw frequencies.

Why first-order: it's the dumbest model that has any business being on
the leaderboard, and it sets the floor any "real" sequence model has to
clear. A transformer that ties or loses to first-order Markov is
broken or overfit.
"""
from __future__ import annotations

from collections import Counter, defaultdict
from collections.abc import Iterable
from dataclasses import dataclass

from pm_bench.predictions import Prediction
from pm_bench.prefixes import Prefix
from pm_bench.split import Activity, Event


@dataclass
class MarkovBaseline:
transitions: dict[Activity, Counter[Activity]]
unigram: Counter[Activity]

def rank(self, last_activity: Activity | None) -> list[Activity]:
"""Return candidate next activities, best first."""
if last_activity is not None and last_activity in self.transitions:
counts = self.transitions[last_activity]
if counts:
return [a for a, _ in counts.most_common()]
return [a for a, _ in self.unigram.most_common()]


def fit_markov(events: Iterable[Event], train_case_ids: Iterable[Activity]) -> MarkovBaseline:
"""Fit a first-order Markov model on the training cases only."""
keep = set(train_case_ids)
by_case: dict[Activity, list[tuple[Activity, object]]] = {}
for case_id, activity, ts in events:
if case_id not in keep:
continue
by_case.setdefault(case_id, []).append((activity, ts))

transitions: dict[Activity, Counter[Activity]] = defaultdict(Counter)
unigram: Counter[Activity] = Counter()
for rows in by_case.values():
rows.sort(key=lambda r: r[1])
activities = [a for a, _ in rows]
for a in activities:
unigram[a] += 1
for prev, nxt in zip(activities, activities[1:], strict=False):
transitions[prev][nxt] += 1

return MarkovBaseline(transitions=dict(transitions), unigram=unigram)


def predict_markov(model: MarkovBaseline, prefixes: Iterable[Prefix]) -> list[Prediction]:
"""Score each prefix with the Markov model."""
out: list[Prediction] = []
for p in prefixes:
last = p.prefix[-1] if p.prefix else None
ranked = model.rank(last)
out.append(Prediction(case_id=p.case_id, prefix_idx=p.prefix_idx, ranked=tuple(ranked)))
return out
Loading
Loading