erphq · protosphinx · May 1, 2026 · May 1, 2026
diff --git a/GOALS.md b/GOALS.md
@@ -6,8 +6,12 @@ Be the default benchmark for new process-mining methods. Within 18 months,
 
 ## v0 success criteria
 - 7 datasets fetchable + hash-verified
-- 5 tasks with fixed scoring scripts
-- `gnn` runs end-to-end as the reference baseline
+- 5 tasks with fixed scoring scripts (next-event ✅; remaining-time, outcome,
+  conformance, bottleneck pending)
+- `gnn` runs end-to-end as the reference baseline (Markov reference ✅;
+  `gnn` integration pending v0.1 dataset machinery)
+- End-to-end loop runs on `synthetic-toy` ✅ — split → prefixes →
+  predict → score, covered by `tests/test_e2e.py`
 
 ## v1 success criteria
 - ≥3 external groups submit to the leaderboard

diff --git a/README.md b/README.md
@@ -105,14 +105,24 @@ and cache locally. Datasets carry their original licenses (linked in
 ```bash
 pip install pm-bench
 
-pm-bench list                                       # available datasets
-pm-bench fetch bpi2020                              # download + verify hash
-pm-bench split bpi2020 --task next-event > split.json
+pm-bench list                                                # available datasets
+pm-bench split synthetic-toy > split.json                    # train/val/test case ids
+pm-bench prefixes synthetic-toy \
+  --split split.json --out prefixes.csv                      # prediction targets
+pm-bench predict synthetic-toy \
+  --split split.json --prefixes prefixes.csv \
+  --out predictions.csv --baseline markov                    # reference baseline
 pm-bench score predictions.csv \
-  --task next-event --dataset bpi2020 --split split.json
-pm-bench leaderboard --task next-event              # current standings
+  --prefixes prefixes.csv --task next-event                  # top-1 / top-3
 ```
 
+The full loop (`split → prefixes → predict → score`) runs end-to-end on
+`synthetic-toy` today; it's covered by `tests/test_e2e.py` and locks
+the file formats the leaderboard depends on. BPI / Sepsis / Helpdesk
+will use the same commands once v0.1's fetch+cache machinery lands —
+4TU's interactive TOS makes the download itself a one-time manual
+step, but everything downstream is automated.
+
 The full pipeline:
 
 ```mermaid
@@ -191,11 +201,14 @@ honesty. The point of the benchmark is to make the comparison real.
 ## ✦ Roadmap
 
 - [x] v0.0 — scaffold, dataset registry, split design
+- [x] v0.0.1 — end-to-end loop on `synthetic-toy`: split → prefixes →
+      predict (Markov) → score, with a smoke test that locks the file
+      formats
 - [ ] v0.1 — fetch + cache + hash for all 7 datasets
 - [ ] v0.2 — splits: next-event, remaining-time
 - [ ] v0.3 — scoring scripts for all 5 tasks
 - [ ] v0.4 — leaderboard CI + landing page
-- [ ] v0.5 — baselines: `gnn`, transformer, LSTM, Markov
+- [ ] v0.5 — baselines: `gnn`, transformer, LSTM, Markov ✅ (Markov shipped)
 - [ ] v1.0 — first external submissions; cited in ≥1 paper
 
 ## ✦ Topics

diff --git a/STATUS.md b/STATUS.md
@@ -0,0 +1,64 @@
+# Status
+
+_Last updated: 2026-04-30._
+
+## Where we are
+
+Pre-v0. The end-to-end loop runs on the bundled `synthetic-toy`
+dataset; the seven public datasets are still pending v0.1's fetch +
+hash machinery.
+
+A submission today looks like:
+
+```bash
+pm-bench split synthetic-toy > split.json
+pm-bench prefixes synthetic-toy --split split.json --out prefixes.csv
+pm-bench predict synthetic-toy --split split.json \
+  --prefixes prefixes.csv --out predictions.csv --baseline markov
+pm-bench score predictions.csv --prefixes prefixes.csv --task next-event
+# → top1 0.976, top3 1.000 (Markov on synthetic-toy)
+```
+
+That sequence is the contract — it's what `tests/test_e2e.py` runs in
+CI, and it's what the leaderboard CI will run once datasets are pinned.
+
+## Recently shipped
+
+- **End-to-end loop on synthetic-toy** (`end-to-end-loop` branch).
+  - `pm_bench/prefixes.py` — extract prediction targets from a split,
+    write/read CSV. Skips length-1 cases.
+  - `pm_bench/predictions.py` — predictions CSV format
+    (`case_id,prefix_idx,predictions`).
+  - `pm_bench/baselines/markov.py` — first-order Markov reference
+    baseline. Trained on the train partition only; falls back to
+    unigram for unseen last-activities.
+  - CLI gained `prefixes`, `predict`, `score`. The full
+    `split → prefixes → predict → score` loop now matches what the
+    README advertises.
+  - `tests/test_e2e.py` covers the loop end-to-end via the click
+    runner; format changes will trip it.
+- **v0.0** (initial release): scaffold, registry, case-chrono split,
+  next-event scoring function, CLI `list` / `info` / `split`.
+
+## Next up
+
+- **v0.1 — dataset fetch + hash** for the seven public logs. The 4TU
+  portal needs interactive TOS acceptance per dataset, so the fetch
+  itself is a one-time manual step; the rest (cache → verify hash →
+  parse XES → run the same loop) is automated. This is the work that
+  unblocks every downstream milestone.
+- **`gnn` as the second reference baseline** once v0.1 lands. `gnn`'s
+  v0.5 milestone is symmetrical with this — it's been waiting for a
+  pinned dataset registry, which `pm-bench` is meant to provide.
+- Additional tasks beyond next-event (remaining-time, outcome,
+  conformance, bottleneck). The split + prefixes machinery is shared;
+  scoring is the per-task piece.
+
+## Known gaps
+
+- No `pm-bench fetch` yet. README still hints at it; the install &
+  use section now shows the loop that actually works (synthetic-toy
+  only) so the doc and the CLI line up.
+- `predict` currently only knows `markov`. The `--baseline` flag is a
+  click choice so adding a second is a one-liner, but the second one
+  worth adding is `gnn`, which depends on v0.1.
diff --git a/pm_bench/__init__.py b/pm_bench/__init__.py
@@ -3,6 +3,8 @@
 
 __version__ = "0.1.0"
 
+from pm_bench.predictions import Prediction, read_predictions_csv, write_predictions_csv
+from pm_bench.prefixes import Prefix, extract_prefixes, read_prefixes_csv, write_prefixes_csv
 from pm_bench.registry import Dataset, get_dataset, load_registry
 from pm_bench.score import NextEventScore, score_next_event
 from pm_bench.split import Event, Split, case_chrono_split
@@ -11,9 +13,16 @@
     "Dataset",
     "Event",
     "NextEventScore",
+    "Prediction",
+    "Prefix",
     "Split",
     "case_chrono_split",
+    "extract_prefixes",
     "get_dataset",
     "load_registry",
+    "read_predictions_csv",
+    "read_prefixes_csv",
     "score_next_event",
+    "write_predictions_csv",
+    "write_prefixes_csv",
 ]
diff --git a/pm_bench/baselines/__init__.py b/pm_bench/baselines/__init__.py
@@ -0,0 +1,12 @@
+"""Reference baselines that ship with pm-bench.
+
+Baselines exist to anchor the leaderboard: a submission that loses to
+the markov reference is an immediate red flag. They're deliberately
+simple — no torch, no scikit-learn, no GPUs, just CPython — so anyone
+can read the code and trust the number.
+"""
+from __future__ import annotations
+
+from pm_bench.baselines.markov import MarkovBaseline, predict_markov
+
+__all__ = ["MarkovBaseline", "predict_markov"]
diff --git a/pm_bench/baselines/markov.py b/pm_bench/baselines/markov.py
@@ -0,0 +1,67 @@
+"""First-order Markov reference baseline.
+
+Counts (current_activity → next_activity) transitions on training cases
+only, then ranks candidates by frequency. Falls back to the global
+unigram distribution when a prefix ends in an activity unseen during
+training. No smoothing — the leaderboard reports raw frequencies.
+
+Why first-order: it's the dumbest model that has any business being on
+the leaderboard, and it sets the floor any "real" sequence model has to
+clear. A transformer that ties or loses to first-order Markov is
+broken or overfit.
+"""
+from __future__ import annotations
+
+from collections import Counter, defaultdict
+from collections.abc import Iterable
+from dataclasses import dataclass
+
+from pm_bench.predictions import Prediction
+from pm_bench.prefixes import Prefix
+from pm_bench.split import Activity, Event
+
+
+@dataclass
+class MarkovBaseline:
+    transitions: dict[Activity, Counter[Activity]]
+    unigram: Counter[Activity]
+
+    def rank(self, last_activity: Activity | None) -> list[Activity]:
+        """Return candidate next activities, best first."""
+        if last_activity is not None and last_activity in self.transitions:
+            counts = self.transitions[last_activity]
+            if counts:
+                return [a for a, _ in counts.most_common()]
+        return [a for a, _ in self.unigram.most_common()]
+
+
+def fit_markov(events: Iterable[Event], train_case_ids: Iterable[Activity]) -> MarkovBaseline:
+    """Fit a first-order Markov model on the training cases only."""
+    keep = set(train_case_ids)
+    by_case: dict[Activity, list[tuple[Activity, object]]] = {}
+    for case_id, activity, ts in events:
+        if case_id not in keep:
+            continue
+        by_case.setdefault(case_id, []).append((activity, ts))
+
+    transitions: dict[Activity, Counter[Activity]] = defaultdict(Counter)
+    unigram: Counter[Activity] = Counter()
+    for rows in by_case.values():
+        rows.sort(key=lambda r: r[1])
+        activities = [a for a, _ in rows]
+        for a in activities:
+            unigram[a] += 1
+        for prev, nxt in zip(activities, activities[1:], strict=False):
+            transitions[prev][nxt] += 1
+
+    return MarkovBaseline(transitions=dict(transitions), unigram=unigram)
+
+
+def predict_markov(model: MarkovBaseline, prefixes: Iterable[Prefix]) -> list[Prediction]:
+    """Score each prefix with the Markov model."""
+    out: list[Prediction] = []
+    for p in prefixes:
+        last = p.prefix[-1] if p.prefix else None
+        ranked = model.rank(last)
+        out.append(Prediction(case_id=p.case_id, prefix_idx=p.prefix_idx, ranked=tuple(ranked)))
+    return out