erphq · protosphinx · May 1, 2026
diff --git a/GOALS.md b/GOALS.md
@@ -5,11 +5,13 @@ Be the default benchmark for new process-mining methods. Within 18 months,
 ≥10 external papers report `pm-bench` numbers in their abstract.
 
 ## v0 success criteria
-- 7 datasets fetchable + hash-verified
+- 7 datasets fetchable + hash-verified — fetch/hash machinery shipped
+  (`pm-bench fetch <name> [--pin]`); per-dataset hash pins pending
+  the one-time TOS-gated downloads
 - 5 tasks with fixed scoring scripts (next-event ✅; remaining-time, outcome,
   conformance, bottleneck pending)
 - `gnn` runs end-to-end as the reference baseline (Markov reference ✅;
-  `gnn` integration pending v0.1 dataset machinery)
+  `gnn` integration pending the first pinned dataset)
 - End-to-end loop runs on `synthetic-toy` ✅ — split → prefixes →
   predict → score, covered by `tests/test_e2e.py`
 

diff --git a/README.md b/README.md
@@ -118,10 +118,22 @@ pm-bench score predictions.csv \
 
 The full loop (`split → prefixes → predict → score`) runs end-to-end on
 `synthetic-toy` today; it's covered by `tests/test_e2e.py` and locks
-the file formats the leaderboard depends on. BPI / Sepsis / Helpdesk
-will use the same commands once v0.1's fetch+cache machinery lands —
-4TU's interactive TOS makes the download itself a one-time manual
-step, but everything downstream is automated.
+the file formats the leaderboard depends on.
+
+For the public datasets, the fetch + hash machinery is in place:
+
+```bash
+pm-bench fetch bpi2020                    # auto-downloads if URL is set
+pm-bench fetch bpi2020 --pin              # after manual TOS-gated download,
+                                          # emits a registry.yml sha256 patch
+```
+
+`pm-bench fetch` resolves a cache directory (`$PM_BENCH_CACHE`, else
+`~/.cache/pm-bench/`), verifies the registry sha256 if pinned, and —
+for TOS-gated 4TU / Mendeley datasets — prints the precise landing URL
+and on-disk path you need to fill in. The per-dataset hash pins are the
+last manual step before BPI / Sepsis / Helpdesk run through the same
+loop as `synthetic-toy`.
 
 The full pipeline:
 
@@ -204,7 +216,10 @@ honesty. The point of the benchmark is to make the comparison real.
 - [x] v0.0.1 — end-to-end loop on `synthetic-toy`: split → prefixes →
       predict (Markov) → score, with a smoke test that locks the file
       formats
-- [ ] v0.1 — fetch + cache + hash for all 7 datasets
+- [🟡] v0.1 — fetch + cache + hash for all 7 datasets. Machinery
+      shipped (`pm-bench fetch <name> [--pin]`, sha256 verification,
+      `$PM_BENCH_CACHE` resolution); per-dataset hash-pinning PRs
+      pending the one-time TOS-gated downloads from 4TU and Mendeley.
 - [ ] v0.2 — splits: next-event, remaining-time
 - [ ] v0.3 — scoring scripts for all 5 tasks
 - [ ] v0.4 — leaderboard CI + landing page

diff --git a/STATUS.md b/STATUS.md
@@ -4,61 +4,106 @@ _Last updated: 2026-04-30._
 
 ## Where we are
 
-Pre-v0. The end-to-end loop runs on the bundled `synthetic-toy`
-dataset; the seven public datasets are still pending v0.1's fetch +
-hash machinery.
+Pre-v0. Two pieces shipped on top of v0.0:
 
-A submission today looks like:
+1. The end-to-end loop runs on the bundled `synthetic-toy` dataset
+   (split → prefixes → predict → score; Markov reference baseline
+   gets top-1 0.976, top-3 1.000).
+2. The fetch + hash + cache machinery is in place. `pm-bench fetch
+   <name>` resolves a dataset to a local path, verifies the registry
+   sha256, and prints precise instructions for the TOS-gated download
+   step on 4TU / Mendeley. `--pin` emits the `registry.yml` patch a
+   contributor pastes into a PR after the manual download.
+
+What's still left in v0.1 is purely a per-dataset operational task: do
+the one-time download, run `--pin`, open seven small PRs to pin the
+hashes, then wire the XES parser to `_load_events` so `split`/
+`prefixes`/`predict` work on real BPI data. None of it requires
+further code design.
+
+A submission today on the bundled toy:
 
 ```bash
 pm-bench split synthetic-toy > split.json
 pm-bench prefixes synthetic-toy --split split.json --out prefixes.csv
 pm-bench predict synthetic-toy --split split.json \
   --prefixes prefixes.csv --out predictions.csv --baseline markov
 pm-bench score predictions.csv --prefixes prefixes.csv --task next-event
-# → top1 0.976, top3 1.000 (Markov on synthetic-toy)
+# → top1 0.976, top3 1.000
 ```
 
-That sequence is the contract — it's what `tests/test_e2e.py` runs in
-CI, and it's what the leaderboard CI will run once datasets are pinned.
+The fetch flow on a TOS-gated dataset:
+
+```bash
+pm-bench fetch bpi2020
+# → bpi2020: no download_url (TOS-gated). Visit https://data.4tu.nl/...,
+#   accept the terms, and save the archive to ~/.cache/pm-bench/bpi2020.xes.gz.
+#   Then re-run `pm-bench fetch bpi2020 --pin` to compute the sha256.
+
+# (manual download + place in cache dir)
+
+pm-bench fetch bpi2020 --pin
+# → bpi2020: cached at ~/.cache/pm-bench/bpi2020.xes.gz (unpinned)
+#   sha256: <hex>
+#
+#   # paste under the matching dataset entry in pm_bench/registry.yml:
+#     - name: bpi2020
+#       sha256: <hex>
+```
 
 ## Recently shipped
 
-- **End-to-end loop on synthetic-toy** (`end-to-end-loop` branch).
+- **v0.1 fetch + hash machinery** (`dataset-fetch` branch).
+  - `pm_bench/cache.py` — cache root resolution
+    (`$PM_BENCH_CACHE` → `~/.cache/pm-bench/`), per-dataset path with
+    correct extension by format.
+  - `pm_bench/fetch.py` — `ensure_cached(dataset)` covers the four
+    cases: cached+match, cached+mismatch (loud failure),
+    cached+unpinned (returns actual hash), not-cached (auto-download
+    if URL set, otherwise raise `ManualFetchRequired`). Streams in
+    1 MiB chunks; atomic `.part`-then-rename writes; sha256 verified
+    against the registry pin.
+  - CLI `pm-bench fetch <name> [--pin]` — prints status, emits a
+    pasteable `registry.yml` patch when `--pin` is set.
+  - 13 new tests across `test_cache.py` and `test_fetch.py`. 37 total.
+- **End-to-end loop on synthetic-toy** (`end-to-end-loop` branch,
+  PR #2).
   - `pm_bench/prefixes.py` — extract prediction targets from a split,
     write/read CSV. Skips length-1 cases.
   - `pm_bench/predictions.py` — predictions CSV format
     (`case_id,prefix_idx,predictions`).
   - `pm_bench/baselines/markov.py` — first-order Markov reference
     baseline. Trained on the train partition only; falls back to
     unigram for unseen last-activities.
-  - CLI gained `prefixes`, `predict`, `score`. The full
-    `split → prefixes → predict → score` loop now matches what the
-    README advertises.
+  - CLI gained `prefixes`, `predict`, `score`.
   - `tests/test_e2e.py` covers the loop end-to-end via the click
     runner; format changes will trip it.
 - **v0.0** (initial release): scaffold, registry, case-chrono split,
   next-event scoring function, CLI `list` / `info` / `split`.
 
 ## Next up
 
-- **v0.1 — dataset fetch + hash** for the seven public logs. The 4TU
-  portal needs interactive TOS acceptance per dataset, so the fetch
-  itself is a one-time manual step; the rest (cache → verify hash →
-  parse XES → run the same loop) is automated. This is the work that
-  unblocks every downstream milestone.
-- **`gnn` as the second reference baseline** once v0.1 lands. `gnn`'s
-  v0.5 milestone is symmetrical with this — it's been waiting for a
-  pinned dataset registry, which `pm-bench` is meant to provide.
+- **One-time dataset pinning.** Per dataset (BPI 2012/2017/2018/2019/
+  2020 collection, Sepsis, Helpdesk): accept the TOS, save to the
+  cache, run `pm-bench fetch <name> --pin`, open the registry PR.
+  This is the gate on every downstream milestone.
+- **XES parser wiring.** `_load_events` currently rejects everything
+  except `synthetic-toy`. Once a dataset is pinned, swap that branch
+  for a pm4py-backed XES read (move pm4py to `[bpi]` extras so the
+  base install stays light).
+- **`gnn` as the second reference baseline.** `gnn`'s v0.5 milestone
+  has been waiting for a pinned dataset registry, which `pm-bench`
+  now provides the moment any single dataset is pinned.
 - Additional tasks beyond next-event (remaining-time, outcome,
   conformance, bottleneck). The split + prefixes machinery is shared;
   scoring is the per-task piece.
 
 ## Known gaps
 
-- No `pm-bench fetch` yet. README still hints at it; the install &
-  use section now shows the loop that actually works (synthetic-toy
-  only) so the doc and the CLI line up.
-- `predict` currently only knows `markov`. The `--baseline` flag is a
-  click choice so adding a second is a one-liner, but the second one
-  worth adding is `gnn`, which depends on v0.1.
+- The base install does not pull `pm4py`, so XES parsing isn't wired
+  yet. Adding a `[bpi]` extra is the right move when we pin the
+  first dataset — keeps `pip install pm-bench` fast for users who
+  only need scoring.
+- No leaderboard CI yet (v0.4). The file formats are stable, so this
+  is "wire up a workflow that runs `pm-bench score`" — orthogonal to
+  the dataset work.
diff --git a/pm_bench/cache.py b/pm_bench/cache.py
@@ -0,0 +1,58 @@
+"""Local cache directory for downloaded event logs.
+
+Datasets land in `$PM_BENCH_CACHE` if set, else `~/.cache/pm-bench/`.
+We never write inside the install tree — the cache survives uninstalls
+and wheel rebuilds, and a single cache can be shared across virtualenvs.
+
+The on-disk layout is one file per dataset:
+
+    <cache_root>/<name>.<ext>
+
+where `<ext>` is `xes.gz` for XES logs (the canonical 4TU
+distribution form) and `csv` / `csv.gz` for CSV. The synthetic-toy
+dataset is generated on demand and never touches the cache.
+"""
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+from pm_bench.registry import Dataset
+
+
+def cache_root(override: str | None = None) -> Path:
+    """Return the cache root, creating it if needed.
+
+    Resolution order: explicit `override`, then `$PM_BENCH_CACHE`, then
+    `~/.cache/pm-bench/`. The directory is created on first call so
+    callers don't have to.
+    """
+    if override:
+        root = Path(override).expanduser()
+    elif env := os.environ.get("PM_BENCH_CACHE"):
+        root = Path(env).expanduser()
+    else:
+        root = Path.home() / ".cache" / "pm-bench"
+    root.mkdir(parents=True, exist_ok=True)
+    return root
+
+
+_EXT_BY_FORMAT = {
+    "xes": "xes.gz",
+    "csv": "csv",
+}
+
+
+def cache_path(dataset: Dataset, override_root: str | None = None) -> Path:
+    """Return the on-disk path where this dataset's archive lives.
+
+    The path is purely a function of `(cache_root, name, format)`; we
+    do not check whether the file actually exists. Callers should test
+    `path.exists()` before reading.
+    """
+    if dataset.format == "synthetic":
+        raise ValueError(f"{dataset.name} is generated on demand, not cached")
+    ext = _EXT_BY_FORMAT.get(dataset.format)
+    if ext is None:
+        raise ValueError(f"unknown dataset format: {dataset.format}")
+    return cache_root(override_root) / f"{dataset.name}.{ext}"
diff --git a/pm_bench/cli.py b/pm_bench/cli.py
@@ -8,6 +8,12 @@
 
 from pm_bench import _synth
 from pm_bench.baselines.markov import fit_markov, predict_markov
+from pm_bench.fetch import (
+    FetchError,
+    ManualFetchRequired,
+    ensure_cached,
+    sha256_file,
+)
 from pm_bench.predictions import read_predictions_csv, write_predictions_csv
 from pm_bench.prefixes import extract_prefixes, read_prefixes_csv, write_prefixes_csv
 from pm_bench.registry import get_dataset, load_registry
@@ -71,6 +77,68 @@ def info(name: str) -> None:
     )
 
 
+@main.command()
+@click.argument("name")
+@click.option(
+    "--pin",
+    is_flag=True,
+    default=False,
+    help="After locating the cached file, print a registry.yml patch with its sha256.",
+)
+def fetch(name: str, pin: bool) -> None:
+    """Make a dataset available locally and verify its hash.
+
+    Auto-downloads when `download_url` is set; otherwise prints
+    instructions for the manual TOS-gated download path (4TU / Mendeley).
+    """
+    try:
+        d = get_dataset(name)
+    except KeyError:
+        click.echo(f"unknown dataset: {name}", err=True)
+        sys.exit(1)
+
+    if d.format == "synthetic":
+        click.echo(f"{name}: generated on demand, no fetch needed")
+        return
+
+    try:
+        result = ensure_cached(d)
+    except ManualFetchRequired as exc:
+        # Special-cased only so we can also handle --pin against a file the
+        # user just placed by hand. If the file is now there, recurse via
+        # ensure_cached; otherwise print the instructions and bail.
+        path = exc.expected_path
+        if path.exists():
+            actual = sha256_file(path)
+            click.echo(f"{name}: cached at {path}")
+            click.echo(f"  sha256: {actual}")
+            if pin:
+                _print_pin_patch(name, actual)
+            elif d.sha256 is None:
+                click.echo("  (registry hash unset — re-run with --pin to emit a patch)")
+            return
+        click.echo(str(exc), err=True)
+        sys.exit(2)
+    except FetchError as exc:
+        click.echo(f"{name}: {exc}", err=True)
+        sys.exit(2)
+
+    state = "downloaded" if result.downloaded else "cached"
+    pinned = "verified" if result.pinned else "unpinned"
+    click.echo(f"{name}: {state} at {result.path} ({pinned})")
+    click.echo(f"  sha256: {result.sha256}")
+    if pin and not result.pinned:
+        _print_pin_patch(name, result.sha256)
+
+
+def _print_pin_patch(name: str, digest: str) -> None:
+    """Print a YAML snippet the user can paste into registry.yml."""
+    click.echo("")
+    click.echo("# paste under the matching dataset entry in pm_bench/registry.yml:")
+    click.echo(f"  - name: {name}")
+    click.echo(f"    sha256: {digest}")
+
+
 @main.command()
 @click.argument("name")
 @click.option("--task", default="next-event", show_default=True)