Skip to content

v0.1: dataset fetch + sha256 verification + cache resolution#3

Closed
protosphinx wants to merge 1 commit into
end-to-end-loopfrom
dataset-fetch
Closed

v0.1: dataset fetch + sha256 verification + cache resolution#3
protosphinx wants to merge 1 commit into
end-to-end-loopfrom
dataset-fetch

Conversation

@protosphinx
Copy link
Copy Markdown
Member

Stacked on top of #2 (end-to-end loop). Merge #2 first; then this rebases trivially onto main.

Summary

  • Ships the v0.1 machinery: pm-bench fetch <name> [--pin], sha256 verification, cache directory resolution.
  • Per-dataset hash-pinning is now a one-time manual step (download from 4TU/Mendeley → place in cache → run --pin → PR the patch). Everything downstream is automated.

What's new

  • pm_bench/cache.py — cache root from $PM_BENCH_CACHE (else ~/.cache/pm-bench/), per-dataset path with the right extension by format. Synthetic datasets are rejected (generated on demand).
  • pm_bench/fetch.pyensure_cached(dataset) covers all four cases:
    1. cached + hash matches → return path
    2. cached + hash mismatch → HashMismatchError (loud, refuses silent proceed)
    3. cached + registry hash unset → returns actual hash (caller can --pin it)
    4. not cached → auto-download if URL set, else ManualFetchRequired with the precise landing URL and on-disk path the user needs to fill in
      Atomic .part-then-rename writes; streams in 1 MiB chunks.
  • CLI: pm-bench fetch <name> [--pin] — prints status; with --pin against an unpinned cached file, emits a pasteable registry.yml sha256 patch.
  • 13 new tests (test_cache.py, test_fetch.py). 37 total.

Why this matters

  • v0.1's blocker was always the 4TU TOS — we can't automate the download. This PR makes that the only manual step. Once a contributor accepts the TOS, downloads, and runs --pin, the registry update is a trivial PR; the rest of the toolchain (parse → split → prefixes → predict → score) is wired.
  • Unblocks gnn's v0.5 milestone the moment any single dataset is pinned.

Smoke

$ pm-bench fetch synthetic-toy
synthetic-toy: generated on demand, no fetch needed

$ pm-bench fetch bpi2020
bpi2020: no download_url (TOS-gated). Visit https://data.4tu.nl/...,
  accept the terms, and save the archive to ~/.cache/pm-bench/bpi2020.xes.gz.
  Then re-run `pm-bench fetch bpi2020 --pin` ...

# (manual download)

$ pm-bench fetch bpi2020 --pin
bpi2020: cached at ~/.cache/pm-bench/bpi2020.xes.gz (unpinned)
  sha256: <hex>

  # paste under the matching dataset entry in pm_bench/registry.yml:
    - name: bpi2020
      sha256: <hex>

Test plan

Roadmap impact

  • v0.1 marked 🟡 in README/GOALS (machinery shipped, per-dataset pins pending). The next concrete step is per-dataset PRs that pin sha256, plus wiring an XES parser to _load_events (likely behind a [bpi] extra to keep the base install light).

- cache.py — `$PM_BENCH_CACHE` → `~/.cache/pm-bench/` with per-dataset
  paths; rejects synthetic and unknown formats
- fetch.py — `ensure_cached(dataset)` covers cached+match,
  cached+mismatch (loud HashMismatchError), cached+unpinned (returns
  actual hash), not-cached (auto-download if URL set, otherwise
  ManualFetchRequired with the precise landing URL + on-disk path).
  Streams in 1 MiB chunks; atomic .part-then-rename writes
- CLI: `pm-bench fetch <name> [--pin]` — prints status, emits a
  pasteable registry.yml sha256 patch when `--pin` is set against an
  unpinned-but-present cached file (the path the TOS-gated workflow
  takes)
- 13 new tests (test_cache.py, test_fetch.py); 37 total, ruff clean
- STATUS / GOALS / README updated: v0.1 marked partial — machinery
  shipped, per-dataset hash pins pending one-time manual downloads
@protosphinx
Copy link
Copy Markdown
Member Author

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

@protosphinx protosphinx closed this May 1, 2026
@protosphinx protosphinx deleted the dataset-fetch branch May 1, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant