Skip to content

v0.1: real download URLs + fetch implementation (skeleton)#1

Closed
protosphinx wants to merge 1 commit into
mainfrom
v0.1-starter
Closed

v0.1: real download URLs + fetch implementation (skeleton)#1
protosphinx wants to merge 1 commit into
mainfrom
v0.1-starter

Conversation

@protosphinx
Copy link
Copy Markdown
Member

Scope

Adds the plumbing for fetching real datasets: per-dataset TODO comments in registry.yml (with API hints for 4TU.ResearchData and Mendeley), a fetch_dataset() skeleton in pm_bench/fetch.py, a _cache.py helper for ~/.cache/pm-bench/, and a pm-bench fetch <name> CLI subcommand. No real URLs or sha256 hashes are guessed — all nullable fields remain null.

Checklist

Registry (pm_bench/registry.yml)

  • bpi2012: resolve 4TU direct download URL and pin sha256
  • bpi2017: resolve 4TU direct download URL and pin sha256
  • bpi2018: resolve 4TU direct download URL and pin sha256
  • bpi2019: resolve 4TU direct download URL and pin sha256
  • bpi2020: decide which sub-files to include; resolve individual URLs and pin sha256
  • sepsis: resolve 4TU direct download URL and pin sha256
  • helpdesk: resolve Mendeley direct CSV download URL and pin sha256

Fetch implementation (pm_bench/fetch.py)

  • HTTP download with resume support (Range header)
  • sha256 verification after download
  • Atomic move from .tmp to final path
  • Wire _cache.cache_dir() as the default cache root

Tests (tests/test_fetch.py)

  • Fill in all TODO tests once fetch_dataset is implemented

Roadmap context

See the Roadmap section of the README.


Generated by Claude Code

@protosphinx
Copy link
Copy Markdown
Member Author

Closing as superseded — every TODO in this draft is now complete on main (commit 9c00b47):

Registry — fetch + hash machinery shipped (pm-bench fetch <name> [--pin]); per-dataset hash pins remain pending the one-time TOS-gated downloads (a human step, not a code task).

Fetch implementationpm_bench/fetch.py ships full HTTP download, atomic move from .part to final path, sha256 verification, PID+UUID-staged tmp files for concurrency safety, partial-download cleanup, and explicit handling for the bundled synthetic-toy case.

Cachepm_bench/cache.py (note: dropped the leading underscore, since it's part of the public API surface for tests) handles PM_BENCH_CACHE override and per-dataset paths.

Teststests/test_fetch.py (16 tests including a tmp-HTTP-server test for the auto-download path) and tests/test_cache.py (path resolution).

The CLI gained pm-bench fetch, plus stats, validate, compare, leaderboard, predict, discover, prefixes, score — all wired up.

If you'd like to keep TODO.md for tracking the per-dataset pinning step (the only remaining v0.1 work), I can open a fresh PR adding just that file.

@protosphinx protosphinx closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant