Migrate dataset distribution to HuggingFace-native

## Context

Today the dataset is distributed two different ways and they have drifted:

- **GCS zip** — `gs://lica-ml/gdb-dataset.zip` (public URL: https://storage.googleapis.com/lica-ml/gdb-dataset.zip), pointed at by `scripts/download_data.py`. Contains the canonical on-disk layout that both the upstream `run_benchmarks.py` and the Harbor adapter load from:
  ```
  gdb-dataset/
  ├── benchmarks/      # ~3.4 GB per-benchmark JSON definitions
  └── lica-data/       # ~1.0 GB (metadata.csv, layouts, images, annotations)
  ```
- **HuggingFace** — [`lica-world/GDB`](https://huggingface.co/datasets/lica-world/GDB), currently ~62 parquet files across 39 benchmark configs. Only has per-benchmark sample rows, no `lica-data/`.

Recent incidents that motivate this cleanup:

1. `template-{4,5}.json` shipped `context_layouts` / `designated_layout` / `ground_truth` as JSON-encoded strings (instead of dicts). Caused a `template-5` evaluator crash and doubly-escaped prompts for `template-4`. Fix landed locally + on HF, but the GCS zip and the harbor-datasets tasks built from it both drifted out of sync until a manual refresh ([harbor PR #1433](https://github.com/harbor-framework/harbor/pull/1433) + companion [harbor-datasets PR #196](https://github.com/laude-institute/harbor-datasets/pull/196)).
2. A stalled `gsutil` upload in the same session left `gs://lica-ml/gdb-dataset.zip` as **404** for several hours, silently breaking `scripts/download_data.py`.

## Proposal

Move the dataset to a HF-native layout and make `lica-world/GDB` the single source of truth.

### Scope

- [ ] Decide on the HF representation for `lica-data/`:
  - Option A: Embed image/layout bytes in the parquet rows (`datasets.Image()` / `datasets.Value('binary')`). Clean but grows the parquet size.
  - Option B: Publish `lica-data/` as a separate HF dataset (`lica-world/GDB-assets`) using the `imagefolder` builder or raw LFS files, and have `datasets.load_dataset('lica-world/GDB', <benchmark>)` reference asset paths.
  - Option C: Keep `lica-data/` on GCS as a pure CDN and only migrate `benchmarks/` to HF. Simplest — but doesn't get us off of GCS.
- [ ] Rewrite `scripts/download_data.py`:
  - Default source: HF (via `huggingface_hub.snapshot_download` or `datasets.load_dataset(...).save_to_disk`).
  - Reconstruct the on-disk `gdb-dataset/` layout so `run_benchmarks.py` and the Harbor adapter keep working unchanged.
  - Keep `--from-zip` as the local-archive escape hatch; drop the GCS URL as the default.
- [ ] Update `BaseBenchmark.load_data(...)` paths and/or the per-task `load_data` implementations if we go Option A (reading from HF rows instead of disk).
- [ ] Add a CI job on `lica-world/GDB` that publishes tagged releases to HF on every main merge (prevents future GCS/HF drift).
- [ ] Deprecate the GCS zip: leave it up with a README note pointing at HF, or redirect via a CDN alias.

### Non-goals (explicitly out of scope)

- Moving the media URLs embedded inside layouts (`https://storage.googleapis.com/lica-video/<uuid>.png`) off GCS. These are public, stable, and referenced across hundreds of assets; that's a separate cleanup.

### Success criteria

- `pip install lica-gdb && python -c 'from datasets import load_dataset; load_dataset(\"lica-world/GDB\", \"template-5\")'` works standalone.
- `python scripts/download_data.py` fetches from HF by default and produces an identical on-disk layout to the current zip extract.
- Removing `gs://lica-ml/gdb-dataset.zip` does not break any supported workflow.

### Why now

The Harbor adapter PR ([harbor-framework/harbor#1433](https://github.com/harbor-framework/harbor/pull/1433)) is landing, and it will become a second public consumer of this dataset. This is the right time to converge on one distribution path before more downstream users build on top of the GCS URL.

---

Filed as a follow-up from the Scenario-2 parity work on the `harbor-adapter` branch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate dataset distribution to HuggingFace-native #2

Context

Proposal

Scope

Non-goals (explicitly out of scope)

Success criteria

Why now

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Migrate dataset distribution to HuggingFace-native #2

Description

Context

Proposal

Scope

Non-goals (explicitly out of scope)

Success criteria

Why now

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions