Context
Today the dataset is distributed two different ways and they have drifted:
- GCS zip —
gs://lica-ml/gdb-dataset.zip (public URL: https://storage.googleapis.com/lica-ml/gdb-dataset.zip), pointed at by scripts/download_data.py. Contains the canonical on-disk layout that both the upstream run_benchmarks.py and the Harbor adapter load from:
gdb-dataset/
├── benchmarks/ # ~3.4 GB per-benchmark JSON definitions
└── lica-data/ # ~1.0 GB (metadata.csv, layouts, images, annotations)
- HuggingFace —
lica-world/GDB, currently ~62 parquet files across 39 benchmark configs. Only has per-benchmark sample rows, no lica-data/.
Recent incidents that motivate this cleanup:
template-{4,5}.json shipped context_layouts / designated_layout / ground_truth as JSON-encoded strings (instead of dicts). Caused a template-5 evaluator crash and doubly-escaped prompts for template-4. Fix landed locally + on HF, but the GCS zip and the harbor-datasets tasks built from it both drifted out of sync until a manual refresh (harbor PR #1433 + companion harbor-datasets PR #196).
- A stalled
gsutil upload in the same session left gs://lica-ml/gdb-dataset.zip as 404 for several hours, silently breaking scripts/download_data.py.
Proposal
Move the dataset to a HF-native layout and make lica-world/GDB the single source of truth.
Scope
Non-goals (explicitly out of scope)
- Moving the media URLs embedded inside layouts (
https://storage.googleapis.com/lica-video/<uuid>.png) off GCS. These are public, stable, and referenced across hundreds of assets; that's a separate cleanup.
Success criteria
pip install lica-gdb && python -c 'from datasets import load_dataset; load_dataset(\"lica-world/GDB\", \"template-5\")' works standalone.
python scripts/download_data.py fetches from HF by default and produces an identical on-disk layout to the current zip extract.
- Removing
gs://lica-ml/gdb-dataset.zip does not break any supported workflow.
Why now
The Harbor adapter PR (harbor-framework/harbor#1433) is landing, and it will become a second public consumer of this dataset. This is the right time to converge on one distribution path before more downstream users build on top of the GCS URL.
Filed as a follow-up from the Scenario-2 parity work on the harbor-adapter branch.
Context
Today the dataset is distributed two different ways and they have drifted:
gs://lica-ml/gdb-dataset.zip(public URL: https://storage.googleapis.com/lica-ml/gdb-dataset.zip), pointed at byscripts/download_data.py. Contains the canonical on-disk layout that both the upstreamrun_benchmarks.pyand the Harbor adapter load from:lica-world/GDB, currently ~62 parquet files across 39 benchmark configs. Only has per-benchmark sample rows, nolica-data/.Recent incidents that motivate this cleanup:
template-{4,5}.jsonshippedcontext_layouts/designated_layout/ground_truthas JSON-encoded strings (instead of dicts). Caused atemplate-5evaluator crash and doubly-escaped prompts fortemplate-4. Fix landed locally + on HF, but the GCS zip and the harbor-datasets tasks built from it both drifted out of sync until a manual refresh (harbor PR #1433 + companion harbor-datasets PR #196).gsutilupload in the same session leftgs://lica-ml/gdb-dataset.zipas 404 for several hours, silently breakingscripts/download_data.py.Proposal
Move the dataset to a HF-native layout and make
lica-world/GDBthe single source of truth.Scope
lica-data/:datasets.Image()/datasets.Value('binary')). Clean but grows the parquet size.lica-data/as a separate HF dataset (lica-world/GDB-assets) using theimagefolderbuilder or raw LFS files, and havedatasets.load_dataset('lica-world/GDB', <benchmark>)reference asset paths.lica-data/on GCS as a pure CDN and only migratebenchmarks/to HF. Simplest — but doesn't get us off of GCS.scripts/download_data.py:huggingface_hub.snapshot_downloadordatasets.load_dataset(...).save_to_disk).gdb-dataset/layout sorun_benchmarks.pyand the Harbor adapter keep working unchanged.--from-zipas the local-archive escape hatch; drop the GCS URL as the default.BaseBenchmark.load_data(...)paths and/or the per-taskload_dataimplementations if we go Option A (reading from HF rows instead of disk).lica-world/GDBthat publishes tagged releases to HF on every main merge (prevents future GCS/HF drift).Non-goals (explicitly out of scope)
https://storage.googleapis.com/lica-video/<uuid>.png) off GCS. These are public, stable, and referenced across hundreds of assets; that's a separate cleanup.Success criteria
pip install lica-gdb && python -c 'from datasets import load_dataset; load_dataset(\"lica-world/GDB\", \"template-5\")'works standalone.python scripts/download_data.pyfetches from HF by default and produces an identical on-disk layout to the current zip extract.gs://lica-ml/gdb-dataset.zipdoes not break any supported workflow.Why now
The Harbor adapter PR (harbor-framework/harbor#1433) is landing, and it will become a second public consumer of this dataset. This is the right time to converge on one distribution path before more downstream users build on top of the GCS URL.
Filed as a follow-up from the Scenario-2 parity work on the
harbor-adapterbranch.