Skip to content

Migrate dataset distribution to HuggingFace-native #2

@mohitgargai

Description

@mohitgargai

Context

Today the dataset is distributed two different ways and they have drifted:

  • GCS zipgs://lica-ml/gdb-dataset.zip (public URL: https://storage.googleapis.com/lica-ml/gdb-dataset.zip), pointed at by scripts/download_data.py. Contains the canonical on-disk layout that both the upstream run_benchmarks.py and the Harbor adapter load from:
    gdb-dataset/
    ├── benchmarks/      # ~3.4 GB per-benchmark JSON definitions
    └── lica-data/       # ~1.0 GB (metadata.csv, layouts, images, annotations)
    
  • HuggingFacelica-world/GDB, currently ~62 parquet files across 39 benchmark configs. Only has per-benchmark sample rows, no lica-data/.

Recent incidents that motivate this cleanup:

  1. template-{4,5}.json shipped context_layouts / designated_layout / ground_truth as JSON-encoded strings (instead of dicts). Caused a template-5 evaluator crash and doubly-escaped prompts for template-4. Fix landed locally + on HF, but the GCS zip and the harbor-datasets tasks built from it both drifted out of sync until a manual refresh (harbor PR #1433 + companion harbor-datasets PR #196).
  2. A stalled gsutil upload in the same session left gs://lica-ml/gdb-dataset.zip as 404 for several hours, silently breaking scripts/download_data.py.

Proposal

Move the dataset to a HF-native layout and make lica-world/GDB the single source of truth.

Scope

  • Decide on the HF representation for lica-data/:
    • Option A: Embed image/layout bytes in the parquet rows (datasets.Image() / datasets.Value('binary')). Clean but grows the parquet size.
    • Option B: Publish lica-data/ as a separate HF dataset (lica-world/GDB-assets) using the imagefolder builder or raw LFS files, and have datasets.load_dataset('lica-world/GDB', <benchmark>) reference asset paths.
    • Option C: Keep lica-data/ on GCS as a pure CDN and only migrate benchmarks/ to HF. Simplest — but doesn't get us off of GCS.
  • Rewrite scripts/download_data.py:
    • Default source: HF (via huggingface_hub.snapshot_download or datasets.load_dataset(...).save_to_disk).
    • Reconstruct the on-disk gdb-dataset/ layout so run_benchmarks.py and the Harbor adapter keep working unchanged.
    • Keep --from-zip as the local-archive escape hatch; drop the GCS URL as the default.
  • Update BaseBenchmark.load_data(...) paths and/or the per-task load_data implementations if we go Option A (reading from HF rows instead of disk).
  • Add a CI job on lica-world/GDB that publishes tagged releases to HF on every main merge (prevents future GCS/HF drift).
  • Deprecate the GCS zip: leave it up with a README note pointing at HF, or redirect via a CDN alias.

Non-goals (explicitly out of scope)

  • Moving the media URLs embedded inside layouts (https://storage.googleapis.com/lica-video/<uuid>.png) off GCS. These are public, stable, and referenced across hundreds of assets; that's a separate cleanup.

Success criteria

  • pip install lica-gdb && python -c 'from datasets import load_dataset; load_dataset(\"lica-world/GDB\", \"template-5\")' works standalone.
  • python scripts/download_data.py fetches from HF by default and produces an identical on-disk layout to the current zip extract.
  • Removing gs://lica-ml/gdb-dataset.zip does not break any supported workflow.

Why now

The Harbor adapter PR (harbor-framework/harbor#1433) is landing, and it will become a second public consumer of this dataset. This is the right time to converge on one distribution path before more downstream users build on top of the GCS URL.


Filed as a follow-up from the Scenario-2 parity work on the harbor-adapter branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions