docs: fall back to snapshot.json when a slug's parquet is missing

mprammer · claude · mprammer · commit 3d81c4579769 · 2026-05-07T14:56:07.000-04:00
`generate_datasets_md` previously read row count, row groups, and file
sizes only from disk: any slug without a local parquet rendered as four
dashes. With 249 slugs and total build cost in the hundreds of CPU-hours,
no maintainer has every output on hand at any one time — partial-build
regen would silently destroy ground truth in the v1 snapshot.

This commit teaches the regen path to fall back to `docs/snapshot.json`
(the existing TUI fallback) when a slug isn't built locally. Lookup
order:

1. `docs/snapshot.json`                 — gitignored scratch, wins
2. `docs/v{schema_version}/snapshot.json` — tracked canonical (fresh-clone path)

Also captures `last_built_row_groups` in the snapshot so the fallback
can fill the row-groups column too.

Surfaces the new "snapshot.json is load-bearing" invariant in agent
tooling: AGENTS.md, SKILLS.md, and the raincloud-docs skill description
all now flag that `docs.py datasets` alone won't refresh the snapshot,
so partial regens should use the no-args form (which already regens
all three artefacts in lockstep).

Tests: 5 new in tests/test_docs.py — disk-present unchanged, two
fallback paths (top-level + v{n}), both-missing dashes, snapshot
captures row groups for built slugs.

Co-Authored-By: Claude &lt;noreply@anthropic.com&gt;
Signed-off-by: mprammer &lt;martin@spiraldb.com&gt;
diff --git a/.agents/skills/raincloud-docs/SKILL.md b/.agents/skills/raincloud-docs/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: raincloud-docs
-description: Regenerate the derived docs (datasets.md, handlers.md, snapshot.json). Use after a build, a manifest edit, or a handler add/remove/rename. The snapshot file is what the TUI falls back to for unbuilt-locally slugs in its columns / types modals. Other catalog views (columns, coverage, vortex-skip, hydrate candidates) are queryable via `/raincloud-list-datasets` and the TUI.
+description: Regenerate the derived docs (datasets.md, handlers.md, snapshot.json). Use after a build, a manifest edit, or a handler add/remove/rename. snapshot.json is load-bearing — it's the fallback both for the TUI's columns / types modals AND for `datasets.md` regen on slugs not built locally, so partial-build maintainers don't dash-out the table. Other catalog views (columns, coverage, vortex-skip, hydrate candidates) are queryable via `/raincloud-list-datasets` and the TUI.
 argument-hint: [datasets | handlers | snapshot]...
 allowed-tools: Bash(python -m scripts.pipeline.docs *)
 ---
@@ -15,7 +15,7 @@ Targets:
 - *(no args)* — regenerates all three: `datasets.md`, `handlers.md`, `snapshot.json`.
 - `datasets` — just `docs/datasets.md` (one row per dataset).
 - `handlers` — just `docs/handlers.md` (one row per registered transform handler).
-- `snapshot` — just `docs/snapshot.json` (per-slug schema + file sizes; read by the TUI as a fallback for slugs whose parquet isn't built locally).
+- `snapshot` — just `docs/snapshot.json` (per-slug schema + file sizes; read by the TUI for unbuilt-locally slugs AND by `datasets.md` regen as the row-count / size fallback). Run alongside `datasets` if you ever invoke `datasets` on its own — otherwise the markdown table will drift from the manifest.
 
 Output lands in **`docs/*.md` / `docs/*.json`** (gitignored scratch). Promote to the tracked canonical path with a manual copy:
 
diff --git a/AGENTS.md b/AGENTS.md
@@ -63,7 +63,7 @@ Hydration policy / philosophy lives in the hand-maintained [`HYDRATING.md`](HYDR
 
 ## What this repo does
 
-Raincloud is a **client-reproducible pipeline** for building a curated catalog of public datasets as Parquet + optional Vortex files. The single source of truth is `sources.json`. Everything under `outputs/`, the two derived docs (`docs/datasets.md`, `docs/handlers.md`), and the JSON catalog snapshot (`docs/snapshot.json`, read by the TUI as a fallback for unbuilt-locally slugs) is **derived** — regenerate, never hand-edit. Column-level / coverage / vortex-skip / hydrate-candidate views are queryable via `list_datasets` flags rather than markdown.
+Raincloud is a **client-reproducible pipeline** for building a curated catalog of public datasets as Parquet + optional Vortex files. The single source of truth is `sources.json`. Everything under `outputs/`, the two derived docs (`docs/datasets.md`, `docs/handlers.md`), and the JSON catalog snapshot (`docs/snapshot.json` — read by the TUI as a fallback for unbuilt-locally slugs, AND used by `docs.py` itself as the row-count / file-size fallback when regenerating `datasets.md` on a partial build) is **derived** — regenerate, never hand-edit. Column-level / coverage / vortex-skip / hydrate-candidate views are queryable via `list_datasets` flags rather than markdown.
 
 The pipeline flow is: **fetch → extract → parse → transform → write → validate → convert** (stage 7 opt-in per-spec), orchestrated by `scripts.pipeline.build`.
 
@@ -122,10 +122,12 @@ Small (<100 MB) parquets are fine to rebuild without asking.
 ## Regenerate derived docs after any pipeline change
 
 ```bash
-python -m scripts.pipeline.docs    # datasets.md + columns_{parquet,vortex}.md + coverage_{parquet,vortex}.md + handlers.md
+python -m scripts.pipeline.docs    # datasets.md + handlers.md + snapshot.json
 ```
 
-All six derived docs are regenerated in one pass by default. Run this after any build, convert run, in-place parquet mutation, or when a handler is added/removed/renamed (handlers.md regenerates from the registry + manifest).
+All three derived artefacts regenerate in one pass by default. Run this after any build, convert run, in-place parquet mutation, or when a handler is added/removed/renamed (handlers.md regenerates from the registry + manifest).
+
+**Keep `docs/snapshot.json` fresh — it's load-bearing.** `datasets.md` regen reads from disk for slugs you've built locally and falls back to `docs/snapshot.json` (or `docs/v{schema_version}/snapshot.json` on a fresh clone) for everything else. Without that fallback, regenerating on a partial build would dash-out 200+ rows and silently destroy ground truth in the tracked snapshot. The default no-args invocation regens snapshot + datasets in lockstep, so it's only at risk if you do partial regens — `docs.py datasets` alone won't refresh the snapshot. After a build, prefer the no-args form.
 
 ## Style and scope
 
diff --git a/SKILLS.md b/SKILLS.md
@@ -380,15 +380,18 @@ df -h .                                           # disk headroom
 ## Regenerating specific docs
 
 ```bash
-python -m scripts.pipeline.docs            # both files (datasets.md + handlers.md)
+python -m scripts.pipeline.docs            # all three (datasets.md + handlers.md + snapshot.json)
 python -m scripts.pipeline.docs datasets   # just datasets.md
 python -m scripts.pipeline.docs handlers   # just handlers.md (registry + manifest usage)
+python -m scripts.pipeline.docs snapshot   # just snapshot.json (per-slug schema + sizes)
 ```
 
-Writes land in `docs/*.md` (gitignored scratch). To promote a snapshot to the tracked canonical path, copy to `docs/v{schema_version}/`.
+Writes land in `docs/{datasets.md, handlers.md, snapshot.json}` (gitignored scratch). To promote, copy to the tracked `docs/v{schema_version}/`.
 
 Regenerate **after** any of: build, convert run, in-place tightening, manifest edit that changes short_name / license / description / family / expect.rows. Skip if the change doesn't affect the catalog or the handler registry.
 
+**`snapshot.json` is the load-bearing fallback** — `datasets.md` regen reads it for any slug whose parquet isn't on disk locally (otherwise the row would dash out the row count, sizes, and column-derived "Data Kind" tag). The no-args form keeps snapshot + datasets in lockstep; if you do a partial regen with `docs.py datasets`, run `docs.py snapshot` first (or just use the no-args form) so the table doesn't drift.
+
 The other catalog views (columns, coverage, vortex-skip, hydration candidates) are no longer markdown — query them via `python -m scripts.pipeline.list_datasets --columns / --coverage / --no-vortex / --hydrate` or interactively in the TUI (`python -m scripts.pipeline.browse`).
 
 ## Removing a dataset
diff --git a/scripts/pipeline/docs.py b/scripts/pipeline/docs.py
@@ -10,6 +10,13 @@
                           built state. Read by the TUI as a fallback when a
                           local parquet isn't built, so the columns / types
                           modals can still show *expected* contents.
+                          ALSO read by `generate_datasets_md` below as the
+                          fallback for row count / row-group count / file
+                          sizes when a slug's parquet isn't present locally
+                          — without it, regen by a maintainer who hasn't
+                          built every slug would dash-out the whole table.
+                          Keep snapshot.json regenerated whenever a new
+                          slug lands or a build's row count / size changes.
 
 Per-column / per-coverage / vortex-skip / hydrated detail used to live as
 markdown too, but the rendering was unscannable and duplicated state
@@ -68,8 +75,12 @@ def _generation_header(kind: str) -> str:
 }
 
 
-def _data_kind(spec: dict, pq_schema=None) -> str:
-    """Best-effort inference of the 'Data Kind' label."""
+def _data_kind(spec: dict, column_names: set[str] | None = None) -> str:
+    """Best-effort inference of the 'Data Kind' label.
+
+    `column_names` may come from a live parquet schema OR from the snapshot
+    fallback — both cases need to recognise the `content` blob convention.
+    """
     family = spec.get("family", "")
     if family in _KIND_BY_FAMILY:
         return _KIND_BY_FAMILY[family]
@@ -93,14 +104,39 @@ def _data_kind(spec: dict, pq_schema=None) -> str:
         base = "Custom"
     else:
         base = "Tabular (CSV)"
-    # Blob column bumps the label
-    if pq_schema is not None:
-        names = {f.name for f in pq_schema}
-        if "content" in names:
-            base = f"{base.split(' (')[0]} + Blobs"
+    if column_names and "content" in column_names:
+        base = f"{base.split(' (')[0]} + Blobs"
     return base
 
 
+def _load_snapshot_slugs(schema_version: int | None = None) -> dict[str, dict]:
+    """Return the `slugs` mapping from the on-disk snapshot, or `{}`.
+
+    Used by `generate_datasets_md` to fall back to the last-known row count
+    / sizes when a slug's parquet isn't present locally. Tries:
+
+        1. `docs/snapshot.json`                 (gitignored scratch — wins
+           if a maintainer regenerated locally)
+        2. `docs/v{schema_version}/snapshot.json`  (tracked canonical — what
+           a fresh clone has)
+
+    Returns `{}` on a missing or malformed snapshot — callers degrade to
+    the dash placeholder.
+    """
+    import json
+    candidates = [SNAPSHOT_JSON]
+    if schema_version is not None:
+        candidates.append(REPO_ROOT / "docs" / f"v{schema_version}" / "snapshot.json")
+    for path in candidates:
+        if not path.exists():
+            continue
+        try:
+            return json.loads(path.read_text()).get("slugs", {})
+        except (json.JSONDecodeError, OSError):
+            continue
+    return {}
+
+
 def _size_label(bytes_: int | None) -> str:
     if bytes_ is None:
         return "—"
@@ -112,24 +148,37 @@ def _size_label(bytes_: int | None) -> str:
 
 def generate_datasets_md():
     manifest = load_manifest()
+    snapshot_slugs = _load_snapshot_slugs(manifest.get("schema_version"))
     rows = []
     advisories: list[tuple[str, str, str]] = []  # (slug, short_name, advisory text)
     for spec in manifest["datasets"]:
         slug = spec["slug"]
         parquet = prepared_parquet(slug)
+        snap = snapshot_slugs.get(slug, {})
         if parquet.exists():
             meta = pq.ParquetFile(parquet).metadata
             schema = pq.ParquetFile(parquet).schema_arrow
             row_count = f"{meta.num_rows:,}"
             row_groups = f"{meta.num_row_groups:,}"
             parquet_size = _size_label(parquet.stat().st_size)
-            kind = _data_kind(spec, schema)
+            kind = _data_kind(spec, column_names={f.name for f in schema})
         else:
-            row_count = row_groups = parquet_size = "—"
-            kind = _data_kind(spec)
+            # Fall back to the last-known snapshot entry so partial-build
+            # maintainers don't dash-out everything they haven't built locally.
+            r = snap.get("last_built_rows")
+            rg = snap.get("last_built_row_groups")
+            row_count = f"{r:,}" if isinstance(r, int) else "—"
+            row_groups = f"{rg:,}" if isinstance(rg, int) else "—"
+            parquet_size = _size_label(snap.get("parquet_bytes"))
+            cols = snap.get("columns") or []
+            names = {c["name"] for c in cols if isinstance(c, dict) and "name" in c}
+            kind = _data_kind(spec, column_names=names or None)
 
         vortex = prepared_vortex(slug)
-        vortex_size = _size_label(vortex.stat().st_size) if vortex.exists() else "—"
+        if vortex.exists():
+            vortex_size = _size_label(vortex.stat().st_size)
+        else:
+            vortex_size = _size_label(snap.get("vortex_bytes"))
 
         short = spec["short_name"]
         advisory = spec_field(spec, "license.scrape_advisory")
@@ -342,7 +391,8 @@ def generate_snapshot(*, overwrite_missing: bool = False):
         expected_rows = spec_field(spec, "expect.rows")
         fresh: dict = {
             "expected_rows": expected_rows,
-            "last_built_rows": None,  # populated below from parquet metadata
+            "last_built_rows": None,        # populated below from parquet metadata
+            "last_built_row_groups": None,  # populated below from parquet metadata
             "parquet_bytes": parquet.stat().st_size if parquet.exists() else None,
             "vortex_bytes": vortex.stat().st_size if vortex.exists() else None,
             "columns": None,  # populated below when schema is readable
@@ -358,6 +408,7 @@ def generate_snapshot(*, overwrite_missing: bool = False):
                     {"name": f.name, "type": str(f.type)} for f in pf.schema_arrow
                 ]
                 fresh["last_built_rows"] = int(pf.metadata.num_rows)
+                fresh["last_built_row_groups"] = int(pf.metadata.num_row_groups)
                 n_with_schema += 1
             except Exception as e:
                 fresh["columns_error"] = f"{type(e).__name__}: {str(e)[:120]}"
diff --git a/tests/test_docs.py b/tests/test_docs.py