Skip to content

Commit 3d81c45

Browse files
mprammerclaude
andcommitted
docs: fall back to snapshot.json when a slug's parquet is missing
`generate_datasets_md` previously read row count, row groups, and file sizes only from disk: any slug without a local parquet rendered as four dashes. With 249 slugs and total build cost in the hundreds of CPU-hours, no maintainer has every output on hand at any one time — partial-build regen would silently destroy ground truth in the v1 snapshot. This commit teaches the regen path to fall back to `docs/snapshot.json` (the existing TUI fallback) when a slug isn't built locally. Lookup order: 1. `docs/snapshot.json` — gitignored scratch, wins 2. `docs/v{schema_version}/snapshot.json` — tracked canonical (fresh-clone path) Also captures `last_built_row_groups` in the snapshot so the fallback can fill the row-groups column too. Surfaces the new "snapshot.json is load-bearing" invariant in agent tooling: AGENTS.md, SKILLS.md, and the raincloud-docs skill description all now flag that `docs.py datasets` alone won't refresh the snapshot, so partial regens should use the no-args form (which already regens all three artefacts in lockstep). Tests: 5 new in tests/test_docs.py — disk-present unchanged, two fallback paths (top-level + v{n}), both-missing dashes, snapshot captures row groups for built slugs. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
1 parent db7b205 commit 3d81c45

5 files changed

Lines changed: 262 additions & 19 deletions

File tree

.agents/skills/raincloud-docs/SKILL.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: raincloud-docs
3-
description: Regenerate the derived docs (datasets.md, handlers.md, snapshot.json). Use after a build, a manifest edit, or a handler add/remove/rename. The snapshot file is what the TUI falls back to for unbuilt-locally slugs in its columns / types modals. Other catalog views (columns, coverage, vortex-skip, hydrate candidates) are queryable via `/raincloud-list-datasets` and the TUI.
3+
description: Regenerate the derived docs (datasets.md, handlers.md, snapshot.json). Use after a build, a manifest edit, or a handler add/remove/rename. snapshot.json is load-bearing — it's the fallback both for the TUI's columns / types modals AND for `datasets.md` regen on slugs not built locally, so partial-build maintainers don't dash-out the table. Other catalog views (columns, coverage, vortex-skip, hydrate candidates) are queryable via `/raincloud-list-datasets` and the TUI.
44
argument-hint: [datasets | handlers | snapshot]...
55
allowed-tools: Bash(python -m scripts.pipeline.docs *)
66
---
@@ -15,7 +15,7 @@ Targets:
1515
- *(no args)* — regenerates all three: `datasets.md`, `handlers.md`, `snapshot.json`.
1616
- `datasets` — just `docs/datasets.md` (one row per dataset).
1717
- `handlers` — just `docs/handlers.md` (one row per registered transform handler).
18-
- `snapshot` — just `docs/snapshot.json` (per-slug schema + file sizes; read by the TUI as a fallback for slugs whose parquet isn't built locally).
18+
- `snapshot` — just `docs/snapshot.json` (per-slug schema + file sizes; read by the TUI for unbuilt-locally slugs AND by `datasets.md` regen as the row-count / size fallback). Run alongside `datasets` if you ever invoke `datasets` on its own — otherwise the markdown table will drift from the manifest.
1919

2020
Output lands in **`docs/*.md` / `docs/*.json`** (gitignored scratch). Promote to the tracked canonical path with a manual copy:
2121

AGENTS.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Hydration policy / philosophy lives in the hand-maintained [`HYDRATING.md`](HYDR
6363

6464
## What this repo does
6565

66-
Raincloud is a **client-reproducible pipeline** for building a curated catalog of public datasets as Parquet + optional Vortex files. The single source of truth is `sources.json`. Everything under `outputs/`, the two derived docs (`docs/datasets.md`, `docs/handlers.md`), and the JSON catalog snapshot (`docs/snapshot.json`, read by the TUI as a fallback for unbuilt-locally slugs) is **derived** — regenerate, never hand-edit. Column-level / coverage / vortex-skip / hydrate-candidate views are queryable via `list_datasets` flags rather than markdown.
66+
Raincloud is a **client-reproducible pipeline** for building a curated catalog of public datasets as Parquet + optional Vortex files. The single source of truth is `sources.json`. Everything under `outputs/`, the two derived docs (`docs/datasets.md`, `docs/handlers.md`), and the JSON catalog snapshot (`docs/snapshot.json`read by the TUI as a fallback for unbuilt-locally slugs, AND used by `docs.py` itself as the row-count / file-size fallback when regenerating `datasets.md` on a partial build) is **derived** — regenerate, never hand-edit. Column-level / coverage / vortex-skip / hydrate-candidate views are queryable via `list_datasets` flags rather than markdown.
6767

6868
The pipeline flow is: **fetch → extract → parse → transform → write → validate → convert** (stage 7 opt-in per-spec), orchestrated by `scripts.pipeline.build`.
6969

@@ -122,10 +122,12 @@ Small (<100 MB) parquets are fine to rebuild without asking.
122122
## Regenerate derived docs after any pipeline change
123123

124124
```bash
125-
python -m scripts.pipeline.docs # datasets.md + columns_{parquet,vortex}.md + coverage_{parquet,vortex}.md + handlers.md
125+
python -m scripts.pipeline.docs # datasets.md + handlers.md + snapshot.json
126126
```
127127

128-
All six derived docs are regenerated in one pass by default. Run this after any build, convert run, in-place parquet mutation, or when a handler is added/removed/renamed (handlers.md regenerates from the registry + manifest).
128+
All three derived artefacts regenerate in one pass by default. Run this after any build, convert run, in-place parquet mutation, or when a handler is added/removed/renamed (handlers.md regenerates from the registry + manifest).
129+
130+
**Keep `docs/snapshot.json` fresh — it's load-bearing.** `datasets.md` regen reads from disk for slugs you've built locally and falls back to `docs/snapshot.json` (or `docs/v{schema_version}/snapshot.json` on a fresh clone) for everything else. Without that fallback, regenerating on a partial build would dash-out 200+ rows and silently destroy ground truth in the tracked snapshot. The default no-args invocation regens snapshot + datasets in lockstep, so it's only at risk if you do partial regens — `docs.py datasets` alone won't refresh the snapshot. After a build, prefer the no-args form.
129131

130132
## Style and scope
131133

SKILLS.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -380,15 +380,18 @@ df -h . # disk headroom
380380
## Regenerating specific docs
381381

382382
```bash
383-
python -m scripts.pipeline.docs # both files (datasets.md + handlers.md)
383+
python -m scripts.pipeline.docs # all three (datasets.md + handlers.md + snapshot.json)
384384
python -m scripts.pipeline.docs datasets # just datasets.md
385385
python -m scripts.pipeline.docs handlers # just handlers.md (registry + manifest usage)
386+
python -m scripts.pipeline.docs snapshot # just snapshot.json (per-slug schema + sizes)
386387
```
387388

388-
Writes land in `docs/*.md` (gitignored scratch). To promote a snapshot to the tracked canonical path, copy to `docs/v{schema_version}/`.
389+
Writes land in `docs/{datasets.md, handlers.md, snapshot.json}` (gitignored scratch). To promote, copy to the tracked `docs/v{schema_version}/`.
389390

390391
Regenerate **after** any of: build, convert run, in-place tightening, manifest edit that changes short_name / license / description / family / expect.rows. Skip if the change doesn't affect the catalog or the handler registry.
391392
393+
**`snapshot.json` is the load-bearing fallback** — `datasets.md` regen reads it for any slug whose parquet isn't on disk locally (otherwise the row would dash out the row count, sizes, and column-derived "Data Kind" tag). The no-args form keeps snapshot + datasets in lockstep; if you do a partial regen with `docs.py datasets`, run `docs.py snapshot` first (or just use the no-args form) so the table doesn't drift.
394+
392395
The other catalog views (columns, coverage, vortex-skip, hydration candidates) are no longer markdown — query them via `python -m scripts.pipeline.list_datasets --columns / --coverage / --no-vortex / --hydrate` or interactively in the TUI (`python -m scripts.pipeline.browse`).
393396
394397
## Removing a dataset

scripts/pipeline/docs.py

Lines changed: 63 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,13 @@
1010
built state. Read by the TUI as a fallback when a
1111
local parquet isn't built, so the columns / types
1212
modals can still show *expected* contents.
13+
ALSO read by `generate_datasets_md` below as the
14+
fallback for row count / row-group count / file
15+
sizes when a slug's parquet isn't present locally
16+
— without it, regen by a maintainer who hasn't
17+
built every slug would dash-out the whole table.
18+
Keep snapshot.json regenerated whenever a new
19+
slug lands or a build's row count / size changes.
1320
1421
Per-column / per-coverage / vortex-skip / hydrated detail used to live as
1522
markdown too, but the rendering was unscannable and duplicated state
@@ -68,8 +75,12 @@ def _generation_header(kind: str) -> str:
6875
}
6976

7077

71-
def _data_kind(spec: dict, pq_schema=None) -> str:
72-
"""Best-effort inference of the 'Data Kind' label."""
78+
def _data_kind(spec: dict, column_names: set[str] | None = None) -> str:
79+
"""Best-effort inference of the 'Data Kind' label.
80+
81+
`column_names` may come from a live parquet schema OR from the snapshot
82+
fallback — both cases need to recognise the `content` blob convention.
83+
"""
7384
family = spec.get("family", "")
7485
if family in _KIND_BY_FAMILY:
7586
return _KIND_BY_FAMILY[family]
@@ -93,14 +104,39 @@ def _data_kind(spec: dict, pq_schema=None) -> str:
93104
base = "Custom"
94105
else:
95106
base = "Tabular (CSV)"
96-
# Blob column bumps the label
97-
if pq_schema is not None:
98-
names = {f.name for f in pq_schema}
99-
if "content" in names:
100-
base = f"{base.split(' (')[0]} + Blobs"
107+
if column_names and "content" in column_names:
108+
base = f"{base.split(' (')[0]} + Blobs"
101109
return base
102110

103111

112+
def _load_snapshot_slugs(schema_version: int | None = None) -> dict[str, dict]:
113+
"""Return the `slugs` mapping from the on-disk snapshot, or `{}`.
114+
115+
Used by `generate_datasets_md` to fall back to the last-known row count
116+
/ sizes when a slug's parquet isn't present locally. Tries:
117+
118+
1. `docs/snapshot.json` (gitignored scratch — wins
119+
if a maintainer regenerated locally)
120+
2. `docs/v{schema_version}/snapshot.json` (tracked canonical — what
121+
a fresh clone has)
122+
123+
Returns `{}` on a missing or malformed snapshot — callers degrade to
124+
the dash placeholder.
125+
"""
126+
import json
127+
candidates = [SNAPSHOT_JSON]
128+
if schema_version is not None:
129+
candidates.append(REPO_ROOT / "docs" / f"v{schema_version}" / "snapshot.json")
130+
for path in candidates:
131+
if not path.exists():
132+
continue
133+
try:
134+
return json.loads(path.read_text()).get("slugs", {})
135+
except (json.JSONDecodeError, OSError):
136+
continue
137+
return {}
138+
139+
104140
def _size_label(bytes_: int | None) -> str:
105141
if bytes_ is None:
106142
return "—"
@@ -112,24 +148,37 @@ def _size_label(bytes_: int | None) -> str:
112148

113149
def generate_datasets_md():
114150
manifest = load_manifest()
151+
snapshot_slugs = _load_snapshot_slugs(manifest.get("schema_version"))
115152
rows = []
116153
advisories: list[tuple[str, str, str]] = [] # (slug, short_name, advisory text)
117154
for spec in manifest["datasets"]:
118155
slug = spec["slug"]
119156
parquet = prepared_parquet(slug)
157+
snap = snapshot_slugs.get(slug, {})
120158
if parquet.exists():
121159
meta = pq.ParquetFile(parquet).metadata
122160
schema = pq.ParquetFile(parquet).schema_arrow
123161
row_count = f"{meta.num_rows:,}"
124162
row_groups = f"{meta.num_row_groups:,}"
125163
parquet_size = _size_label(parquet.stat().st_size)
126-
kind = _data_kind(spec, schema)
164+
kind = _data_kind(spec, column_names={f.name for f in schema})
127165
else:
128-
row_count = row_groups = parquet_size = "—"
129-
kind = _data_kind(spec)
166+
# Fall back to the last-known snapshot entry so partial-build
167+
# maintainers don't dash-out everything they haven't built locally.
168+
r = snap.get("last_built_rows")
169+
rg = snap.get("last_built_row_groups")
170+
row_count = f"{r:,}" if isinstance(r, int) else "—"
171+
row_groups = f"{rg:,}" if isinstance(rg, int) else "—"
172+
parquet_size = _size_label(snap.get("parquet_bytes"))
173+
cols = snap.get("columns") or []
174+
names = {c["name"] for c in cols if isinstance(c, dict) and "name" in c}
175+
kind = _data_kind(spec, column_names=names or None)
130176

131177
vortex = prepared_vortex(slug)
132-
vortex_size = _size_label(vortex.stat().st_size) if vortex.exists() else "—"
178+
if vortex.exists():
179+
vortex_size = _size_label(vortex.stat().st_size)
180+
else:
181+
vortex_size = _size_label(snap.get("vortex_bytes"))
133182

134183
short = spec["short_name"]
135184
advisory = spec_field(spec, "license.scrape_advisory")
@@ -342,7 +391,8 @@ def generate_snapshot(*, overwrite_missing: bool = False):
342391
expected_rows = spec_field(spec, "expect.rows")
343392
fresh: dict = {
344393
"expected_rows": expected_rows,
345-
"last_built_rows": None, # populated below from parquet metadata
394+
"last_built_rows": None, # populated below from parquet metadata
395+
"last_built_row_groups": None, # populated below from parquet metadata
346396
"parquet_bytes": parquet.stat().st_size if parquet.exists() else None,
347397
"vortex_bytes": vortex.stat().st_size if vortex.exists() else None,
348398
"columns": None, # populated below when schema is readable
@@ -358,6 +408,7 @@ def generate_snapshot(*, overwrite_missing: bool = False):
358408
{"name": f.name, "type": str(f.type)} for f in pf.schema_arrow
359409
]
360410
fresh["last_built_rows"] = int(pf.metadata.num_rows)
411+
fresh["last_built_row_groups"] = int(pf.metadata.num_row_groups)
361412
n_with_schema += 1
362413
except Exception as e:
363414
fresh["columns_error"] = f"{type(e).__name__}: {str(e)[:120]}"

0 commit comments

Comments
 (0)