You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: fall back to snapshot.json when a slug's parquet is missing
`generate_datasets_md` previously read row count, row groups, and file
sizes only from disk: any slug without a local parquet rendered as four
dashes. With 249 slugs and total build cost in the hundreds of CPU-hours,
no maintainer has every output on hand at any one time — partial-build
regen would silently destroy ground truth in the v1 snapshot.
This commit teaches the regen path to fall back to `docs/snapshot.json`
(the existing TUI fallback) when a slug isn't built locally. Lookup
order:
1. `docs/snapshot.json` — gitignored scratch, wins
2. `docs/v{schema_version}/snapshot.json` — tracked canonical (fresh-clone path)
Also captures `last_built_row_groups` in the snapshot so the fallback
can fill the row-groups column too.
Surfaces the new "snapshot.json is load-bearing" invariant in agent
tooling: AGENTS.md, SKILLS.md, and the raincloud-docs skill description
all now flag that `docs.py datasets` alone won't refresh the snapshot,
so partial regens should use the no-args form (which already regens
all three artefacts in lockstep).
Tests: 5 new in tests/test_docs.py — disk-present unchanged, two
fallback paths (top-level + v{n}), both-missing dashes, snapshot
captures row groups for built slugs.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Copy file name to clipboardExpand all lines: .agents/skills/raincloud-docs/SKILL.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
name: raincloud-docs
3
-
description: Regenerate the derived docs (datasets.md, handlers.md, snapshot.json). Use after a build, a manifest edit, or a handler add/remove/rename. The snapshot file is what the TUI falls back to for unbuilt-locally slugs in its columns / types modals. Other catalog views (columns, coverage, vortex-skip, hydrate candidates) are queryable via `/raincloud-list-datasets` and the TUI.
3
+
description: Regenerate the derived docs (datasets.md, handlers.md, snapshot.json). Use after a build, a manifest edit, or a handler add/remove/rename. snapshot.json is load-bearing — it's the fallback both for the TUI's columns / types modals AND for `datasets.md` regen on slugs not built locally, so partial-build maintainers don't dash-out the table. Other catalog views (columns, coverage, vortex-skip, hydrate candidates) are queryable via `/raincloud-list-datasets` and the TUI.
-*(no args)* — regenerates all three: `datasets.md`, `handlers.md`, `snapshot.json`.
16
16
-`datasets` — just `docs/datasets.md` (one row per dataset).
17
17
-`handlers` — just `docs/handlers.md` (one row per registered transform handler).
18
-
-`snapshot` — just `docs/snapshot.json` (per-slug schema + file sizes; read by the TUI as a fallback for slugs whose parquet isn't built locally).
18
+
-`snapshot` — just `docs/snapshot.json` (per-slug schema + file sizes; read by the TUI for unbuilt-locally slugs AND by `datasets.md` regen as the row-count / size fallback). Run alongside `datasets` if you ever invoke `datasets` on its own — otherwise the markdown table will drift from the manifest.
19
19
20
20
Output lands in **`docs/*.md` / `docs/*.json`** (gitignored scratch). Promote to the tracked canonical path with a manual copy:
Copy file name to clipboardExpand all lines: AGENTS.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ Hydration policy / philosophy lives in the hand-maintained [`HYDRATING.md`](HYDR
63
63
64
64
## What this repo does
65
65
66
-
Raincloud is a **client-reproducible pipeline** for building a curated catalog of public datasets as Parquet + optional Vortex files. The single source of truth is `sources.json`. Everything under `outputs/`, the two derived docs (`docs/datasets.md`, `docs/handlers.md`), and the JSON catalog snapshot (`docs/snapshot.json`, read by the TUI as a fallback for unbuilt-locally slugs) is **derived** — regenerate, never hand-edit. Column-level / coverage / vortex-skip / hydrate-candidate views are queryable via `list_datasets` flags rather than markdown.
66
+
Raincloud is a **client-reproducible pipeline** for building a curated catalog of public datasets as Parquet + optional Vortex files. The single source of truth is `sources.json`. Everything under `outputs/`, the two derived docs (`docs/datasets.md`, `docs/handlers.md`), and the JSON catalog snapshot (`docs/snapshot.json` — read by the TUI as a fallback for unbuilt-locally slugs, AND used by `docs.py` itself as the row-count / file-size fallback when regenerating `datasets.md` on a partial build) is **derived** — regenerate, never hand-edit. Column-level / coverage / vortex-skip / hydrate-candidate views are queryable via `list_datasets` flags rather than markdown.
67
67
68
68
The pipeline flow is: **fetch → extract → parse → transform → write → validate → convert** (stage 7 opt-in per-spec), orchestrated by `scripts.pipeline.build`.
69
69
@@ -122,10 +122,12 @@ Small (<100 MB) parquets are fine to rebuild without asking.
122
122
## Regenerate derived docs after any pipeline change
All six derived docs are regenerated in one pass by default. Run this after any build, convert run, in-place parquet mutation, or when a handler is added/removed/renamed (handlers.md regenerates from the registry + manifest).
128
+
All three derived artefacts regenerate in one pass by default. Run this after any build, convert run, in-place parquet mutation, or when a handler is added/removed/renamed (handlers.md regenerates from the registry + manifest).
129
+
130
+
**Keep `docs/snapshot.json` fresh — it's load-bearing.**`datasets.md` regen reads from disk for slugs you've built locally and falls back to `docs/snapshot.json` (or `docs/v{schema_version}/snapshot.json` on a fresh clone) for everything else. Without that fallback, regenerating on a partial build would dash-out 200+ rows and silently destroy ground truth in the tracked snapshot. The default no-args invocation regens snapshot + datasets in lockstep, so it's only at risk if you do partial regens — `docs.py datasets` alone won't refresh the snapshot. After a build, prefer the no-args form.
Writes land in`docs/*.md` (gitignored scratch). To promote a snapshot to the tracked canonical path, copy to`docs/v{schema_version}/`.
389
+
Writes land in`docs/{datasets.md, handlers.md, snapshot.json}` (gitignored scratch). To promote, copy to the tracked `docs/v{schema_version}/`.
389
390
390
391
Regenerate **after** any of: build, convert run, in-place tightening, manifest edit that changes short_name / license / description / family /expect.rows. Skipif the change doesn't affect the catalog or the handler registry.
391
392
393
+
**`snapshot.json` is the load-bearing fallback** — `datasets.md` regen reads it for any slug whose parquet isn't on disk locally (otherwise the row would dash out the row count, sizes, and column-derived "Data Kind" tag). The no-args form keeps snapshot + datasets in lockstep; if you do a partial regen with`docs.py datasets`, run `docs.py snapshot`first (or just use the no-args form) so the table doesn't drift.
394
+
392
395
The other catalog views (columns, coverage, vortex-skip, hydration candidates) are no longer markdown — query them via `python -m scripts.pipeline.list_datasets --columns / --coverage / --no-vortex / --hydrate` or interactively in the TUI (`python -m scripts.pipeline.browse`).
0 commit comments