feat(cluster): add snapshot-then-merge architecture by mvanhorn · Pull Request #29 · pwrdrvr/ghcrawl

mvanhorn · 2026-03-20T22:47:12Z

Summary

Implements the snapshot-then-merge cluster storage model from the design doc. Each cluster rebuild now stores a frozen snapshot of cluster membership, and a repo_cluster_state pointer table tracks active/previous runs per repo.

Why this matters

Follow-up to the PR #19 discussion where @huntharo confirmed this direction. The current prune logic deletes all but the current run on every rebuild (service.ts:pruneOldClusterRuns), which is destructive and prevents run-to-run comparison once the lineage tracking from PR #19 lands. The design doc (docs/designs/cluster-storage-cleanup.md) already proposed repo_cluster_state and an append-only model - this PR implements it.

Changes

New tables (db/migrate.ts):

repo_cluster_state - per-repo pointer to active and previous cluster runs
cluster_snapshots - frozen cluster membership per run (JSON array of thread IDs)

New module (cluster/snapshot.ts):

mergeClusterSnapshots() - Jaccard-based comparison of current vs previous snapshots
Greedy matching with configurable threshold (default 0.5)
Produces typed outcomes: updated, new, dissolved

Service updates (service.ts):

clusterRepository() now calls persistClusterSnapshots() and flipClusterState() after persisting the run
getLatestClusterRun() reads from repo_cluster_state first, falls back to raw query for backward compatibility
listClusters() uses the updated read path
pruneOldClusterRuns() keeps both active and previous runs instead of deleting everything

Tests (cluster/snapshot.test.ts):

9 tests covering: first run, identical clusters, overlapping, non-overlapping, mixed, below-threshold, custom threshold, greedy matching

Testing

pnpm typecheck passes
All 9 snapshot tests pass (pnpm --filter @ghcrawl/api-core exec tsx --tsconfig tsconfig.test.json --test 'src/cluster/*.test.ts')
Pre-existing config test failures on main are unrelated (verified by running config.test.ts on main)

This contribution was developed with AI assistance (Claude Code).

@huntharo

Store per-run cluster snapshots in a new cluster_snapshots table and track active/previous run pointers in repo_cluster_state. The read path prefers the state pointer over raw "latest completed run" queries. Prune now keeps both active and previous runs instead of deleting all but the current one. Follow-up to PR pwrdrvr#19 discussion with @huntharo on the cluster lineage tracking design. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

frankekn · 2026-03-27T05:50:36Z

This is a good foundation direction. One thing I would explicitly leave room for: today cluster_snapshots stores frozen member sets and repo_cluster_state flips active/previous pointers, but the actual merge step still looks like a pure helper (mergeClusterSnapshots() / loadClusterSnapshots()) rather than something persisted or exposed yet.

Because of that, I would avoid treating snapshot state as permanently “members only”. If you later want lineage and maintainer-facing decision outputs, you will probably want an extension point for derived annotations such as matched previous cluster id, merge/split classification, representative rationale, or other decision metadata, without another storage rewrite.

Not a blocker, just a request to keep the current schema/read path extensible in that direction.

mvanhorn · 2026-03-27T05:58:34Z

Agreed on keeping the extension path open. The cluster_snapshots table and mergeClusterSnapshots() output types (updated/new/dissolved) were designed with exactly that in mind - the typed outcomes from the merge step are the natural place to attach decision metadata (matched previous cluster ID, rationale, confidence scores) without a storage rewrite.

Right now the merge output is consumed and discarded after the flip. When lineage tracking from #19 lands, the plan is to persist those merge outcomes alongside the snapshot - either as additional columns on cluster_snapshots or a lightweight cluster_merge_decisions table. The current schema supports both without breaking changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cluster): add snapshot-then-merge architecture#29

feat(cluster): add snapshot-then-merge architecture#29
mvanhorn wants to merge 1 commit into
pwrdrvr:mainfrom
mvanhorn:osc/feat-cluster-snapshot-merge

mvanhorn commented Mar 20, 2026

Uh oh!

frankekn commented Mar 27, 2026

Uh oh!

mvanhorn commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mvanhorn commented Mar 20, 2026

Summary

Why this matters

Changes

Testing

Uh oh!

frankekn commented Mar 27, 2026

Uh oh!

mvanhorn commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants