Skip to content

feat(cluster): add snapshot-then-merge architecture#29

Open
mvanhorn wants to merge 1 commit into
pwrdrvr:mainfrom
mvanhorn:osc/feat-cluster-snapshot-merge
Open

feat(cluster): add snapshot-then-merge architecture#29
mvanhorn wants to merge 1 commit into
pwrdrvr:mainfrom
mvanhorn:osc/feat-cluster-snapshot-merge

Conversation

@mvanhorn
Copy link
Copy Markdown
Collaborator

Summary

Implements the snapshot-then-merge cluster storage model from the design doc. Each cluster rebuild now stores a frozen snapshot of cluster membership, and a repo_cluster_state pointer table tracks active/previous runs per repo.

Why this matters

Follow-up to the PR #19 discussion where @huntharo confirmed this direction. The current prune logic deletes all but the current run on every rebuild (service.ts:pruneOldClusterRuns), which is destructive and prevents run-to-run comparison once the lineage tracking from PR #19 lands. The design doc (docs/designs/cluster-storage-cleanup.md) already proposed repo_cluster_state and an append-only model - this PR implements it.

Changes

New tables (db/migrate.ts):

  • repo_cluster_state - per-repo pointer to active and previous cluster runs
  • cluster_snapshots - frozen cluster membership per run (JSON array of thread IDs)

New module (cluster/snapshot.ts):

  • mergeClusterSnapshots() - Jaccard-based comparison of current vs previous snapshots
  • Greedy matching with configurable threshold (default 0.5)
  • Produces typed outcomes: updated, new, dissolved

Service updates (service.ts):

  • clusterRepository() now calls persistClusterSnapshots() and flipClusterState() after persisting the run
  • getLatestClusterRun() reads from repo_cluster_state first, falls back to raw query for backward compatibility
  • listClusters() uses the updated read path
  • pruneOldClusterRuns() keeps both active and previous runs instead of deleting everything

Tests (cluster/snapshot.test.ts):

  • 9 tests covering: first run, identical clusters, overlapping, non-overlapping, mixed, below-threshold, custom threshold, greedy matching

Testing

  • pnpm typecheck passes
  • All 9 snapshot tests pass (pnpm --filter @ghcrawl/api-core exec tsx --tsconfig tsconfig.test.json --test 'src/cluster/*.test.ts')
  • Pre-existing config test failures on main are unrelated (verified by running config.test.ts on main)

This contribution was developed with AI assistance (Claude Code).

Store per-run cluster snapshots in a new cluster_snapshots table and
track active/previous run pointers in repo_cluster_state. The read
path prefers the state pointer over raw "latest completed run" queries.
Prune now keeps both active and previous runs instead of deleting all
but the current one.

Follow-up to PR pwrdrvr#19 discussion with @huntharo on the cluster lineage
tracking design.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@frankekn
Copy link
Copy Markdown

This is a good foundation direction. One thing I would explicitly leave room for: today cluster_snapshots stores frozen member sets and repo_cluster_state flips active/previous pointers, but the actual merge step still looks like a pure helper (mergeClusterSnapshots() / loadClusterSnapshots()) rather than something persisted or exposed yet.

Because of that, I would avoid treating snapshot state as permanently “members only”. If you later want lineage and maintainer-facing decision outputs, you will probably want an extension point for derived annotations such as matched previous cluster id, merge/split classification, representative rationale, or other decision metadata, without another storage rewrite.

Not a blocker, just a request to keep the current schema/read path extensible in that direction.

@mvanhorn
Copy link
Copy Markdown
Collaborator Author

Agreed on keeping the extension path open. The cluster_snapshots table and mergeClusterSnapshots() output types (updated/new/dissolved) were designed with exactly that in mind - the typed outcomes from the merge step are the natural place to attach decision metadata (matched previous cluster ID, rationale, confidence scores) without a storage rewrite.

Right now the merge output is consumed and discarded after the flip. When lineage tracking from #19 lands, the plan is to persist those merge outcomes alongside the snapshot - either as additional columns on cluster_snapshots or a lightweight cluster_merge_decisions table. The current schema supports both without breaking changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants