From a2c3fe13204aae2d78c36292f7a1961f508d07e3 Mon Sep 17 00:00:00 2001 From: Frank Yang Date: Fri, 27 Mar 2026 16:57:33 +0800 Subject: [PATCH 1/6] docs: add maintainer decision layer roadmap --- docs/PLAN.md | 9 + docs/designs/maintainer-decision-layer.md | 248 ++++++++++++++++++++++ 2 files changed, 257 insertions(+) create mode 100644 docs/designs/maintainer-decision-layer.md diff --git a/docs/PLAN.md b/docs/PLAN.md index 41c4308..8f8258a 100644 --- a/docs/PLAN.md +++ b/docs/PLAN.md @@ -117,6 +117,15 @@ Decision note: - [ ] Test on a real or sanitized fixture corpus to inspect false positives and false negatives. - [ ] Testing goal: add golden cluster fixtures proving known related threads end up together. +## Phase 6.5: Maintainer Decision Analysis + +- [ ] Add a reusable decision-analysis layer above clusters and semantic neighbors. +- [ ] Introduce a seed-centric `analyze-pr` style workflow for explicit maintainer decisions. +- [ ] Reuse a shared score/explanation model instead of duplicating logic across CLI, report, and API surfaces. +- [ ] Define explicit first-pass roles such as `best_base`, `same_cluster_candidate`, `superseded_candidate`, and `excluded_neighbor`. +- [ ] Keep the first iteration additive: no cluster-model replacement and no storage redesign required. +- [ ] Track the target shape in `docs/designs/maintainer-decision-layer.md`. + ## Phase 7: API And Future UI - [x] Implement local API endpoints for health, repositories, threads, search, clusters, and rerun actions. diff --git a/docs/designs/maintainer-decision-layer.md b/docs/designs/maintainer-decision-layer.md new file mode 100644 index 0000000..6465655 --- /dev/null +++ b/docs/designs/maintainer-decision-layer.md @@ -0,0 +1,248 @@ +# Maintainer Decision Layer Roadmap + +## Context + +`ghcrawl` already does the hard part of local maintainer discovery: + +- GitHub sync into local SQLite state +- canonical thread summaries +- embeddings and semantic neighbors +- deterministic cluster construction +- cluster summaries and detail views + +That is enough to answer: + +- which issues and PRs are about the same problem area +- which threads are near each other semantically + +It is not yet enough to answer stronger maintainer questions such as: + +- which PR is the best base to keep +- which nearby variant is probably superseded +- which neighbor is too weak or too noisy to promote + +This document proposes a clean next layer for those questions without replacing the current cluster model. + +## Problem + +Today the main semantic model is: + +- retrieve nearby threads +- materialize similarity edges +- build connected-component clusters +- expose cluster summaries and cluster detail + +That is the right base, but it leaves a gap between similarity grouping and maintainer action. + +`ghcrawl` can say: + +- these items are related + +It cannot yet say: + +- start review here +- this is the strongest base +- this variant likely lost +- this neighbor is related but should stay excluded + +## Target Shape + +The next architecture step should be a reusable decision-analysis layer above retrieval and clustering. + +```mermaid +flowchart LR + A[Sync / Normalize / Summarize / Embed] --> B[Search / Neighbors / Clusters] + B --> C[Candidate Retrieval Layer] + C --> D[Decision Analysis Layer] + D --> E[Maintainer Outputs] + + C --> C1[cluster members] + C --> C2[semantic neighbors] + C --> C3[path / issue expansion] + + D --> D1[feature extraction] + D --> D2[stage-2 rerank] + D --> D3[role classification] + D --> D4[explanations] + + E --> E1[analyze-pr] + E --> E2[triage report] + E --> E3[API] + E --> E4[future UI] +``` + +The key idea is additive layering: + +- keep the current search and cluster pipeline +- reuse current cluster and neighbor data as candidate recall +- add a second-stage maintainer decision pass +- expose that decision pass through one or more surfaces + +## Layer Responsibilities + +### Candidate Retrieval Layer + +Purpose: + +- collect a bounded candidate set around a seed thread + +Inputs may include: + +- cluster members +- semantic neighbors +- path-overlap candidates +- issue-linked candidates + +This layer should optimize for recall, not for final ranking. + +### Decision Analysis Layer + +Purpose: + +- score and classify the candidate set using maintainer-oriented signals + +Expected signals: + +- linked issue overlap +- changed-path relevance +- companion test relevance +- unrelated churn or noise penalty +- state and recency + +This layer should optimize for maintainer usefulness, not for raw semantic similarity. + +### Explanation Layer + +Purpose: + +- make the result auditable and operationally safe + +Expected outputs: + +- score breakdown by signal +- short explanation text +- reason codes +- decision trace + +### Presentation Layer + +Purpose: + +- reuse the same analyzer core in different maintainer surfaces + +Consumers should include: + +- `ghcrawl analyze-pr` +- triage report generation +- local API responses +- future TUI or web views + +## Proposed Initial Roles + +The first decision-aware outputs should stay explicit and narrow: + +- `best_base` +- `same_cluster_candidate` +- `superseded_candidate` +- `excluded_neighbor` + +These roles are intentionally stronger than "same cluster" but weaker than a fully automated duplicate-close policy. + +## Roadmap + +### Phase 0: Fix Current Contracts + +Make the existing maintainer surfaces safer before adding a new layer. + +- soften over-claiming triage wording +- align report wording with real count semantics +- improve section-aware PR-template heuristics for edited tails +- keep current cluster storage work extensible for future decision metadata + +### Phase 1: Introduce A Reusable Decision Core + +Add a reusable analysis module, not just a command-specific heuristic bundle. + +- seed thread in +- candidate set out of existing retrieval sources +- stage-two scoring +- explicit role classification +- explanation payload + +This phase should not require storage redesign. + +### Phase 2: Add `analyze-pr` + +Expose the decision core through one focused CLI surface first. + +Suggested command: + +```bash +ghcrawl analyze-pr owner/repo --number 123 --json +``` + +Suggested output: + +- chosen best base +- nearby alternatives +- superseded candidates +- excluded neighbors +- score breakdown and explanation metadata + +### Phase 3: Reuse The Core In Triage And API + +Once the core is stable: + +- feed decision outputs into triage reports +- expose decision payloads through the local HTTP API +- let future UI surfaces render the same outputs + +This prevents decision logic from being duplicated in each surface. + +### Phase 4: Attach Decision Artifacts To Run State + +After the decision model is useful and stable: + +- attach decision metadata to cluster snapshots or adjacent run-state tables +- preserve explanation and lineage context across rebuilds +- make it easier to compare how maintainer recommendations evolve over time + +This phase should build on the snapshot/current-view work rather than replace it. + +### Phase 5: Evaluation And Feedback Loop + +Turn the decision layer into a measured subsystem instead of a one-off feature. + +- build a small labeled maintainer corpus +- add regression fixtures for best-base and superseded classification +- track false positives and false negatives +- tune thresholds and explanations using real maintainer review cases + +## Relationship To Current Work + +This roadmap is designed to fit current open work rather than compete with it. + +- Cluster storage and lineage work remains the persistence foundation. +- Triage report work remains the reporting surface. +- PR-template heuristic work remains a deterministic maintainer signal. + +The decision layer sits above those efforts and gives them a cleaner long-term destination. + +## Non-Goals + +- do not replace connected-component clustering +- do not redesign snapshot storage in the first iteration +- do not change embedding backends as part of this roadmap +- do not claim mathematically perfect duplicate adjudication +- do not force all maintainer logic into one giant command or report + +## Why This Is Better + +This roadmap keeps the architecture clean: + +- retrieval stays retrieval +- decision logic stays reusable +- explanations stay first-class +- output surfaces stay thin + +That gives `ghcrawl` a credible path from semantic grouping to maintainer decision support without turning every new feature into a special-case heuristic branch. From 177926e391b37e66ff2236ead06813d988e6e327 Mon Sep 17 00:00:00 2001 From: Frank Yang Date: Fri, 27 Mar 2026 17:17:45 +0800 Subject: [PATCH 2/6] docs: add roadmap diagrams --- docs/designs/maintainer-decision-layer.md | 86 +++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/docs/designs/maintainer-decision-layer.md b/docs/designs/maintainer-decision-layer.md index 6465655..52c9792 100644 --- a/docs/designs/maintainer-decision-layer.md +++ b/docs/designs/maintainer-decision-layer.md @@ -45,6 +45,37 @@ It cannot yet say: - this variant likely lost - this neighbor is related but should stay excluded +## Current State vs Target State + +The architectural gap is small but important. + +Today, `ghcrawl` mainly stops at semantic grouping: + +```mermaid +flowchart LR + A[Sync / Summaries / Embeddings] --> B[Neighbors / Clusters] + B --> C[Similarity grouping] + C --> D[Cluster summary and detail] +``` + +The proposed target keeps that pipeline and adds one reusable decision layer above it: + +```mermaid +flowchart LR + A[Sync / Summaries / Embeddings] --> B[Neighbors / Clusters] + B --> C[Candidate retrieval] + C --> D[Decision analysis] + D --> E[Maintainer-facing outputs] + + E --> E1[best_base] + E --> E2[superseded_candidate] + E --> E3[excluded_neighbor] + E --> E4[decision trace] + + style D fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px + style E fill:#fff3e0,stroke:#ef6c00,stroke-width:2px +``` + ## Target Shape The next architecture step should be a reusable decision-analysis layer above retrieval and clustering. @@ -137,6 +168,31 @@ Consumers should include: - local API responses - future TUI or web views +## Current Work Fit + +This proposal is designed to give the current open work a clearer long-term shape rather than competing with it. + +```mermaid +flowchart TD + A[PR #29
snapshot and current-view foundation] + B[PR #20
triage report surface] + C[PR #14
deterministic PR-template heuristic] + D[Maintainer decision layer] + E[Future maintainer outputs] + + A --> D + B --> D + C --> D + D --> E + + E --> E1[analyze-pr] + E --> E2[triage suggestions] + E --> E3[API payloads] + E --> E4[future UI] + + style D fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px +``` + ## Proposed Initial Roles The first decision-aware outputs should stay explicit and narrow: @@ -218,6 +274,36 @@ Turn the decision layer into a measured subsystem instead of a one-off feature. - track false positives and false negatives - tune thresholds and explanations using real maintainer review cases +## Delivery Sequence + +The intended delivery path is incremental rather than rewrite-oriented. + +```mermaid +flowchart TD + P0[Phase 0
fix current contracts] + P1[Phase 1
reusable decision core] + P2[Phase 2
analyze-pr] + P3[Phase 3
triage and API reuse] + P4[Phase 4
decision artifacts on run state] + P5[Phase 5
evaluation loop] + + P0 --> P1 --> P2 --> P3 --> P4 --> P5 + + P0 --> P0a[soften triage claims] + P0 --> P0b[improve template heuristics] + P0 --> P0c[keep storage extensible] + + P1 --> P1a[candidate retrieval] + P1 --> P1b[stage-2 rerank] + P1 --> P1c[role classification] + P1 --> P1d[explanations] + + P2 --> P2a[JSON-first CLI output] + P3 --> P3a[shared API payloads] + P4 --> P4a[snapshot annotations] + P5 --> P5a[labeled maintainer fixtures] +``` + ## Relationship To Current Work This roadmap is designed to fit current open work rather than compete with it. From 6bd34405e3238dccef6922115e2633f34667d5c1 Mon Sep 17 00:00:00 2001 From: Frank Yang Date: Fri, 27 Mar 2026 17:21:05 +0800 Subject: [PATCH 3/6] docs: add decision scoring proposal --- docs/designs/maintainer-decision-layer.md | 68 +++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/docs/designs/maintainer-decision-layer.md b/docs/designs/maintainer-decision-layer.md index 52c9792..8ef2bd4 100644 --- a/docs/designs/maintainer-decision-layer.md +++ b/docs/designs/maintainer-decision-layer.md @@ -142,6 +142,74 @@ Expected signals: This layer should optimize for maintainer usefulness, not for raw semantic similarity. +## Initial Scoring Proposal + +This is the part that should carry the main maintainer-facing value. + +The first iteration should ship with an explicit score model instead of hiding the decision logic behind vague heuristics. The exact weights can be tuned later, but the shape should be clear from the start. + +```mermaid +flowchart LR + A[semantic similarity] + B[linked issue overlap] + C[path relevance] + D[companion test relevance] + E[state and recency] + F[unrelated churn penalty] + + A --> G[decision score] + B --> G + C --> G + D --> G + E --> G + F --> G + + G --> H[best_base] + G --> I[same_cluster_candidate] + G --> J[superseded_candidate] + G --> K[excluded_neighbor] + + style G fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px +``` + +Suggested v1 score shape: + +```text +decisionScore = + 0.35 * semanticSimilarity + + 0.20 * linkedIssueOverlap + + 0.20 * pathRelevance + + 0.15 * companionTestRelevance + + 0.10 * stateRecencyBonus + - 0.20 * unrelatedChurnPenalty +``` + +The exact coefficients above are a starting point, not a claim that tuning is finished. The important part is the composition: + +- semantic similarity keeps the decision layer grounded in the current retrieval model +- linked issue overlap adds deterministic problem affinity +- path relevance rewards candidates that touch the same implementation area +- companion test relevance rewards candidates that validate the same fix surface +- state and recency give maintainers a slight operational preference +- unrelated churn penalty suppresses broad but noisy neighbors + +One reasonable normalization target for the first iteration is a `[0, 1]` score per positive signal and a `[0, 1]` penalty for unrelated churn. + +### Suggested Role Logic + +The role assignment should stay explicit and rule-driven on top of the score model. + +- `excluded_neighbor` + - semantic candidate is below the minimum affinity threshold, or the noise penalty dominates the score +- `best_base` + - highest non-excluded candidate after score and tie-break review +- `superseded_candidate` + - strong affinity but materially lower score than the best base, especially when coverage or validation is weaker +- `same_cluster_candidate` + - strong enough to keep, but not clearly dominant or clearly superseded + +This makes the decision layer stronger than raw cluster membership while still being auditable. + ### Explanation Layer Purpose: From 06a50d182a563d87e5b66d823e51da0a6d756369 Mon Sep 17 00:00:00 2001 From: Frank Yang Date: Fri, 27 Mar 2026 19:26:01 +0800 Subject: [PATCH 4/6] docs: refine decision-layer contract --- docs/PLAN.md | 6 +- docs/designs/maintainer-decision-layer.md | 119 ++++++++++++++++++---- 2 files changed, 102 insertions(+), 23 deletions(-) diff --git a/docs/PLAN.md b/docs/PLAN.md index 8f8258a..a7a4fb5 100644 --- a/docs/PLAN.md +++ b/docs/PLAN.md @@ -122,8 +122,12 @@ Decision note: - [ ] Add a reusable decision-analysis layer above clusters and semantic neighbors. - [ ] Introduce a seed-centric `analyze-pr` style workflow for explicit maintainer decisions. - [ ] Reuse a shared score/explanation model instead of duplicating logic across CLI, report, and API surfaces. -- [ ] Define explicit first-pass roles such as `best_base`, `same_cluster_candidate`, `superseded_candidate`, and `excluded_neighbor`. +- [ ] Split retrieval provenance from decision role so candidates can say how they were found separately from what decision they received. +- [ ] Define explicit first-pass roles such as `best_base`, `same_cluster_candidate`, `superseded_candidate`, and `excluded_neighbor`, with role eligibility depending on thread kind. +- [ ] Keep the first iteration local-data-only by default and add richer feature providers only after the data is first-class and normalized. +- [ ] Add small golden fixtures early so score tuning and role classification have regression protection from the first implementation. - [ ] Keep the first iteration additive: no cluster-model replacement and no storage redesign required. +- [ ] If persisted later, store decision artifacts in adjacent decision tables rather than cluster snapshots. - [ ] Track the target shape in `docs/designs/maintainer-decision-layer.md`. ## Phase 7: API And Future UI diff --git a/docs/designs/maintainer-decision-layer.md b/docs/designs/maintainer-decision-layer.md index 8ef2bd4..41615e4 100644 --- a/docs/designs/maintainer-decision-layer.md +++ b/docs/designs/maintainer-decision-layer.md @@ -126,6 +126,15 @@ Inputs may include: This layer should optimize for recall, not for final ranking. +The candidate set should preserve retrieval provenance explicitly, for example: + +- `cluster_member` +- `semantic_neighbor` +- `path_overlap` +- `issue_linked` + +Those are not decision outcomes. They are evidence about how a candidate entered the set. + ### Decision Analysis Layer Purpose: @@ -146,7 +155,9 @@ This layer should optimize for maintainer usefulness, not for raw semantic simil This is the part that should carry the main maintainer-facing value. -The first iteration should ship with an explicit score model instead of hiding the decision logic behind vague heuristics. The exact weights can be tuned later, but the shape should be clear from the start. +The first iteration should ship with an explicit score model instead of hiding the decision logic behind vague heuristics. The roadmap should be concrete about feature families and monotonic direction, but exact coefficients should live in code or config plus fixture-based evaluation. + +There is already an existence proof for weighted stage-two scoring in `claw-maintainer-tui`. That implementation uses explicit weighted composition for semantic reranking and related maintainer decision heuristics in a live maintainer workflow. This roadmap should treat that as a reference implementation and starting point, rather than presenting one fixed coefficient set as universal truth for `ghcrawl`. ```mermaid flowchart LR @@ -172,19 +183,7 @@ flowchart LR style G fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px ``` -Suggested v1 score shape: - -```text -decisionScore = - 0.35 * semanticSimilarity - + 0.20 * linkedIssueOverlap - + 0.20 * pathRelevance - + 0.15 * companionTestRelevance - + 0.10 * stateRecencyBonus - - 0.20 * unrelatedChurnPenalty -``` - -The exact coefficients above are a starting point, not a claim that tuning is finished. The important part is the composition: +The important part is the composition: - semantic similarity keeps the decision layer grounded in the current retrieval model - linked issue overlap adds deterministic problem affinity @@ -193,20 +192,95 @@ The exact coefficients above are a starting point, not a claim that tuning is fi - state and recency give maintainers a slight operational preference - unrelated churn penalty suppresses broad but noisy neighbors -One reasonable normalization target for the first iteration is a `[0, 1]` score per positive signal and a `[0, 1]` penalty for unrelated churn. +One reasonable starting rule is: + +- normalize each positive feature into a bounded local score +- normalize unrelated churn into a bounded penalty +- keep the exact weighting and thresholds in implementation config +- allow the first `ghcrawl` implementation to start from the `claw-maintainer-tui` weighting profile and then retune against `ghcrawl` fixtures +- tune those values against labeled fixtures before widening the feature set + +## Feature Availability Boundary + +The roadmap should be explicit about what the first iteration can use without hidden live fetches or ad hoc parsing. + +### V1 default + +V1 should be local-data-only by default. + +That means the first implementation should prefer signals that are already cheap and local: + +- semantic similarity +- retrieval provenance such as active-cluster membership or semantic-neighbor membership +- thread kind and thread state +- draft / merged / closed facts when locally available +- updated-at recency +- PR-template residue, if the PR-template heuristic lands + +### Later feature providers + +These are good targets, but should be treated as later feature providers unless they become first-class normalized local signals: + +- rich linked-issue overlap +- changed-path relevance +- companion test relevance +- unrelated-churn metrics derived from structured diff data + +This keeps the analyzer deterministic and stops V1 from quietly becoming a live GitHub fetch workflow. + +## Seed And Candidate Contract + +The core contract should separate retrieval provenance from decision outcome. + +Suggested shape: + +```ts +analyzeSeed(seedThreadId, opts) => { + seed, + activeClusterRunId, + candidates: [ + { + thread, + retrievalSources: ["cluster_member", "semantic_neighbor"], + features: { + semanticSimilarity, + inActiveCluster, + isSemanticNeighbor, + threadKind, + threadState, + updatedAt, + templateResidue, + }, + score, + decisionRole, + reasonCodes, + }, + ], + bestBaseThreadId, +} +``` + +This is the main contract boundary that future CLI, API, triage, and UI surfaces should share. ### Suggested Role Logic The role assignment should stay explicit and rule-driven on top of the score model. -- `excluded_neighbor` - - semantic candidate is below the minimum affinity threshold, or the noise penalty dominates the score +- retrieval provenance and decision role should remain separate fields +- role eligibility should depend on thread kind + - `best_base` - - highest non-excluded candidate after score and tie-break review + - only valid for pull request candidates + - highest non-excluded pull request after score and tie-break review - `superseded_candidate` + - only valid for pull request candidates - strong affinity but materially lower score than the best base, especially when coverage or validation is weaker - `same_cluster_candidate` - - strong enough to keep, but not clearly dominant or clearly superseded + - valid for retained neighbors, but should not mean “won the decision” +- `excluded_neighbor` + - semantic candidate is below the minimum affinity threshold, or the noise penalty dominates the score + +Issues may still appear as candidates and evidence, but they should not compete for `best_base`. This makes the decision layer stronger than raw cluster membership while still being auditable. @@ -292,8 +366,9 @@ Add a reusable analysis module, not just a command-specific heuristic bundle. - stage-two scoring - explicit role classification - explanation payload +- a tiny labeled fixture set for regression protection -This phase should not require storage redesign. +This phase should not require storage redesign and should stay local-data-only by default. ### Phase 2: Add `analyze-pr` @@ -327,11 +402,11 @@ This prevents decision logic from being duplicated in each surface. After the decision model is useful and stable: -- attach decision metadata to cluster snapshots or adjacent run-state tables +- persist results in adjacent `decision_runs` or `decision_candidates` style tables - preserve explanation and lineage context across rebuilds - make it easier to compare how maintainer recommendations evolve over time -This phase should build on the snapshot/current-view work rather than replace it. +This phase should build on the snapshot/current-view work rather than replace it, and should stay decoupled from cluster snapshots themselves. ### Phase 5: Evaluation And Feedback Loop From fd0416c717ba825395953992b638b5e20bbd6519 Mon Sep 17 00:00:00 2001 From: Frank Yang Date: Fri, 27 Mar 2026 19:36:03 +0800 Subject: [PATCH 5/6] docs: clarify roadmap boundaries and diagram contrast --- docs/PLAN.md | 2 +- docs/designs/maintainer-decision-layer.md | 23 +++++++++++++++-------- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/docs/PLAN.md b/docs/PLAN.md index a7a4fb5..ecbad61 100644 --- a/docs/PLAN.md +++ b/docs/PLAN.md @@ -124,7 +124,7 @@ Decision note: - [ ] Reuse a shared score/explanation model instead of duplicating logic across CLI, report, and API surfaces. - [ ] Split retrieval provenance from decision role so candidates can say how they were found separately from what decision they received. - [ ] Define explicit first-pass roles such as `best_base`, `same_cluster_candidate`, `superseded_candidate`, and `excluded_neighbor`, with role eligibility depending on thread kind. -- [ ] Keep the first iteration local-data-only by default and add richer feature providers only after the data is first-class and normalized. +- [ ] Keep the first iteration read-only against the latest local snapshot, with freshness coming from the existing explicit `refresh` or `sync -> embed -> cluster` pipeline rather than hidden live fetches during scoring. - [ ] Add small golden fixtures early so score tuning and role classification have regression protection from the first implementation. - [ ] Keep the first iteration additive: no cluster-model replacement and no storage redesign required. - [ ] If persisted later, store decision artifacts in adjacent decision tables rather than cluster snapshots. diff --git a/docs/designs/maintainer-decision-layer.md b/docs/designs/maintainer-decision-layer.md index 41615e4..4612b1e 100644 --- a/docs/designs/maintainer-decision-layer.md +++ b/docs/designs/maintainer-decision-layer.md @@ -72,8 +72,8 @@ flowchart LR E --> E3[excluded_neighbor] E --> E4[decision trace] - style D fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px - style E fill:#fff3e0,stroke:#ef6c00,stroke-width:2px + style D fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff + style E fill:#3a2812,stroke:#fbbf24,stroke-width:2px,color:#ffffff ``` ## Target Shape @@ -180,7 +180,7 @@ flowchart LR G --> J[superseded_candidate] G --> K[excluded_neighbor] - style G fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px + style G fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff ``` The important part is the composition: @@ -204,9 +204,16 @@ One reasonable starting rule is: The roadmap should be explicit about what the first iteration can use without hidden live fetches or ad hoc parsing. -### V1 default +### V1 boundary -V1 should be local-data-only by default. +V1 analysis should operate on the latest local repository snapshot by default. + +Freshness should come from the existing explicit maintenance pipeline: + +- `refresh` +- or `sync` -> `embed` -> `cluster` + +The decision layer itself should stay read-only against that local state. Scoring should not perform hidden live GitHub or OpenAI fetches. That means the first implementation should prefer signals that are already cheap and local: @@ -226,7 +233,7 @@ These are good targets, but should be treated as later feature providers unless - companion test relevance - unrelated-churn metrics derived from structured diff data -This keeps the analyzer deterministic and stops V1 from quietly becoming a live GitHub fetch workflow. +This keeps the analyzer deterministic, preserves a clear operator contract for freshness, and stops V1 from quietly becoming a live GitHub fetch workflow. ## Seed And Candidate Contract @@ -332,7 +339,7 @@ flowchart TD E --> E3[API payloads] E --> E4[future UI] - style D fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px + style D fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff ``` ## Proposed Initial Roles @@ -368,7 +375,7 @@ Add a reusable analysis module, not just a command-specific heuristic bundle. - explanation payload - a tiny labeled fixture set for regression protection -This phase should not require storage redesign and should stay local-data-only by default. +This phase should not require storage redesign and should analyze the latest local snapshot without introducing hidden live fetches into the scoring path. ### Phase 2: Add `analyze-pr` From 23618332907f03943d7ed13eaa4eaa80b7a0ac51 Mon Sep 17 00:00:00 2001 From: Frank Yang Date: Sun, 29 Mar 2026 15:00:30 +0800 Subject: [PATCH 6/6] docs: add durable family identity design --- docs/designs/durable-family-identity.md | 320 ++++++++++++++++++++++++ 1 file changed, 320 insertions(+) create mode 100644 docs/designs/durable-family-identity.md diff --git a/docs/designs/durable-family-identity.md b/docs/designs/durable-family-identity.md new file mode 100644 index 0000000..bee5883 --- /dev/null +++ b/docs/designs/durable-family-identity.md @@ -0,0 +1,320 @@ +# Durable Family Identity + +## Purpose + +This document defines the storage and identity model needed to make cluster identity durable across reruns. + +The core problem is simple: + +- cluster rebuilds are currently valid as local grouping runs +- but a run-local `clusters.id` is not a durable maintainer-facing identity +- unchanged families must retain identity across reruns + +This document proposes a stable `family_id` layer above run-local cluster rows. + +## Problem Statement + +A local cluster rebuild produces a fresh set of `clusters` rows on every run. + +That is acceptable for: + +- run-local storage +- transient similarity edges +- rebuild internals + +It is not acceptable for: + +- visible maintainer-facing cluster identity +- durable links between reruns +- explanations such as "this is the same family as before" + +The visible identity contract must not depend on a newly inserted row id. + +## Design Goal + +The durable identity contract should satisfy all of the following: + +- unchanged families keep the same visible identity across reruns +- changed families retain lineage to the previous visible identity +- unrelated families do not churn when a local change affects only part of the repo +- family identity remains separate from maintainer decision scoring + +## Layer Model + +The architecture should be split into four layers: + +1. retrieval layer +2. family identity layer +3. snapshot and lineage layer +4. maintainer decision layer + +### Retrieval Layer + +Purpose: + +- collect nearby candidates with high recall + +Examples: + +- semantic neighbors +- linked issue overlap +- path overlap +- exact family members from prior state + +This layer should not own durable identity. + +### Family Identity Layer + +Purpose: + +- define the visible maintainer-facing identity for a family + +The visible identity should be `family_id`, not `clusters.id`. + +This is the layer that must remain stable across reruns. + +### Snapshot And Lineage Layer + +Purpose: + +- store per-run snapshots of family membership +- connect current families back to previous families + +This layer answers: + +- unchanged +- updated +- new +- dissolved +- later: split +- later: merged + +### Maintainer Decision Layer + +Purpose: + +- rank and classify family members for maintainer action + +Examples: + +- `best_base` +- `same_family_candidate` +- `superseded_candidate` +- `excluded_neighbor` + +This layer should consume family identity and retrieval evidence. It should not be coupled into the storage identity contract. + +## Identity Contract + +### Run-local ids + +`clusters.id` may remain a run-local row id. + +That id is allowed to change every rebuild. + +It must not be the visible maintainer-facing family identity. + +### Durable visible id + +Introduce `family_id` as the canonical visible identity. + +Rules: + +- `family_id` is created once for a new family +- `family_id` is inherited by unchanged or updated families on later reruns +- UI, API, and reports should prefer `family_id` + +## Initial Family Identity Strategy + +### Linked-issue families + +Linked-issue families can start with a deterministic canonical identity. + +For example: + +- sorted linked issue set +- repository-scoped canonical family key + +This gives a strong and cheap first source of stable identity. + +### Semantic-only families + +Semantic-only families should not use a fresh row id and should not rely only on the current member set as their long-term identity contract. + +Instead: + +- compare current family snapshots against previous family snapshots +- inherit previous `family_id` when the family is judged to be the same continuation +- allocate a new `family_id` only when no suitable previous family matches + +## Storage Proposal + +The exact schema can vary, but the minimum durable model needs: + +### `families` + +One row per durable family identity. + +Suggested fields: + +- `id` +- `repo_id` +- `basis` +- `created_at` +- `retired_at` + +### `family_snapshots` + +One row per family per snapshot run. + +Suggested fields: + +- `id` +- `family_id` +- `repo_id` +- `snapshot_run_id` +- `representative_thread_id` +- `member_thread_ids` +- `member_count` +- `basis` +- `created_at` + +### `family_transitions` + +Lineage and transition classification between reruns. + +Suggested fields: + +- `id` +- `repo_id` +- `snapshot_run_id` +- `family_id` +- `previous_family_id` +- `transition_type` +- `similarity_score` +- `created_at` + +## Transition Semantics + +The first iteration needs only a small set of transitions: + +- `unchanged` +- `updated` +- `new` +- `dissolved` + +Later, if needed: + +- `split` +- `merged` + +The key point is not naming richness. The key point is that every changed family must have an explicit lineage explanation instead of silently changing visible identity. + +## Matching Rule + +The initial matching rule can remain simple: + +- compare current and previous member sets +- compute Jaccard similarity +- greedily match highest score first above threshold + +That is enough for a first durable family implementation. + +The important requirement is what happens after matching: + +- matched current family inherits previous `family_id` +- unmatched current family gets a new `family_id` +- unmatched previous family is marked `dissolved` + +Without the inheritance step, the merge result is informational only and does not produce durable visible identity. + +## Maintainer Decision Integration + +Once family identity is durable, the maintainer decision layer should sit on top of it. + +Recommended responsibilities: + +- retrieval determines candidate recall +- family identity determines continuity across reruns +- maintainer scoring determines what action to take + +This prevents three different concepts from being collapsed into one number or one row id. + +## Acceptance Tests + +### 1. No-op rerun + +Test: + +- run clustering +- rerun clustering on the same DB with the same data + +Expected result: + +- unchanged family membership: `100%` +- unchanged visible `family_id` retention: `100%` + +### 2. Incremental rerun + +Test: + +- run clustering on snapshot A +- sync to snapshot B in the same DB +- rerun clustering + +Expected result: + +- unchanged families keep `family_id` +- unrelated family churn is near zero +- changed families receive explicit lineage transitions + +### 3. Synthetic fixture coverage + +Fixture scenarios: + +- unchanged family +- add one member to family +- split one family into two +- merge two families into one + +Expected result: + +- transition types are correct +- `family_id` inheritance is correct + +## Success Criteria + +The durable identity implementation is ready only if all of the following are true: + +- no-op rerun keeps visible family identity stable +- incremental rerun preserves identity for unchanged families +- transition lineage is explicit and machine-readable +- maintainer-facing outputs no longer depend on run-local `clusters.id` + +## Non-Goals + +- not a replacement for the retrieval pipeline +- not a replacement for cluster scoring or maintainer ranking +- not a requirement to finalize split and merge semantics in v1 +- not a requirement to redesign embeddings or storage wholesale + +## Recommended Delivery Order + +1. introduce `family_id` +2. stop exposing raw `clusters.id` as the visible family identity +3. persist snapshot lineage +4. pass the no-op rerun test +5. pass the incremental rerun test +6. layer maintainer decision outputs on top + +## Bottom Line + +The durable identity problem is not solved by storing snapshots alone. + +It is solved only when: + +- current families are matched to previous families +- matched families inherit the same visible identity +- unchanged families no longer churn on rerun + +That is the minimum contract required for a durable family implementation.