diff --git a/.gitignore b/.gitignore index ba90e8d..4fca5e8 100644 --- a/.gitignore +++ b/.gitignore @@ -13,3 +13,4 @@ packages/*/src/**/*.d.ts.map **/__pycache__/ **/*.pyc **/.DS_Store +.context/ diff --git a/README.md b/README.md index a8ad294..b409617 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,7 @@ GitHub is required to crawl issue and PR data. OpenAI is required for embeddings ```bash ghcrawl init +ghcrawl configure ghcrawl doctor ghcrawl refresh owner/repo ghcrawl tui owner/repo @@ -44,11 +45,33 @@ ghcrawl tui owner/repo - save plaintext keys in `~/.config/ghcrawl/config.json` - or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file -`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI. +`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, summarizes changed items only when the active embedding basis depends on summaries, refreshes vectors, and rebuilds the clusters you browse in the TUI. + +## One-Time Migration + +Upgrading to this release changes the local vector and cluster pipeline: + +- vectors now use a persistent `vectorlite` sidecar index +- the active vector is one vector per open thread +- old multi-row `document_embeddings` are removed after the first successful rebuild + +For an existing repo, the one-time migration command is: + +```bash +ghcrawl refresh owner/repo +``` + +Important notes: + +- `refresh` performs the migration; plain `sync` does not +- with the default `title_original` basis, the migration rebuilds vectors and clusters without running LLM summaries +- if you switch to `title_summary`, `refresh` also runs the summarize step before embedding +- after the first successful migration refresh, ghcrawl removes legacy embeddings, compacts the local DB, and rebuilds clusters from the current vectors ## Typical Commands ```bash +ghcrawl configure ghcrawl doctor ghcrawl refresh owner/repo ghcrawl tui owner/repo @@ -56,7 +79,7 @@ ghcrawl tui owner/repo `refresh`, `sync`, and `embed` call remote services and should be run intentionally. -`cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`. +`cluster` does not call remote services, but it is still time consuming. It now uses a persistent `vectorlite` index instead of exact in-memory scans, so large-repo rebuilds are materially faster, but still not instant. `clusters` explores the clusters already stored in the local SQLite database and is expected to be the fast, read-only inspection path. @@ -72,6 +95,7 @@ ghcrawl refresh --help For agent-facing and script-facing commands, prefer explicit machine mode: ```bash +ghcrawl configure --json ghcrawl doctor --json ghcrawl threads owner/repo --numbers 42,43,44 --json ghcrawl clusters owner/repo --min-size 10 --limit 20 --sort recent --json @@ -118,11 +142,12 @@ If you need tighter control, you can run the three stages yourself: ```bash ghcrawl sync owner/repo # pull the latest open issues and pull requests from GitHub -ghcrawl embed owner/repo # generate or refresh OpenAI embeddings for changed items +ghcrawl summarize owner/repo # optional explicit summary refresh when using title_summary +ghcrawl embed owner/repo # generate or refresh the single active vector per thread ghcrawl cluster owner/repo # rebuild local related-work clusters from the current vectors (local-only, but can take ~10 minutes on a ~12k issue/PR repo) ``` -Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you. +Run them in that order. If your embedding basis is `title_summary`, `refresh` automatically inserts the summarize stage before embed for you. With the default `title_original` basis, `refresh` does not summarize unless you run `summarize` explicitly. ## Init And Doctor @@ -158,8 +183,29 @@ GitHub token guidance: - local DB path wiring - GitHub token presence, token-shape validation, and a live auth smoke check - OpenAI key presence, key-shape validation, and a live auth smoke check +- `vectorlite` runtime readiness - if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly +## Configure + +Use `configure` to inspect or change the active summary model and embedding basis: + +```bash +ghcrawl configure +ghcrawl configure --summary-model gpt-5.4-mini +ghcrawl configure --embedding-basis title_original +``` + +Current defaults: + +- summary model: `gpt-5-mini` +- embedding basis: `title_original` (`title + original body`) +- vector backend: `vectorlite` + +Changing the summary model or embedding basis makes the next `refresh` rebuild vectors and clusters for that repo. + +If you opt into `title_summary`, ghcrawl summarizes before embedding and uses `title + dedupe summary` as the active vector text. On `openclaw/openclaw`, that improved non-solo cluster membership by about 50% versus `title_original`, but it adds OpenAI spend. A first summarize of roughly `18k` open issues and PRs in that repo typically costs about `$15-$30` with `gpt-5-mini`; later refreshes are usually much cheaper because only changed items need summaries. + ### 1Password CLI Example If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly: @@ -214,10 +260,17 @@ Use `close-cluster` when you want to locally suppress a whole cluster from defau ## Cost To Operate -The main variable cost is OpenAI embeddings. Current model pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings). +The main variable costs are summarization and embeddings. Embedding pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings). On a real local run against roughly `12k` issues plus about `1.2x` related PR and issue inputs, [`text-embedding-3-large`](https://developers.openai.com/api/docs/pricing#embeddings) came out to about **$0.65 USD** total to embed the repo. Treat that as an approximate data point for something like `~14k` issue and PR inputs, not a hard guarantee. +For one-time summary migration planning on a repo around the size of `openclaw/openclaw` (`~20k` issues and PRs), `ghcrawl configure` reports these operator estimates using the April 1, 2026 USD pricing assumptions for this release: + +- `gpt-5-mini`: about **$12 USD** one time +- `gpt-5.4-mini`: about **$30 USD** one time + +`gpt-5-mini` is the default to keep that migration cost lower. `gpt-5.4-mini` is available when you want higher-quality summaries and are comfortable with the higher one-time spend. + This screenshot is the reference point for that estimate: ![OpenAI embeddings cost for a 12k-issue repo](./docs/images/openai-embeddings-12k-issue-repo.png) @@ -265,15 +318,16 @@ The agent and build contract for this repo lives in [SPEC.md](./SPEC.md). - a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo - `sync` is metadata-only by default - `sync --include-comments` enables issue comments, PR reviews, and review comments for deeper context -- `embed` defaults to `text-embedding-3-large` -- `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present -- `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding +- `embed` defaults to `text-embedding-3-large` with `dimensions=1024` +- `embed` maintains one active vector per thread, stored in a persistent `vectorlite` sidecar index +- `embed` stores an input hash per thread and will not resubmit unchanged text for re-embedding +- the default embedding basis is `title + original body`; use `ghcrawl configure --embedding-basis title_summary` if you want to summarize before embedding - `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo` - `sync --limit ` is the best smoke-test path on a busy repository - `tui` remembers sort order and min cluster size per repository in the persisted config file - the TUI shows locally closed threads and clusters in gray; press `x` to hide or show them - on wide screens, press `l` to toggle between three columns and a wider cluster list with members/detail stacked on the right -- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> embed -> cluster and opens that repo with min cluster size `1+` +- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> summarize-if-needed -> embed -> cluster and opens that repo with min cluster size `1+` ## Responsibility Attestation diff --git a/apps/cli/README.md b/apps/cli/README.md index 4580f5e..80dc101 100644 --- a/apps/cli/README.md +++ b/apps/cli/README.md @@ -36,6 +36,7 @@ GitHub is required to crawl issue and PR data. OpenAI is required for embeddings ```bash ghcrawl init +ghcrawl configure ghcrawl doctor ghcrawl refresh owner/repo ghcrawl tui owner/repo @@ -46,11 +47,12 @@ ghcrawl tui owner/repo - save plaintext keys in `~/.config/ghcrawl/config.json` - or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file -`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI. +`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, summarizes changed items when the active embedding basis depends on summaries, refreshes vectors, and rebuilds the clusters you browse in the TUI. ## Typical Commands ```bash +ghcrawl configure ghcrawl doctor ghcrawl refresh owner/repo ghcrawl tui owner/repo @@ -58,7 +60,7 @@ ghcrawl tui owner/repo `refresh`, `sync`, and `embed` call remote services and should be run intentionally. -`cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`. +`cluster` does not call remote services, but it is still time consuming. It now uses a persistent `vectorlite` index instead of exact in-memory scans, so large-repo rebuilds are materially faster, but still not instant. `clusters` explores the clusters already stored in the local SQLite database and is expected to be the fast, read-only inspection path. @@ -74,6 +76,7 @@ ghcrawl refresh --help For agent-facing and script-facing commands, prefer explicit machine mode: ```bash +ghcrawl configure --json ghcrawl doctor --json ghcrawl threads owner/repo --numbers 42,43,44 --json ghcrawl clusters owner/repo --min-size 10 --limit 20 --sort recent --json @@ -120,11 +123,12 @@ If you need tighter control, you can run the three stages yourself: ```bash ghcrawl sync owner/repo # pull the latest open issues and pull requests from GitHub -ghcrawl embed owner/repo # generate or refresh OpenAI embeddings for changed items +ghcrawl summarize owner/repo # optional explicit summary refresh when using title_summary +ghcrawl embed owner/repo # generate or refresh the single active vector per thread ghcrawl cluster owner/repo # rebuild local related-work clusters from the current vectors (local-only, but can take ~10 minutes on a ~12k issue/PR repo) ``` -Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you. +Run them in that order. If your embedding basis is `title_summary`, `refresh` automatically inserts the summarize stage before embed for you. ## Init And Doctor @@ -160,8 +164,27 @@ GitHub token guidance: - local DB path wiring - GitHub token presence, token-shape validation, and a live auth smoke check - OpenAI key presence, key-shape validation, and a live auth smoke check +- `vectorlite` runtime readiness - if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly +## Configure + +Use `configure` to inspect or change the active summary model and embedding basis: + +```bash +ghcrawl configure +ghcrawl configure --summary-model gpt-5.4-mini +ghcrawl configure --embedding-basis title_original +``` + +Current defaults: + +- summary model: `gpt-5-mini` +- embedding basis: `title_summary` (`title + dedupe summary`) +- vector backend: `vectorlite` + +Changing the summary model or embedding basis makes the next `refresh` rebuild vectors and clusters for that repo. + ### 1Password CLI Example If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly: @@ -216,10 +239,17 @@ Use `close-cluster` when you want to locally suppress a whole cluster from defau ## Cost To Operate -The main variable cost is OpenAI embeddings. Current model pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings). +The main variable costs are summarization and embeddings. Embedding pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings). On a real local run against roughly `12k` issues plus about `1.2x` related PR and issue inputs, [`text-embedding-3-large`](https://developers.openai.com/api/docs/pricing#embeddings) came out to about **$0.65 USD** total to embed the repo. Treat that as an approximate data point for something like `~14k` issue and PR inputs, not a hard guarantee. +For one-time summary migration planning on a repo around the size of `openclaw/openclaw` (`~20k` issues and PRs), `ghcrawl configure` reports these operator estimates using the April 1, 2026 USD pricing assumptions for this release: + +- `gpt-5-mini`: about **$12 USD** one time +- `gpt-5.4-mini`: about **$30 USD** one time + +`gpt-5-mini` is the default to keep that migration cost lower. `gpt-5.4-mini` is available when you want higher-quality summaries and are comfortable with the higher one-time spend. + This screenshot is the reference point for that estimate: ![OpenAI embeddings cost for a 12k-issue repo](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/openai-embeddings-12k-issue-repo.png) @@ -267,15 +297,16 @@ The agent and build contract for this repo lives in [SPEC.md](https://github.com - a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo - `sync` is metadata-only by default - `sync --include-comments` enables issue comments, PR reviews, and review comments for deeper context -- `embed` defaults to `text-embedding-3-large` -- `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present -- `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding +- `embed` defaults to `text-embedding-3-large` with `dimensions=1024` +- `embed` maintains one active vector per thread, stored in a persistent `vectorlite` sidecar index +- `embed` stores an input hash per thread and will not resubmit unchanged text for re-embedding +- the default embedding basis is `title + dedupe summary`; use `ghcrawl configure` to switch to `title + original body` - `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo` - `sync --limit ` is the best smoke-test path on a busy repository - `tui` remembers sort order and min cluster size per repository in the persisted config file - the TUI shows locally closed threads and clusters in gray; press `x` to hide or show them - on wide screens, press `l` to toggle between three columns and a wider cluster list with members/detail stacked on the right -- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> embed -> cluster and opens that repo with min cluster size `1+` +- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> summarize-if-needed -> embed -> cluster and opens that repo with min cluster size `1+` ## Responsibility Attestation diff --git a/apps/cli/src/main.test.ts b/apps/cli/src/main.test.ts index 020e06b..c018b8d 100644 --- a/apps/cli/src/main.test.ts +++ b/apps/cli/src/main.test.ts @@ -5,7 +5,7 @@ import os from 'node:os'; import path from 'node:path'; import { fileURLToPath } from 'node:url'; -import { GHCrawlService } from '@ghcrawl/api-core'; +import { GHCrawlService, readPersistedConfig } from '@ghcrawl/api-core'; import { formatDoctorReport, formatLogLine, getExitCode, parseOwnerRepo, parseRepoFlags, resolveSinceValue, run, runCli } from './main.js'; function createWritableCapture(isTTY?: boolean) { @@ -39,6 +39,7 @@ function makeRunContext(): { env: NodeJS.ProcessEnv; cwd: string; cleanup: () => const publicCommands = [ 'init', 'doctor', + 'configure', 'version', 'sync', 'refresh', @@ -171,6 +172,38 @@ test('run prints json doctor output when explicitly requested', async () => { assert.match(stdout.read(), /"github"/); }); +test('configure prints current persisted settings and cost estimates', async () => { + const stdout = createWritableCapture(true); + const context = makeRunContext(); + + try { + await run(['configure'], stdout.stream, { env: context.env, cwd: context.cwd }); + } finally { + context.cleanup(); + } + + assert.match(stdout.read(), /ghcrawl configure/); + assert.match(stdout.read(), /summary model: gpt-5-mini/); + assert.match(stdout.read(), /embedding basis: title_original/); + assert.match(stdout.read(), /gpt-5\.4-mini: ~\$30 USD/); +}); + +test('configure persists summary model changes', async () => { + const stdout = createWritableCapture(); + const context = makeRunContext(); + + try { + await run(['configure', '--summary-model', 'gpt-5.4-mini', '--json'], stdout.stream, { + env: context.env, + cwd: context.cwd, + }); + const persisted = readPersistedConfig({ env: context.env, cwd: context.cwd }); + assert.equal(persisted.data.summaryModel, 'gpt-5.4-mini'); + } finally { + context.cleanup(); + } +}); + test('unknown command exits with code 2 and a top-level help hint', async () => { const stderr = createWritableCapture(); const code = await runCli(['wat'], { stderr: stderr.stream }); @@ -410,6 +443,11 @@ test('formatDoctorReport renders a human-readable health summary', () => { authOk: false, error: 'missing', }, + vectorlite: { + configured: true, + runtimeOk: true, + error: null, + }, }); assert.match(rendered, /config path: \/tmp\/config\.json/); diff --git a/apps/cli/src/main.ts b/apps/cli/src/main.ts index 7919338..4446002 100644 --- a/apps/cli/src/main.ts +++ b/apps/cli/src/main.ts @@ -5,7 +5,7 @@ import path from 'node:path'; import { parseArgs } from 'node:util'; import { fileURLToPath } from 'node:url'; -import { createApiServer, GHCrawlService, loadConfig, type LoadConfigOptions } from '@ghcrawl/api-core'; +import { createApiServer, GHCrawlService, loadConfig, readPersistedConfig, writePersistedConfig, type LoadConfigOptions } from '@ghcrawl/api-core'; import { createHeapDiagnostics, type HeapDiagnostics } from './heap-diagnostics.js'; import { runInitWizard } from './init-wizard.js'; import { startTui } from './tui/app.js'; @@ -13,6 +13,7 @@ import { startTui } from './tui/app.js'; type CommandName = | 'init' | 'doctor' + | 'configure' | 'version' | 'sync' | 'refresh' @@ -24,6 +25,7 @@ type CommandName = | 'purge-comments' | 'embed' | 'cluster' + | 'cluster-experiment' | 'clusters' | 'cluster-detail' | 'search' @@ -42,7 +44,28 @@ type CommandSpec = { }; type DoctorResult = Awaited>; -type DoctorReport = DoctorResult & { version: string }; +type DoctorReport = DoctorResult & { + version: string; + vectorlite?: { + configured: boolean; + runtimeOk: boolean; + error: string | null; + }; +}; + +type ConfigureReport = { + configPath: string; + updated: boolean; + summaryModel: 'gpt-5-mini' | 'gpt-5.4-mini'; + embeddingBasis: 'title_original' | 'title_summary'; + vectorBackend: 'vectorlite'; + costEstimateUsd: { + sampleThreads: number; + pricingDate: string; + gpt5Mini: number; + gpt54Mini: number; + }; +}; type ParsedGlobalFlags = { argv: string[]; @@ -79,6 +102,18 @@ const COMMAND_SPECS: readonly CommandSpec[] = [ examples: ['ghcrawl doctor', 'ghcrawl doctor --json'], agentJson: true, }, + { + name: 'configure', + synopsis: 'configure [--summary-model gpt-5-mini|gpt-5.4-mini] [--embedding-basis title_original|title_summary] [--json]', + description: 'Show or update persisted summarization and embedding settings.', + options: [ + '--summary-model Select gpt-5-mini or gpt-5.4-mini for summarization', + '--embedding-basis Select title_original or title_summary for active vectors', + '--json Emit machine-readable JSON output explicitly', + ], + examples: ['ghcrawl configure', 'ghcrawl configure --summary-model gpt-5.4-mini', 'ghcrawl configure --embedding-basis title_original --json'], + agentJson: true, + }, { name: 'version', synopsis: 'version', @@ -430,6 +465,8 @@ export function parseRepoFlags(command: CommandName, args: string[]): ParsedRepo query: { type: 'string' }, mode: { type: 'string' }, k: { type: 'string' }, + backend: { type: 'string' }, + 'candidate-k': { type: 'string' }, threshold: { type: 'string' }, port: { type: 'string' }, id: { type: 'string' }, @@ -575,6 +612,24 @@ function parseEnum(command: CommandName, flagName: string, val throw new CliUsageError(`Invalid --${flagName}: ${value}. Use one of ${allowed.join(', ')}.`, command); } +function buildConfigureReport(options: { + configPath: string; + updated: boolean; + summaryModel: 'gpt-5-mini' | 'gpt-5.4-mini'; + embeddingBasis: 'title_original' | 'title_summary'; + vectorBackend: 'vectorlite'; +}): ConfigureReport { + return { + ...options, + costEstimateUsd: { + sampleThreads: 20_000, + pricingDate: 'April 1, 2026', + gpt5Mini: 12, + gpt54Mini: 30, + }, + }; +} + export function formatDoctorReport(result: DoctorReport): string { const lines = [ 'ghcrawl doctor', @@ -607,6 +662,43 @@ export function formatDoctorReport(result: DoctorReport): string { if (result.openai.error) { lines.push(` note: ${result.openai.error}`); } + lines.push( + '', + 'Vectorlite', + ` configured: ${formatBooleanStatus(result.vectorlite?.configured ?? false)}`, + ` runtime ok: ${formatBooleanStatus(result.vectorlite?.runtimeOk ?? false)}`, + ); + if (result.vectorlite?.error) { + lines.push(` note: ${result.vectorlite.error}`); + } + return `${lines.join('\n')}\n`; +} + +export function formatConfigureReport(result: ConfigureReport): string { + const basisLabel = result.embeddingBasis === 'title_summary' + ? 'title + dedupe summary' + : 'title + original body'; + const summaryModeNote = result.embeddingBasis === 'title_summary' + ? 'enabled automatically during refresh' + : 'disabled by default; enable title_summary to summarize before embedding'; + const lines = [ + 'ghcrawl configure', + `config path: ${result.configPath}`, + `updated: ${result.updated ? 'yes' : 'no'}`, + '', + 'Active settings', + ` summary model: ${result.summaryModel}`, + ` embedding basis: ${result.embeddingBasis} (${basisLabel})`, + ` llm summaries: ${summaryModeNote}`, + ` vector backend: ${result.vectorBackend}`, + '', + `Estimated one-time summary cost for ~${result.costEstimateUsd.sampleThreads.toLocaleString()} threads`, + ` pricing date: ${result.costEstimateUsd.pricingDate}`, + ` gpt-5-mini: ~$${result.costEstimateUsd.gpt5Mini.toFixed(0)} USD`, + ` gpt-5.4-mini: ~$${result.costEstimateUsd.gpt54Mini.toFixed(0)} USD`, + '', + 'Changing summary model or embedding basis will make the next refresh rebuild vectors and clusters.', + ]; return `${lines.join('\n')}\n`; } @@ -782,6 +874,41 @@ export async function run( stdout.write(shouldWriteJson ? `${JSON.stringify(result, null, 2)}\n` : formatDoctorReport(result)); return; } + case 'configure': { + const parsed = parseArgsForCommand('configure', rest, { + 'summary-model': { type: 'string' }, + 'embedding-basis': { type: 'string' }, + json: { type: 'boolean' }, + }); + const values = parsed.values as RepoCommandValues; + const summaryModel = parseEnum('configure', 'summary-model', values['summary-model'], ['gpt-5-mini', 'gpt-5.4-mini']); + const embeddingBasis = parseEnum('configure', 'embedding-basis', values['embedding-basis'], ['title_original', 'title_summary']); + const current = getConfig(); + const stored = readPersistedConfig(loadConfigOptions); + const next = { + ...stored.data, + summaryModel: summaryModel ?? current.summaryModel, + embeddingBasis: embeddingBasis ?? current.embeddingBasis, + vectorBackend: 'vectorlite' as const, + }; + const updated = + next.summaryModel !== current.summaryModel || + next.embeddingBasis !== current.embeddingBasis || + next.vectorBackend !== current.vectorBackend; + if (updated) { + writePersistedConfig(next, loadConfigOptions); + } + const result = buildConfigureReport({ + configPath: current.configPath, + updated, + summaryModel: next.summaryModel as 'gpt-5-mini' | 'gpt-5.4-mini', + embeddingBasis: next.embeddingBasis as 'title_original' | 'title_summary', + vectorBackend: 'vectorlite', + }); + const shouldWriteJson = values.json === true || (stdout as NodeJS.WriteStream).isTTY !== true; + stdout.write(shouldWriteJson ? `${JSON.stringify(result, null, 2)}\n` : formatConfigureReport(result)); + return; + } case 'version': { stdout.write(`${CLI_VERSION}\n`); return; @@ -934,6 +1061,21 @@ export async function run( heapDiagnostics?.dispose(); } } + case 'cluster-experiment': { + const { owner, repo, values } = parseRepoFlags('cluster-experiment', rest); + const backend = values.backend === 'exact' || values.backend === 'vectorlite' ? values.backend : undefined; + const result = getService().clusterExperiment({ + owner, + repo, + backend, + k: typeof values.k === 'string' ? Number(values.k) : undefined, + minScore: typeof values.threshold === 'string' ? Number(values.threshold) : undefined, + candidateK: typeof values['candidate-k'] === 'string' ? Number(values['candidate-k']) : undefined, + onProgress: (message: string) => writeProgress(message, stderr), + }); + stdout.write(`${JSON.stringify(result, null, 2)}\n`); + return; + } case 'clusters': { const { owner, repo, values } = parseRepoFlags('clusters', rest); const sort = parseEnum('clusters', 'sort', values.sort, ['recent', 'size']); diff --git a/apps/cli/src/tui/app.test.ts b/apps/cli/src/tui/app.test.ts index 86ff761..19289d9 100644 --- a/apps/cli/src/tui/app.test.ts +++ b/apps/cli/src/tui/app.test.ts @@ -6,6 +6,7 @@ import type { TuiClusterDetail, TuiRepoStats, TuiThreadDetail } from '@ghcrawl/a import { buildRefreshCliArgs, buildHelpContent, + buildUpdatePipelineHelpContent, buildUpdatePipelineLabels, describeUpdateTask, escapeBlessedText, @@ -73,9 +74,11 @@ test('renderDetailPane escapes user-provided text before rendering into a tags-e const rendered = renderDetailPane(detail, cluster, 'detail'); assert.match(rendered, /Cluster 1 \(#42 representative issue\)/); assert.match(rendered, /Bad \\{bold\\}title\\{\/bold\\}/); + assert.match(rendered, /LLM Summary:/); assert.match(rendered, /Body with \\{red-fg\\}tags\\{\/red-fg\\}/); assert.match(rendered, /Summary \\{yellow-fg\\}text\\{\/yellow-fg\\}/); assert.match(rendered, /Neighbor \\{blue-fg\\}title\\{\/blue-fg\\}/); + assert.ok(rendered.indexOf('LLM Summary:') < rendered.indexOf('{bold}Body{/bold}')); }); test('parseOwnerRepoValue accepts owner slash repo values and rejects invalid ones', () => { @@ -199,6 +202,17 @@ test('buildHelpContent includes the full key command list', () => { assert.match(content, /This popup scrolls\./); }); +test('buildUpdatePipelineHelpContent explains the LLM summary tradeoff for both modes', () => { + const disabled = buildUpdatePipelineHelpContent('title_original'); + assert.match(disabled, /LLM summaries: disabled/); + assert.match(disabled, /configure --embedding-basis title_summary/); + assert.match(disabled, /\$15-\$30/); + + const enabled = buildUpdatePipelineHelpContent('title_summary'); + assert.match(enabled, /LLM summaries: enabled/); + assert.match(enabled, /about 50%/); +}); + test('buildRefreshCliArgs maps the staged selection to refresh skip flags', () => { assert.deepEqual(buildRefreshCliArgs({ owner: 'openclaw', repo: 'openclaw' }, { sync: true, embed: true, cluster: true }), [ 'refresh', diff --git a/apps/cli/src/tui/app.ts b/apps/cli/src/tui/app.ts index 0fd8226..d0c5920 100644 --- a/apps/cli/src/tui/app.ts +++ b/apps/cli/src/tui/app.ts @@ -836,7 +836,11 @@ export async function startTui(params: StartTuiParams): Promise { void (async () => { modalOpen = true; try { - const selection = await promptUpdatePipelineSelection(widgets.screen, snapshot?.stats ?? null); + const selection = await promptUpdatePipelineSelection( + widgets.screen, + snapshot?.stats ?? null, + params.service.config.embeddingBasis, + ); if (!selection) { render(); return; @@ -1241,7 +1245,10 @@ export function renderDetailPane( ? `{bold}Closed:{/bold} ${escapeBlessedText(thread.closedAtLocal ?? thread.closedAtGh ?? 'yes')} ${thread.closeReasonLocal ? `(${escapeBlessedText(thread.closeReasonLocal)})` : ''}`.trimEnd() : '{bold}Closed:{/bold} no'; const summaries = Object.entries(threadDetail.summaries) - .map(([key, value]) => `{bold}${key}:{/bold}\n${escapeBlessedText(value)}`) + .map(([key, value]) => { + const label = key === 'dedupe_summary' ? 'LLM Summary' : key; + return `{bold}${escapeBlessedText(label)}:{/bold}\n${escapeBlessedText(value)}`; + }) .join('\n\n'); const neighbors = threadDetail.neighbors.length > 0 @@ -1261,10 +1268,10 @@ export function renderDetailPane( `{bold}Updated:{/bold} ${thread.updatedAtGh ?? 'unknown'}`, `{bold}Labels:{/bold} ${labels}`, `{bold}URL:{/bold} ${escapeBlessedText(thread.htmlUrl)}`, + summaries ? `\n\n${summaries}` : '', '', `{bold}Body{/bold}`, escapeBlessedText(thread.body ?? '(no body)'), - summaries ? `\n\n${summaries}` : '', `\n\n{bold}Neighbors{/bold}\n${neighbors}`, ] .filter(Boolean) @@ -1348,6 +1355,23 @@ export function buildUpdatePipelineLabels( }); } +export function buildUpdatePipelineHelpContent(embeddingBasis: 'title_original' | 'title_summary'): string { + const summariesEnabled = embeddingBasis === 'title_summary'; + const summaryStatus = summariesEnabled + ? 'LLM summaries: enabled via title_summary.' + : 'LLM summaries: disabled; current basis is title_original.'; + const summaryAction = summariesEnabled + ? 'On openclaw/openclaw this improved non-solo cluster membership by about 50% versus title_original.' + : 'Enable with `ghcrawl configure --embedding-basis title_summary` if you want richer clustering; on openclaw/openclaw that improved non-solo cluster membership by about 50%.'; + return [ + 'Usually you want all three. Run order is fixed: GitHub sync/reconcile -> embeddings -> clusters.', + `${summaryStatus} ${summaryAction}`, + 'A first summarize of ~18k open issues/PRs in openclaw/openclaw typically costs about $15-$30 with gpt-5-mini.', + 'Later refreshes are usually much cheaper because only changed items need summaries.', + 'Toggle with space, move with arrows, Enter to start, Esc to cancel.', + ].join('\n'); +} + export function buildHelpContent(): string { return [ '{bold}ghcrawl TUI Help{/bold}', @@ -1474,6 +1498,7 @@ async function promptHelp(screen: blessed.Widgets.Screen): Promise { async function promptUpdatePipelineSelection( screen: blessed.Widgets.Screen, stats: TuiRepoStats | null, + embeddingBasis: 'title_original' | 'title_summary', ): Promise { const selection: UpdateTaskSelection = { sync: true, embed: true, cluster: true }; const modalWidth = '76%'; @@ -1487,7 +1512,7 @@ async function promptUpdatePipelineSelection( top: 'center', left: 'center', width: modalWidth, - height: 11, + height: 14, style: { border: { fg: '#5bc0eb' }, item: { fg: 'white' }, @@ -1497,14 +1522,12 @@ async function promptUpdatePipelineSelection( }); const help = blessed.box({ parent: screen, - top: 'center-4', + top: 'center-5', left: 'center', width: modalWidth, - height: 4, + height: 7, style: { fg: 'white', bg: '#101522' }, - content: - 'Usually you want all three. Run order is fixed: GitHub sync/reconcile -> embeddings -> clusters.\n' + - 'Toggle with space, move with arrows, Enter to start, Esc to cancel.', + content: buildUpdatePipelineHelpContent(embeddingBasis), }); box.focus(); diff --git a/docs/brainstorms/2026-04-01-vectorlite-default-search-and-summary-migration-requirements.md b/docs/brainstorms/2026-04-01-vectorlite-default-search-and-summary-migration-requirements.md new file mode 100644 index 0000000..ed7b5bb --- /dev/null +++ b/docs/brainstorms/2026-04-01-vectorlite-default-search-and-summary-migration-requirements.md @@ -0,0 +1,124 @@ +--- +date: 2026-04-01 +topic: vectorlite-default-search-and-summary-migration +--- + +# Vectorlite Default Search And Summary Migration + +## Problem Frame + +`ghcrawl` currently treats `vectorlite` as an experiment, keeps embeddings in `document_embeddings.embedding_json`, and rebuilds clusters from in-memory exact similarity logic. That creates three problems: + +1. Clustering and semantic lookups do not scale cleanly as repos grow past 10k-20k issues/PRs. +2. The product does not yet have a persistent ANN index that can answer semantic search or cluster-membership questions quickly for day-to-day use. +3. The current embedding/summarization pipeline is not versioned strongly enough to support a deliberate migration to shorter embeddings, a persistent vector index, and user-selectable summary models without confusion about what is stale. + +The release needs to move `ghcrawl` to a persistent `vectorlite`-backed search model, migrate embeddings to 1024-dimensional `text-embedding-3-large`, preserve summary-skip behavior for unchanged inputs, and make the operator experience clear about cost, migration state, and rebuild behavior. + +```mermaid +flowchart TB + A["User upgrades ghcrawl"] --> B["doctor / configure shows migration state"] + B --> C["User runs refresh"] + C --> D["sync GitHub metadata"] + D --> E["summarize only stale threads when summary-based embeddings are enabled"] + E --> F["rebuild embeddings at 1024 dimensions"] + F --> G["write/update persistent vectorlite index"] + G --> H["rebuild clusters from vectorlite search"] + H --> I["mark repo migrated and ready for semantic search"] +``` + +## Requirements + +**Release Structure** +- R1. Ship this as a single coordinated breaking release rather than splitting it across three releases. +- R2. Treat `vectorlite` as a runtime requirement for supported installs in this release, not an experiment-only optional dependency. +- R3. Promote vector-backed semantic search and clustering to the supported default path for day-to-day operation. + +**Persistent Vector Search** +- R4. `ghcrawl` must maintain a persistent `vectorlite`-backed vector index for each repository so semantic lookup does not require loading the full embedding corpus into memory. +- R5. Semantic search must query the persistent vector index directly rather than the current exact in-memory scan path. +- R6. The product must support fast “find likely cluster membership / nearest neighbors for a newly synced thread” using the persistent vector index. +- R7. The release must define one supported persistence strategy for the vector index and keep it stable for the release. The recommended default is a managed sidecar index or sidecar SQLite store rather than storing `vectorlite` virtual tables in the canonical issue/PR database. + +**Summarization And Embedding Pipeline** +- R8. The release must support two summarization models for operator choice: + - `gpt-5-mini` + - `gpt-5.4-mini` +- R9. The default summarization model for new and upgraded installs must be `gpt-5-mini`. +- R10. The release must continue skipping summarization work when the summary input has not changed, and automated tests must prove this behavior. +- R11. The release must move embeddings to `text-embedding-3-large` with explicit `dimensions=1024`. +- R12. The release must store only one active embedding per thread for the active embedding basis, not multiple long-lived parallel embedding sources for old and new strategies. +- R13. The active embedding basis must be operator-selectable between: + - title + original description + - title + summarized description +- R14. The default embedding basis for this release should be title + summarized description because recent repo experiments indicate better clustering quality from the dedupe-summary path. +- R15. The stored embedding metadata must record the embedding basis and pipeline version used to create it so stale rows can be detected deterministically after config or model changes. +- R16. If the operator changes summarization model or embedding basis, `ghcrawl` must mark the affected summaries and/or embeddings stale instead of silently treating old data as current. + +**Refresh And Migration Behavior** +- R17. The first `refresh` after upgrade must detect that pre-release embeddings are obsolete and rebuild embeddings before vector search and clustering are treated as current. +- R18. If the active embedding basis depends on summaries, `refresh` must become summary-aware and run a summarize phase before embedding whenever the relevant summary content is missing or stale. +- R19. Existing cluster runs built from pre-migration embeddings must be treated as stale after upgrade and must not continue to appear as if they are current once the repo is known to need migration. +- R20. The operator experience must make migration status obvious before and during the first post-upgrade rebuild. +- R21. Rebuild behavior must be repository-scoped, so one repo can complete migration without forcing all repos to migrate immediately. +- R22. If a repo has not completed the required rebuild yet, commands that depend on current vectors or current clusters must say so clearly instead of returning misleading old results. + +**Operator Controls And UX** +- R23. Add a first-class `configure` command that shows the selected summarization model, embedding basis, vector backend status, and whether the current repo data is migrated or stale. +- R24. `configure` must allow the operator to switch summarization models intentionally. +- R25. The operator-facing docs and CLI help must explain that `gpt-5-mini` is the cheaper default and `gpt-5.4-mini` is the higher-quality, more expensive option. +- R26. `doctor` should report whether `vectorlite` loads successfully on the current machine, because this release makes it a supported runtime dependency. + +**Cost And Spend Transparency** +- R27. The docs must include an estimate, using April 1, 2026 USD pricing, for summarizing roughly 20k `openclaw/openclaw` issues/PRs with both supported summary models. +- R28. The release should communicate that a one-time full summarize of ~20k threads is expected to cost roughly: + - about `$30` with `gpt-5.4-mini` + - about `$11-$13` with `gpt-5-mini` + based on current OpenAI pricing and the repo’s observed token profile. +- R29. Long-running summarize/refresh progress output should continue reporting spend and estimated total cost so operators can stop early if needed. + +**Validation And Release Safety** +- R30. Tests must prove summary skipping still works when thread input is unchanged after this migration. +- R31. Tests must prove stale detection for summary model changes, embedding basis changes, and pre-migration embeddings. +- R32. Tests must prove semantic search uses the persistent vector index successfully after migration. +- R33. Release verification must include upgrade testing from a pre-vectorlite database to the new release on at least one realistic repo dataset. +- R34. Release verification must include packaging/install validation for the supported desktop/server platforms because `vectorlite` becomes a hard dependency. + +## Success Criteria +- Operators can upgrade and complete the first repo migration with a normal `refresh` flow instead of manually orchestrating summarize/embed/cluster recovery steps. +- Large repos no longer require loading the full embedding corpus into RAM for normal semantic lookup or cluster rebuild workflows. +- Semantic search and cluster-neighbor lookup are fast enough to feel interactive on migrated repos. +- Unchanged threads are not re-summarized on repeated refreshes. +- Docs and CLI clearly communicate model choice, migration status, and likely one-time spend before operators trigger a full rebuild. + +## Scope Boundaries +- Not in scope for this release: keeping exact in-memory search as a co-equal supported production path. +- Not in scope for this release: shipping three separate rollout releases for vector backend, embedding migration, and summary model controls. +- Not in scope for this release: introducing a web UI as part of the migration. +- Not in scope for this release: preserving old cluster results as “current” after the release knows the repo must be re-embedded. + +## Key Decisions +- One coordinated breaking release: The migration behavior is too coupled across vectors, summaries, search, and clustering to justify stretching it across three separate releases. +- `vectorlite` is a real dependency now: the release should treat it as supported runtime infrastructure, not an experiment hiding behind a side command. +- Default summary model is `gpt-5-mini`: it keeps the out-of-the-box one-time migration cost materially lower while preserving a higher-quality paid-up option. +- Default embedding basis is title + summarized description: recent repo experiments indicate this produces cleaner clusters than title/body alone. +- Existing summary skip behavior stays: the release should migrate the pipeline without turning `refresh` into an always-re-summarize money sink. + +## Dependencies / Assumptions +- The supported `vectorlite` packaging story is good enough on the platforms `ghcrawl` intends to support in this release. +- OpenAI pricing as of April 1, 2026 remains: + - `gpt-5-mini`: $0.25 / 1M input, $0.025 / 1M cached input, $2.00 / 1M output + - `gpt-5.4-mini`: $0.75 / 1M input, $0.075 / 1M cached input, $4.50 / 1M output +- The recent repo experiment result in `docs/solutions/performance-issues/clustering-vectorlite-hnsw-embedding-optimization-2026-03-30.md` remains directionally valid for the release default. + +## Outstanding Questions + +### Deferred To Planning +- [Affects R7][Technical] Should the persistent vector index live in a sidecar SQLite DB, a sidecar vector index file, or inside the primary DB schema with `vectorlite` virtual tables? +- [Affects R18][Technical] How should `refresh` expose the new summarize-aware phase in progress output and skip flags without making the CLI surface confusing? +- [Affects R19][Technical] What is the cleanest way to mark old cluster runs invalid after upgrade while still letting the app explain why clusters are unavailable? +- [Affects R23][Technical] Should `configure` remain fully interactive, accept flags for non-interactive use, or support both? +- [Affects R31][Needs research] What is the smallest durable pipeline-versioning scheme that covers summary model, summary prompt version, embedding basis, embedding model, and dimensions without becoming hard to maintain? + +## Next Steps +→ `/prompts:ce-plan` for structured implementation planning diff --git a/docs/plans/2026-04-01-001-feat-persistent-vectorlite-migration-plan.md b/docs/plans/2026-04-01-001-feat-persistent-vectorlite-migration-plan.md new file mode 100644 index 0000000..b01802a --- /dev/null +++ b/docs/plans/2026-04-01-001-feat-persistent-vectorlite-migration-plan.md @@ -0,0 +1,479 @@ +--- +title: feat: adopt persistent vectorlite search and summary-aware embedding migration +type: feat +status: active +date: 2026-04-01 +origin: docs/brainstorms/2026-04-01-vectorlite-default-search-and-summary-migration-requirements.md +--- + +# feat: adopt persistent vectorlite search and summary-aware embedding migration + +## Overview + +This plan upgrades `ghcrawl` from exact in-memory embedding usage to persistent `vectorlite`-backed search and clustering, while simultaneously migrating the embedding pipeline to a single active vector per thread using `text-embedding-3-large` at 1024 dimensions. The same release also formalizes operator choice of summarization model, makes `refresh` summary-aware when the active embedding basis depends on summaries, and preserves summary skip behavior so ongoing refreshes do not become an unnecessary spend multiplier. + +## Problem Frame + +The current product has a split personality: the repo now has a promising `vectorlite` experiment, but production behavior still centers on `document_embeddings.embedding_json`, exact in-memory scanning, and a `refresh` pipeline that does not know how to summarize before embedding. That leaves `ghcrawl` with slow clustering on large repos, no persistent ANN search surface, stale/ambiguous upgrade behavior for old embeddings and cluster runs, and no supported operator workflow for choosing summary quality versus cost. The release needs to unify those into one supported model without making users manually recover from migration state. (see origin: `docs/brainstorms/2026-04-01-vectorlite-default-search-and-summary-migration-requirements.md`) + +## Requirements Trace + +- R1. Ship one coordinated breaking release for vector search, embedding migration, and summary model controls. +- R2. Make `vectorlite` a supported runtime requirement. +- R4. Maintain a persistent vector index per repository. +- R5. Query semantic search from the persistent vector index rather than exact in-memory scans. +- R6. Support fast nearest-neighbor and likely-cluster lookups for newly synced threads. +- R8. Support `gpt-5-mini` and `gpt-5.4-mini` as summary model choices. +- R9. Default summary model to `gpt-5-mini`. +- R10. Preserve skip-on-unchanged summarization behavior with tests. +- R11. Move embeddings to `text-embedding-3-large` with explicit `dimensions=1024`. +- R12. Store one active embedding per thread for the active basis. +- R13. Support operator-selectable embedding basis: title + original description or title + summarized description. +- R14. Default embedding basis to title + summarized description. +- R15. Record pipeline metadata strongly enough to detect stale summaries, vectors, and clusters deterministically. +- R17. Force pre-release embeddings to rebuild on first `refresh` after upgrade. +- R18. Make `refresh` summary-aware when the active embedding basis depends on summaries. +- R19. Treat pre-migration cluster runs as stale after upgrade. +- R23. Add `configure` to show and change active operator settings. +- R26. Make `doctor` report `vectorlite` runtime readiness. +- R27. Publish operator cost estimates for ~20k thread one-time summarization. +- R30. Add tests for summary skipping after migration. +- R31. Add tests for stale detection across model/basis/pipeline changes. +- R32. Add tests proving semantic search uses the persistent vector index. +- R34. Validate packaging/install behavior with `vectorlite` as a hard dependency. + +## Scope Boundaries + +- Not in scope: keeping the old exact in-memory vector path as a co-equal supported production backend. +- Not in scope: a web UI redesign or any browser surface changes. +- Not in scope: preserving old cluster runs as “current” after a repo is known to need vector migration. +- Not in scope: multi-release rollout staging for vector search, embedding migration, and summary controls. + +## Context & Research + +### Relevant Code and Patterns + +- `packages/api-core/src/service.ts` + - current `refreshRepository()` only runs `sync -> embed -> cluster` + - current `summarizeRepository()` already skips unchanged work using `document_summaries.content_hash` + - current `clusterExperiment()` contains the best existing `vectorlite` integration pattern, including extension loading and HNSW queries + - current `searchRepository()` still builds semantic results from exact local embeddings rather than a persistent vector index +- `packages/api-core/src/openai/provider.ts` + - already centralizes summary and embedding API calls + - embeddings currently do not pass an explicit `dimensions` parameter +- `packages/api-core/src/config.ts` + - already persists `summaryModel` and `embedModel` + - does not yet expose embedding basis, vector backend state, or a first-class configuration workflow +- `packages/api-core/src/db/migrate.ts` + - currently persists `document_summaries`, `document_embeddings`, cluster tables, and run tables + - has no first-class pipeline-state table for migration invalidation or vector store metadata +- `apps/cli/src/main.ts` + - already has stable command/help patterns for `doctor`, `refresh`, `search`, `neighbors`, and `cluster` + - currently treats `summarize` as a dev-only command and has no `configure` command +- `packages/api-contract/src/contracts.ts` + - already defines the response contracts for `doctor`, `search`, `neighbors`, and refresh results + +### Institutional Learnings + +- `docs/solutions/performance-issues/clustering-vectorlite-hnsw-embedding-optimization-2026-03-30.md` + - recent internal evaluation found the strongest clustering quality from summary-oriented embeddings rather than raw title/body-only embeddings + - `vectorlite` HNSW materially improved clustering performance on larger datasets + - the repo’s observed summary costs support the release’s one-time migration cost messaging + +### External References + +- `vectorlite` README: persistent HNSW indexes can be backed by `index_file_path`, virtual tables support insert/update/delete by `rowid`, and metadata filtering is supported. + - +- OpenAI `text-embedding-3-large` model docs confirm the current model and pricing surface for embeddings. + - +- OpenAI `gpt-5-mini` and `gpt-5.4-mini` model docs confirm current per-token pricing and positioning. + - + - + +## Key Technical Decisions + +- **Persistent vector storage will use a sidecar repository-scoped SQLite/vectorlite store rather than embedding vectorlite tables into the canonical main DB.** + - Rationale: this keeps the issue/PR database as ordinary SQLite for easier rollback, inspection, and future recovery while still giving the product a durable ANN index. It also lets the vector store be rebuilt or replaced without entangling core relational data. + +- **The release will introduce one active vector per thread keyed by `thread_id`, with vector bytes living in the sidecar vector store and pipeline metadata living in the main DB.** + - Rationale: the current `document_embeddings` multi-row-per-thread model was useful for experimentation, but it is the wrong shape for “always-on” ANN search and creates unnecessary migration ambiguity. + +- **`refresh` will become a four-stage pipeline in behavior: `sync -> summarize-if-needed -> embed -> cluster`, while preserving the operator mental model that `refresh` is still the one command to run.** + - Rationale: if the active embedding basis depends on summaries, embedding can no longer be treated as independent of summarize. + +- **Staleness will be managed by explicit pipeline versioning, not by ad hoc assumptions about timestamps alone.** + - Rationale: this release changes multiple dimensions at once: summary model, summary prompt/version, embedding basis, embedding model, dimensions, and vector backend. A single pipeline compatibility stamp is easier to reason about than a loose mix of heuristics. + +- **Default embedding basis will be `title + dedupe summary`; default summary model will be `gpt-5-mini`.** + - Rationale: recent repo learnings suggest the summary-based embedding basis gives better clustering quality, while `gpt-5-mini` keeps one-time upgrade cost tolerable by default. + +- **`configure` will support both read and write behavior.** + - Rationale: `ghcrawl configure` with no mutation flags should show current settings and migration state; explicit flags should update persisted operator choices. This keeps the command useful for both humans and scripts. + +## Open Questions + +### Resolved During Planning + +- **Where should the persistent vector index live?** + - Resolution: in a repository-scoped sidecar vector store managed by `ghcrawl`, with the main DB retaining only relational metadata and pipeline state. + +- **Should `configure` be interactive-only or scriptable?** + - Resolution: support both. No-flag invocation shows current config and state; explicit flags mutate persisted config. + +- **Should `refresh` remain unaware of summarize?** + - Resolution: no. Summary-aware refresh is required whenever the active embedding basis depends on summaries. + +### Deferred to Implementation + +- **Exact sidecar path naming** + - Decide the final on-disk naming convention during implementation, but keep it repo-scoped and colocated with existing runtime data. + +- **How aggressively to scrub legacy `document_embeddings` rows** + - The release should stop treating them as authoritative. Whether old rows are deleted immediately or left as inert historical data for one release can be finalized during implementation. + +- **Whether to add a distinct “vector migration pending” status to TUI headers** + - This is a UX refinement worth deciding while touching the TUI status code, not a blocker to planning. + +## High-Level Technical Design + +> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.* + +```mermaid +flowchart TB + A["Main DB: threads / documents / document_summaries"] --> B["Pipeline state tables"] + A --> C["thread_vectors metadata (one row per thread)"] + C --> D["Repo-scoped vector sidecar"] + D --> E["vectorlite HNSW table (rowid = thread_id)"] + B --> F["stale detection"] + F --> G["refresh"] + G --> H["sync"] + H --> I["summarize if basis=title+summary and stale"] + I --> J["embed at 1024 dimensions"] + J --> K["upsert vector sidecar + thread_vectors metadata"] + K --> L["cluster from vectorlite neighbors"] + E --> M["semantic search / neighbors / cluster membership checks"] +``` + +## Implementation Units + +```mermaid +flowchart TB + U1["Unit 1: Config and migration state"] --> U2["Unit 2: Persistent vector store"] + U1 --> U3["Unit 3: Summary/vector pipeline migration"] + U2 --> U4["Unit 4: Search and clustering on vectorlite"] + U3 --> U4 + U3 --> U5["Unit 5: Refresh and operator UX"] + U4 --> U5 + U5 --> U6["Unit 6: Docs and release validation"] +``` + +- [ ] **Unit 1: Add pipeline config and repo migration state** + +**Goal:** Introduce durable configuration and per-repo state strong enough to drive a controlled migration instead of implicit behavior. + +**Requirements:** R8, R9, R13, R15, R17, R19, R23, R26, R31 + +**Dependencies:** None + +**Files:** +- Modify: `packages/api-core/src/config.ts` +- Modify: `packages/api-core/src/db/migrate.ts` +- Modify: `packages/api-core/src/service.ts` +- Modify: `packages/api-core/src/index.ts` +- Test: `packages/api-core/src/config.test.ts` +- Test: `packages/api-core/src/db/migrate.test.ts` +- Test: `packages/api-core/src/service.test.ts` + +**Approach:** +- Extend persisted config with: + - summary model selection + - embedding basis selection + - any explicit vector backend / vector store config needed for this release +- Add repo-scoped pipeline state in the main DB to record: + - current migration compatibility version + - active summary model / prompt version + - active embedding basis / embedding model / dimensions + - vector index freshness + - cluster freshness relative to vectors +- Make stale-state computation explicit in service helpers rather than distributing it across command handlers. + +**Execution note:** Start with migration-state and config tests before touching orchestration logic so later units have a stable contract to build on. + +**Patterns to follow:** +- `packages/api-core/src/config.ts` persisted config loading/writing +- `packages/api-core/src/db/migrate.ts` additive schema migration style +- `packages/api-core/src/service.ts` repo-scoped run and state helpers + +**Test scenarios:** +- Happy path: loading config with no new settings defaults to `gpt-5-mini` and the default embedding basis. +- Happy path: persisted config round-trips the chosen summary model and embedding basis. +- Edge case: a repo with no pipeline state is treated as migration-pending rather than current. +- Edge case: changing summary model or embedding basis marks repo vector/cluster state stale. +- Error path: malformed persisted config values are ignored or rejected consistently with current config behavior. +- Integration: startup migration on an existing pre-release DB adds the new pipeline state tables without destroying existing threads, summaries, or run history. + +**Verification:** +- The config object exposes the new settings predictably. +- A migrated DB can answer “is this repo current, stale, or migration-pending?” deterministically. + +- [ ] **Unit 2: Introduce a persistent vectorlite sidecar store** + +**Goal:** Replace ephemeral experiment-only vectorlite usage with a supported persistent vector store abstraction. + +**Requirements:** R2, R4, R5, R6, R7, R12, R15, R32, R34 + +**Dependencies:** Unit 1 + +**Files:** +- Create: `packages/api-core/src/vector/store.ts` +- Create: `packages/api-core/src/vector/vectorlite-store.ts` +- Modify: `packages/api-core/src/db/sqlite.ts` +- Modify: `packages/api-core/src/service.ts` +- Modify: `packages/api-core/src/index.ts` +- Test: `packages/api-core/src/vector/vectorlite-store.test.ts` +- Test: `packages/api-core/src/service.test.ts` + +**Approach:** +- Introduce a dedicated vector-store abstraction that owns: + - extension loading + - sidecar file lifecycle + - persistent HNSW virtual table creation + - upsert/delete/query by `thread_id` +- Use a single active vector table per repo with `rowid = thread_id`. +- Keep vector values in the sidecar store rather than in the main DB as JSON. +- Keep enough metadata in the main DB to detect whether a thread’s stored vector is current without needing to inspect vector payloads. +- Make vector store open/health behavior explicit so `doctor` and runtime commands can report extension problems cleanly. + +**Execution note:** Build this behind a narrow interface first; do not spread raw vectorlite SQL across service methods. + +**Patterns to follow:** +- `packages/api-core/src/service.ts` current `clusterExperiment()` vectorlite extension-loading pattern +- `packages/api-core/src/db/sqlite.ts` SQLite open/pragma helpers + +**Test scenarios:** +- Happy path: a vector store can create/open a repo sidecar and persist vectors across connection reopen. +- Happy path: querying nearest neighbors returns the expected thread ids after reopen, without rebuilding the index. +- Edge case: deleting a thread vector removes it from later ANN results. +- Edge case: updating an existing thread vector replaces the old vector rather than duplicating it. +- Error path: missing vectorlite extension surfaces an actionable error that `doctor` can report. +- Integration: service helpers can upsert/query/delete vectors through the abstraction without touching temp experiment paths. + +**Verification:** +- A migrated repo can reopen its vector store and serve neighbor queries without loading the entire corpus into process memory. + +- [ ] **Unit 3: Migrate summaries and embeddings to one active vector per thread** + +**Goal:** Rebuild the content pipeline around one active vector per thread, explicit 1024-dimensional embeddings, and summary-aware stale detection. + +**Requirements:** R10, R11, R12, R13, R14, R15, R16, R17, R18, R30, R31 + +**Dependencies:** Unit 1, Unit 2 + +**Files:** +- Modify: `packages/api-core/src/openai/provider.ts` +- Modify: `packages/api-core/src/service.ts` +- Modify: `packages/api-core/src/db/migrate.ts` +- Modify: `packages/api-core/src/search/exact.ts` +- Test: `packages/api-core/src/service.test.ts` +- Test: `packages/api-core/src/openai/provider.test.ts` (or add a new targeted provider/unit test file if one does not exist yet) + +**Approach:** +- Add explicit embedding dimensions support to the provider and pass `dimensions=1024` for `text-embedding-3-large`. +- Replace the current multi-source embedding workset with one active vector input per thread derived from the configured basis. +- Keep summary generation as a prerequisite only when the chosen basis requires it. +- Extend summary stale detection so it remains keyed by content hash and summary model, and add any prompt/pipeline version metadata needed to invalidate old summaries intentionally. +- Add vector metadata keyed by thread id so the system can decide whether a thread needs re-embedding without reading the sidecar vector value. +- Ensure upgraded repos treat legacy vectors as stale and rebuild them on the next refresh. + +**Execution note:** Characterization-first around summary skipping and embedding workset selection will reduce migration regressions here. + +**Patterns to follow:** +- `packages/api-core/src/service.ts` current `summarizeRepository()` skip-on-content-hash behavior +- `packages/api-core/src/service.ts` current `embedRepository()` batching/progress patterns +- `packages/api-core/src/openai/provider.ts` retry behavior and model call centralization + +**Test scenarios:** +- Happy path: unchanged summary input under the same summary model is skipped on repeated runs. +- Happy path: basis=`title+original` produces vectors without requiring summaries. +- Happy path: basis=`title+summary` triggers summarize first when dedupe summaries are missing or stale. +- Edge case: changing summary model marks summary-derived vectors stale. +- Edge case: changing embedding basis marks vectors stale even when summaries remain current. +- Edge case: pre-release `document_embeddings` rows are ignored as current vectors after migration. +- Error path: embedding provider failures do not falsely mark vectors current. +- Integration: a repo upgraded from the old schema rebuilds vectors to 1024 dimensions on first eligible refresh. + +**Verification:** +- After migration, each active thread has exactly one current vector metadata row and one current sidecar vector entry for the chosen basis. + +- [ ] **Unit 4: Move semantic search, neighbors, and cluster building onto vectorlite** + +**Goal:** Make persistent vectorlite the supported engine for semantic lookup and cluster construction. + +**Requirements:** R3, R4, R5, R6, R19, R22, R32 + +**Dependencies:** Unit 2, Unit 3 + +**Files:** +- Modify: `packages/api-core/src/service.ts` +- Modify: `packages/api-core/src/api/server.ts` +- Modify: `packages/api-core/src/cluster/build.ts` +- Modify: `packages/api-contract/src/contracts.ts` +- Test: `packages/api-core/src/service.test.ts` +- Test: `packages/api-core/src/api/server.test.ts` +- Test: `packages/api-core/src/cluster/perf.integration.ts` + +**Approach:** +- Replace semantic search candidate generation with ANN neighbor lookup from the persistent vector store. +- Replace `neighbors` lookup with vector store queries rather than exact local scans. +- Rebuild cluster edge generation from vector store neighbor queries instead of loading the full embedding corpus into RAM. +- Make service methods return clear “migration pending / vectors stale / clusters stale” responses when the repo has not completed the required rebuild yet. +- Keep cluster persistence and local close-state behavior aligned with the current cluster run model unless the implementation uncovers a simpler equivalent. + +**Execution note:** Start with search/neighbors characterization coverage so the ANN transition preserves the public contract while changing the engine underneath. + +**Patterns to follow:** +- `packages/api-core/src/service.ts` existing `searchRepository()`, `listNeighbors()`, and cluster persistence flow +- `packages/api-contract/src/contracts.ts` current JSON contract style +- `packages/api-core/src/cluster/perf.integration.ts` current perf comparison harness + +**Test scenarios:** +- Happy path: semantic search on a migrated repo returns ANN-backed hits without loading exact embeddings. +- Happy path: neighbors for a migrated thread come from the persistent vector store. +- Happy path: cluster rebuild on a migrated repo uses vectorlite and persists clusters successfully. +- Edge case: repo marked migration-pending returns a clear stale-state error or status rather than misleading old clusters. +- Edge case: closed/deleted threads are excluded from later neighbor and cluster results after vector-store cleanup. +- Integration: API server endpoints for search and neighbors continue returning valid contracts after the backend swap. +- Integration: perf harness can compare exact legacy-style behavior versus persistent vectorlite behavior on synthetic fixtures. + +**Verification:** +- Semantic queries and cluster builds no longer require loading the full embedding corpus from JSON rows into process memory. + +- [ ] **Unit 5: Redefine refresh and add operator configuration/health UX** + +**Goal:** Make the operator experience understandable and safe during and after the migration. + +**Requirements:** R17, R18, R19, R20, R21, R22, R23, R24, R25, R26, R29 + +**Dependencies:** Unit 1, Unit 3, Unit 4 + +**Files:** +- Modify: `apps/cli/src/main.ts` +- Modify: `apps/cli/src/main.test.ts` +- Modify: `apps/cli/src/init-wizard.ts` +- Modify: `apps/cli/src/tui/app.ts` +- Modify: `packages/api-core/src/service.ts` +- Modify: `packages/api-contract/src/contracts.ts` +- Test: `apps/cli/src/main.test.ts` +- Test: `packages/api-core/src/service.test.ts` + +**Approach:** +- Add `ghcrawl configure`: + - no flags: show current summary model, embedding basis, vector readiness, and migration state + - explicit flags: change summary model and embedding basis intentionally +- Extend `doctor` to report `vectorlite` loadability and any migration readiness signals worth surfacing globally. +- Redefine `refresh` so it can automatically run summarize when the active basis requires it, while preserving the one-command operator experience. +- Update TUI status text where needed so stale vectors/clusters are visible and not mistaken for current data. +- Preserve existing spend progress messaging and expand it where the first post-upgrade refresh may trigger a large summary rebuild. + +**Patterns to follow:** +- `apps/cli/src/main.ts` current command/help formatting and doctor output patterns +- `apps/cli/src/init-wizard.ts` current config write and operator guidance patterns +- `apps/cli/src/tui/app.ts` current repo status/header conventions + +**Test scenarios:** +- Happy path: `configure` with no flags reports the active summary model, embedding basis, and migration status. +- Happy path: `configure` updates the summary model and persists it. +- Happy path: refresh on a migration-pending repo runs summarize before embed when basis=`title+summary`. +- Edge case: refresh on basis=`title+original` skips summarize unless the user explicitly requests summary work. +- Edge case: doctor reports vectorlite readiness failures cleanly when the extension cannot load. +- Error path: user switches summary model and is told the repo now needs rebuild work rather than being shown stale current state. +- Integration: TUI and CLI both surface repo migration/stale status consistently after settings changes. + +**Verification:** +- Operators can discover current settings and complete the first migration refresh without consulting source code or guessing which hidden command to run. + +- [ ] **Unit 6: Update docs, release notes, and upgrade validation** + +**Goal:** Make the release understandable to operators and safe to ship. + +**Requirements:** R1, R27, R28, R29, R33, R34 + +**Dependencies:** Units 1-5 + +**Files:** +- Modify: `README.md` +- Modify: `apps/cli/README.md` +- Modify: `CONTRIBUTING.md` +- Modify: `docs/DESIGN.md` +- Modify: `docs/PLAN.md` +- Modify: `.github/workflows/ci.yml` +- Modify: `.github/workflows/publish.yml` (if packaging validation needs to change) +- Test: `packages/api-core/src/cluster/perf.integration.ts` +- Test expectation: none -- docs files themselves do not need behavioral tests, but the release validation steps must be represented in automated checks where practical + +**Approach:** +- Update operator docs to explain: + - `vectorlite` is now required + - first refresh after upgrade may trigger a one-time rebuild + - `gpt-5-mini` vs `gpt-5.4-mini` tradeoffs and costs + - semantic search now depends on the persistent vector index +- Publish a concrete upgrade story for pre-vectorlite users. +- Extend CI/package validation enough to catch missing `vectorlite` packaging or install regressions on supported platforms. +- Preserve or refine the existing cluster perf workflow so future PRs can still compare the vector path meaningfully. + +**Patterns to follow:** +- current README operator sections +- current CI/package smoke workflows +- current cluster perf workflow/comment structure + +**Test scenarios:** +- Happy path: package smoke or install validation confirms `vectorlite` is present and the CLI still launches on supported platforms. +- Edge case: upgrade validation from a pre-vectorlite DB shows first refresh drives the required rebuild path without manual DB surgery. +- Integration: CI continues to publish meaningful cluster perf comparisons after the migration lands. + +**Verification:** +- A maintainer can follow the docs to understand migration cost, migration steps, and expected behavior on first refresh. + +## System-Wide Impact + +- **Interaction graph:** `sync`, `summarize`, `embed`, `search`, `neighbors`, `cluster`, `refresh`, `doctor`, `configure`, TUI status surfaces, and package/install flows all participate in this release. +- **Error propagation:** vector store open/load failures must surface as health/configuration errors, not as silent empty search results. +- **State lifecycle risks:** the main DB and vector sidecar must stay logically synchronized when threads close, content changes, configs change, or refresh is interrupted mid-run. +- **API surface parity:** CLI and local HTTP API should report migration state consistently for search- and cluster-dependent commands. +- **Integration coverage:** upgrade-from-old-DB scenarios, vector store reopen scenarios, and refresh-after-setting-change scenarios need integration coverage beyond isolated unit tests. +- **Unchanged invariants:** GitHub sync remains the source of truth for thread metadata and local close-state behavior remains intact; this plan changes vector/search infrastructure, not the core sync contract. + +## Risks & Dependencies + +| Risk | Mitigation | +|------|------------| +| `vectorlite` native packaging breaks on one or more supported platforms | Add package smoke and install validation, surface `doctor` readiness clearly, and keep sidecar/vector-store code isolated so failures are diagnosable | +| Old cluster runs appear current after upgrade | Add explicit repo pipeline state and invalidate cluster freshness when pipeline compatibility changes | +| Summary-derived embedding migration silently re-summarizes too much and causes unnecessary spend | Preserve content-hash-based skipping, add model/basis-aware stale detection, and keep spend progress reporting visible during summarize/refresh | +| Main DB and vector sidecar drift out of sync | Centralize vector upsert/delete through one abstraction and update pipeline state only after successful writes | +| Search quality regresses if vectorlite and summary-basis defaults are not aligned with current learnings | Use the recent internal clustering solution doc as the default rationale and keep perf/quality comparison harnesses intact | + +## Documentation / Operational Notes + +- Document the one-time cost estimate for ~20k-thread full summarization with both summary models. +- Document that the first post-upgrade refresh may take meaningfully longer than future refreshes because it performs migration work. +- Document that semantic search and nearest-neighbor features are only trustworthy after repo migration completes. +- Include a short maintainer release note calling out the breaking nature of the vector/search migration. + +## Sources & References + +- **Origin document:** `docs/brainstorms/2026-04-01-vectorlite-default-search-and-summary-migration-requirements.md` +- Related code: + - `packages/api-core/src/service.ts` + - `packages/api-core/src/openai/provider.ts` + - `packages/api-core/src/config.ts` + - `packages/api-core/src/db/migrate.ts` + - `packages/api-contract/src/contracts.ts` + - `apps/cli/src/main.ts` +- Institutional learning: + - `docs/solutions/performance-issues/clustering-vectorlite-hnsw-embedding-optimization-2026-03-30.md` +- External docs: + - + - + - + - diff --git a/docs/solutions/performance-issues/clustering-vectorlite-hnsw-embedding-optimization-2026-03-30.md b/docs/solutions/performance-issues/clustering-vectorlite-hnsw-embedding-optimization-2026-03-30.md new file mode 100644 index 0000000..78151df --- /dev/null +++ b/docs/solutions/performance-issues/clustering-vectorlite-hnsw-embedding-optimization-2026-03-30.md @@ -0,0 +1,116 @@ +--- +title: "Clustering optimization: dedupe_summary embeddings with vectorlite HNSW outperform exact kNN" +date: 2026-03-30 +category: performance-issues +module: clustering +problem_type: performance_issue +component: tooling +symptoms: + - "Original exact kNN clustering produced ungoverned 455-member mega-clusters" + - "Only 31.2% of threads ended up in multi-member clusters" + - "Raw title+body embeddings included template boilerplate noise" +root_cause: missing_tooling +resolution_type: tooling_addition +severity: medium +tags: + - clustering + - embeddings + - vectorlite + - hnsw + - llm-as-judge + - prompt-optimization + - dedupe-summary +--- + +# Clustering optimization: dedupe_summary embeddings with vectorlite HNSW outperform exact kNN + +## Problem + +GitCrawl clusters ~18,500 GitHub issues/PRs by embedding similarity to identify duplicates and related threads. The original pipeline used exact k-nearest-neighbor search on title+body embeddings with unbounded clustering. This produced 4.73/5 coherence but ungoverned cluster growth (max 455 members), only 31.2% coverage, and 1.70% outlier rate. The goal was to improve coherence, control cluster sizes, and increase coverage. + +## Approach + +### Phase 1: Summarization Prompt Optimization + +Tested 11 prompt variants for summarizing issue/PR content before embedding. An LLM-as-judge (gpt-5-mini) scored each on boilerplate removal, signal density, and clustering suitability (1-5 scale). + +**Winner: `v5-component-focused` (4.97/5 vs baseline 2.65/5).** Key insight: explicit component-first structure (e.g., "Discord gateway: connection drops on resume") clusters far better than generic summaries. Full summarization of 18.5k threads cost $26 using gpt-5.4-mini. + +### Phase 2: Clustering Experiments + +Sixteen configurations tested across four dimensions: + +- **Embedding sources**: title, body, dedupe_summary (optimized summaries), and combinations +- **Search backend**: Vectorlite HNSW approximate nearest-neighbor vs exact kNN +- **Score aggregation**: max, mean, weighted, min-of-2, boost +- **Parameters**: similarity threshold (0.75-0.88), neighbor count k (6, 12), max cluster size (200, 400) + +Clustering used **size-bounded Union-Find**: edges sorted by descending score, merges refused when exceeding maxSize cap. + +**Evaluation** used LLM-as-judge with stratified sampling: 30 clusters (10 large, 10 mid, 10 small) scored for coherence, plus 15 singletons scored for false-negative detection. + +## Results + +### Baselines + +| Configuration | Sources | Backend | Mode | Coherence | Multi% | MaxSz | AvgSz | Outlier% | Duration | +|---|---|---|---|---|---|---|---|---|---| +| **Original** | title+body | exact kNN | basic (unbounded) | 4.73 | 31.2% | 455 | 3.52 | 1.70% | 800s | +| All sources, max agg | all 3 | vectorlite | bounded (200) | 4.62 | 49.5% | 200 | 4.64 | 2.07% | 180s | + +### Key Experiments (vectorlite HNSW, bounded mode, maxSize=200) + +| Experiment | Sources | Aggregation | Threshold | Coherence | Multi% | Outlier% | Takeaway | +|---|---|---|---|---|---|---|---| +| **source-dedupe-only** | dedupe_summary | max | 0.82 | **4.93** | 44.6% | **0.85%** | **Recommended.** Best coherence at reasonable coverage. | +| agg-min-of-2 | all 3 | min-of-2 | 0.82 | 4.97 | 23.8% | 0.67% | Highest coherence but low coverage. Precision champion. | +| param-high-threshold | all 3 | max | 0.88 | 5.00 | 14.4% | 0.00% | Perfect coherence, too conservative for general use. | +| agg-boost | all 3 | boost | 0.82 | 4.85 | 49.5% | 0.94% | Best multi-source option if more coverage needed. | +| source-body-dedupe | body+dedupe | max | 0.82 | 4.89 | 48.8% | 1.47% | Adding body helps coverage slightly, hurts coherence. | +| param-low-threshold | dedupe_summary | max | 0.75 | 4.77 | 77.4% | 0.96% | High coverage but coherence drops and clusters get large (avg 11.7). | +| baseline-all-max | all 3 | max | 0.82 | 4.62 | 49.5% | 2.07% | Adding dedupe_summary to max agg made things *worse*. | + +### What the Columns Mean + +- **Coherence** (1-5): LLM judge score for how well cluster members relate. Stratified sample of 30 clusters. +- **Multi%**: Percentage of threads in multi-member clusters (coverage). +- **Outlier%**: Percentage of cluster members judged as not belonging. + +## Recommended Configuration + +**`source-dedupe-only` with Vectorlite HNSW, threshold 0.82, maxSize 200.** + +- **+0.20 coherence** over original (4.93 vs 4.73) +- **+13% coverage** (44.6% vs 31.2% multi-member) +- **Half the outlier rate** (0.85% vs 1.70%) +- **15x faster** (55s vs 800s) +- **Simplest**: single embedding source, no aggregation complexity +- **Controlled cluster sizes**: max 200 vs unbounded 455 + +## Key Learnings + +1. **Summarization prompt quality is the biggest lever.** The prompt improvement (2.65 to 4.97 judge score) drove more quality gain than any clustering algorithm change. Good embeddings matter more than clever aggregation. + +2. **More sources does not mean better clusters.** Naive multi-source max aggregation (4.62) was *worse* than single-source dedupe-summary (4.93). Title and body embeddings introduce noise that dilutes the optimized summary signal. + +3. **Multi-source only helps with strict aggregation.** The only multi-source configs that beat single-source used min-of-2 or high thresholds -- essentially filtering out noise from weaker sources. Added complexity for marginal gain. + +4. **HNSW approximate search outperforms exact kNN in practice.** The approximate search found ~2x more edges because it casts a wider net. This produced better clusters, not worse, while being 15x faster. + +5. **Size-bounded Union-Find is essential.** The original system's largest cluster (455 members) was incoherent. Capping at 200 with score-ordered merging ensures best edges are used first. + +6. **Mid-range and small clusters are consistently perfect (5.0).** Quality issues concentrate in the largest clusters. The top_by_size bucket is the discriminator between configs. + +## Future Work + +- **Threshold tuning per component**: Different areas may cluster at different similarity levels +- **Hierarchical clustering**: Tight clusters first (0.88), then looser grouping (0.78) for topic organization +- **Coverage gap analysis**: 55.4% of threads remain singletons -- sampling these would quantify false-negative rate +- **Incremental updates**: Delta-based matching against existing cluster centroids instead of full rebuild + +## Related + +- `docs/DESIGN.md` -- Original architecture describing exact cosine similarity kNN approach +- `docs/PLAN.md` -- Phase 4 (Embeddings) and Phase 5 (OpenSearch Evaluation) +- `.context/compound-engineering/ce-optimize/embedding-clustering/` -- Raw experiment results (16 JSON files) +- `.context/compound-engineering/ce-optimize/summary-prompt/` -- Prompt optimization results (11 variants) diff --git a/package.json b/package.json index 67bacc8..599ad15 100644 --- a/package.json +++ b/package.json @@ -33,6 +33,11 @@ "serve": "node ./apps/cli/bin/ghcrawl.js serve", "project:sync": "node ./.agents/skills/project-manager/scripts/sync-work-items.mjs", "test:cluster-perf": "pnpm --filter @ghcrawl/api-core test:cluster-perf", + "perf:cluster:large": "pnpm --filter @ghcrawl/api-core build && node ./scripts/cluster-perf-large-compare.mjs", + "perf:cluster:real": "pnpm --filter @ghcrawl/api-core build && node ./scripts/cluster-perf-real-compare.mjs", + "perf:cluster:population": "pnpm --filter @ghcrawl/api-core build && node ./scripts/cluster-population-compare.mjs", + "perf:cluster:topology": "pnpm --filter @ghcrawl/api-core build && node ./scripts/cluster-topology-compare.mjs", + "perf:cluster:refine": "pnpm --filter @ghcrawl/api-core build && node ./scripts/cluster-refine-component.mjs", "pack:smoke": "node ./scripts/pack-smoke.mjs", "release:metadata": "node ./scripts/release-metadata.mjs", "release:apply-version": "node ./scripts/apply-release-version.mjs", diff --git a/packages/api-core/package.json b/packages/api-core/package.json index f644160..6db0773 100644 --- a/packages/api-core/package.json +++ b/packages/api-core/package.json @@ -46,13 +46,14 @@ }, "dependencies": { "@ghcrawl/api-contract": "workspace:*", - "@shutterstock/p-map-iterable": "^1.1.2", "@octokit/plugin-retry": "^8.0.3", "@octokit/plugin-throttling": "^11.0.1", - "octokit": "^5.0.3", - "better-sqlite3": "^12.8.0", + "@shutterstock/p-map-iterable": "^1.1.2", + "better-sqlite3": "^12.2.0", "dotenv": "^17.2.2", + "octokit": "^5.0.3", "openai": "^6.33.0", + "vectorlite": "^0.2.0", "zod": "^4.3.6" } } diff --git a/packages/api-core/src/api/server.test.ts b/packages/api-core/src/api/server.test.ts index 0661f04..869e189 100644 --- a/packages/api-core/src/api/server.test.ts +++ b/packages/api-core/src/api/server.test.ts @@ -21,6 +21,8 @@ test('health endpoint returns contract payload', async () => { openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 8, embedConcurrency: 10, embedMaxUnread: 20, @@ -71,6 +73,8 @@ test('neighbors endpoint returns contract payload', async () => { openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 8, embedConcurrency: 10, embedMaxUnread: 20, @@ -163,6 +167,8 @@ test('threads endpoint can filter by a bulk number list', async () => { openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 8, embedConcurrency: 10, embedMaxUnread: 20, @@ -232,6 +238,8 @@ test('author-threads endpoint returns one author with strongest same-author matc openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 8, embedConcurrency: 10, embedMaxUnread: 20, @@ -311,6 +319,8 @@ test('close-thread and includeClosed thread routes expose locally closed items', openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 8, embedConcurrency: 10, embedMaxUnread: 20, @@ -394,6 +404,8 @@ test('server returns 400 for malformed request inputs', async () => { openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 8, embedConcurrency: 10, embedMaxUnread: 20, @@ -448,6 +460,8 @@ test('cluster summary and detail endpoints return contract payloads', async () = openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 8, embedConcurrency: 10, embedMaxUnread: 20, diff --git a/packages/api-core/src/cluster/build.ts b/packages/api-core/src/cluster/build.ts index e585305..b956fa1 100644 --- a/packages/api-core/src/cluster/build.ts +++ b/packages/api-core/src/cluster/build.ts @@ -12,30 +12,60 @@ type Node = { class UnionFind { private readonly parent = new Map(); + private readonly size = new Map(); add(value: number): void { - if (!this.parent.has(value)) this.parent.set(value, value); + if (!this.parent.has(value)) { + this.parent.set(value, value); + this.size.set(value, 1); + } } find(value: number): number { - const parent = this.parent.get(value); + let parent = this.parent.get(value); if (parent === undefined) { this.parent.set(value, value); + this.size.set(value, 1); return value; } - if (parent === value) return value; - const root = this.find(parent); - this.parent.set(value, root); - return root; + // Iterative path-finding to avoid stack overflow on deep chains + let current: number = value; + while (parent !== current) { + const grandparent: number = this.parent.get(parent) ?? parent; + this.parent.set(current, grandparent); // path splitting + current = parent; + parent = grandparent; + } + return current; } union(left: number, right: number): void { const leftRoot = this.find(left); const rightRoot = this.find(right); if (leftRoot !== rightRoot) { + const leftSize = this.size.get(leftRoot) ?? 1; + const rightSize = this.size.get(rightRoot) ?? 1; this.parent.set(rightRoot, leftRoot); + this.size.set(leftRoot, leftSize + rightSize); } } + + /** Merge only if the combined component would not exceed maxSize. Returns true if merged. */ + unionBounded(left: number, right: number, maxSize: number): boolean { + const leftRoot = this.find(left); + const rightRoot = this.find(right); + if (leftRoot === rightRoot) return true; // already same component + const leftSize = this.size.get(leftRoot) ?? 1; + const rightSize = this.size.get(rightRoot) ?? 1; + if (leftSize + rightSize > maxSize) return false; + this.parent.set(rightRoot, leftRoot); + this.size.set(leftRoot, leftSize + rightSize); + return true; + } + + getSize(value: number): number { + return this.size.get(this.find(value)) ?? 1; + } } export function buildClusters(nodes: Node[], edges: SimilarityEdge[]): Array<{ representativeThreadId: number; members: number[] }> { @@ -51,6 +81,169 @@ export function buildClusters(nodes: Node[], edges: SimilarityEdge[]): Array<{ r byRoot.set(root, list); } + return formatClusters(nodes, edges, byRoot); +} + +/** + * Build clusters with size-bounded Union-Find. + * + * Process edges from highest to lowest score, merging components only when + * the combined size stays within `maxClusterSize`. Strongest connections are + * preserved; weaker edges that would create oversized clusters are skipped. + * This avoids the "threshold raising" problem where splitting mega-clusters + * creates many solos. + */ +export function buildSizeBoundedClusters( + nodes: Node[], + edges: SimilarityEdge[], + options: { maxClusterSize: number }, +): Array<{ representativeThreadId: number; members: number[] }> { + const uf = new UnionFind(); + for (const node of nodes) uf.add(node.threadId); + + // Sort edges by score descending — strongest connections first + const sortedEdges = [...edges].sort((a, b) => b.score - a.score); + const keptEdges: SimilarityEdge[] = []; + + for (const edge of sortedEdges) { + if (uf.unionBounded(edge.leftThreadId, edge.rightThreadId, options.maxClusterSize)) { + keptEdges.push(edge); + } + } + + const byRoot = new Map(); + for (const node of nodes) { + const root = uf.find(node.threadId); + const list = byRoot.get(root) ?? []; + list.push(node.threadId); + byRoot.set(root, list); + } + + return formatClusters(nodes, keptEdges, byRoot); +} + +/** + * Build clusters with iterative refinement of oversized components. + * + * 1. Run Union-Find at the base threshold (edges already filtered by minScore). + * 2. For any cluster above `maxClusterSize`, re-cluster its members using only + * edges above a progressively higher threshold (raised by `refineStep` each + * iteration) until all clusters are within limits or threshold reaches 1.0. + */ +export function buildRefinedClusters( + nodes: Node[], + edges: SimilarityEdge[], + options: { maxClusterSize: number; refineStep: number }, +): Array<{ representativeThreadId: number; members: number[] }> { + const nodesById = new Map(nodes.map((node) => [node.threadId, node])); + const result: Array<{ representativeThreadId: number; members: number[] }> = []; + + // Initial Union-Find pass + const uf = new UnionFind(); + for (const node of nodes) uf.add(node.threadId); + for (const edge of edges) uf.union(edge.leftThreadId, edge.rightThreadId); + + const byRoot = new Map(); + for (const node of nodes) { + const root = uf.find(node.threadId); + const list = byRoot.get(root) ?? []; + list.push(node.threadId); + byRoot.set(root, list); + } + + // Build adjacency list for O(E) iteration instead of O(n²) pair scanning + const adjacency = new Map(); + for (const edge of edges) { + let list = adjacency.get(edge.leftThreadId); + if (!list) { list = []; adjacency.set(edge.leftThreadId, list); } + list.push(edge); + let rList = adjacency.get(edge.rightThreadId); + if (!rList) { rList = []; adjacency.set(edge.rightThreadId, rList); } + rList.push(edge); + } + + // Process each initial cluster + type WorkItem = { memberIds: number[]; currentThreshold: number }; + const workQueue: WorkItem[] = []; + + for (const members of byRoot.values()) { + if (members.length <= options.maxClusterSize) { + const clusterNodes = members.map((id) => nodesById.get(id)).filter((n): n is Node => n !== undefined); + const clusterEdges = edgesWithinSet(new Set(members), adjacency); + result.push(...formatClusters(clusterNodes, clusterEdges, new Map([[0, members]]))); + } else { + workQueue.push({ memberIds: members, currentThreshold: 0 }); + } + } + + // Iteratively refine oversized clusters + while (workQueue.length > 0) { + const item = workQueue.pop()!; + const newThreshold = item.currentThreshold + options.refineStep; + if (newThreshold >= 1.0) { + for (const memberId of item.memberIds) { + result.push({ representativeThreadId: memberId, members: [memberId] }); + } + continue; + } + + // Filter edges within this component to the higher threshold + const memberSet = new Set(item.memberIds); + const filteredEdges: SimilarityEdge[] = []; + for (const memberId of item.memberIds) { + for (const edge of adjacency.get(memberId) ?? []) { + const otherId = edge.leftThreadId === memberId ? edge.rightThreadId : edge.leftThreadId; + if (otherId > memberId && memberSet.has(otherId) && edge.score >= newThreshold) { + filteredEdges.push(edge); + } + } + } + + // Re-cluster with filtered edges + const subUf = new UnionFind(); + for (const memberId of item.memberIds) subUf.add(memberId); + for (const edge of filteredEdges) subUf.union(edge.leftThreadId, edge.rightThreadId); + + const subByRoot = new Map(); + for (const memberId of item.memberIds) { + const root = subUf.find(memberId); + const list = subByRoot.get(root) ?? []; + list.push(memberId); + subByRoot.set(root, list); + } + + for (const subMembers of subByRoot.values()) { + if (subMembers.length <= options.maxClusterSize) { + const clusterNodes = subMembers.map((id) => nodesById.get(id)).filter((n): n is Node => n !== undefined); + const clusterEdges = edgesWithinSet(new Set(subMembers), adjacency); + result.push(...formatClusters(clusterNodes, clusterEdges, new Map([[0, subMembers]]))); + } else { + workQueue.push({ memberIds: subMembers, currentThreshold: newThreshold }); + } + } + } + + return result.sort((left, right) => right.members.length - left.members.length); +} + +function edgesWithinSet(memberSet: Set, adjacency: Map): SimilarityEdge[] { + const edges: SimilarityEdge[] = []; + for (const memberId of memberSet) { + for (const edge of adjacency.get(memberId) ?? []) { + const otherId = edge.leftThreadId === memberId ? edge.rightThreadId : edge.leftThreadId; + if (otherId > memberId && memberSet.has(otherId)) { + edges.push(edge); + } + } + } + return edges; +} + +function formatClusters( + nodes: Node[], + edges: SimilarityEdge[], + byRoot: Map, +): Array<{ representativeThreadId: number; members: number[] }> { const edgeCounts = new Map(); for (const edge of edges) { edgeCounts.set(edge.leftThreadId, (edgeCounts.get(edge.leftThreadId) ?? 0) + 1); diff --git a/packages/api-core/src/cluster/perf-large.json b/packages/api-core/src/cluster/perf-large.json new file mode 100644 index 0000000..abae01f --- /dev/null +++ b/packages/api-core/src/cluster/perf-large.json @@ -0,0 +1,31 @@ +{ + "schemaVersion": 1, + "fixture": { + "clusterCount": 96, + "threadsPerCluster": 16, + "clusterBlockWidth": 4, + "noiseDimensions": 32, + "assertExactClusterCount": false, + "sourceKinds": [ + "title", + "body", + "dedupe_summary" + ], + "k": 7, + "minScore": 0.82 + }, + "benchmark": { + "warmupRuns": 0, + "runsPerSample": 1, + "minSamples": 1, + "maxSamples": 1, + "maxTotalMs": 600000 + }, + "baseline": { + "fixtureMedianMs": 1800, + "projectedOpenclawMs": 600000 + }, + "thresholds": { + "maxRegressionPercent": 1000 + } +} diff --git a/packages/api-core/src/cluster/perf.integration.ts b/packages/api-core/src/cluster/perf.integration.ts index b8bcf5d..bce518f 100644 --- a/packages/api-core/src/cluster/perf.integration.ts +++ b/packages/api-core/src/cluster/perf.integration.ts @@ -16,6 +16,7 @@ type PerfBaseline = { threadsPerCluster: number; clusterBlockWidth: number; noiseDimensions: number; + assertExactClusterCount?: boolean; sourceKinds: EmbeddingSourceKind[]; k: number; minScore: number; @@ -37,8 +38,28 @@ type PerfBaseline = { }; type PerfRunResult = { + backend: 'exact' | 'vectorlite'; + timingBasis: 'cluster-only'; sampleDurationsMs: number[]; + totalSampleDurationsMs: number[]; + loadSampleDurationsMs: number[]; + setupSampleDurationsMs: number[]; + edgeBuildSampleDurationsMs: number[]; + indexBuildSampleDurationsMs: number[]; + querySampleDurationsMs: number[]; + clusterBuildSampleDurationsMs: number[]; + peakRssBytesSamples: number[]; + peakHeapUsedBytesSamples: number[]; medianMs: number; + totalMedianMs: number; + loadMedianMs: number; + setupMedianMs: number; + edgeBuildMedianMs: number; + indexBuildMedianMs: number; + queryMedianMs: number; + clusterBuildMedianMs: number; + medianPeakRssBytes: number; + medianPeakHeapUsedBytes: number; baselineMedianMs: number; deltaMs: number; deltaPercent: number; @@ -58,16 +79,42 @@ type SuggestedBaseline = { projectedOpenclawMs: number; }; -const BASELINE_PATH = fileURLToPath(new URL('./perf-baseline.json', import.meta.url)); +const DEFAULT_BASELINE_PATH = fileURLToPath(new URL('./perf-baseline.json', import.meta.url)); + +function getBaselinePath(): string { + const configuredPath = process.env.GHCRAWL_CLUSTER_PERF_CONFIG_PATH?.trim(); + return configuredPath ? path.resolve(configuredPath) : DEFAULT_BASELINE_PATH; +} function loadBaseline(): PerfBaseline { - return JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf8')) as PerfBaseline; + return JSON.parse(fs.readFileSync(getBaselinePath(), 'utf8')) as PerfBaseline; } function shouldBootstrapBaseline(): boolean { return process.env.GHCRAWL_CLUSTER_PERF_BOOTSTRAP === '1'; } +function shouldIgnoreRegressionThreshold(): boolean { + return process.env.GHCRAWL_CLUSTER_PERF_IGNORE_THRESHOLD === '1'; +} + +function getPerfBackend(): 'exact' | 'vectorlite' { + return process.env.GHCRAWL_CLUSTER_PERF_BACKEND === 'vectorlite' ? 'vectorlite' : 'exact'; +} + +function assertBenchmarkShape( + result: { clusters: number; edges: number }, + baseline: PerfBaseline, + backend: 'exact' | 'vectorlite', +): void { + if (backend === 'exact' && baseline.fixture.assertExactClusterCount !== false) { + assert.equal(result.clusters, baseline.fixture.clusterCount); + } else { + assert.ok(result.clusters > 0); + } + assert.ok(result.edges > baseline.fixture.clusterCount); +} + function formatDurationMs(durationMs: number): string { if (!Number.isFinite(durationMs)) return 'n/a'; if (durationMs < 1000) { @@ -82,6 +129,14 @@ function formatDurationMs(durationMs: number): string { return `${minutes}m ${seconds.toFixed(1)}s`; } +function formatBytes(bytes: number): string { + if (!Number.isFinite(bytes)) return 'n/a'; + if (bytes < 1024 * 1024) { + return `${(bytes / 1024).toFixed(1)} KiB`; + } + return `${(bytes / (1024 * 1024)).toFixed(1)} MiB`; +} + function formatPercent(value: number): string { const sign = value > 0 ? '+' : ''; return `${sign}${value.toFixed(1)}%`; @@ -146,6 +201,8 @@ function createService(dbPath: string): GHCrawlService { openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 2, embedConcurrency: 2, embedMaxUnread: 4, @@ -279,27 +336,82 @@ function seedBenchmarkDatabase(dbPath: string, baseline: PerfBaseline): void { } } -async function runSingleCluster(dbPath: string, baseline: PerfBaseline): Promise<{ durationMs: number; clusters: number; edges: number }> { +async function runSingleCluster( + dbPath: string, + baseline: PerfBaseline, + backend: 'exact' | 'vectorlite', +): Promise<{ + durationMs: number; + totalDurationMs: number; + loadMs: number; + setupMs: number; + edgeBuildMs: number; + indexBuildMs: number; + queryMs: number; + clusterBuildMs: number; + peakRssBytes: number; + peakHeapUsedBytes: number; + clusters: number; + edges: number; +}> { const service = createService(dbPath); try { - const startedAt = performance.now(); - const result = await service.clusterRepository({ + // clusterExperiment may not exist on older branches (e.g. base worktree in CI) + if (typeof service.clusterExperiment !== 'function') { + const startedAt = performance.now(); + const result = await service.clusterRepository({ + owner: 'openclaw', + repo: 'openclaw', + k: baseline.fixture.k, + minScore: baseline.fixture.minScore, + }); + const durationMs = performance.now() - startedAt; + return { + durationMs, + totalDurationMs: durationMs, + loadMs: 0, + setupMs: 0, + edgeBuildMs: durationMs, + indexBuildMs: 0, + queryMs: 0, + clusterBuildMs: 0, + peakRssBytes: 0, + peakHeapUsedBytes: 0, + clusters: result.clusters, + edges: result.edges, + }; + } + const result = service.clusterExperiment({ owner: 'openclaw', repo: 'openclaw', + backend, k: baseline.fixture.k, minScore: baseline.fixture.minScore, }); - const durationMs = performance.now() - startedAt; - return { durationMs, clusters: result.clusters, edges: result.edges }; + return { + durationMs: result.durationMs, + totalDurationMs: result.totalDurationMs, + loadMs: result.loadMs, + setupMs: result.setupMs, + edgeBuildMs: result.edgeBuildMs, + indexBuildMs: result.indexBuildMs, + queryMs: result.queryMs, + clusterBuildMs: result.clusterBuildMs, + peakRssBytes: result.memory.peakRssBytes, + peakHeapUsedBytes: result.memory.peakHeapUsedBytes, + clusters: result.clusters, + edges: result.edges, + }; } finally { service.close(); } } async function measureBenchmark(baseline: PerfBaseline): Promise { + const backend = getPerfBackend(); if (baseline.baseline.fixtureMedianMs <= 0 && !shouldBootstrapBaseline()) { throw new Error( - `Cluster perf baseline is not set in ${BASELINE_PATH}. Run the benchmark once, then record fixtureMedianMs before enforcing regressions.`, + `Cluster perf baseline is not set in ${getBaselinePath()}. Run the benchmark once, then record fixtureMedianMs before enforcing regressions.`, ); } @@ -311,28 +423,63 @@ async function measureBenchmark(baseline: PerfBaseline): Promise const warmupRuns = baseline.benchmark.warmupRuns; const runsPerSample = baseline.benchmark.runsPerSample; const sampleDurationsMs: number[] = []; + const totalSampleDurationsMs: number[] = []; + const loadSampleDurationsMs: number[] = []; + const setupSampleDurationsMs: number[] = []; + const edgeBuildSampleDurationsMs: number[] = []; + const indexBuildSampleDurationsMs: number[] = []; + const querySampleDurationsMs: number[] = []; + const clusterBuildSampleDurationsMs: number[] = []; + const peakRssBytesSamples: number[] = []; + const peakHeapUsedBytesSamples: number[] = []; const benchmarkStartedAt = performance.now(); let runCounter = 0; for (let warmupIndex = 0; warmupIndex < warmupRuns; warmupIndex += 1) { const warmupDbPath = path.join(tempRoot, `warmup-${warmupIndex}.sqlite`); fs.copyFileSync(seedDbPath, warmupDbPath); - const warmupResult = await runSingleCluster(warmupDbPath, baseline); - assert.equal(warmupResult.clusters, baseline.fixture.clusterCount); - assert.ok(warmupResult.edges > baseline.fixture.clusterCount); + const warmupResult = await runSingleCluster(warmupDbPath, baseline, backend); + assertBenchmarkShape(warmupResult, baseline, backend); } while (sampleDurationsMs.length < baseline.benchmark.maxSamples) { - const sampleStartedAt = performance.now(); + let sampleDurationMs = 0; + let totalSampleDurationMs = 0; + let loadSampleDurationMs = 0; + let setupSampleDurationMs = 0; + let edgeBuildSampleDurationMs = 0; + let indexBuildSampleDurationMs = 0; + let querySampleDurationMs = 0; + let clusterBuildSampleDurationMs = 0; + let samplePeakRssBytes = 0; + let samplePeakHeapUsedBytes = 0; for (let runIndex = 0; runIndex < runsPerSample; runIndex += 1) { const runDbPath = path.join(tempRoot, `run-${runCounter}.sqlite`); runCounter += 1; fs.copyFileSync(seedDbPath, runDbPath); - const result = await runSingleCluster(runDbPath, baseline); - assert.equal(result.clusters, baseline.fixture.clusterCount); - assert.ok(result.edges > baseline.fixture.clusterCount); + const result = await runSingleCluster(runDbPath, baseline, backend); + assertBenchmarkShape(result, baseline, backend); + sampleDurationMs += result.durationMs; + totalSampleDurationMs += result.totalDurationMs; + loadSampleDurationMs += result.loadMs; + setupSampleDurationMs += result.setupMs; + edgeBuildSampleDurationMs += result.edgeBuildMs; + indexBuildSampleDurationMs += result.indexBuildMs; + querySampleDurationMs += result.queryMs; + clusterBuildSampleDurationMs += result.clusterBuildMs; + samplePeakRssBytes = Math.max(samplePeakRssBytes, result.peakRssBytes); + samplePeakHeapUsedBytes = Math.max(samplePeakHeapUsedBytes, result.peakHeapUsedBytes); } - sampleDurationsMs.push(performance.now() - sampleStartedAt); + sampleDurationsMs.push(sampleDurationMs); + totalSampleDurationsMs.push(totalSampleDurationMs); + loadSampleDurationsMs.push(loadSampleDurationMs); + setupSampleDurationsMs.push(setupSampleDurationMs); + edgeBuildSampleDurationsMs.push(edgeBuildSampleDurationMs); + indexBuildSampleDurationsMs.push(indexBuildSampleDurationMs); + querySampleDurationsMs.push(querySampleDurationMs); + clusterBuildSampleDurationsMs.push(clusterBuildSampleDurationMs); + peakRssBytesSamples.push(samplePeakRssBytes); + peakHeapUsedBytesSamples.push(samplePeakHeapUsedBytes); const elapsedMs = performance.now() - benchmarkStartedAt; if (sampleDurationsMs.length >= baseline.benchmark.minSamples && elapsedMs >= baseline.benchmark.maxTotalMs) { @@ -341,6 +488,15 @@ async function measureBenchmark(baseline: PerfBaseline): Promise } const medianMs = median(sampleDurationsMs); + const totalMedianMs = median(totalSampleDurationsMs); + const loadMedianMs = median(loadSampleDurationsMs); + const setupMedianMs = median(setupSampleDurationsMs); + const edgeBuildMedianMs = median(edgeBuildSampleDurationsMs); + const indexBuildMedianMs = median(indexBuildSampleDurationsMs); + const queryMedianMs = median(querySampleDurationsMs); + const clusterBuildMedianMs = median(clusterBuildSampleDurationsMs); + const medianPeakRssBytes = median(peakRssBytesSamples); + const medianPeakHeapUsedBytes = median(peakHeapUsedBytesSamples); const baselineMedianMs = baseline.baseline.fixtureMedianMs > 0 ? baseline.baseline.fixtureMedianMs : medianMs; const deltaMs = medianMs - baselineMedianMs; const deltaPercent = baselineMedianMs > 0 ? (deltaMs / baselineMedianMs) * 100 : 0; @@ -350,8 +506,28 @@ async function measureBenchmark(baseline: PerfBaseline): Promise const projectedDeltaPercent = (projectedDeltaMs / projectedBaselineOpenclawMs) * 100; return { + backend, + timingBasis: 'cluster-only', sampleDurationsMs, + totalSampleDurationsMs, + loadSampleDurationsMs, + setupSampleDurationsMs, + edgeBuildSampleDurationsMs, + indexBuildSampleDurationsMs, + querySampleDurationsMs, + clusterBuildSampleDurationsMs, + peakRssBytesSamples, + peakHeapUsedBytesSamples, medianMs, + totalMedianMs, + loadMedianMs, + setupMedianMs, + edgeBuildMedianMs, + indexBuildMedianMs, + queryMedianMs, + clusterBuildMedianMs, + medianPeakRssBytes, + medianPeakHeapUsedBytes, baselineMedianMs, deltaMs, deltaPercent, @@ -374,6 +550,7 @@ function buildSummary(result: PerfRunResult): string { const status = result.deltaPercent > result.maxRegressionPercent ? 'FAIL' : 'PASS'; const sampleList = result.sampleDurationsMs.map((value) => formatDurationMs(value)).join(', '); const suggestedBaseline = buildSuggestedBaseline(result); + const timingLabel = 'Fixture median'; const bootstrapLine = result.baselineMedianMs === result.medianMs ? '- Bootstrap mode: using the current fixture median as the provisional baseline' @@ -384,8 +561,19 @@ function buildSummary(result: PerfRunResult): string { return [ '## Cluster Performance', '', + `- Backend: ${result.backend}`, + `- Timing basis: ${result.timingBasis}`, `- Status: ${status}`, - `- Fixture median: ${formatDurationMs(result.medianMs)} (${result.samples} samples, ${result.runsPerSample} cluster rebuilds/sample)`, + `- Fixture median (cluster-only): ${formatDurationMs(result.medianMs)} (${result.samples} samples, ${result.runsPerSample} cluster rebuilds/sample)`, + `- Fixture median (total run): ${formatDurationMs(result.totalMedianMs)}`, + `- Fixture median load stage: ${formatDurationMs(result.loadMedianMs)}`, + `- Fixture median setup stage: ${formatDurationMs(result.setupMedianMs)}`, + `- Fixture median exact edge-build stage: ${formatDurationMs(result.edgeBuildMedianMs)}`, + `- Fixture median vector index-build stage: ${formatDurationMs(result.indexBuildMedianMs)}`, + `- Fixture median vector query stage: ${formatDurationMs(result.queryMedianMs)}`, + `- Fixture median cluster-assembly stage: ${formatDurationMs(result.clusterBuildMedianMs)}`, + `- Median peak RSS: ${formatBytes(result.medianPeakRssBytes)}`, + `- Median peak heap used: ${formatBytes(result.medianPeakHeapUsedBytes)}`, `- Fixture baseline: ${formatDurationMs(result.baselineMedianMs)}`, `- Fixture delta: ${formatDurationMs(result.deltaMs)} (${formatPercent(result.deltaPercent)})`, `- Projected openclaw/openclaw duration: ${formatDurationMs(result.projectedOpenclawMs)}`, @@ -430,7 +618,7 @@ async function main(): Promise { const result = await measureBenchmark(baseline); const summary = buildSummary(result); const bootstrap = shouldBootstrapBaseline(); - const shouldFail = !bootstrap && result.deltaPercent > result.maxRegressionPercent; + const shouldFail = !bootstrap && !shouldIgnoreRegressionThreshold() && result.deltaPercent > result.maxRegressionPercent; process.stdout.write(`${summary}\n`); const suggestedBaseline = buildSuggestedBaseline(result); diff --git a/packages/api-core/src/config.test.ts b/packages/api-core/src/config.test.ts index 3b19483..7b7895f 100644 --- a/packages/api-core/src/config.test.ts +++ b/packages/api-core/src/config.test.ts @@ -55,6 +55,9 @@ test('loadConfig prefers persisted config and stores defaults under the user con assert.equal(config.githubTokenSource, 'config'); assert.equal(config.openaiApiKeySource, 'config'); assert.equal(config.dbPath, path.join(home, '.config', 'ghcrawl', 'ghcrawl.db')); + assert.equal(config.summaryModel, 'gpt-5-mini'); + assert.equal(config.embeddingBasis, 'title_original'); + assert.equal(config.vectorBackend, 'vectorlite'); }); test('loadConfig lets environment variables override persisted config', () => { @@ -163,6 +166,30 @@ test('writePersistedConfig creates a readable config file', () => { assert.equal(persisted.data.openaiApiKey, 'sk-proj-testkey1234567890'); }); +test('persisted config round-trips summary model, embedding basis, and vector backend', () => { + const home = makeTempHome(); + const env = { + ...makeTestEnv(), + HOME: home, + }; + + writePersistedConfig( + { + githubToken: 'ghp_testtoken1234567890', + openaiApiKey: 'sk-proj-testkey1234567890', + summaryModel: 'gpt-5.4-mini', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', + }, + { env }, + ); + + const loaded = loadConfig({ env, cwd: process.cwd() }); + assert.equal(loaded.summaryModel, 'gpt-5.4-mini'); + assert.equal(loaded.embeddingBasis, 'title_original'); + assert.equal(loaded.vectorBackend, 'vectorlite'); +}); + test('config path override redirects persisted config reads and writes', () => { const workspace = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-workspace-')); const overridePath = path.join(workspace, '.tmp-config', 'custom-config.json'); diff --git a/packages/api-core/src/config.ts b/packages/api-core/src/config.ts index a792dd7..f57726d 100644 --- a/packages/api-core/src/config.ts +++ b/packages/api-core/src/config.ts @@ -9,6 +9,8 @@ export type SecretProvider = 'plaintext' | 'op'; export type TuiSortPreference = 'recent' | 'size'; export type TuiMinClusterSize = 0 | 1 | 10 | 20 | 50; export type TuiWideLayoutPreference = 'columns' | 'right-stack'; +export type EmbeddingBasis = 'title_original' | 'title_summary'; +export type VectorBackend = 'vectorlite'; export type TuiRepositoryPreference = { minClusterSize: TuiMinClusterSize; @@ -26,6 +28,8 @@ export type PersistedGitcrawlConfig = { apiPort?: number; summaryModel?: string; embedModel?: string; + embeddingBasis?: EmbeddingBasis; + vectorBackend?: VectorBackend; embedBatchSize?: number; embedConcurrency?: number; embedMaxUnread?: number; @@ -51,6 +55,8 @@ export type GitcrawlConfig = { opItemName?: string; summaryModel: string; embedModel: string; + embeddingBasis: EmbeddingBasis; + vectorBackend: VectorBackend; embedBatchSize: number; embedConcurrency: number; embedMaxUnread: number; @@ -174,6 +180,14 @@ function getTuiWideLayoutPreference(value: unknown): TuiWideLayoutPreference | u return value === 'columns' || value === 'right-stack' ? value : undefined; } +function getEmbeddingBasis(value: unknown): EmbeddingBasis | undefined { + return value === 'title_original' || value === 'title_summary' ? value : undefined; +} + +function getVectorBackend(value: unknown): VectorBackend | undefined { + return value === 'vectorlite' ? value : undefined; +} + function getTuiPreferences(value: unknown): Record | undefined { if (!value || typeof value !== 'object') { return undefined; @@ -219,6 +233,8 @@ export function readPersistedConfig(options: LoadConfigOptions = {}): LoadedStor apiPort: getNumber(raw.apiPort), summaryModel: getString(raw.summaryModel), embedModel: getString(raw.embedModel), + embeddingBasis: getEmbeddingBasis(raw.embeddingBasis), + vectorBackend: getVectorBackend(raw.vectorBackend), embedBatchSize: getNumber(raw.embedBatchSize), embedConcurrency: getNumber(raw.embedConcurrency), embedMaxUnread: getNumber(raw.embedMaxUnread), @@ -337,6 +353,18 @@ export function loadConfig(options: LoadConfigOptions = {}): GitcrawlConfig { { source: 'dotenv', value: getDotenvString(dotenvValues, 'GHCRAWL_EMBED_MODEL', 'GHCRAWL_EMBED_MODEL') }, { source: 'default', value: 'text-embedding-3-large' }, ); + const embeddingBasis = pickDefined( + { source: 'env', value: getEmbeddingBasis(getEnvString(env, 'GHCRAWL_EMBEDDING_BASIS', 'GHCRAWL_EMBEDDING_BASIS')) }, + { source: 'config', value: stored.data.embeddingBasis }, + { source: 'dotenv', value: getEmbeddingBasis(getDotenvString(dotenvValues, 'GHCRAWL_EMBEDDING_BASIS', 'GHCRAWL_EMBEDDING_BASIS')) }, + { source: 'default', value: 'title_original' }, + ); + const vectorBackend = pickDefined( + { source: 'env', value: getVectorBackend(getEnvString(env, 'GHCRAWL_VECTOR_BACKEND', 'GHCRAWL_VECTOR_BACKEND')) }, + { source: 'config', value: stored.data.vectorBackend }, + { source: 'dotenv', value: getVectorBackend(getDotenvString(dotenvValues, 'GHCRAWL_VECTOR_BACKEND', 'GHCRAWL_VECTOR_BACKEND')) }, + { source: 'default', value: 'vectorlite' }, + ); const openSearchUrl = pickDefined( { source: 'env', value: getEnvString(env, 'GHCRAWL_OPENSEARCH_URL', 'GHCRAWL_OPENSEARCH_URL') }, { source: 'config', value: stored.data.openSearchUrl }, @@ -375,6 +403,8 @@ export function loadConfig(options: LoadConfigOptions = {}): GitcrawlConfig { opItemName: stored.data.opItemName, summaryModel: summaryModel.value ?? 'gpt-5-mini', embedModel: embedModel.value ?? 'text-embedding-3-large', + embeddingBasis: embeddingBasis.value ?? 'title_original', + vectorBackend: vectorBackend.value ?? 'vectorlite', embedBatchSize, embedConcurrency, embedMaxUnread, @@ -387,6 +417,7 @@ export function loadConfig(options: LoadConfigOptions = {}): GitcrawlConfig { export function ensureRuntimeDirs(config: GitcrawlConfig): void { fs.mkdirSync(config.configDir, { recursive: true }); fs.mkdirSync(path.dirname(config.dbPath), { recursive: true }); + fs.mkdirSync(path.join(config.configDir, 'vectors'), { recursive: true }); } export function getTuiRepositoryPreference(config: GitcrawlConfig, owner: string, repo: string): TuiRepositoryPreference { diff --git a/packages/api-core/src/db/migrate.test.ts b/packages/api-core/src/db/migrate.test.ts index 7aba573..476cc5a 100644 --- a/packages/api-core/src/db/migrate.test.ts +++ b/packages/api-core/src/db/migrate.test.ts @@ -17,13 +17,18 @@ test('migrate creates core tables', () => { assert.ok(names.includes('threads')); assert.ok(names.includes('documents')); assert.ok(names.includes('document_embeddings')); + assert.ok(names.includes('thread_vectors')); assert.ok(names.includes('cluster_runs')); assert.ok(names.includes('repo_sync_state')); + assert.ok(names.includes('repo_pipeline_state')); const threadColumns = db.prepare('pragma table_info(threads)').all() as Array<{ name: string }>; const threadColumnNames = threadColumns.map((column) => column.name); assert.ok(threadColumnNames.includes('first_pulled_at')); assert.ok(threadColumnNames.includes('last_pulled_at')); + + const summaryColumns = db.prepare('pragma table_info(document_summaries)').all() as Array<{ name: string }>; + assert.ok(summaryColumns.map((column) => column.name).includes('prompt_version')); } finally { db.close(); } diff --git a/packages/api-core/src/db/migrate.ts b/packages/api-core/src/db/migrate.ts index 7ec4059..36c4e3e 100644 --- a/packages/api-core/src/db/migrate.ts +++ b/packages/api-core/src/db/migrate.ts @@ -125,6 +125,34 @@ const migrationStatements = [ ) `, ` + create table if not exists thread_vectors ( + thread_id integer primary key references threads(id) on delete cascade, + basis text not null, + model text not null, + dimensions integer not null, + content_hash text not null, + vector_json text not null, + vector_backend text not null, + created_at text not null, + updated_at text not null + ) + `, + ` + create table if not exists repo_pipeline_state ( + repo_id integer primary key references repositories(id) on delete cascade, + summary_model text not null, + summary_prompt_version text not null, + embedding_basis text not null, + embed_model text not null, + embed_dimensions integer not null, + embed_pipeline_version text not null, + vector_backend text not null, + vectors_current_at text, + clusters_current_at text, + updated_at text not null + ) + `, + ` create table if not exists sync_runs ( id integer primary key, repo_id integer references repositories(id) on delete cascade, @@ -249,8 +277,23 @@ export function migrate(db: SqliteDatabase): void { db.exec('alter table clusters add column close_reason_local text'); } + const summaryColumns = new Set( + (db.prepare('pragma table_info(document_summaries)').all() as Array<{ name: string }>).map((column) => column.name), + ); + if (!summaryColumns.has('prompt_version')) { + db.exec("alter table document_summaries add column prompt_version text default 'v1'"); + } + + const vectorColumns = new Set( + (db.prepare('pragma table_info(thread_vectors)').all() as Array<{ name: string }>).map((column) => column.name), + ); + if (!vectorColumns.has('vector_backend')) { + db.exec("alter table thread_vectors add column vector_backend text default 'vectorlite'"); + } + db.exec('create index if not exists idx_threads_repo_number on threads(repo_id, number)'); db.exec('create index if not exists idx_document_summaries_thread_model on document_summaries(thread_id, model)'); + db.exec('create index if not exists idx_thread_vectors_basis_model on thread_vectors(basis, model)'); db.exec('create index if not exists idx_cluster_runs_repo_status_id on cluster_runs(repo_id, status, id)'); db.exec('create index if not exists idx_clusters_repo_run_id on clusters(repo_id, cluster_run_id, id)'); db.exec('create index if not exists idx_cluster_members_thread_cluster on cluster_members(thread_id, cluster_id)'); diff --git a/packages/api-core/src/index.ts b/packages/api-core/src/index.ts index 15471e0..0119310 100644 --- a/packages/api-core/src/index.ts +++ b/packages/api-core/src/index.ts @@ -4,3 +4,5 @@ export * from './documents/normalize.js'; export * from './search/exact.js'; export * from './cluster/build.js'; export * from './service.js'; +export * from './vector/store.js'; +export * from './vector/vectorlite-store.js'; diff --git a/packages/api-core/src/openai/provider.ts b/packages/api-core/src/openai/provider.ts index cfd3b94..53a6cc3 100644 --- a/packages/api-core/src/openai/provider.ts +++ b/packages/api-core/src/openai/provider.ts @@ -21,7 +21,7 @@ export type SummaryUsage = { export type AiProvider = { checkAuth: () => Promise; summarizeThread: (params: { model: string; text: string }) => Promise<{ summary: SummaryResult; usage?: SummaryUsage }>; - embedTexts: (params: { model: string; texts: string[] }) => Promise; + embedTexts: (params: { model: string; texts: string[]; dimensions?: number }) => Promise; }; const summarySchema = z.object({ @@ -56,8 +56,20 @@ export class OpenAiProvider implements AiProvider { content: [ { type: 'input_text', - text: - 'Summarize this GitHub issue or pull request thread. Return concise JSON only with keys problem_summary, solution_summary, maintainer_signal_summary, dedupe_summary. Each field should be plain text, no markdown, and usually 1-3 sentences.', + text: [ + 'Summarize this GitHub issue or pull request for automated duplicate detection. Your summary will be embedded and clustered.', + '', + 'Structure your analysis:', + '1. First identify the COMPONENT or SUBSYSTEM (e.g., "Discord gateway", "WhatsApp delivery", "Telegram media handler", "CLI routing", "session management")', + '2. Then identify the SPECIFIC PROBLEM or CHANGE within that component', + '3. Combine into a clear dedupe_summary that starts with the component name', + '', + 'Ignore completely: template boilerplate, testing instructions, checklists, environment info, reproduction steps, deployment notes, version numbers, cross-references.', + '', + 'Return JSON with keys: problem_summary, solution_summary, maintainer_signal_summary, dedupe_summary.', + 'Plain text, no markdown, 1-3 sentences each.', + 'dedupe_summary format: "[Component]: [specific issue or change]" — this helps cluster by subsystem.', + ].join('\n'), }, ], }, @@ -104,7 +116,7 @@ export class OpenAiProvider implements AiProvider { throw new Error(`OpenAI summarization failed after 3 attempts: ${lastError?.message ?? 'unknown error'}`); } - async embedTexts(params: { model: string; texts: string[] }): Promise { + async embedTexts(params: { model: string; texts: string[]; dimensions?: number }): Promise { if (params.texts.length === 0) { return []; } @@ -115,6 +127,7 @@ export class OpenAiProvider implements AiProvider { const response = await this.client.embeddings.create({ model: params.model, input: params.texts, + dimensions: params.dimensions, }); return response.data.map((item) => item.embedding); diff --git a/packages/api-core/src/service.test.ts b/packages/api-core/src/service.test.ts index 9d71229..478553f 100644 --- a/packages/api-core/src/service.test.ts +++ b/packages/api-core/src/service.test.ts @@ -1,13 +1,17 @@ import test from 'node:test'; import assert from 'node:assert/strict'; +import fs from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; import { GHCrawlService } from './service.js'; function makeTestConfig(overrides: Partial = {}): GHCrawlService['config'] { + const configDir = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-service-test-')); return { workspaceRoot: process.cwd(), - configDir: '/tmp/ghcrawl-test', - configPath: '/tmp/ghcrawl-test/config.json', + configDir, + configPath: path.join(configDir, 'config.json'), configFileExists: true, dbPath: ':memory:', dbPathSource: 'config', @@ -19,6 +23,8 @@ function makeTestConfig(overrides: Partial = {}): GHCr openaiApiKeySource: 'none', summaryModel: 'gpt-5-mini', embedModel: 'text-embedding-3-large', + embeddingBasis: 'title_original', + vectorBackend: 'vectorlite', embedBatchSize: 2, embedConcurrency: 2, embedMaxUnread: 4, @@ -38,6 +44,14 @@ function makeTestService( }); } +function makeEmbedding(seed: number, variant = 0): number[] { + return Array.from({ length: 1024 }, (_value, index) => { + if (index === 0) return seed; + if (index === 1) return variant; + return 0; + }); +} + test('doctor reports config path and successful auth smoke checks', async () => { let githubChecked = 0; let openAiChecked = 0; @@ -71,11 +85,13 @@ test('doctor reports config path and successful auth smoke checks', async () => try { const result = await service.doctor(); - assert.equal(result.health.configPath, '/tmp/ghcrawl-test/config.json'); + assert.equal(result.health.configPath, service.config.configPath); assert.equal(result.github.formatOk, true); assert.equal(result.github.authOk, true); assert.equal(result.openai.formatOk, true); assert.equal(result.openai.authOk, true); + assert.equal(result.vectorlite.configured, true); + assert.equal(result.vectorlite.runtimeOk, true); assert.equal(githubChecked, 1); assert.equal(openAiChecked, 1); } finally { @@ -584,6 +600,99 @@ test('summarizeRepository includes hydrated human comments when includeComments } }); +test('summarizeRepository prices progress output using the configured summary model', async () => { + const progress: string[] = []; + const service = makeTestService( + { + checkAuth: async () => undefined, + getRepo: async () => ({ id: 1, full_name: 'openclaw/openclaw' }), + listRepositoryIssues: async () => [], + getIssue: async () => { + throw new Error('not expected'); + }, + getPull: async () => { + throw new Error('not expected'); + }, + listIssueComments: async () => [], + listPullReviews: async () => [], + listPullReviewComments: async () => [], + }, + { + checkAuth: async () => undefined, + summarizeThread: async () => ({ + summary: { + problemSummary: 'Problem', + solutionSummary: 'Solution', + maintainerSignalSummary: 'Signal', + dedupeSummary: 'Dedupe', + }, + usage: { + inputTokens: 1_000_000, + outputTokens: 0, + totalTokens: 1_000_000, + cachedInputTokens: 0, + reasoningTokens: 0, + }, + }), + embedTexts: async () => [], + }, + ); + + try { + const now = '2026-03-09T00:00:00Z'; + service.db + .prepare( + `insert into repositories (id, owner, name, full_name, github_repo_id, raw_json, updated_at) + values (?, ?, ?, ?, ?, ?, ?)`, + ) + .run(1, 'openclaw', 'openclaw', 'openclaw/openclaw', '1', '{}', now); + service.db + .prepare( + `insert into threads ( + id, repo_id, github_id, number, kind, state, title, body, author_login, author_type, html_url, + labels_json, assignees_json, raw_json, content_hash, is_draft, created_at_gh, updated_at_gh, closed_at_gh, + merged_at_gh, first_pulled_at, last_pulled_at, updated_at + ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`, + ) + .run( + 10, + 1, + '100', + 42, + 'issue', + 'open', + 'Downloader hangs', + 'The transfer never finishes.', + 'alice', + 'User', + 'https://github.com/openclaw/openclaw/issues/42', + '["bug"]', + '[]', + '{}', + 'hash-42', + 0, + now, + now, + null, + null, + now, + now, + now, + ); + + await service.summarizeRepository({ + owner: 'openclaw', + repo: 'openclaw', + threadNumber: 42, + onProgress: (message) => progress.push(message), + }); + + assert.ok(progress.some((message) => message.includes('cost=$0.25') && message.includes('est_total=$0.25'))); + } finally { + service.close(); + } +}); + test('purgeComments removes hydrated comments and refreshes canonical documents', () => { const service = makeTestService({ checkAuth: async () => undefined, @@ -703,7 +812,7 @@ test('embedRepository batches multi-source embeddings and skips unchanged inputs }, embedTexts: async ({ texts }) => { embedCalls.push(texts); - return texts.map((text, index) => [text.length, index]); + return texts.map((text, index) => makeEmbedding(text.length, index)); }, }, ); @@ -751,33 +860,174 @@ test('embedRepository batches multi-source embeddings and skips unchanged inputs ); service.db .prepare( - `insert into document_summaries (thread_id, summary_kind, model, content_hash, summary_text, created_at, updated_at) - values (?, ?, ?, ?, ?, ?, ?)`, + `insert into document_summaries (thread_id, summary_kind, model, prompt_version, content_hash, summary_text, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, ) - .run(10, 'dedupe_summary', 'gpt-5-mini', 'summary-hash', 'Transfer hangs near completion.', now, now); + .run(10, 'dedupe_summary', 'gpt-5-mini', 'v1', 'summary-hash', 'Transfer hangs near completion.', now, now); const first = await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); - assert.equal(first.embedded, 3); - assert.equal(embedCalls.length, 2); + assert.equal(first.embedded, 1); + assert.equal(embedCalls.length, 1); assert.deepEqual( service.db - .prepare('select source_kind from document_embeddings order by source_kind asc') + .prepare('select basis, vector_json from thread_vectors order by basis asc') .all() - .map((row: unknown) => (row as { source_kind: string }).source_kind), - ['body', 'dedupe_summary', 'title'], + .map((row: unknown) => { + const typed = row as { basis: string; vector_json: Buffer | string }; + return { basis: typed.basis, vectorKind: Buffer.isBuffer(typed.vector_json) ? 'blob' : typeof typed.vector_json }; + }), + [{ basis: 'title_original', vectorKind: 'blob' }], ); const second = await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); assert.equal(second.embedded, 0); - assert.equal(embedCalls.length, 2); + assert.equal(embedCalls.length, 1); service.db .prepare('update threads set body = ?, updated_at = ? where id = ?') .run('The transfer now stalls at 99%.', now, 10); const third = await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); assert.equal(third.embedded, 1); - assert.equal(embedCalls.length, 3); - assert.deepEqual(embedCalls[2], ['The transfer now stalls at 99%.']); + assert.equal(embedCalls.length, 2); + assert.deepEqual(embedCalls[1], ['title: Downloader hangs\n\nbody: The transfer now stalls at 99%.']); + } finally { + service.close(); + } +}); + +test('listNeighbors uses the vectorlite sidecar for current active vectors', async () => { + const service = new GHCrawlService({ + config: makeTestConfig(), + github: { + checkAuth: async () => undefined, + getRepo: async () => ({ id: 1, full_name: 'openclaw/openclaw' }), + listRepositoryIssues: async () => [], + getIssue: async () => { + throw new Error('not expected'); + }, + getPull: async () => { + throw new Error('not expected'); + }, + listIssueComments: async () => [], + listPullReviews: async () => [], + listPullReviewComments: async () => [], + }, + ai: { + checkAuth: async () => undefined, + summarizeThread: async () => { + throw new Error('not expected'); + }, + embedTexts: async ({ texts }) => texts.map((_text, index) => (index === 0 ? makeEmbedding(1, 0) : makeEmbedding(0.99, 0.01))), + }, + }); + + try { + const now = '2026-03-09T00:00:00Z'; + const insertThread = service.db.prepare( + `insert into repositories (id, owner, name, full_name, github_repo_id, raw_json, updated_at) + values (?, ?, ?, ?, ?, ?, ?)`, + ); + insertThread.run(1, 'openclaw', 'openclaw', 'openclaw/openclaw', '1', '{}', now); + const insert = service.db.prepare( + `insert into threads ( + id, repo_id, github_id, number, kind, state, title, body, author_login, author_type, html_url, + labels_json, assignees_json, raw_json, content_hash, is_draft, created_at_gh, updated_at_gh, closed_at_gh, + merged_at_gh, first_pulled_at, last_pulled_at, updated_at + ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`, + ); + insert.run(10, 1, '100', 42, 'issue', 'open', 'Downloader hangs', 'The transfer never finishes.', 'alice', 'User', 'https://github.com/openclaw/openclaw/issues/42', '[]', '[]', '{}', 'hash-42', 0, now, now, null, null, now, now, now); + insert.run(11, 1, '101', 43, 'issue', 'open', 'Downloader retry issue', 'The transfer retries forever.', 'bob', 'User', 'https://github.com/openclaw/openclaw/issues/43', '[]', '[]', '{}', 'hash-43', 0, now, now, null, null, now, now, now); + + await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); + + const result = service.listNeighbors({ + owner: 'openclaw', + repo: 'openclaw', + threadNumber: 42, + limit: 2, + minScore: 0.1, + }); + + assert.equal(result.thread.number, 42); + assert.deepEqual(result.neighbors.map((neighbor) => neighbor.number), [43]); + } finally { + service.close(); + } +}); + +test('embedRepository prunes closed vectors before reusing current active vectors', async () => { + const service = new GHCrawlService({ + config: makeTestConfig(), + github: { + checkAuth: async () => undefined, + getRepo: async () => ({ id: 1, full_name: 'openclaw/openclaw' }), + listRepositoryIssues: async () => [], + getIssue: async () => { + throw new Error('not expected'); + }, + getPull: async () => { + throw new Error('not expected'); + }, + listIssueComments: async () => [], + listPullReviews: async () => [], + listPullReviewComments: async () => [], + }, + ai: { + checkAuth: async () => undefined, + summarizeThread: async () => { + throw new Error('not expected'); + }, + embedTexts: async ({ texts }) => + texts.map((text) => { + if (text.includes('Target issue')) return makeEmbedding(1, 0); + if (text.includes('Closed similar one')) return makeEmbedding(0.999, 0.001); + if (text.includes('Closed similar two')) return makeEmbedding(0.998, 0.002); + if (text.includes('Open fallback')) return makeEmbedding(0.9, 0.1); + throw new Error(`unexpected embedding input: ${text}`); + }), + }, + }); + + try { + const now = '2026-03-09T00:00:00Z'; + service.db + .prepare( + `insert into repositories (id, owner, name, full_name, github_repo_id, raw_json, updated_at) + values (?, ?, ?, ?, ?, ?, ?)`, + ) + .run(1, 'openclaw', 'openclaw', 'openclaw/openclaw', '1', '{}', now); + const insertThread = service.db.prepare( + `insert into threads ( + id, repo_id, github_id, number, kind, state, title, body, author_login, author_type, html_url, + labels_json, assignees_json, raw_json, content_hash, is_draft, created_at_gh, updated_at_gh, closed_at_gh, + merged_at_gh, first_pulled_at, last_pulled_at, updated_at + ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`, + ); + insertThread.run(10, 1, '100', 42, 'issue', 'open', 'Target issue', 'Primary issue body.', 'alice', 'User', 'https://github.com/openclaw/openclaw/issues/42', '[]', '[]', '{}', 'hash-42', 0, now, now, null, null, now, now, now); + insertThread.run(11, 1, '101', 43, 'issue', 'open', 'Closed similar one', 'Very similar body.', 'bob', 'User', 'https://github.com/openclaw/openclaw/issues/43', '[]', '[]', '{}', 'hash-43', 0, now, now, null, null, now, now, now); + insertThread.run(12, 1, '102', 44, 'issue', 'open', 'Closed similar two', 'Also very similar body.', 'carol', 'User', 'https://github.com/openclaw/openclaw/issues/44', '[]', '[]', '{}', 'hash-44', 0, now, now, null, null, now, now, now); + insertThread.run(13, 1, '103', 45, 'issue', 'open', 'Open fallback', 'Somewhat similar body.', 'dave', 'User', 'https://github.com/openclaw/openclaw/issues/45', '[]', '[]', '{}', 'hash-45', 0, now, now, null, null, now, now, now); + + await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); + + service.db + .prepare('update threads set state = ?, closed_at_gh = ?, updated_at = ? where id in (?, ?)') + .run('closed', now, now, 11, 12); + + const rerun = await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); + assert.equal(rerun.embedded, 0); + + const vectorCount = service.db.prepare('select count(*) as count from thread_vectors').get() as { count: number }; + assert.equal(vectorCount.count, 2); + + const result = service.listNeighbors({ + owner: 'openclaw', + repo: 'openclaw', + threadNumber: 42, + limit: 1, + minScore: 0.1, + }); + assert.deepEqual(result.neighbors.map((neighbor) => neighbor.number), [45]); } finally { service.close(); } @@ -812,7 +1062,7 @@ test('embedRepository truncates oversized inputs before submission', async () => }, embedTexts: async ({ texts }) => { embedCalls.push(texts); - return texts.map((text, index) => [text.length, index]); + return texts.map((text, index) => makeEmbedding(text.length, index)); }, }, }); @@ -895,7 +1145,7 @@ test('embedRepository truncates oversized inputs before submission', async () => const result = await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); - assert.equal(result.embedded, 4); + assert.equal(result.embedded, 2); assert.ok(embedCalls.length >= 1); const truncatedBodies = embedCalls.flat().filter((text) => text.includes('[truncated for embedding]')); assert.equal(truncatedBodies.length, 2); @@ -943,7 +1193,7 @@ test('embedRepository isolates a failing oversized item from a mixed batch and r ); } } - return texts.map((text, index) => [text.length, index]); + return texts.map((text, index) => makeEmbedding(text.length, index)); }, }, }); @@ -1025,9 +1275,9 @@ test('embedRepository isolates a failing oversized item from a mixed batch and r const result = await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); - assert.equal(result.embedded, 4); + assert.equal(result.embedded, 2); assert.ok(embedCalls.length >= 3); - assert.equal(embedCalls[0].length, 4); + assert.equal(embedCalls[0].length, 2); assert.ok(embedCalls.flat().some((text) => text.includes('[truncated for embedding]'))); } finally { service.close(); @@ -1069,7 +1319,7 @@ test('embedRepository recovers from wrapped maximum input length errors by shrin `OpenAI embeddings failed after 5 attempts: 400 Invalid 'input[${overLimitIndex}]': maximum input length is 8192 tokens.`, ); } - return texts.map((text, index) => [text.length, index]); + return texts.map((text, index) => makeEmbedding(text.length, index)); }, }, }); @@ -1151,7 +1401,7 @@ test('embedRepository recovers from wrapped maximum input length errors by shrin const result = await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); - assert.equal(result.embedded, 4); + assert.equal(result.embedded, 2); const shortenedAttempts = Array.from( new Set( embedCalls @@ -1503,6 +1753,154 @@ test('clusterRepository prunes older cluster runs for the repo after a successfu } }); +test('clusterRepository purges legacy embeddings and inline vector payloads after a current-vector rebuild', async () => { + const service = new GHCrawlService({ + config: makeTestConfig(), + github: { + checkAuth: async () => undefined, + getRepo: async () => ({ id: 1, full_name: 'openclaw/openclaw' }), + listRepositoryIssues: async () => [], + getIssue: async () => { + throw new Error('not expected'); + }, + getPull: async () => { + throw new Error('not expected'); + }, + listIssueComments: async () => [], + listPullReviews: async () => [], + listPullReviewComments: async () => [], + }, + ai: { + checkAuth: async () => undefined, + summarizeThread: async () => { + throw new Error('not expected'); + }, + embedTexts: async ({ texts }) => texts.map((_text, index) => (index === 0 ? makeEmbedding(1, 0) : makeEmbedding(0.99, 0.01))), + }, + }); + + try { + const now = '2026-03-09T00:00:00Z'; + service.db + .prepare( + `insert into repositories (id, owner, name, full_name, github_repo_id, raw_json, updated_at) + values (?, ?, ?, ?, ?, ?, ?)`, + ) + .run(1, 'openclaw', 'openclaw', 'openclaw/openclaw', '1', '{}', now); + + const insertThread = service.db.prepare( + `insert into threads ( + id, repo_id, github_id, number, kind, state, title, body, author_login, author_type, html_url, + labels_json, assignees_json, raw_json, content_hash, is_draft, created_at_gh, updated_at_gh, closed_at_gh, + merged_at_gh, first_pulled_at, last_pulled_at, updated_at + ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`, + ); + insertThread.run(10, 1, '100', 42, 'issue', 'open', 'Downloader hangs', 'The transfer never finishes.', 'alice', 'User', 'https://github.com/openclaw/openclaw/issues/42', '[]', '[]', '{}', 'hash-42', 0, now, now, null, null, now, now, now); + insertThread.run(11, 1, '101', 43, 'issue', 'open', 'Fix downloader hang', 'Implements a fix.', 'bob', 'User', 'https://github.com/openclaw/openclaw/issues/43', '[]', '[]', '{}', 'hash-43', 0, now, now, null, null, now, now, now); + + const insertLegacy = service.db.prepare( + `insert into document_embeddings (thread_id, source_kind, model, dimensions, content_hash, embedding_json, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ); + for (const sourceKind of ['title', 'body', 'dedupe_summary'] as const) { + insertLegacy.run(10, sourceKind, 'text-embedding-3-large', 2, `hash-42-${sourceKind}`, '[1,0]', now, now); + insertLegacy.run(11, sourceKind, 'text-embedding-3-large', 2, `hash-43-${sourceKind}`, '[0.99,0.01]', now, now); + } + + await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); + const beforeCluster = service.db.prepare('select count(*) as count from document_embeddings').get() as { count: number }; + assert.equal(beforeCluster.count, 6); + + await service.clusterRepository({ + owner: 'openclaw', + repo: 'openclaw', + k: 1, + minScore: 0.5, + }); + + const legacyCount = service.db.prepare('select count(*) as count from document_embeddings').get() as { count: number }; + const inlineVectors = service.db + .prepare('select typeof(vector_json) as vector_kind from thread_vectors order by thread_id asc') + .all() as Array<{ vector_kind: string }>; + + assert.equal(legacyCount.count, 0); + assert.deepEqual(inlineVectors.map((row) => row.vector_kind), ['blob', 'blob']); + } finally { + service.close(); + } +}); + +test('clusterExperiment falls back to active vectors when legacy embeddings are absent', async () => { + const service = new GHCrawlService({ + config: makeTestConfig(), + github: { + checkAuth: async () => undefined, + getRepo: async () => ({ id: 1, full_name: 'openclaw/openclaw' }), + listRepositoryIssues: async () => [], + getIssue: async () => { + throw new Error('not expected'); + }, + getPull: async () => { + throw new Error('not expected'); + }, + listIssueComments: async () => [], + listPullReviews: async () => [], + listPullReviewComments: async () => [], + }, + ai: { + checkAuth: async () => undefined, + summarizeThread: async () => { + throw new Error('not expected'); + }, + embedTexts: async ({ texts }) => texts.map((_text, index) => (index === 0 ? makeEmbedding(1, 0) : makeEmbedding(0.99, 0.01))), + }, + }); + + try { + const now = '2026-03-09T00:00:00Z'; + service.db + .prepare( + `insert into repositories (id, owner, name, full_name, github_repo_id, raw_json, updated_at) + values (?, ?, ?, ?, ?, ?, ?)`, + ) + .run(1, 'openclaw', 'openclaw', 'openclaw/openclaw', '1', '{}', now); + + const insertThread = service.db.prepare( + `insert into threads ( + id, repo_id, github_id, number, kind, state, title, body, author_login, author_type, html_url, + labels_json, assignees_json, raw_json, content_hash, is_draft, created_at_gh, updated_at_gh, closed_at_gh, + merged_at_gh, first_pulled_at, last_pulled_at, updated_at + ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`, + ); + insertThread.run(10, 1, '100', 42, 'issue', 'open', 'Downloader hangs', 'The transfer never finishes.', 'alice', 'User', 'https://github.com/openclaw/openclaw/issues/42', '[]', '[]', '{}', 'hash-42', 0, now, now, null, null, now, now, now); + insertThread.run(11, 1, '101', 43, 'issue', 'open', 'Downloader retry issue', 'The transfer retries forever.', 'bob', 'User', 'https://github.com/openclaw/openclaw/issues/43', '[]', '[]', '{}', 'hash-43', 0, now, now, null, null, now, now, now); + + await service.embedRepository({ owner: 'openclaw', repo: 'openclaw' }); + + const exact = service.clusterExperiment({ + owner: 'openclaw', + repo: 'openclaw', + backend: 'exact', + k: 1, + minScore: 0.5, + }); + const vectorlite = service.clusterExperiment({ + owner: 'openclaw', + repo: 'openclaw', + backend: 'vectorlite', + k: 1, + minScore: 0.5, + }); + + assert.equal(exact.threads, 2); + assert.equal(exact.clusters, 1); + assert.equal(vectorlite.threads, 2); + assert.equal(vectorlite.clusters, 1); + } finally { + service.close(); + } +}); + test('clusterRepository does not retain a parsed embedding cache in-process', async () => { const service = makeTestService({ checkAuth: async () => undefined, @@ -1627,7 +2025,7 @@ test('tui snapshot returns mixed issue and pull request counts with default rece assert.equal(snapshot.stats.lastGithubReconciliationAt, '2026-03-09T12:00:00Z'); assert.equal(snapshot.stats.lastEmbedRefreshAt, '2026-03-09T13:00:00Z'); assert.equal(snapshot.stats.staleEmbedThreadCount, 5); - assert.equal(snapshot.stats.staleEmbedSourceCount, 10); + assert.equal(snapshot.stats.staleEmbedSourceCount, 5); assert.equal(snapshot.stats.latestClusterRunId, 1); assert.equal(snapshot.clusters.length, 0); @@ -1865,7 +2263,7 @@ test('refreshRepository runs sync, embed, and cluster in order and returns the c summarizeThread: async () => { throw new Error('not expected'); }, - embedTexts: async ({ texts }) => texts.map(() => [1, 0]), + embedTexts: async ({ texts }) => texts.map((_text, index) => makeEmbedding(1, index)), }, ); @@ -1880,7 +2278,7 @@ test('refreshRepository runs sync, embed, and cluster in order and returns the c assert.equal(result.selected.embed, true); assert.equal(result.selected.cluster, true); assert.equal(result.sync?.threadsSynced, 1); - assert.equal(result.embed?.embedded, 2); + assert.equal(result.embed?.embedded, 1); assert.equal(result.cluster?.clusters, 1); const syncIndex = messages.findIndex((message) => message.includes('[sync]')); diff --git a/packages/api-core/src/service.ts b/packages/api-core/src/service.ts index 21f05f8..5a8b45f 100644 --- a/packages/api-core/src/service.ts +++ b/packages/api-core/src/service.ts @@ -1,7 +1,10 @@ import http from 'node:http'; import crypto from 'node:crypto'; +import fs from 'node:fs'; import { existsSync } from 'node:fs'; +import { createRequire } from 'node:module'; import os from 'node:os'; +import path from 'node:path'; import { fileURLToPath } from 'node:url'; import { Worker } from 'node:worker_threads'; @@ -45,7 +48,7 @@ import { type ThreadsResponse, } from '@ghcrawl/api-contract'; -import { buildClusters } from './cluster/build.js'; +import { buildClusters, buildRefinedClusters, buildSizeBoundedClusters } from './cluster/build.js'; import { buildSourceKindEdges } from './cluster/exact-edges.js'; import { ensureRuntimeDirs, @@ -54,6 +57,7 @@ import { loadConfig, requireGithubToken, requireOpenAiKey, + type EmbeddingBasis, type ConfigValueSource, type GitcrawlConfig, } from './config.js'; @@ -62,7 +66,9 @@ import { openDb, type SqliteDatabase } from './db/sqlite.js'; import { buildCanonicalDocument, isBotLikeAuthor } from './documents/normalize.js'; import { makeGitHubClient, type GitHubClient } from './github/client.js'; import { OpenAiProvider, type AiProvider } from './openai/provider.js'; -import { cosineSimilarity, normalizeEmbedding, rankNearestNeighbors } from './search/exact.js'; +import { cosineSimilarity, dotProduct, normalizeEmbedding, rankNearestNeighbors, rankNearestNeighborsByScore } from './search/exact.js'; +import type { VectorStore } from './vector/store.js'; +import { VectorliteStore } from './vector/vectorlite-store.js'; type RunTable = 'sync_runs' | 'summary_runs' | 'embedding_runs' | 'cluster_runs'; @@ -114,6 +120,94 @@ type StoredEmbeddingRow = ThreadRow & { embedding_json: string; }; +type ActiveVectorTask = { + threadId: number; + threadNumber: number; + basis: EmbeddingBasis; + text: string; + contentHash: string; + estimatedTokens: number; + wasTruncated: boolean; +}; + +type ActiveVectorRow = ThreadRow & { + basis: EmbeddingBasis; + model: string; + dimensions: number; + content_hash: string; + vector_json: Buffer | string; + vector_backend: string; +}; + +type RepoPipelineStateRow = { + repo_id: number; + summary_model: string; + summary_prompt_version: string; + embedding_basis: EmbeddingBasis; + embed_model: string; + embed_dimensions: number; + embed_pipeline_version: string; + vector_backend: string; + vectors_current_at: string | null; + clusters_current_at: string | null; + updated_at: string; +}; + +type ClusterExperimentMemoryStats = { + rssBeforeBytes: number; + rssAfterBytes: number; + peakRssBytes: number; + heapUsedBeforeBytes: number; + heapUsedAfterBytes: number; + peakHeapUsedBytes: number; +}; + +type ClusterExperimentSizeBucket = { + size: number; + count: number; +}; + +type ClusterExperimentClusterSizeStats = { + soloClusters: number; + maxClusterSize: number; + topClusterSizes: number[]; + histogram: ClusterExperimentSizeBucket[]; +}; + +type ClusterExperimentCluster = { + representativeThreadId: number; + memberThreadIds: number[]; +}; + +type ClusterExperimentResult = { + backend: 'exact' | 'vectorlite'; + repository: RepositoryDto; + tempDbPath: string | null; + threads: number; + sourceKinds: number; + edges: number; + clusters: number; + timingBasis: 'cluster-only'; + durationMs: number; + totalDurationMs: number; + loadMs: number; + setupMs: number; + edgeBuildMs: number; + indexBuildMs: number; + queryMs: number; + clusterBuildMs: number; + candidateK: number; + memory: ClusterExperimentMemoryStats; + clusterSizes: ClusterExperimentClusterSizeStats; + clustersDetail: ClusterExperimentCluster[] | null; +}; + +type SummaryModelPricing = { + inputCostPerM: number; + cachedInputCostPerM: number; + outputCostPerM: number; +}; + type EmbeddingWorkset = { rows: Array<{ id: number; @@ -121,9 +215,10 @@ type EmbeddingWorkset = { title: string; body: string | null; }>; - tasks: EmbeddingTask[]; + tasks: ActiveVectorTask[]; existing: Map; - pending: EmbeddingTask[]; + pending: ActiveVectorTask[]; + missingSummaryThreadNumbers: number[]; }; type SyncCursorState = { @@ -237,6 +332,11 @@ export type DoctorResult = { authOk: boolean; error: string | null; }; + vectorlite: { + configured: boolean; + runtimeOk: boolean; + error: string | null; + }; }; type SyncOptions = { @@ -261,10 +361,31 @@ const CLUSTER_PARALLEL_MIN_EMBEDDINGS = 5000; const EMBED_ESTIMATED_CHARS_PER_TOKEN = 3; const EMBED_MAX_ITEM_TOKENS = 7000; const EMBED_MAX_BATCH_TOKENS = 250000; +const requireFromHere = createRequire(import.meta.url); const EMBED_TRUNCATION_MARKER = '\n\n[truncated for embedding]'; const EMBED_CONTEXT_RETRY_ATTEMPTS = 5; const EMBED_CONTEXT_RETRY_FALLBACK_SHRINK_RATIO = 0.9; const EMBED_CONTEXT_RETRY_TARGET_BUFFER_RATIO = 0.95; +const SUMMARY_PROMPT_VERSION = 'v1'; +const ACTIVE_EMBED_DIMENSIONS = 1024; +const ACTIVE_EMBED_PIPELINE_VERSION = 'vectorlite-1024-v1'; +const DEFAULT_CLUSTER_MIN_SCORE = 0.78; +const VECTORLITE_CLUSTER_EXPANDED_K = 24; +const VECTORLITE_CLUSTER_EXPANDED_MULTIPLIER = 4; +const VECTORLITE_CLUSTER_EXPANDED_CANDIDATE_K = 512; +const VECTORLITE_CLUSTER_EXPANDED_EF_SEARCH = 1024; +const SUMMARY_MODEL_PRICING: Record = { + 'gpt-5-mini': { + inputCostPerM: 0.25, + cachedInputCostPerM: 0.025, + outputCostPerM: 2.0, + }, + 'gpt-5.4-mini': { + inputCostPerM: 0.75, + cachedInputCostPerM: 0.075, + outputCostPerM: 4.5, + }, +}; function nowIso(): string { return new Date().toISOString(); @@ -428,12 +549,14 @@ export class GHCrawlService { readonly db: SqliteDatabase; readonly github?: GitHubClient; readonly ai?: AiProvider; + readonly vectorStore: VectorStore; constructor(options: { config?: GitcrawlConfig; db?: SqliteDatabase; github?: GitHubClient; ai?: AiProvider; + vectorStore?: VectorStore; } = {}) { this.config = options.config ?? loadConfig(); ensureRuntimeDirs(this.config); @@ -441,9 +564,11 @@ export class GHCrawlService { migrate(this.db); this.github = options.github ?? (this.config.githubToken ? makeGitHubClient({ token: this.config.githubToken }) : undefined); this.ai = options.ai ?? (this.config.openaiApiKey ? new OpenAiProvider(this.config.openaiApiKey) : undefined); + this.vectorStore = options.vectorStore ?? new VectorliteStore(); } close(): void { + this.vectorStore.close(); this.db.close(); } @@ -510,7 +635,18 @@ export class GHCrawlService { } } - return { health, github, openai }; + const vectorliteHealth = this.vectorStore.checkRuntime(); + + return { + health, + github, + openai, + vectorlite: { + configured: this.config.vectorBackend === 'vectorlite', + runtimeOk: vectorliteHealth.ok, + error: vectorliteHealth.error, + }, + }; } listRepositories(): RepositoriesResponse { @@ -930,10 +1066,12 @@ export class GHCrawlService { const pending = sources.filter((row) => { const latest = this.db .prepare( - 'select content_hash from document_summaries where thread_id = ? and summary_kind = ? and model = ? limit 1', + 'select content_hash, prompt_version from document_summaries where thread_id = ? and summary_kind = ? and model = ? limit 1', ) - .get(row.id, 'dedupe_summary', this.config.summaryModel) as { content_hash: string } | undefined; - return latest?.content_hash !== row.summaryContentHash; + .get(row.id, 'dedupe_summary', this.config.summaryModel) as + | { content_hash: string; prompt_version: string | null } + | undefined; + return latest?.content_hash !== row.summaryContentHash || latest?.prompt_version !== SUMMARY_PROMPT_VERSION; }); params.onProgress?.( @@ -944,25 +1082,80 @@ export class GHCrawlService { let inputTokens = 0; let outputTokens = 0; let totalTokens = 0; - for (const [index, row] of pending.entries()) { - params.onProgress?.(`[summarize] ${index + 1}/${pending.length} thread #${row.number}`); - const result = await ai.summarizeThread({ - model: this.config.summaryModel, - text: row.summaryInput, - }); - const summary = result.summary; - - this.upsertSummary(row.id, row.summaryContentHash, 'problem_summary', summary.problemSummary); - this.upsertSummary(row.id, row.summaryContentHash, 'solution_summary', summary.solutionSummary); - this.upsertSummary(row.id, row.summaryContentHash, 'maintainer_signal_summary', summary.maintainerSignalSummary); - this.upsertSummary(row.id, row.summaryContentHash, 'dedupe_summary', summary.dedupeSummary); - if (result.usage) { - inputTokens += result.usage.inputTokens; - outputTokens += result.usage.outputTokens; - totalTokens += result.usage.totalTokens; - params.onProgress?.( - `[summarize] tokens thread #${row.number} in=${result.usage.inputTokens} out=${result.usage.outputTokens} total=${result.usage.totalTokens} cached_in=${result.usage.cachedInputTokens} reasoning=${result.usage.reasoningTokens}`, - ); + let cachedInputTokens = 0; + const startTime = Date.now(); + + const pricing = SUMMARY_MODEL_PRICING[this.config.summaryModel] ?? null; + + // Stage 1: concurrent API calls + const fetcher = new IterableMapper( + pending, + async (row) => { + const result = await ai.summarizeThread({ + model: this.config.summaryModel, + text: row.summaryInput, + }); + return { row, result }; + }, + { concurrency: 5 }, + ); + + // Stage 2: sequential DB writes — consumes from fetcher without blocking API completions + const writer = new IterableMapper( + fetcher, + async ({ row, result }) => { + const summary = result.summary; + this.upsertSummary(row.id, row.summaryContentHash, 'problem_summary', summary.problemSummary); + this.upsertSummary(row.id, row.summaryContentHash, 'solution_summary', summary.solutionSummary); + this.upsertSummary(row.id, row.summaryContentHash, 'maintainer_signal_summary', summary.maintainerSignalSummary); + this.upsertSummary(row.id, row.summaryContentHash, 'dedupe_summary', summary.dedupeSummary); + return { row, usage: result.usage }; + }, + { concurrency: 1 }, + ); + + let index = 0; + for await (const { row, usage } of writer) { + index += 1; + if (usage) { + inputTokens += usage.inputTokens; + outputTokens += usage.outputTokens; + totalTokens += usage.totalTokens; + cachedInputTokens += usage.cachedInputTokens; + } + + // Compute cost and ETA every 10 items or on the last item + if (index % 10 === 0 || index === pending.length) { + const remaining = pending.length - index; + const avgIn = inputTokens / index; + const avgOut = outputTokens / index; + const avgCachedIn = cachedInputTokens / index; + + const elapsedSec = (Date.now() - startTime) / 1000; + const secPerItem = elapsedSec / index; + const etaSec = remaining * secPerItem; + const etaMin = Math.round(etaSec / 60); + const etaStr = etaMin >= 60 ? `${Math.floor(etaMin / 60)}h${etaMin % 60}m` : `${etaMin}m`; + + if (pricing) { + const uncachedInput = inputTokens - cachedInputTokens; + const costSoFar = + (uncachedInput / 1_000_000) * pricing.inputCostPerM + + (cachedInputTokens / 1_000_000) * pricing.cachedInputCostPerM + + (outputTokens / 1_000_000) * pricing.outputCostPerM; + const estTotalCost = + costSoFar + + ((remaining * (avgIn - avgCachedIn)) / 1_000_000) * pricing.inputCostPerM + + ((remaining * avgCachedIn) / 1_000_000) * pricing.cachedInputCostPerM + + ((remaining * avgOut) / 1_000_000) * pricing.outputCostPerM; + params.onProgress?.( + `[summarize] ${index}/${pending.length} thread #${row.number} | cost=$${costSoFar.toFixed(2)} est_total=$${estTotalCost.toFixed(2)} | avg_in=${Math.round(avgIn)} avg_out=${Math.round(avgOut)} | ETA ${etaStr}`, + ); + } else { + params.onProgress?.( + `[summarize] ${index}/${pending.length} thread #${row.number} | avg_in=${Math.round(avgIn)} avg_out=${Math.round(avgOut)} | ETA ${etaStr}`, + ); + } } summarized += 1; } @@ -1027,22 +1220,39 @@ export class GHCrawlService { const runId = this.startRun('embedding_runs', repository.id, params.threadNumber ? `thread:${params.threadNumber}` : repository.fullName); try { - const { rows, tasks, pending } = this.getEmbeddingWorkset(repository.id, params.threadNumber); + if (params.threadNumber === undefined) { + if (!this.isRepoVectorStateCurrent(repository.id)) { + this.resetRepositoryVectors(repository.id, repository.fullName); + } else { + const pruned = this.pruneInactiveRepositoryVectors(repository.id, repository.fullName); + if (pruned > 0) { + params.onProgress?.(`[embed] pruned ${pruned} closed or inactive vector(s) before refresh`); + } + } + } + + const { rows, tasks, pending, missingSummaryThreadNumbers } = this.getEmbeddingWorkset(repository.id, params.threadNumber); const skipped = tasks.length - pending.length; const truncated = tasks.filter((task) => task.wasTruncated).length; + if (missingSummaryThreadNumbers.length > 0) { + throw new Error( + `Embedding basis ${this.config.embeddingBasis} requires summaries before embedding. Missing summaries for thread(s): ${missingSummaryThreadNumbers.slice(0, 10).join(', ')}${missingSummaryThreadNumbers.length > 10 ? ', …' : ''}.`, + ); + } + params.onProgress?.( - `[embed] loaded ${rows.length} open thread(s) and ${tasks.length} embedding source(s) for ${repository.fullName}`, + `[embed] loaded ${rows.length} open thread(s) and ${tasks.length} active vector task(s) for ${repository.fullName}`, ); params.onProgress?.( - `[embed] pending=${pending.length} skipped=${skipped} truncated=${truncated} model=${this.config.embedModel} batch_size=${this.config.embedBatchSize} concurrency=${this.config.embedConcurrency} max_unread=${this.config.embedMaxUnread} max_batch_tokens=${EMBED_MAX_BATCH_TOKENS}`, + `[embed] pending=${pending.length} skipped=${skipped} truncated=${truncated} model=${this.config.embedModel} dimensions=${ACTIVE_EMBED_DIMENSIONS} basis=${this.config.embeddingBasis} batch_size=${this.config.embedBatchSize} concurrency=${this.config.embedConcurrency} max_unread=${this.config.embedMaxUnread} max_batch_tokens=${EMBED_MAX_BATCH_TOKENS}`, ); let embedded = 0; const batches = this.chunkEmbeddingTasks(pending, this.config.embedBatchSize, EMBED_MAX_BATCH_TOKENS); const mapper = new IterableMapper( batches, - async (batch: EmbeddingTask[]) => { + async (batch: ActiveVectorTask[]) => { return this.embedBatchWithRecovery(ai, batch, params.onProgress); }, { @@ -1054,17 +1264,18 @@ export class GHCrawlService { let completedBatches = 0; for await (const batchResult of mapper) { completedBatches += 1; - const numbers = batchResult.map(({ task }) => `#${task.threadNumber}:${task.sourceKind}`); + const numbers = batchResult.map(({ task }) => `#${task.threadNumber}:${task.basis}`); const estimatedTokens = batchResult.reduce((sum, { task }) => sum + task.estimatedTokens, 0); params.onProgress?.( `[embed] batch ${completedBatches}/${Math.max(batches.length, 1)} size=${batchResult.length} est_tokens=${estimatedTokens} items=${numbers.join(',')}`, ); for (const { task, embedding } of batchResult) { - this.upsertEmbedding(task.threadId, task.sourceKind, task.contentHash, embedding); + this.upsertActiveVector(repository.id, repository.fullName, task.threadId, task.basis, task.contentHash, embedding); embedded += 1; } } + this.markRepoVectorsCurrent(repository.id); this.finishRun('embedding_runs', runId, 'completed', { embedded }); return embedResultSchema.parse({ runId, embedded }); } catch (error) { @@ -1082,20 +1293,73 @@ export class GHCrawlService { }): Promise { const repository = this.requireRepository(params.owner, params.repo); const runId = this.startRun('cluster_runs', repository.id, repository.fullName); - const minScore = params.minScore ?? 0.82; + const minScore = params.minScore ?? DEFAULT_CLUSTER_MIN_SCORE; const k = params.k ?? 6; try { - const { items, sourceKinds } = this.loadClusterableThreadMeta(repository.id); + let items: Array<{ id: number; number: number; title: string }>; + let aggregatedEdges: Map }>; + + if (this.isRepoVectorStateCurrent(repository.id)) { + const vectorItems = this.loadClusterableActiveVectorMeta(repository.id, repository.fullName); + const activeIds = new Set(vectorItems.map((item) => item.id)); + const annQuery = this.getVectorliteClusterQuery(vectorItems.length, k); + aggregatedEdges = new Map(); + let processed = 0; + let lastProgressAt = Date.now(); + + params.onProgress?.( + `[cluster] loaded ${vectorItems.length} active vector(s) for ${repository.fullName} backend=${this.config.vectorBackend} k=${k} query_limit=${annQuery.limit} candidateK=${annQuery.candidateK} efSearch=${annQuery.efSearch ?? 'default'} minScore=${minScore}`, + ); + for (const item of vectorItems) { + const neighbors = this.vectorStore.queryNearest({ + storePath: this.repoVectorStorePath(repository.fullName), + dimensions: ACTIVE_EMBED_DIMENSIONS, + vector: item.embedding, + limit: annQuery.limit, + candidateK: annQuery.candidateK + 1, + efSearch: annQuery.efSearch, + excludeThreadId: item.id, + }); + for (const neighbor of neighbors) { + if (!activeIds.has(neighbor.threadId)) continue; + if (neighbor.score < minScore) continue; + const key = this.edgeKey(item.id, neighbor.threadId); + const existing = aggregatedEdges.get(key); + if (existing) { + existing.score = Math.max(existing.score, neighbor.score); + } else { + aggregatedEdges.set(key, { + leftThreadId: Math.min(item.id, neighbor.threadId), + rightThreadId: Math.max(item.id, neighbor.threadId), + score: neighbor.score, + sourceKinds: new Set(['dedupe_summary']), + }); + } + } + processed += 1; + const now = Date.now(); + if (params.onProgress && now - lastProgressAt >= CLUSTER_PROGRESS_INTERVAL_MS) { + params.onProgress(`[cluster] queried ${processed}/${vectorItems.length} vectors current_edges=${aggregatedEdges.size}`); + lastProgressAt = now; + } + } + items = vectorItems; + } else if (this.hasLegacyEmbeddings(repository.id)) { + const legacy = this.loadClusterableThreadMeta(repository.id); + items = legacy.items; + params.onProgress?.( + `[cluster] loaded ${items.length} legacy embedded thread(s) across ${legacy.sourceKinds.length} source kind(s) for ${repository.fullName} k=${k} minScore=${minScore}`, + ); + aggregatedEdges = await this.aggregateRepositoryEdges(repository.id, legacy.sourceKinds, { + limit: k, + minScore, + onProgress: params.onProgress, + }); + } else { + throw new Error(`Vectors for ${repository.fullName} are stale or missing. Run refresh or embed first.`); + } - params.onProgress?.( - `[cluster] loaded ${items.length} embedded thread(s) across ${sourceKinds.length} source kind(s) for ${repository.fullName} k=${k} minScore=${minScore}`, - ); - const aggregatedEdges = await this.aggregateRepositoryEdges(repository.id, sourceKinds, { - limit: k, - minScore, - onProgress: params.onProgress, - }); const edges = Array.from(aggregatedEdges.values()).map((entry) => ({ leftThreadId: entry.leftThreadId, rightThreadId: entry.rightThreadId, @@ -1110,6 +1374,10 @@ export class GHCrawlService { ); this.persistClusterRun(repository.id, runId, aggregatedEdges, clusters); this.pruneOldClusterRuns(repository.id, runId); + if (this.isRepoVectorStateCurrent(repository.id)) { + this.markRepoClustersCurrent(repository.id); + this.cleanupMigratedRepositoryArtifacts(repository.id, repository.fullName, params.onProgress); + } params.onProgress?.(`[cluster] persisted ${clusters.length} cluster(s) and pruned older cluster runs`); @@ -1121,6 +1389,312 @@ export class GHCrawlService { } } + clusterExperiment(params: { + owner: string; + repo: string; + backend?: 'exact' | 'vectorlite'; + minScore?: number; + k?: number; + candidateK?: number; + efSearch?: number; + maxClusterSize?: number; + refineStep?: number; + clusterMode?: 'basic' | 'refine' | 'bounded'; + includeClusters?: boolean; + sourceKinds?: EmbeddingSourceKind[]; + aggregation?: 'max' | 'mean' | 'weighted' | 'min-of-2' | 'boost'; + aggregationWeights?: Partial>; + onProgress?: (message: string) => void; + }): ClusterExperimentResult { + const backend = params.backend ?? 'vectorlite'; + const repository = this.requireRepository(params.owner, params.repo); + const loaded = this.loadClusterableThreadMeta(repository.id); + const activeVectors = this.isRepoVectorStateCurrent(repository.id) ? this.loadNormalizedActiveVectors(repository.id) : []; + const activeSourceKind: EmbeddingSourceKind = this.config.embeddingBasis === 'title_summary' ? 'dedupe_summary' : 'body'; + const useActiveVectors = activeVectors.length > 0 && (params.sourceKinds === undefined || loaded.items.length === 0); + const sourceKinds = useActiveVectors ? [activeSourceKind] : (params.sourceKinds ?? loaded.sourceKinds); + const items = useActiveVectors + ? activeVectors.map((item) => ({ id: item.id, number: item.number, title: item.title })) + : loaded.items; + const aggregation = params.aggregation ?? 'max'; + const minScore = params.minScore ?? DEFAULT_CLUSTER_MIN_SCORE; + const k = params.k ?? 6; + const candidateK = Math.max(k, params.candidateK ?? Math.max(k * 16, 64)); + const efSearch = params.efSearch; + const startedAt = Date.now(); + const memoryBefore = process.memoryUsage(); + let peakRssBytes = memoryBefore.rss; + let peakHeapUsedBytes = memoryBefore.heapUsed; + const recordMemory = (): void => { + const usage = process.memoryUsage(); + peakRssBytes = Math.max(peakRssBytes, usage.rss); + peakHeapUsedBytes = Math.max(peakHeapUsedBytes, usage.heapUsed); + }; + recordMemory(); + + if (useActiveVectors && params.sourceKinds && loaded.items.length === 0) { + params.onProgress?.( + `[cluster-experiment] legacy source embeddings are unavailable for ${repository.fullName}; falling back to active ${this.config.embeddingBasis} vectors`, + ); + } + + params.onProgress?.( + `[cluster-experiment] loaded ${items.length} embedded thread(s) across ${sourceKinds.length} source kind(s) for ${repository.fullName} backend=${backend} k=${k} candidateK=${candidateK} minScore=${minScore} aggregation=${aggregation}`, + ); + + const perSourceScores = new Map }>(); + let loadMs = 0; + let setupMs = 0; + let edgeBuildMs = 0; + let indexBuildMs = 0; + let queryMs = 0; + let clusterBuildMs = 0; + let tempDbPath: string | null = null; + let tempDb: SqliteDatabase | null = null; + let tempDir: string | null = null; + + try { + if (backend === 'exact') { + if (useActiveVectors) { + const loadStartedAt = Date.now(); + const normalizedRows = activeVectors.map(({ id, embedding }) => ({ id, normalizedEmbedding: embedding })); + loadMs += Date.now() - loadStartedAt; + recordMemory(); + + const edgesStartedAt = Date.now(); + const edges = buildSourceKindEdges(normalizedRows, { + limit: k, + minScore, + progressIntervalMs: CLUSTER_PROGRESS_INTERVAL_MS, + onProgress: (progress) => { + recordMemory(); + if (!params.onProgress) return; + params.onProgress( + `[cluster-experiment] exact ${progress.processedItems}/${normalizedRows.length} active vectors processed current_edges~=${perSourceScores.size + progress.currentEdgeEstimate}`, + ); + }, + }); + edgeBuildMs += Date.now() - edgesStartedAt; + this.collectSourceKindScores(perSourceScores, edges, activeSourceKind); + recordMemory(); + } else { + const totalItems = sourceKinds.reduce((sum, sourceKind) => sum + this.countEmbeddingsForSourceKind(repository.id, sourceKind), 0); + let processedItems = 0; + + for (const sourceKind of sourceKinds) { + const loadStartedAt = Date.now(); + const normalizedRows = this.loadNormalizedEmbeddingsForSourceKind(repository.id, sourceKind); + loadMs += Date.now() - loadStartedAt; + recordMemory(); + + const edgesStartedAt = Date.now(); + const edges = buildSourceKindEdges(normalizedRows, { + limit: k, + minScore, + progressIntervalMs: CLUSTER_PROGRESS_INTERVAL_MS, + onProgress: (progress) => { + recordMemory(); + if (!params.onProgress) return; + params.onProgress( + `[cluster-experiment] exact ${processedItems + progress.processedItems}/${totalItems} source embeddings processed current_edges~=${perSourceScores.size + progress.currentEdgeEstimate}`, + ); + }, + }); + edgeBuildMs += Date.now() - edgesStartedAt; + processedItems += normalizedRows.length; + this.collectSourceKindScores(perSourceScores, edges, sourceKind); + recordMemory(); + } + } + } else { + const setupStartedAt = Date.now(); + tempDir = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-vectorlite-')); + tempDbPath = path.join(tempDir, 'cluster-experiment.db'); + tempDb = openDb(tempDbPath); + tempDb.pragma('journal_mode = MEMORY'); + tempDb.pragma('synchronous = OFF'); + tempDb.pragma('temp_store = MEMORY'); + const vectorlite = requireFromHere('vectorlite') as { vectorlitePath: () => string }; + (tempDb as SqliteDatabase & { loadExtension: (extensionPath: string) => void }).loadExtension(vectorlite.vectorlitePath()); + setupMs += Date.now() - setupStartedAt; + recordMemory(); + + const vectorSources = useActiveVectors + ? [ + { + sourceKind: activeSourceKind, + rows: activeVectors.map(({ id, embedding }) => ({ id, normalizedEmbedding: embedding })), + }, + ] + : sourceKinds.map((sourceKind) => ({ + sourceKind, + rows: this.loadNormalizedEmbeddingsForSourceKind(repository.id, sourceKind).map((row) => ({ + id: row.id, + normalizedEmbedding: row.normalizedEmbedding, + })), + })); + + for (const source of vectorSources) { + const sourceRowCount = source.rows.length; + if (sourceRowCount === 0) { + continue; + } + + const dimension = source.rows[0]!.normalizedEmbedding.length; + const safeCandidateK = Math.min(candidateK, Math.max(1, sourceRowCount - 1)); + const tableName = `vector_${source.sourceKind}`; + + params.onProgress?.( + `[cluster-experiment] building ${source.sourceKind} HNSW index with ${sourceRowCount} vector(s)`, + ); + const indexStartedAt = Date.now(); + tempDb.exec( + `create virtual table ${tableName} using vectorlite(vec float32[${dimension}], hnsw(max_elements=${sourceRowCount}));`, + ); + const insert = tempDb.prepare(`insert into ${tableName}(rowid, vec) values (?, ?)`); + tempDb.transaction(() => { + const loadStartedAt = Date.now(); + for (const row of source.rows) { + insert.run(row.id, this.normalizedEmbeddingBuffer(row.normalizedEmbedding)); + } + loadMs += Date.now() - loadStartedAt; + })(); + indexBuildMs += Date.now() - indexStartedAt; + recordMemory(); + + const queryStartedAt = Date.now(); + const querySql = + efSearch !== undefined + ? `select rowid, distance from ${tableName} where knn_search(vec, knn_param(?, ${safeCandidateK + 1}, ${efSearch}))` + : `select rowid, distance from ${tableName} where knn_search(vec, knn_param(?, ${safeCandidateK + 1}))`; + const query = tempDb.prepare(querySql); + let processed = 0; + let lastProgressAt = Date.now(); + const queryLoadStartedAt = Date.now(); + for (const row of source.rows) { + const candidates = query.all(this.normalizedEmbeddingBuffer(row.normalizedEmbedding)) as Array<{ + rowid: number; + distance: number; + }>; + const ranked = rankNearestNeighborsByScore(candidates, { + limit: k, + minScore, + score: (candidate) => { + if (candidate.rowid === row.id) { + return -1; + } + return this.normalizedDistanceToScore(candidate.distance); + }, + }); + let addedThisRow = 0; + for (const candidate of ranked) { + const score = candidate.score; + const key = this.edgeKey(row.id, candidate.item.rowid); + const existing = perSourceScores.get(key); + if (existing) { + existing.scores.set(source.sourceKind, Math.max(existing.scores.get(source.sourceKind) ?? -1, score)); + continue; + } + const scores = new Map(); + scores.set(source.sourceKind, score); + perSourceScores.set(key, { + leftThreadId: Math.min(row.id, candidate.item.rowid), + rightThreadId: Math.max(row.id, candidate.item.rowid), + scores, + }); + addedThisRow += 1; + } + processed += 1; + const now = Date.now(); + if (params.onProgress && now - lastProgressAt >= CLUSTER_PROGRESS_INTERVAL_MS) { + recordMemory(); + params.onProgress( + `[cluster-experiment] querying ${source.sourceKind} index ${processed}/${sourceRowCount} current_edges=${perSourceScores.size} added_this_step=${addedThisRow}`, + ); + lastProgressAt = now; + } + } + loadMs += Date.now() - queryLoadStartedAt; + queryMs += Date.now() - queryStartedAt; + tempDb.exec(`drop table ${tableName}`); + recordMemory(); + } + } + + // Finalize edge scores using the configured aggregation method + const defaultWeights: Record = { dedupe_summary: 0.5, title: 0.3, body: 0.2 }; + const weights = { ...defaultWeights, ...(params.aggregationWeights ?? {}) }; + const aggregated = this.finalizeEdgeScores(perSourceScores, aggregation, weights, minScore); + + params.onProgress?.( + `[cluster-experiment] finalized ${aggregated.length} edges from ${perSourceScores.size} candidate pairs using ${aggregation} aggregation`, + ); + + const clusterStartedAt = Date.now(); + const clusterNodes = items.map((item) => ({ threadId: item.id, number: item.number, title: item.title })); + const clusterEdges = aggregated; + const clusterMode = params.clusterMode ?? (params.maxClusterSize !== undefined ? 'refine' : 'basic'); + const clusters = clusterMode === 'bounded' + ? buildSizeBoundedClusters(clusterNodes, clusterEdges, { + maxClusterSize: params.maxClusterSize ?? 200, + }) + : clusterMode === 'refine' + ? buildRefinedClusters(clusterNodes, clusterEdges, { + maxClusterSize: params.maxClusterSize ?? 200, + refineStep: params.refineStep ?? 0.02, + }) + : buildClusters(clusterNodes, clusterEdges); + clusterBuildMs += Date.now() - clusterStartedAt; + recordMemory(); + const memoryAfter = process.memoryUsage(); + const durationMs = + backend === 'vectorlite' + ? indexBuildMs + queryMs + clusterBuildMs + : edgeBuildMs + clusterBuildMs; + const totalDurationMs = Date.now() - startedAt; + + return { + backend, + repository, + tempDbPath, + threads: items.length, + sourceKinds: sourceKinds.length, + edges: aggregated.length, + clusters: clusters.length, + timingBasis: 'cluster-only', + durationMs, + totalDurationMs, + loadMs, + setupMs, + edgeBuildMs, + indexBuildMs, + queryMs, + clusterBuildMs, + candidateK, + memory: { + rssBeforeBytes: memoryBefore.rss, + rssAfterBytes: memoryAfter.rss, + peakRssBytes, + heapUsedBeforeBytes: memoryBefore.heapUsed, + heapUsedAfterBytes: memoryAfter.heapUsed, + peakHeapUsedBytes, + }, + clusterSizes: this.summarizeClusterSizes(clusters), + clustersDetail: params.includeClusters + ? clusters.map((cluster) => ({ + representativeThreadId: cluster.representativeThreadId, + memberThreadIds: [...cluster.members], + })) + : null, + }; + } finally { + tempDb?.close(); + if (tempDir) { + fs.rmSync(tempDir, { recursive: true, force: true }); + } + } + } + async searchRepository(params: { owner: string; repo: string; @@ -1152,11 +1726,30 @@ export class GHCrawlService { } if (mode !== 'keyword' && this.ai) { - const [queryEmbedding] = await this.ai.embedTexts({ model: this.config.embedModel, texts: [params.query] }); - for (const row of this.iterateStoredEmbeddings(repository.id)) { - const score = cosineSimilarity(queryEmbedding, JSON.parse(row.embedding_json) as number[]); - if (score < 0.2) continue; - semanticScores.set(row.id, Math.max(semanticScores.get(row.id) ?? -1, score)); + if (this.isRepoVectorStateCurrent(repository.id)) { + const [queryEmbedding] = await this.ai.embedTexts({ + model: this.config.embedModel, + texts: [params.query], + dimensions: ACTIVE_EMBED_DIMENSIONS, + }); + const neighbors = this.vectorStore.queryNearest({ + storePath: this.repoVectorStorePath(repository.fullName), + dimensions: ACTIVE_EMBED_DIMENSIONS, + vector: queryEmbedding, + limit: limit * 2, + candidateK: Math.max(limit * 8, 64), + }); + for (const neighbor of neighbors) { + if (neighbor.score < 0.2) continue; + semanticScores.set(neighbor.threadId, Math.max(semanticScores.get(neighbor.threadId) ?? -1, neighbor.score)); + } + } else if (this.hasLegacyEmbeddings(repository.id)) { + const [queryEmbedding] = await this.ai.embedTexts({ model: this.config.embedModel, texts: [params.query] }); + for (const row of this.iterateStoredEmbeddings(repository.id)) { + const score = cosineSimilarity(queryEmbedding, JSON.parse(row.embedding_json) as number[]); + if (score < 0.2) continue; + semanticScores.set(row.id, Math.max(semanticScores.get(row.id) ?? -1, score)); + } } } @@ -1252,45 +1845,109 @@ export class GHCrawlService { const limit = params.limit ?? 10; const minScore = params.minScore ?? 0.2; - const targetRows = this.loadStoredEmbeddingsForThreadNumber(repository.id, params.threadNumber); - if (targetRows.length === 0) { - throw new Error( - `Thread #${params.threadNumber} for ${repository.fullName} was not found with an embedding. Run embed first.`, - ); - } - const targetRow = targetRows[0]; - const targetBySource = new Map(); - for (const row of targetRows) { - targetBySource.set(row.source_kind, JSON.parse(row.embedding_json) as number[]); - } - - const aggregated = new Map(); - for (const row of this.iterateStoredEmbeddings(repository.id)) { - if (row.id === targetRow.id) continue; - const targetEmbedding = targetBySource.get(row.source_kind); - if (!targetEmbedding) continue; - const score = cosineSimilarity(targetEmbedding, JSON.parse(row.embedding_json) as number[]); - if (score < minScore) continue; - const previous = aggregated.get(row.id); - if (!previous || score > previous.score) { - aggregated.set(row.id, { number: row.number, kind: row.kind, title: row.title, score }); + const targetRow = this.db + .prepare( + `select t.*, tv.basis, tv.model, tv.dimensions, tv.content_hash, tv.vector_json, tv.vector_backend + from threads t + join thread_vectors tv on tv.thread_id = t.id + where t.repo_id = ? + and t.number = ? + and t.state = 'open' + and t.closed_at_local is null + and tv.model = ? + and tv.basis = ? + and tv.dimensions = ? + limit 1`, + ) + .get( + repository.id, + params.threadNumber, + this.config.embedModel, + this.config.embeddingBasis, + ACTIVE_EMBED_DIMENSIONS, + ) as ActiveVectorRow | undefined; + let responseThread: ThreadRow | ActiveVectorRow; + let neighbors: Array<{ threadId: number; number: number; kind: 'issue' | 'pull_request'; title: string; score: number }>; + + if (targetRow) { + responseThread = targetRow; + const candidateRows = this.vectorStore + .queryNearest({ + storePath: this.repoVectorStorePath(repository.fullName), + dimensions: ACTIVE_EMBED_DIMENSIONS, + vector: this.parseStoredVector(targetRow.vector_json), + limit: limit * 2, + candidateK: Math.max(limit * 8, 64), + excludeThreadId: targetRow.id, + }) + .filter((row) => row.score >= minScore); + const candidateIds = candidateRows.map((row) => row.threadId); + const neighborMeta = candidateIds.length + ? (this.db + .prepare( + `select * from threads + where repo_id = ? and state = 'open' and closed_at_local is null and id in (${candidateIds.map(() => '?').join(',')})`, + ) + .all(repository.id, ...candidateIds) as ThreadRow[]) + : []; + const metaById = new Map(neighborMeta.map((row) => [row.id, row])); + neighbors = candidateRows + .map((row) => { + const meta = metaById.get(row.threadId); + if (!meta) { + return null; + } + return { + threadId: row.threadId, + number: meta.number, + kind: meta.kind, + title: meta.title, + score: row.score, + }; + }) + .filter((row): row is NonNullable => row !== null) + .slice(0, limit); + } else { + const targetRows = this.loadStoredEmbeddingsForThreadNumber(repository.id, params.threadNumber); + if (targetRows.length === 0) { + throw new Error( + `Thread #${params.threadNumber} for ${repository.fullName} was not found with an embedding. Run embed first.`, + ); + } + responseThread = targetRows[0]!; + const targetBySource = new Map(); + for (const row of targetRows) { + targetBySource.set(row.source_kind, JSON.parse(row.embedding_json) as number[]); } - } - const neighbors = Array.from(aggregated.entries()) - .map(([threadId, value]) => ({ - threadId, - number: value.number, - kind: value.kind, - title: value.title, - score: value.score, - })) - .sort((left, right) => right.score - left.score) - .slice(0, limit); + const aggregated = new Map(); + for (const row of this.iterateStoredEmbeddings(repository.id)) { + if (row.id === responseThread.id) continue; + const targetEmbedding = targetBySource.get(row.source_kind); + if (!targetEmbedding) continue; + const score = cosineSimilarity(targetEmbedding, JSON.parse(row.embedding_json) as number[]); + if (score < minScore) continue; + const previous = aggregated.get(row.id); + if (!previous || score > previous.score) { + aggregated.set(row.id, { number: row.number, kind: row.kind, title: row.title, score }); + } + } + + neighbors = Array.from(aggregated.entries()) + .map(([threadId, value]) => ({ + threadId, + number: value.number, + kind: value.kind, + title: value.title, + score: value.score, + })) + .sort((left, right) => right.score - left.score) + .slice(0, limit); + } return neighborsResponseSchema.parse({ repository, - thread: threadToDto(targetRow), + thread: threadToDto(responseThread), neighbors, }); } @@ -1399,6 +2056,16 @@ export class GHCrawlService { onProgress: params.onProgress, }); } + if (selected.embed && this.config.embeddingBasis === 'title_summary') { + params.onProgress?.( + `[refresh] embedding basis ${this.config.embeddingBasis} requires summaries; running summarize before embed`, + ); + await this.summarizeRepository({ + owner: params.owner, + repo: params.repo, + onProgress: params.onProgress, + }); + } if (selected.embed) { embed = await this.embedRepository({ owner: params.owner, @@ -1664,10 +2331,10 @@ export class GHCrawlService { .prepare( `select summary_kind, summary_text from document_summaries - where thread_id = ? and model = ? + where thread_id = ? and model = ? and prompt_version = ? order by summary_kind asc`, ) - .all(row.id, this.config.summaryModel) as Array<{ summary_kind: string; summary_text: string }>; + .all(row.id, this.config.summaryModel, SUMMARY_PROMPT_VERSION) as Array<{ summary_kind: string; summary_text: string }>; const summaries: TuiThreadDetail['summaries'] = {}; for (const summary of summaryRows) { if ( @@ -1862,7 +2529,225 @@ export class GHCrawlService { }; } + private getDesiredPipelineState(): Omit { + return { + summary_model: this.config.summaryModel, + summary_prompt_version: SUMMARY_PROMPT_VERSION, + embedding_basis: this.config.embeddingBasis, + embed_model: this.config.embedModel, + embed_dimensions: ACTIVE_EMBED_DIMENSIONS, + embed_pipeline_version: ACTIVE_EMBED_PIPELINE_VERSION, + vector_backend: this.config.vectorBackend, + }; + } + + private getRepoPipelineState(repoId: number): RepoPipelineStateRow | null { + return ( + (this.db.prepare('select * from repo_pipeline_state where repo_id = ? limit 1').get(repoId) as RepoPipelineStateRow | undefined) ?? + null + ); + } + + private isRepoVectorStateCurrent(repoId: number): boolean { + const state = this.getRepoPipelineState(repoId); + if (!state || !state.vectors_current_at) { + return false; + } + const desired = this.getDesiredPipelineState(); + return ( + state.summary_model === desired.summary_model && + state.summary_prompt_version === desired.summary_prompt_version && + state.embedding_basis === desired.embedding_basis && + state.embed_model === desired.embed_model && + state.embed_dimensions === desired.embed_dimensions && + state.embed_pipeline_version === desired.embed_pipeline_version && + state.vector_backend === desired.vector_backend + ); + } + + private isRepoClusterStateCurrent(repoId: number): boolean { + const state = this.getRepoPipelineState(repoId); + return this.isRepoVectorStateCurrent(repoId) && Boolean(state?.clusters_current_at); + } + + private hasLegacyEmbeddings(repoId: number): boolean { + const row = this.db + .prepare( + `select count(*) as count + from document_embeddings e + join threads t on t.id = e.thread_id + where t.repo_id = ? + and t.state = 'open' + and t.closed_at_local is null + and e.model = ?`, + ) + .get(repoId, this.config.embedModel) as { count: number }; + return row.count > 0; + } + + private writeRepoPipelineState( + repoId: number, + overrides: Partial>, + ): void { + const desired = this.getDesiredPipelineState(); + const current = this.getRepoPipelineState(repoId); + this.db + .prepare( + `insert into repo_pipeline_state ( + repo_id, + summary_model, + summary_prompt_version, + embedding_basis, + embed_model, + embed_dimensions, + embed_pipeline_version, + vector_backend, + vectors_current_at, + clusters_current_at, + updated_at + ) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + on conflict(repo_id) do update set + summary_model = excluded.summary_model, + summary_prompt_version = excluded.summary_prompt_version, + embedding_basis = excluded.embedding_basis, + embed_model = excluded.embed_model, + embed_dimensions = excluded.embed_dimensions, + embed_pipeline_version = excluded.embed_pipeline_version, + vector_backend = excluded.vector_backend, + vectors_current_at = excluded.vectors_current_at, + clusters_current_at = excluded.clusters_current_at, + updated_at = excluded.updated_at`, + ) + .run( + repoId, + desired.summary_model, + desired.summary_prompt_version, + desired.embedding_basis, + desired.embed_model, + desired.embed_dimensions, + desired.embed_pipeline_version, + desired.vector_backend, + overrides.vectors_current_at ?? current?.vectors_current_at ?? null, + overrides.clusters_current_at ?? current?.clusters_current_at ?? null, + nowIso(), + ); + } + + private markRepoVectorsCurrent(repoId: number): void { + this.writeRepoPipelineState(repoId, { + vectors_current_at: nowIso(), + clusters_current_at: null, + }); + } + + private markRepoClustersCurrent(repoId: number): void { + const state = this.getRepoPipelineState(repoId); + this.writeRepoPipelineState(repoId, { + vectors_current_at: state?.vectors_current_at ?? nowIso(), + clusters_current_at: nowIso(), + }); + } + + private repoVectorStorePath(repoFullName: string): string { + const safeName = repoFullName.replace(/[^a-zA-Z0-9._-]+/g, '__'); + return path.join(this.config.configDir, 'vectors', `${safeName}.sqlite`); + } + + private resetRepositoryVectors(repoId: number, repoFullName: string): void { + this.db + .prepare( + `delete from thread_vectors + where thread_id in (select id from threads where repo_id = ?)`, + ) + .run(repoId); + this.vectorStore.resetRepository({ + storePath: this.repoVectorStorePath(repoFullName), + dimensions: ACTIVE_EMBED_DIMENSIONS, + }); + this.writeRepoPipelineState(repoId, { + vectors_current_at: null, + clusters_current_at: null, + }); + } + + private pruneInactiveRepositoryVectors(repoId: number, repoFullName: string): number { + const rows = this.db + .prepare( + `select tv.thread_id + from thread_vectors tv + join threads t on t.id = tv.thread_id + where t.repo_id = ? + and (t.state != 'open' or t.closed_at_local is not null)`, + ) + .all(repoId) as Array<{ thread_id: number }>; + if (rows.length === 0) { + return 0; + } + + const deleteVectorRow = this.db.prepare('delete from thread_vectors where thread_id = ?'); + this.db.transaction(() => { + for (const row of rows) { + deleteVectorRow.run(row.thread_id); + this.vectorStore.deleteVector({ + storePath: this.repoVectorStorePath(repoFullName), + dimensions: ACTIVE_EMBED_DIMENSIONS, + threadId: row.thread_id, + }); + } + })(); + return rows.length; + } + + private cleanupMigratedRepositoryArtifacts(repoId: number, repoFullName: string, onProgress?: (message: string) => void): void { + const legacyEmbeddingCount = this.countLegacyEmbeddings(repoId); + const inlineJsonVectorCount = this.countInlineJsonThreadVectors(repoId); + if (legacyEmbeddingCount === 0 && inlineJsonVectorCount === 0) { + return; + } + + if (legacyEmbeddingCount > 0) { + this.db + .prepare( + `delete from document_embeddings + where thread_id in (select id from threads where repo_id = ?)`, + ) + .run(repoId); + onProgress?.(`[cleanup] removed ${legacyEmbeddingCount} legacy document embedding row(s) after vector migration`); + } + + if (inlineJsonVectorCount > 0) { + const rows = this.db + .prepare( + `select tv.thread_id, tv.vector_json + from thread_vectors tv + join threads t on t.id = tv.thread_id + where t.repo_id = ? + and typeof(tv.vector_json) = 'text' + and tv.vector_json != ''`, + ) + .all(repoId) as Array<{ thread_id: number; vector_json: string }>; + const update = this.db.prepare('update thread_vectors set vector_json = ?, updated_at = ? where thread_id = ?'); + this.db.transaction(() => { + for (const row of rows) { + update.run(this.vectorBlob(JSON.parse(row.vector_json) as number[]), nowIso(), row.thread_id); + } + })(); + onProgress?.(`[cleanup] compacted ${inlineJsonVectorCount} inline SQLite vector payload(s) from JSON to binary blobs`); + } + + if (this.config.dbPath !== ':memory:') { + onProgress?.(`[cleanup] checkpointing WAL and vacuuming ${repoFullName} migration changes`); + this.db.pragma('wal_checkpoint(TRUNCATE)'); + this.db.exec('VACUUM'); + this.db.pragma('wal_checkpoint(TRUNCATE)'); + } + } + private getLatestClusterRun(repoId: number): { id: number; finished_at: string | null } | null { + const state = this.getRepoPipelineState(repoId); + if (state && !this.isRepoClusterStateCurrent(repoId)) { + return null; + } return ( (this.db .prepare("select id, finished_at from cluster_runs where repo_id = ? and status = 'completed' order by id desc limit 1") @@ -2563,7 +3448,9 @@ export class GHCrawlService { } const summaryInput = parts.join('\n\n'); - const summaryContentHash = stableContentHash(`summary:${includeComments ? 'with-comments' : 'metadata-only'}\n${summaryInput}`); + const summaryContentHash = stableContentHash( + `summary:${SUMMARY_PROMPT_VERSION}:${includeComments ? 'with-comments' : 'metadata-only'}\n${summaryInput}`, + ); return { summaryInput, summaryContentHash }; } @@ -2617,6 +3504,45 @@ export class GHCrawlService { return tasks; } + private buildActiveVectorTask(params: { + threadId: number; + threadNumber: number; + title: string; + body: string | null; + dedupeSummary: string | null; + }): ActiveVectorTask | null { + const sections = [`title: ${normalizeSummaryText(params.title)}`]; + if (this.config.embeddingBasis === 'title_summary') { + const summary = normalizeSummaryText(params.dedupeSummary ?? ''); + if (!summary) { + return null; + } + sections.push(`summary: ${summary}`); + } else { + const body = normalizeSummaryText(params.body ?? ''); + if (body) { + sections.push(`body: ${body}`); + } + } + + const prepared = this.prepareEmbeddingText(sections.join('\n\n'), EMBED_MAX_ITEM_TOKENS); + if (!prepared) { + return null; + } + + return { + threadId: params.threadId, + threadNumber: params.threadNumber, + basis: this.config.embeddingBasis, + text: prepared.text, + contentHash: stableContentHash( + `embedding:${ACTIVE_EMBED_PIPELINE_VERSION}:${this.config.embeddingBasis}:${this.config.embedModel}:${ACTIVE_EMBED_DIMENSIONS}\n${prepared.text}`, + ), + estimatedTokens: prepared.estimatedTokens, + wasTruncated: prepared.wasTruncated, + }; + } + private prepareEmbeddingText( text: string, maxEstimatedTokens: number, @@ -2665,13 +3591,14 @@ export class GHCrawlService { private async embedBatchWithRecovery( ai: AiProvider, - batch: EmbeddingTask[], + batch: ActiveVectorTask[], onProgress?: (message: string) => void, - ): Promise> { + ): Promise> { try { const embeddings = await ai.embedTexts({ model: this.config.embedModel, texts: batch.map((task) => task.text), + dimensions: ACTIVE_EMBED_DIMENSIONS, }); return batch.map((task, index) => ({ task, embedding: embeddings[index] })); } catch (error) { @@ -2687,7 +3614,7 @@ export class GHCrawlService { `[embed] batch context error; isolating ${batch.length} item(s) to find oversized input(s)`, ); - const recovered: Array<{ task: EmbeddingTask; embedding: number[] }> = []; + const recovered: Array<{ task: ActiveVectorTask; embedding: number[] }> = []; for (const task of batch) { recovered.push(await this.embedSingleTaskWithRecovery(ai, task, onProgress)); } @@ -2697,9 +3624,9 @@ export class GHCrawlService { private async embedSingleTaskWithRecovery( ai: AiProvider, - task: EmbeddingTask, + task: ActiveVectorTask, onProgress?: (message: string) => void, - ): Promise<{ task: EmbeddingTask; embedding: number[] }> { + ): Promise<{ task: ActiveVectorTask; embedding: number[] }> { let current = task; for (let attempt = 0; attempt < EMBED_CONTEXT_RETRY_ATTEMPTS; attempt += 1) { @@ -2707,6 +3634,7 @@ export class GHCrawlService { const [embedding] = await ai.embedTexts({ model: this.config.embedModel, texts: [current.text], + dimensions: ACTIVE_EMBED_DIMENSIONS, }); return { task: current, embedding }; } catch (error) { @@ -2720,19 +3648,19 @@ export class GHCrawlService { throw error; } onProgress?.( - `[embed] shortened #${current.threadNumber}:${current.sourceKind} after context error est_tokens=${current.estimatedTokens}->${next.estimatedTokens}`, + `[embed] shortened #${current.threadNumber}:${current.basis} after context error est_tokens=${current.estimatedTokens}->${next.estimatedTokens}`, ); current = next; } } - throw new Error(`Unable to shrink embedding input for #${task.threadNumber}:${task.sourceKind} below model limits`); + throw new Error(`Unable to shrink embedding input for #${task.threadNumber}:${task.basis} below model limits`); } private shrinkEmbeddingTask( - task: EmbeddingTask, + task: ActiveVectorTask, context?: { limitTokens: number | null; requestedTokens: number | null }, - ): EmbeddingTask | null { + ): ActiveVectorTask | null { const withoutMarker = task.text.endsWith(EMBED_TRUNCATION_MARKER) ? task.text.slice(0, -EMBED_TRUNCATION_MARKER.length) : task.text; @@ -2751,7 +3679,9 @@ export class GHCrawlService { return { ...task, text: nextText, - contentHash: stableContentHash(`embedding:${task.sourceKind}\n${nextText}`), + contentHash: stableContentHash( + `embedding:${ACTIVE_EMBED_PIPELINE_VERSION}:${task.basis}:${this.config.embedModel}:${ACTIVE_EMBED_DIMENSIONS}\n${nextText}`, + ), estimatedTokens: this.estimateEmbeddingTokens(nextText), wasTruncated: true, }; @@ -2777,9 +3707,9 @@ export class GHCrawlService { return Math.floor(textLength * EMBED_CONTEXT_RETRY_FALLBACK_SHRINK_RATIO); } - private chunkEmbeddingTasks(items: EmbeddingTask[], maxItems: number, maxEstimatedTokens: number): EmbeddingTask[][] { - const chunks: EmbeddingTask[][] = []; - let current: EmbeddingTask[] = []; + private chunkEmbeddingTasks(items: ActiveVectorTask[], maxItems: number, maxEstimatedTokens: number): ActiveVectorTask[][] { + const chunks: ActiveVectorTask[][] = []; + let current: ActiveVectorTask[] = []; let currentEstimatedTokens = 0; for (const item of items) { @@ -2847,6 +3777,59 @@ export class GHCrawlService { .iterate(repoId, this.config.embedModel) as IterableIterator; } + private loadNormalizedEmbeddingForSourceKindHead( + repoId: number, + sourceKind: EmbeddingSourceKind, + ): { id: number; normalizedEmbedding: number[] } | null { + const row = this.db + .prepare( + `select t.id, e.embedding_json + from threads t + join document_embeddings e on e.thread_id = t.id + where t.repo_id = ? + and t.state = 'open' + and t.closed_at_local is null + and e.model = ? + and e.source_kind = ? + order by t.number asc + limit 1`, + ) + .get(repoId, this.config.embedModel, sourceKind) as { id: number; embedding_json: string } | undefined; + if (!row) { + return null; + } + return { + id: row.id, + normalizedEmbedding: normalizeEmbedding(JSON.parse(row.embedding_json) as number[]).normalized, + }; + } + + private *iterateNormalizedEmbeddingsForSourceKind( + repoId: number, + sourceKind: EmbeddingSourceKind, + ): IterableIterator<{ id: number; normalizedEmbedding: number[] }> { + const rows = this.db + .prepare( + `select t.id, e.embedding_json + from threads t + join document_embeddings e on e.thread_id = t.id + where t.repo_id = ? + and t.state = 'open' + and t.closed_at_local is null + and e.model = ? + and e.source_kind = ? + order by t.number asc`, + ) + .iterate(repoId, this.config.embedModel, sourceKind) as IterableIterator<{ id: number; embedding_json: string }>; + + for (const row of rows) { + yield { + id: row.id, + normalizedEmbedding: normalizeEmbedding(JSON.parse(row.embedding_json) as number[]).normalized, + }; + } + } + private loadNormalizedEmbeddingsForSourceKind( repoId: number, sourceKind: EmbeddingSourceKind, @@ -2871,6 +3854,14 @@ export class GHCrawlService { })); } + private normalizedEmbeddingBuffer(values: number[]): Buffer { + return Buffer.from(Float32Array.from(values).buffer); + } + + private normalizedDistanceToScore(distance: number): number { + return 1 - distance / 2; + } + private loadClusterableThreadMeta(repoId: number): { items: Array<{ id: number; number: number; title: string }>; sourceKinds: EmbeddingSourceKind[]; @@ -2899,6 +3890,43 @@ export class GHCrawlService { }; } + private loadClusterableActiveVectorMeta(repoId: number, _repoFullName: string): Array<{ id: number; number: number; title: string; embedding: number[] }> { + const rows = this.db + .prepare( + `select t.id, t.number, t.title, tv.vector_json + from threads t + join thread_vectors tv on tv.thread_id = t.id + where t.repo_id = ? + and t.state = 'open' + and t.closed_at_local is null + and tv.model = ? + and tv.basis = ? + and tv.dimensions = ? + order by t.number asc`, + ) + .all(repoId, this.config.embedModel, this.config.embeddingBasis, ACTIVE_EMBED_DIMENSIONS) as Array<{ + id: number; + number: number; + title: string; + vector_json: Buffer | string; + }>; + return rows.map((row) => ({ + id: row.id, + number: row.number, + title: row.title, + embedding: this.parseStoredVector(row.vector_json), + })); + } + + private loadNormalizedActiveVectors(repoId: number): Array<{ id: number; number: number; title: string; embedding: number[] }> { + return this.loadClusterableActiveVectorMeta(repoId, '').map((row) => ({ + id: row.id, + number: row.number, + title: row.title, + embedding: normalizeEmbedding(row.embedding).normalized, + })); + } + private listStoredClusterNeighbors(repoId: number, threadId: number, limit: number): SearchHitDto['neighbors'] { const latestRun = this.getLatestClusterRun(repoId); if (!latestRun) { @@ -2972,72 +4000,76 @@ export class GHCrawlService { title: string; body: string | null; }>; - const summaryTexts = this.loadCombinedSummaryTextMap(repoId, threadNumber); - const tasks = rows.flatMap((row) => - this.buildEmbeddingTasks({ + const summaryTexts = this.loadDedupeSummaryTextMap(repoId, threadNumber); + const missingSummaryThreadNumbers: number[] = []; + const tasks = rows.flatMap((row) => { + const task = this.buildActiveVectorTask({ threadId: row.id, threadNumber: row.number, title: row.title, body: row.body, dedupeSummary: summaryTexts.get(row.id) ?? null, - }), - ); + }); + if (task) { + return [task]; + } + if (this.config.embeddingBasis === 'title_summary') { + missingSummaryThreadNumbers.push(row.number); + } + return []; + }); + const pipelineCurrent = this.isRepoVectorStateCurrent(repoId); const existingRows = this.db .prepare( - `select e.thread_id, e.source_kind, e.content_hash - from document_embeddings e - join threads t on t.id = e.thread_id - where t.repo_id = ? and e.model = ?`, + `select tv.thread_id, tv.content_hash + from thread_vectors tv + join threads t on t.id = tv.thread_id + where t.repo_id = ? + and tv.model = ? + and tv.basis = ? + and tv.dimensions = ?`, ) - .all(repoId, this.config.embedModel) as Array<{ + .all(repoId, this.config.embedModel, this.config.embeddingBasis, ACTIVE_EMBED_DIMENSIONS) as Array<{ thread_id: number; - source_kind: EmbeddingSourceKind; content_hash: string; }>; const existing = new Map(); for (const row of existingRows) { - existing.set(`${row.thread_id}:${row.source_kind}`, row.content_hash); + existing.set(String(row.thread_id), row.content_hash); } - const pending = tasks.filter((task) => existing.get(`${task.threadId}:${task.sourceKind}`) !== task.contentHash); - return { rows, tasks, existing, pending }; + const pending = pipelineCurrent + ? tasks.filter((task) => existing.get(String(task.threadId)) !== task.contentHash) + : tasks; + return { rows, tasks, existing, pending, missingSummaryThreadNumbers }; } - private loadCombinedSummaryTextMap(repoId: number, threadNumber?: number): Map { + private loadDedupeSummaryTextMap(repoId: number, threadNumber?: number): Map { let sql = - `select s.thread_id, s.summary_kind, s.summary_text + `select s.thread_id, s.summary_text from document_summaries s join threads t on t.id = s.thread_id - where t.repo_id = ? and t.state = 'open' and t.closed_at_local is null and s.model = ?`; - const args: Array = [repoId, this.config.summaryModel]; + where t.repo_id = ? + and t.state = 'open' + and t.closed_at_local is null + and s.model = ? + and s.summary_kind = 'dedupe_summary' + and s.prompt_version = ?`; + const args: Array = [repoId, this.config.summaryModel, SUMMARY_PROMPT_VERSION]; if (threadNumber) { sql += ' and t.number = ?'; args.push(threadNumber); } - sql += ' order by t.number asc, s.summary_kind asc'; + sql += ' order by t.number asc'; const rows = this.db.prepare(sql).all(...args) as Array<{ thread_id: number; - summary_kind: string; summary_text: string; }>; - const byThread = new Map>(); - for (const row of rows) { - const entry = byThread.get(row.thread_id) ?? new Map(); - entry.set(row.summary_kind, normalizeSummaryText(row.summary_text)); - byThread.set(row.thread_id, entry); - } - const combined = new Map(); - const order = ['problem_summary', 'solution_summary', 'maintainer_signal_summary', 'dedupe_summary']; - for (const [threadId, entry] of byThread.entries()) { - const parts = order - .map((summaryKind) => { - const text = entry.get(summaryKind); - return text ? `${summaryKind}: ${text}` : ''; - }) - .filter(Boolean); - if (parts.length > 0) { - combined.set(threadId, parts.join('\n\n')); + for (const row of rows) { + const text = normalizeSummaryText(row.summary_text); + if (text) { + combined.set(row.thread_id, text); } } return combined; @@ -3171,6 +4203,90 @@ export class GHCrawlService { } } + private collectSourceKindScores( + perSourceScores: Map }>, + edges: Array<{ leftThreadId: number; rightThreadId: number; score: number }>, + sourceKind: EmbeddingSourceKind, + ): void { + for (const edge of edges) { + const key = this.edgeKey(edge.leftThreadId, edge.rightThreadId); + const existing = perSourceScores.get(key); + if (existing) { + existing.scores.set(sourceKind, Math.max(existing.scores.get(sourceKind) ?? -1, edge.score)); + continue; + } + const scores = new Map(); + scores.set(sourceKind, edge.score); + perSourceScores.set(key, { + leftThreadId: edge.leftThreadId, + rightThreadId: edge.rightThreadId, + scores, + }); + } + } + + private finalizeEdgeScores( + perSourceScores: Map }>, + aggregation: 'max' | 'mean' | 'weighted' | 'min-of-2' | 'boost', + weights: Record, + minScore: number, + ): Array<{ leftThreadId: number; rightThreadId: number; score: number }> { + const result: Array<{ leftThreadId: number; rightThreadId: number; score: number }> = []; + + for (const entry of perSourceScores.values()) { + const scoreValues = Array.from(entry.scores.values()); + let finalScore: number; + + switch (aggregation) { + case 'max': + finalScore = Math.max(...scoreValues); + break; + + case 'mean': + finalScore = scoreValues.reduce((a, b) => a + b, 0) / scoreValues.length; + break; + + case 'weighted': { + let weightedSum = 0; + let weightSum = 0; + for (const [kind, score] of entry.scores) { + const w = weights[kind] ?? 0.1; + weightedSum += score * w; + weightSum += w; + } + finalScore = weightSum > 0 ? weightedSum / weightSum : 0; + break; + } + + case 'min-of-2': + // Require at least 2 source kinds to agree (both above minScore) + if (scoreValues.length < 2) { + continue; // Skip edges with only 1 source kind + } + finalScore = Math.max(...scoreValues); + break; + + case 'boost': { + // Best score + bonus per additional agreeing source + const best = Math.max(...scoreValues); + const bonusSources = scoreValues.length - 1; + finalScore = Math.min(1.0, best + bonusSources * 0.05); + break; + } + } + + if (finalScore >= minScore) { + result.push({ + leftThreadId: entry.leftThreadId, + rightThreadId: entry.rightThreadId, + score: finalScore, + }); + } + } + + return result; + } + private countEmbeddingsForSourceKind(repoId: number, sourceKind: EmbeddingSourceKind): number { const row = this.db .prepare( @@ -3253,17 +4369,149 @@ export class GHCrawlService { this.db.prepare('delete from cluster_runs where repo_id = ? and id <> ?').run(repoId, keepRunId); } + private summarizeClusterSizes( + clusters: Array<{ representativeThreadId: number; members: number[] }>, + ): ClusterExperimentClusterSizeStats { + const histogramCounts = new Map(); + const topClusterSizes = clusters.map((cluster) => cluster.members.length).sort((left, right) => right - left); + let soloClusters = 0; + + for (const cluster of clusters) { + const size = cluster.members.length; + histogramCounts.set(size, (histogramCounts.get(size) ?? 0) + 1); + if (size === 1) { + soloClusters += 1; + } + } + + return { + soloClusters, + maxClusterSize: topClusterSizes[0] ?? 0, + topClusterSizes: topClusterSizes.slice(0, 50), + histogram: Array.from(histogramCounts.entries()) + .map(([size, count]) => ({ size, count })) + .sort((left, right) => left.size - right.size), + }; + } + private upsertSummary(threadId: number, contentHash: string, summaryKind: string, summaryText: string): void { this.db .prepare( - `insert into document_summaries (thread_id, summary_kind, model, content_hash, summary_text, created_at, updated_at) - values (?, ?, ?, ?, ?, ?, ?) + `insert into document_summaries (thread_id, summary_kind, model, prompt_version, content_hash, summary_text, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?) on conflict(thread_id, summary_kind, model) do update set + prompt_version = excluded.prompt_version, content_hash = excluded.content_hash, summary_text = excluded.summary_text, updated_at = excluded.updated_at`, ) - .run(threadId, summaryKind, this.config.summaryModel, contentHash, summaryText, nowIso(), nowIso()); + .run(threadId, summaryKind, this.config.summaryModel, SUMMARY_PROMPT_VERSION, contentHash, summaryText, nowIso(), nowIso()); + } + + private upsertActiveVector( + repoId: number, + repoFullName: string, + threadId: number, + basis: EmbeddingBasis, + contentHash: string, + embedding: number[], + ): void { + this.db + .prepare( + `insert into thread_vectors (thread_id, basis, model, dimensions, content_hash, vector_json, vector_backend, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?, ?) + on conflict(thread_id) do update set + basis = excluded.basis, + model = excluded.model, + dimensions = excluded.dimensions, + content_hash = excluded.content_hash, + vector_json = excluded.vector_json, + vector_backend = excluded.vector_backend, + updated_at = excluded.updated_at`, + ) + .run( + threadId, + basis, + this.config.embedModel, + embedding.length, + contentHash, + this.vectorBlob(embedding), + this.config.vectorBackend, + nowIso(), + nowIso(), + ); + this.vectorStore.upsertVector({ + storePath: this.repoVectorStorePath(repoFullName), + dimensions: ACTIVE_EMBED_DIMENSIONS, + threadId, + vector: embedding, + }); + } + + private countLegacyEmbeddings(repoId: number): number { + const row = this.db + .prepare( + `select count(*) as count + from document_embeddings + where thread_id in (select id from threads where repo_id = ?)`, + ) + .get(repoId) as { count: number }; + return row.count; + } + + private countInlineJsonThreadVectors(repoId: number): number { + const row = this.db + .prepare( + `select count(*) as count + from thread_vectors + where thread_id in (select id from threads where repo_id = ?) + and typeof(vector_json) = 'text' + and vector_json != ''`, + ) + .get(repoId) as { count: number }; + return row.count; + } + + private getVectorliteClusterQuery(totalItems: number, requestedK: number): { + limit: number; + candidateK: number; + efSearch?: number; + } { + if (totalItems < CLUSTER_PARALLEL_MIN_EMBEDDINGS) { + return { + limit: requestedK, + candidateK: Math.max(requestedK * 16, 64), + }; + } + + const limit = Math.min( + Math.max(requestedK * VECTORLITE_CLUSTER_EXPANDED_MULTIPLIER, VECTORLITE_CLUSTER_EXPANDED_K), + Math.max(1, totalItems - 1), + ); + const candidateK = Math.min( + Math.max(limit * 16, VECTORLITE_CLUSTER_EXPANDED_CANDIDATE_K), + Math.max(limit, totalItems - 1), + ); + return { + limit, + candidateK, + efSearch: Math.max(candidateK * 2, VECTORLITE_CLUSTER_EXPANDED_EF_SEARCH), + }; + } + + private vectorBlob(values: number[]): Buffer { + return Buffer.from(Float32Array.from(values).buffer); + } + + private parseStoredVector(value: Buffer | string): number[] { + if (typeof value === 'string') { + if (!value) { + throw new Error('Stored vector payload is empty. Run refresh or embed first.'); + } + return JSON.parse(value) as number[]; + } + const floats = new Float32Array(value.buffer, value.byteOffset, Math.floor(value.byteLength / Float32Array.BYTES_PER_ELEMENT)); + return Array.from(floats); } private upsertEmbedding(threadId: number, sourceKind: EmbeddingSourceKind, contentHash: string, embedding: number[]): void { diff --git a/packages/api-core/src/vector/store.ts b/packages/api-core/src/vector/store.ts new file mode 100644 index 0000000..73a6fa0 --- /dev/null +++ b/packages/api-core/src/vector/store.ts @@ -0,0 +1,28 @@ +export type VectorStoreHealth = { + ok: boolean; + error: string | null; +}; + +export type VectorNeighbor = { + threadId: number; + score: number; +}; + +export type VectorQueryParams = { + storePath: string; + dimensions: number; + vector: number[]; + limit: number; + candidateK?: number; + excludeThreadId?: number; + efSearch?: number; +}; + +export type VectorStore = { + checkRuntime: () => VectorStoreHealth; + resetRepository: (params: { storePath: string; dimensions: number }) => void; + upsertVector: (params: { storePath: string; dimensions: number; threadId: number; vector: number[] }) => void; + deleteVector: (params: { storePath: string; dimensions: number; threadId: number }) => void; + queryNearest: (params: VectorQueryParams) => VectorNeighbor[]; + close: () => void; +}; diff --git a/packages/api-core/src/vector/vectorlite-store.test.ts b/packages/api-core/src/vector/vectorlite-store.test.ts new file mode 100644 index 0000000..c7eab8e --- /dev/null +++ b/packages/api-core/src/vector/vectorlite-store.test.ts @@ -0,0 +1,88 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; +import fs from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; + +import { VectorliteStore } from './vectorlite-store.js'; + +function makeStorePath(): string { + const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-vector-store-test-')); + return path.join(dir, 'repo.sqlite'); +} + +test('vectorlite store persists vectors across reopen', () => { + const storePath = makeStorePath(); + const vector = [1, 0, 0, 0]; + const neighbor = [0.9, 0.1, 0, 0]; + const far = [0, 1, 0, 0]; + + const first = new VectorliteStore(); + try { + const health = first.checkRuntime(); + assert.equal(health.ok, true); + first.upsertVector({ storePath, dimensions: 4, threadId: 1, vector }); + first.upsertVector({ storePath, dimensions: 4, threadId: 2, vector: neighbor }); + first.upsertVector({ storePath, dimensions: 4, threadId: 3, vector: far }); + } finally { + first.close(); + } + + const reopened = new VectorliteStore(); + try { + const results = reopened.queryNearest({ + storePath, + dimensions: 4, + vector, + limit: 2, + excludeThreadId: 1, + candidateK: 3, + }); + assert.deepEqual(results.map((row) => row.threadId), [2, 3]); + assert.ok(results[0]!.score > results[1]!.score); + } finally { + reopened.close(); + } +}); + +test('vectorlite store update and delete affect later queries', () => { + const storePath = makeStorePath(); + const store = new VectorliteStore(); + try { + store.upsertVector({ storePath, dimensions: 3, threadId: 1, vector: [1, 0, 0] }); + store.upsertVector({ storePath, dimensions: 3, threadId: 2, vector: [0.8, 0.2, 0] }); + let results = store.queryNearest({ + storePath, + dimensions: 3, + vector: [1, 0, 0], + limit: 1, + excludeThreadId: 1, + candidateK: 2, + }); + assert.deepEqual(results.map((row) => row.threadId), [2]); + + store.upsertVector({ storePath, dimensions: 3, threadId: 2, vector: [0, 1, 0] }); + results = store.queryNearest({ + storePath, + dimensions: 3, + vector: [1, 0, 0], + limit: 1, + excludeThreadId: 1, + candidateK: 2, + }); + assert.ok(results[0]!.score < 0.5); + + store.deleteVector({ storePath, dimensions: 3, threadId: 2 }); + results = store.queryNearest({ + storePath, + dimensions: 3, + vector: [1, 0, 0], + limit: 1, + excludeThreadId: 1, + candidateK: 2, + }); + assert.deepEqual(results, []); + } finally { + store.close(); + } +}); diff --git a/packages/api-core/src/vector/vectorlite-store.ts b/packages/api-core/src/vector/vectorlite-store.ts new file mode 100644 index 0000000..dfdb3cb --- /dev/null +++ b/packages/api-core/src/vector/vectorlite-store.ts @@ -0,0 +1,159 @@ +import { createRequire } from 'node:module'; +import fs from 'node:fs'; +import path from 'node:path'; + +import { openDb, type SqliteDatabase } from '../db/sqlite.js'; +import type { VectorNeighbor, VectorQueryParams, VectorStore, VectorStoreHealth } from './store.js'; + +const requireFromHere = createRequire(import.meta.url); +const TABLE_NAME = 'thread_vectors_ann'; +const META_TABLE_NAME = 'vector_store_meta'; +const HNSW_MAX_ELEMENTS = 1_000_000; + +type SqliteWithExtension = SqliteDatabase & { + loadExtension: (extensionPath: string) => void; +}; + +type StoreHandle = { + db: SqliteWithExtension; + storePath: string; + dimensions: number | null; +}; + +export class VectorliteStore implements VectorStore { + private readonly handles = new Map(); + + constructor( + private readonly options: { + extensionPathProvider?: () => string; + } = {}, + ) {} + + checkRuntime(): VectorStoreHealth { + try { + this.resolveExtensionPath(); + const db = openDb(':memory:') as SqliteWithExtension; + try { + db.loadExtension(this.resolveExtensionPath()); + db.prepare('select vectorlite_info()').get(); + } finally { + db.close(); + } + return { ok: true, error: null }; + } catch (error) { + return { + ok: false, + error: error instanceof Error ? error.message : String(error), + }; + } + } + + resetRepository(params: { storePath: string; dimensions: number }): void { + const handle = this.getHandle(params.storePath, params.dimensions); + handle.db.exec(`drop table if exists ${TABLE_NAME}`); + handle.db.exec(`delete from ${META_TABLE_NAME}`); + fs.rmSync(this.indexPath(params.storePath), { force: true }); + handle.dimensions = null; + this.ensureSchema(handle, params.dimensions); + } + + upsertVector(params: { storePath: string; dimensions: number; threadId: number; vector: number[] }): void { + const handle = this.getHandle(params.storePath, params.dimensions); + handle.db.exec(`delete from ${TABLE_NAME} where rowid = ${Math.trunc(params.threadId)}`); + handle.db + .prepare(`insert into ${TABLE_NAME}(rowid, vec) values (?, ?)`) + .run(params.threadId, this.vectorBuffer(params.vector)); + } + + deleteVector(params: { storePath: string; dimensions: number; threadId: number }): void { + const handle = this.getHandle(params.storePath, params.dimensions); + handle.db.exec(`delete from ${TABLE_NAME} where rowid = ${Math.trunc(params.threadId)}`); + } + + queryNearest(params: VectorQueryParams): VectorNeighbor[] { + const handle = this.getHandle(params.storePath, params.dimensions); + const safeLimit = Math.max(1, params.limit); + const safeCandidateK = Math.max(safeLimit, params.candidateK ?? safeLimit); + const querySql = + params.efSearch !== undefined + ? `select rowid, distance from ${TABLE_NAME} where knn_search(vec, knn_param(?, ${safeCandidateK}, ${params.efSearch}))` + : `select rowid, distance from ${TABLE_NAME} where knn_search(vec, knn_param(?, ${safeCandidateK}))`; + const rows = handle.db.prepare(querySql).all([this.vectorBuffer(params.vector)]) as Array<{ rowid: number; distance: number }>; + + return rows + .filter((row) => row.rowid !== params.excludeThreadId) + .slice(0, safeLimit) + .map((row) => ({ + threadId: row.rowid, + score: this.distanceToScore(row.distance), + })); + } + + close(): void { + for (const handle of this.handles.values()) { + handle.db.close(); + } + this.handles.clear(); + } + + private getHandle(storePath: string, dimensions: number): StoreHandle { + const existing = this.handles.get(storePath); + if (existing) { + this.ensureSchema(existing, dimensions); + return existing; + } + + const db = openDb(storePath) as SqliteWithExtension; + db.pragma('journal_mode = WAL'); + db.pragma('synchronous = NORMAL'); + db.loadExtension(this.resolveExtensionPath()); + const handle: StoreHandle = { db, storePath, dimensions: null }; + this.handles.set(storePath, handle); + this.ensureSchema(handle, dimensions); + return handle; + } + + private ensureSchema(handle: StoreHandle, dimensions: number): void { + handle.db.exec(`create table if not exists ${META_TABLE_NAME} (id integer primary key check (id = 1), dimensions integer not null)`); + const meta = handle.db.prepare(`select dimensions from ${META_TABLE_NAME} where id = 1`).get() as { dimensions: number } | undefined; + const tableExists = Boolean( + handle.db.prepare("select 1 from sqlite_master where type = 'table' and name = ? limit 1").get(TABLE_NAME), + ); + + if (!meta || meta.dimensions !== dimensions || !tableExists) { + handle.db.exec(`drop table if exists ${TABLE_NAME}`); + handle.db.exec(`delete from ${META_TABLE_NAME}`); + const indexPath = this.indexPath(handle.storePath); + handle.db.exec( + `create virtual table ${TABLE_NAME} using vectorlite(vec float32[${dimensions}], hnsw(max_elements=${HNSW_MAX_ELEMENTS}), '${this.escapeSqlString(indexPath)}')`, + ); + handle.db.prepare(`insert into ${META_TABLE_NAME}(id, dimensions) values (1, ?)`).run(dimensions); + } + + handle.dimensions = dimensions; + } + + private resolveExtensionPath(): string { + if (this.options.extensionPathProvider) { + return this.options.extensionPathProvider(); + } + const vectorlite = requireFromHere('vectorlite') as { vectorlitePath: () => string }; + return vectorlite.vectorlitePath(); + } + + private vectorBuffer(vector: number[]): Buffer { + return Buffer.from(Float32Array.from(vector).buffer); + } + + private distanceToScore(distance: number): number { + return 1 - distance / 2; + } + + private indexPath(storePath: string): string { + return path.join(path.dirname(storePath), `${path.basename(storePath, path.extname(storePath))}.hnsw`); + } + + private escapeSqlString(value: string): string { + return value.replace(/'/g, "''"); + } +} diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index b6f15cb..2127129 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -69,7 +69,7 @@ importers: specifier: ^1.1.2 version: 1.1.2(aggregate-error@3.1.0) better-sqlite3: - specifier: ^12.8.0 + specifier: ^12.2.0 version: 12.8.0 dotenv: specifier: ^17.2.2 @@ -80,12 +80,35 @@ importers: openai: specifier: ^6.33.0 version: 6.33.0(zod@4.3.6) + vectorlite: + specifier: ^0.2.0 + version: 0.2.0 zod: specifier: ^4.3.6 version: 4.3.6 packages: + '@1yefuwang1/vectorlite-darwin-arm64@0.2.0': + resolution: {integrity: sha512-taYA4xt4zFgQi0DwZvS1pr0bYmP7uAYVScxFGf5rl3CIxOXDxPyHytHRrvU5rqPUprLZlnxEb/gNGvCplYLNhA==} + cpu: [arm64] + os: [darwin] + + '@1yefuwang1/vectorlite-darwin-x64@0.2.0': + resolution: {integrity: sha512-vOU7h+PPE7VoJkUw6UQ2IT+a4Stzgj103yKdr85iLJqi4oX8TjxNH2OP5c1LKU8ZT/vQziKAqUE+p63XmFP8LA==} + cpu: [x64] + os: [darwin] + + '@1yefuwang1/vectorlite-linux-x64@0.2.0': + resolution: {integrity: sha512-VWgE4DVPVKoVzStnUR+PASWUloi5mUO+z7TEfsu6zHEBzPkOgL9ofC0DDr9/rFo1UNozN93j17ZOZL0HTsZKcQ==} + cpu: [x64] + os: [linux] + + '@1yefuwang1/vectorlite-win32-x64@0.2.0': + resolution: {integrity: sha512-NLZLfxQf2wS+PrhL1bavrULOSQhaOItGVsLebKy0Tu8Hpn81M3KPmweKwNiVTSqfLzy0IlhSalLlb+ypUdRm0Q==} + cpu: [x64] + os: [win32] + '@clack/core@1.1.0': resolution: {integrity: sha512-SVcm4Dqm2ukn64/8Gub2wnlA5nS2iWJyCkdNHcvNHPIeBTGojpdJ+9cZKwLfmqy7irD4N5qLteSilJlE0WLAtA==} @@ -590,6 +613,9 @@ packages: util-deprecate@1.0.2: resolution: {integrity: sha512-EPD5q1uXyFxJpCrLnCc1nHnq3gOa6DZBocAIiI2TaSCA7VCJ1UJDMagCzIkXNsUYfD1daK//LTEQ8xiIbrHtcw==} + vectorlite@0.2.0: + resolution: {integrity: sha512-hHFAISJuUblqTecD/EtNmEhIm4P6vTax4tswN486qCBDtse9uqBPLGdUsW4+CSjyG9Zoc0Jxj+dubQSZjvYGqg==} + wrappy@1.0.2: resolution: {integrity: sha512-l4Sp/DRseor9wL6EvV2+TuQn63dMkPjZ/sp9XkghTEbV9KlPS1xUsZ3u7/IQO4wxtcFB4bgpQPRcR3QCvezPcQ==} @@ -603,6 +629,18 @@ packages: snapshots: + '@1yefuwang1/vectorlite-darwin-arm64@0.2.0': + optional: true + + '@1yefuwang1/vectorlite-darwin-x64@0.2.0': + optional: true + + '@1yefuwang1/vectorlite-linux-x64@0.2.0': + optional: true + + '@1yefuwang1/vectorlite-win32-x64@0.2.0': + optional: true + '@clack/core@1.1.0': dependencies: sisteransi: 1.0.5 @@ -1090,6 +1128,13 @@ snapshots: util-deprecate@1.0.2: {} + vectorlite@0.2.0: + optionalDependencies: + '@1yefuwang1/vectorlite-darwin-arm64': 0.2.0 + '@1yefuwang1/vectorlite-darwin-x64': 0.2.0 + '@1yefuwang1/vectorlite-linux-x64': 0.2.0 + '@1yefuwang1/vectorlite-win32-x64': 0.2.0 + wrappy@1.0.2: {} yaml@2.8.3: {} diff --git a/scripts/cluster-judge-experiment.mjs b/scripts/cluster-judge-experiment.mjs new file mode 100644 index 0000000..f44d286 --- /dev/null +++ b/scripts/cluster-judge-experiment.mjs @@ -0,0 +1,435 @@ +#!/usr/bin/env node +/** + * Run a clustering experiment with LLM-as-judge evaluation. + * + * 1. Runs clusterExperiment with given params + * 2. Samples clusters (stratified: top-by-size, mid-range, small) + * 3. Samples singletons for false-negative evaluation + * 4. Judges each sample with an LLM + * 5. Outputs aggregate scores + full results JSON + * + * Usage: + * node scripts/cluster-judge-experiment.mjs openclaw/openclaw \ + * --experiment-id baseline \ + * --source-kinds title,body,dedupe_summary \ + * --aggregation max \ + * --threshold 0.82 \ + * --output-dir .context/compound-engineering/ce-optimize/embedding-clustering/results + * + * Requires OPENAI_API_KEY in environment. + */ +import fs from 'node:fs'; +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { createRequire } from 'node:module'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); +const { GHCrawlService } = await import(serviceModulePath); + +const apiCoreRequire = createRequire(path.join(repoRoot, 'packages', 'api-core', 'package.json')); +const { default: OpenAI } = await import(apiCoreRequire.resolve('openai')); + +const CLUSTER_RUBRIC = `You are evaluating a cluster of GitHub issues/PRs that were grouped together by embedding similarity. Each item shows its number, kind (issue/PR), title, and dedupe_summary. + +Rate this cluster 1-5 for COHERENCE: +- 5: All items clearly about the same specific issue, feature, or component +- 4: Strong theme with minor outliers (1 loosely related item) +- 3: Related topic area but covers 2-3 distinct sub-topics that could be split +- 2: Weak connection — items share superficial similarity only +- 1: Unrelated items grouped together, no meaningful connection + +Also report: +- distinct_topics: integer — how many distinct sub-topics are in this cluster +- outlier_count: integer — items that don't belong +- dominant_theme: string — 1 sentence describing the main topic + +Return JSON only: { "score": , "distinct_topics": , "outlier_count": , "dominant_theme": "", "reasoning": "" }`; + +const SINGLETON_RUBRIC = `This GitHub thread is currently a SINGLETON — it was not grouped with any other thread in a repository of ~18k issues/PRs. Given its title and dedupe_summary, evaluate whether this is correct. + +Rate 1-5: +- 5: Clearly unique topic, no plausible duplicates would exist +- 4: Probably unique, though a loose connection to other topics is possible +- 3: Uncertain — could go either way, might have related threads +- 2: Likely should be grouped — the topic is common enough to have duplicates +- 1: Obvious false negative — this clearly belongs with other threads on a common topic + +Return JSON only: { "score": , "reasoning": "" }`; + +function parseArgs(argv) { + let repo = 'openclaw/openclaw'; + let experimentId = 'unnamed'; + let outputDir = '.context/compound-engineering/ce-optimize/embedding-clustering/results'; + let sourceKinds; + let aggregation; + let aggregationWeights; + let threshold; + let k; + let candidateK; + let efSearch; + let backend = 'vectorlite'; + let maxClusterSize = 200; + let clusterMode = 'bounded'; + let clusterSampleSize = 30; + let singletonSampleSize = 15; + let judgeModel = 'gpt-5-mini'; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--experiment-id') { experimentId = argv[++index]; continue; } + if (token === '--output-dir') { outputDir = argv[++index]; continue; } + if (token === '--source-kinds') { sourceKinds = argv[++index].split(','); continue; } + if (token === '--aggregation') { aggregation = argv[++index]; continue; } + if (token === '--weights') { aggregationWeights = JSON.parse(argv[++index]); continue; } + if (token === '--threshold') { threshold = Number(argv[++index]); continue; } + if (token === '--k') { k = Number(argv[++index]); continue; } + if (token === '--candidate-k') { candidateK = Number(argv[++index]); continue; } + if (token === '--ef-search') { efSearch = Number(argv[++index]); continue; } + if (token === '--backend') { backend = argv[++index]; continue; } + if (token === '--max-cluster-size') { maxClusterSize = Number(argv[++index]); continue; } + if (token === '--cluster-mode') { clusterMode = argv[++index]; continue; } + if (token === '--cluster-sample-size') { clusterSampleSize = Number(argv[++index]); continue; } + if (token === '--singleton-sample-size') { singletonSampleSize = Number(argv[++index]); continue; } + if (token === '--judge-model') { judgeModel = argv[++index]; continue; } + if (!token.startsWith('--')) repo = token; + } + + const [owner, name] = repo.split('/'); + return { + owner, repo: name, + experimentId, outputDir, + backend, + sourceKinds, aggregation, aggregationWeights, + threshold, k, candidateK, efSearch, + maxClusterSize, clusterMode, + clusterSampleSize, singletonSampleSize, + judgeModel, + }; +} + +function sampleClusters(clusters, sampleSize, seed = 42) { + // Separate multi-member clusters from singletons + const multiMember = clusters.filter(c => c.memberThreadIds.length > 1); + const singletons = clusters.filter(c => c.memberThreadIds.length === 1); + + // Sort by size descending + multiMember.sort((a, b) => b.memberThreadIds.length - a.memberThreadIds.length); + + const perBucket = Math.floor(sampleSize / 3); + const sampled = []; + + // Top by size + sampled.push(...multiMember.slice(0, perBucket).map(c => ({ ...c, bucket: 'top_by_size' }))); + + // Mid range + const midStart = Math.floor(multiMember.length * 0.3); + const midEnd = Math.floor(multiMember.length * 0.7); + const midPool = multiMember.slice(midStart, midEnd); + // Deterministic pseudo-random selection + const midSampled = deterministicSample(midPool, perBucket, seed); + sampled.push(...midSampled.map(c => ({ ...c, bucket: 'mid_range' }))); + + // Small clusters (size 2-3) + const smallPool = multiMember.filter(c => c.memberThreadIds.length <= 3); + const remaining = sampleSize - sampled.length; + const smallSampled = deterministicSample(smallPool, remaining, seed + 1); + sampled.push(...smallSampled.map(c => ({ ...c, bucket: 'small_clusters' }))); + + return { sampled, singletons }; +} + +function deterministicSample(pool, count, seed) { + if (pool.length <= count) return [...pool]; + // Simple seeded shuffle + const indices = pool.map((_, i) => i); + let s = seed; + for (let i = indices.length - 1; i > 0; i--) { + s = (s * 1103515245 + 12345) & 0x7fffffff; + const j = s % (i + 1); + [indices[i], indices[j]] = [indices[j], indices[i]]; + } + return indices.slice(0, count).map(i => pool[i]); +} + +async function judgeCluster(client, model, cluster, threadDetails) { + // For large clusters, show a sample of items to avoid exceeding context limits + let displayIds = cluster.memberThreadIds; + let truncationNote = ''; + if (displayIds.length > 25) { + // Show first 10, last 5, and 10 evenly spaced from the middle + const first = displayIds.slice(0, 10); + const last = displayIds.slice(-5); + const middle = []; + const step = Math.floor((displayIds.length - 15) / 10); + for (let i = 10; i < displayIds.length - 5 && middle.length < 10; i += Math.max(1, step)) { + middle.push(displayIds[i]); + } + displayIds = [...first, ...middle, ...last]; + truncationNote = `\n(Showing ${displayIds.length} of ${cluster.memberThreadIds.length} items — sampled for brevity)`; + } + + const items = displayIds.map(id => { + const t = threadDetails.get(id); + if (!t) return ` - Thread ID ${id}: (details not found)`; + return ` - #${t.number} (${t.kind}): "${t.title}" — ${t.dedupeSummary || '(no summary)'}`; + }).join('\n'); + + const input = `Cluster with ${cluster.memberThreadIds.length} items:${truncationNote}\n${items}`; + + const response = await client.responses.create({ + model, + input: [ + { role: 'system', content: [{ type: 'input_text', text: CLUSTER_RUBRIC }] }, + { role: 'user', content: [{ type: 'input_text', text: input }] }, + ], + text: { + format: { type: 'json_schema', name: 'cluster_judge', strict: true, schema: { + type: 'object', + properties: { + score: { type: 'integer' }, + distinct_topics: { type: 'integer' }, + outlier_count: { type: 'integer' }, + dominant_theme: { type: 'string' }, + reasoning: { type: 'string' }, + }, + required: ['score', 'distinct_topics', 'outlier_count', 'dominant_theme', 'reasoning'], + additionalProperties: false, + }}, + }, + max_output_tokens: 800, + }); + + try { + return JSON.parse(response.output_text ?? '{}'); + } catch { + return { score: null, reasoning: 'parse error' }; + } +} + +async function judgeSingleton(client, model, threadDetail) { + const input = `Thread #${threadDetail.number} (${threadDetail.kind}): "${threadDetail.title}"\ndedupe_summary: ${threadDetail.dedupeSummary || '(none)'}`; + + const response = await client.responses.create({ + model, + input: [ + { role: 'system', content: [{ type: 'input_text', text: SINGLETON_RUBRIC }] }, + { role: 'user', content: [{ type: 'input_text', text: input }] }, + ], + text: { + format: { type: 'json_schema', name: 'singleton_judge', strict: true, schema: { + type: 'object', + properties: { + score: { type: 'integer' }, + reasoning: { type: 'string' }, + }, + required: ['score', 'reasoning'], + additionalProperties: false, + }}, + }, + max_output_tokens: 500, + }); + + try { + return JSON.parse(response.output_text ?? '{}'); + } catch { + return { score: null, reasoning: 'parse error' }; + } +} + +// Main execution +const args = parseArgs(process.argv.slice(2)); + +const apiKey = process.env.OPENAI_API_KEY; +if (!apiKey) throw new Error('OPENAI_API_KEY not set'); +const client = new OpenAI({ apiKey }); + +const service = new GHCrawlService(); + +try { + // Step 1: Run clustering + process.stderr.write(`[experiment] ${args.experimentId}: running clustering...\n`); + const result = service.clusterExperiment({ + owner: args.owner, + repo: args.repo, + backend: args.backend, + minScore: args.threshold, + k: args.k, + candidateK: args.candidateK, + efSearch: args.efSearch, + maxClusterSize: args.maxClusterSize, + clusterMode: args.clusterMode, + sourceKinds: args.sourceKinds, + aggregation: args.aggregation, + aggregationWeights: args.aggregationWeights, + includeClusters: true, + onProgress: (msg) => process.stderr.write(`${msg}\n`), + }); + + const totalThreads = result.threads; + const soloClusters = result.clusterSizes.soloClusters; + const multiMemberClusters = result.clusters - soloClusters; + const threadsInMulti = totalThreads - soloClusters; + const multiMemberPct = totalThreads > 0 ? threadsInMulti / totalThreads : 0; + + const metrics = { + multi_member_pct: Math.round(multiMemberPct * 10000) / 100, + edge_count: result.edges, + cluster_count: result.clusters, + solo_clusters: soloClusters, + multi_member_clusters: multiMemberClusters, + threads_in_multi: threadsInMulti, + total_threads: totalThreads, + max_cluster_size: result.clusterSizes.maxClusterSize, + solo_pct: Math.round((soloClusters / Math.max(result.clusters, 1)) * 10000) / 100, + avg_multi_size: multiMemberClusters > 0 ? Math.round((threadsInMulti / multiMemberClusters) * 100) / 100 : 0, + duration_ms: result.durationMs, + }; + + process.stderr.write(`[experiment] clustering done: ${metrics.multi_member_pct}% multi-member, ${metrics.edge_count} edges\n`); + + // Check degenerate gates + if (metrics.solo_pct >= 95 || metrics.max_cluster_size > 500 || metrics.multi_member_pct < 5) { + process.stderr.write(`[experiment] DEGENERATE: solo_pct=${metrics.solo_pct} max_cluster=${metrics.max_cluster_size} multi%=${metrics.multi_member_pct}\n`); + const output = { experiment_id: args.experimentId, outcome: 'degenerate', metrics, judge: null }; + fs.mkdirSync(path.resolve(args.outputDir), { recursive: true }); + fs.writeFileSync(path.resolve(args.outputDir, `${args.experimentId}.json`), JSON.stringify(output, null, 2)); + process.stdout.write(JSON.stringify({ experiment_id: args.experimentId, outcome: 'degenerate', ...metrics }, null, 2) + '\n'); + process.exit(0); + } + + // Step 2: Load thread details for judging + process.stderr.write(`[experiment] loading thread details for judging...\n`); + const clusters = result.clustersDetail; + const allThreadIds = new Set(); + for (const c of clusters) { + for (const id of c.memberThreadIds) allThreadIds.add(id); + } + + const threadDetails = new Map(); + const threadIds = Array.from(allThreadIds); + for (let i = 0; i < threadIds.length; i += 500) { + const batch = threadIds.slice(i, i + 500); + const placeholders = batch.map(() => '?').join(','); + const rows = service.db.prepare( + `select t.id, t.number, t.kind, t.title, s.summary_text as dedupe_summary + from threads t + left join document_summaries s on s.thread_id = t.id and s.summary_kind = 'dedupe_summary' + where t.id in (${placeholders})` + ).all(...batch); + for (const row of rows) { + threadDetails.set(row.id, { + number: row.number, + kind: row.kind, + title: row.title, + dedupeSummary: row.dedupe_summary, + }); + } + } + + // Step 3: Sample clusters + const { sampled, singletons } = sampleClusters(clusters, args.clusterSampleSize); + const singletonSample = deterministicSample(singletons, args.singletonSampleSize, 42); + + process.stderr.write(`[experiment] sampled ${sampled.length} clusters + ${singletonSample.length} singletons for judging\n`); + + // Step 4: Judge clusters + const clusterJudgments = []; + for (const [i, cluster] of sampled.entries()) { + process.stderr.write(`[judge] cluster ${i + 1}/${sampled.length} (size=${cluster.memberThreadIds.length}, bucket=${cluster.bucket})\n`); + const judgment = await judgeCluster(client, args.judgeModel, cluster, threadDetails); + clusterJudgments.push({ + bucket: cluster.bucket, + size: cluster.memberThreadIds.length, + representativeThreadId: cluster.representativeThreadId, + judgment, + }); + } + + // Step 5: Judge singletons + const singletonJudgments = []; + for (const [i, singleton] of singletonSample.entries()) { + const threadId = singleton.memberThreadIds[0]; + const detail = threadDetails.get(threadId); + if (!detail) continue; + process.stderr.write(`[judge] singleton ${i + 1}/${singletonSample.length} #${detail.number}\n`); + const judgment = await judgeSingleton(client, args.judgeModel, detail); + singletonJudgments.push({ + threadId, + number: detail.number, + title: detail.title, + judgment, + }); + } + + // Step 6: Aggregate + const scoredClusters = clusterJudgments.filter(j => j.judgment?.score != null); + const meanScore = scoredClusters.length > 0 + ? scoredClusters.reduce((s, j) => s + j.judgment.score, 0) / scoredClusters.length + : 0; + const meanDistinctTopics = scoredClusters.length > 0 + ? scoredClusters.reduce((s, j) => s + (j.judgment.distinct_topics ?? 0), 0) / scoredClusters.length + : 0; + const totalOutliers = scoredClusters.reduce((s, j) => s + (j.judgment.outlier_count ?? 0), 0); + const totalMembers = scoredClusters.reduce((s, j) => s + j.size, 0); + const outlierRate = totalMembers > 0 ? totalOutliers / totalMembers : 0; + + const scoredSingletons = singletonJudgments.filter(j => j.judgment?.score != null); + const singletonScore = scoredSingletons.length > 0 + ? scoredSingletons.reduce((s, j) => s + j.judgment.score, 0) / scoredSingletons.length + : 0; + + // Per-bucket breakdown + const bucketScores = {}; + for (const bucket of ['top_by_size', 'mid_range', 'small_clusters']) { + const bucketItems = scoredClusters.filter(j => j.bucket === bucket); + bucketScores[bucket] = bucketItems.length > 0 + ? Math.round(bucketItems.reduce((s, j) => s + j.judgment.score, 0) / bucketItems.length * 100) / 100 + : null; + } + + const judgeResults = { + mean_score: Math.round(meanScore * 100) / 100, + mean_distinct_topics: Math.round(meanDistinctTopics * 100) / 100, + outlier_rate: Math.round(outlierRate * 10000) / 100, + singleton_score: Math.round(singletonScore * 100) / 100, + bucket_scores: bucketScores, + clusters_judged: scoredClusters.length, + singletons_judged: scoredSingletons.length, + }; + + // Save full results + const output = { + experiment_id: args.experimentId, + outcome: 'measured', + timestamp: new Date().toISOString(), + params: { + source_kinds: args.sourceKinds ?? 'all', + aggregation: args.aggregation ?? 'max', + threshold: args.threshold ?? 0.82, + k: args.k ?? 6, + max_cluster_size: args.maxClusterSize, + cluster_mode: args.clusterMode, + }, + metrics, + judge: judgeResults, + cluster_judgments: clusterJudgments, + singleton_judgments: singletonJudgments, + }; + + fs.mkdirSync(path.resolve(args.outputDir), { recursive: true }); + const outputPath = path.resolve(args.outputDir, `${args.experimentId}.json`); + fs.writeFileSync(outputPath, JSON.stringify(output, null, 2)); + process.stderr.write(`\n[experiment] results saved to ${outputPath}\n`); + + // Print summary to stdout + process.stdout.write(JSON.stringify({ + experiment_id: args.experimentId, + outcome: 'measured', + ...metrics, + ...judgeResults, + }, null, 2) + '\n'); +} finally { + service.close(); +} diff --git a/scripts/cluster-optimize-measure.mjs b/scripts/cluster-optimize-measure.mjs new file mode 100644 index 0000000..1568623 --- /dev/null +++ b/scripts/cluster-optimize-measure.mjs @@ -0,0 +1,151 @@ +/** + * Measurement harness for cluster optimization experiments. + * + * Runs clusterExperiment with configurable parameters and outputs JSON metrics. + * Does NOT modify the shared DB — clusterExperiment is read-only on the main DB. + * + * Usage: + * node scripts/cluster-optimize-measure.mjs [owner/repo] \ + * --k 6 --threshold 0.82 --candidate-k 96 --ef-search 200 --backend vectorlite + * + * Output: JSON object with all metrics to stdout (progress to stderr). + */ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); + +const { GHCrawlService } = await import(serviceModulePath); + +function parseArgs(argv) { + let repo = 'openclaw/openclaw'; + let k; + let threshold; + let candidateK; + let efSearch; + let backend = 'vectorlite'; + let maxClusterSize; + let refineStep; + let clusterMode; + let sourceKinds; + let aggregation; + let aggregationWeights; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--k') { k = Number(argv[++index]); continue; } + if (token === '--threshold') { threshold = Number(argv[++index]); continue; } + if (token === '--candidate-k') { candidateK = Number(argv[++index]); continue; } + if (token === '--ef-search') { efSearch = Number(argv[++index]); continue; } + if (token === '--backend') { backend = argv[++index]; continue; } + if (token === '--max-cluster-size') { maxClusterSize = Number(argv[++index]); continue; } + if (token === '--refine-step') { refineStep = Number(argv[++index]); continue; } + if (token === '--cluster-mode') { clusterMode = argv[++index]; continue; } + if (token === '--source-kinds') { sourceKinds = argv[++index].split(','); continue; } + if (token === '--aggregation') { aggregation = argv[++index]; continue; } + if (token === '--weights') { aggregationWeights = JSON.parse(argv[++index]); continue; } + if (!token.startsWith('--')) repo = token; + } + + const [owner, name] = repo.split('/'); + if (!owner || !name) throw new Error(`Expected owner/repo, received: ${repo}`); + + return { + owner, + repo: name, + fullName: `${owner}/${name}`, + k: Number.isFinite(k) ? k : undefined, + threshold: Number.isFinite(threshold) ? threshold : undefined, + candidateK: Number.isFinite(candidateK) ? candidateK : undefined, + efSearch: Number.isFinite(efSearch) ? efSearch : undefined, + backend, + maxClusterSize: Number.isFinite(maxClusterSize) ? maxClusterSize : undefined, + refineStep: Number.isFinite(refineStep) ? refineStep : undefined, + clusterMode: clusterMode || undefined, + sourceKinds: sourceKinds || undefined, + aggregation: aggregation || undefined, + aggregationWeights: aggregationWeights || undefined, + }; +} + +const args = parseArgs(process.argv.slice(2)); + +const service = new GHCrawlService(); +try { + const result = service.clusterExperiment({ + owner: args.owner, + repo: args.repo, + backend: args.backend, + k: args.k, + minScore: args.threshold, + candidateK: args.candidateK, + efSearch: args.efSearch, + maxClusterSize: args.maxClusterSize, + refineStep: args.refineStep, + clusterMode: args.clusterMode, + sourceKinds: args.sourceKinds, + aggregation: args.aggregation, + aggregationWeights: args.aggregationWeights, + onProgress: (message) => process.stderr.write(`${message}\n`), + }); + + const totalThreads = result.threads; + const soloClusters = result.clusterSizes.soloClusters; + const multiMemberClusters = result.clusters - soloClusters; + const threadsInMulti = totalThreads - soloClusters; + const multiMemberPct = totalThreads > 0 ? threadsInMulti / totalThreads : 0; + + const metrics = { + // Primary metric + multi_member_pct: Math.round(multiMemberPct * 10000) / 100, + + // Gate metrics + edge_count: result.edges, + cluster_count: result.clusters, + solo_clusters: soloClusters, + multi_member_clusters: multiMemberClusters, + threads_in_multi: threadsInMulti, + total_threads: totalThreads, + max_cluster_size: result.clusterSizes.maxClusterSize, + + // Diagnostics + solo_pct: Math.round((soloClusters / Math.max(result.clusters, 1)) * 10000) / 100, + avg_multi_size: multiMemberClusters > 0 + ? Math.round((threadsInMulti / multiMemberClusters) * 100) / 100 + : 0, + + // Timing + duration_ms: result.durationMs, + total_duration_ms: result.totalDurationMs, + load_ms: result.loadMs, + setup_ms: result.setupMs, + index_build_ms: result.indexBuildMs, + query_ms: result.queryMs, + cluster_build_ms: result.clusterBuildMs, + + // Params used + params: { + backend: result.backend, + k: args.k ?? 6, + min_score: args.threshold ?? 0.82, + candidate_k: result.candidateK, + ef_search: args.efSearch ?? null, + max_cluster_size: args.maxClusterSize ?? null, + refine_step: args.refineStep ?? null, + cluster_mode: args.clusterMode ?? null, + source_kinds: args.sourceKinds ?? null, + aggregation: args.aggregation ?? 'max', + aggregation_weights: args.aggregationWeights ?? null, + }, + + // Size distribution (top 20) + top_cluster_sizes: result.clusterSizes.topClusterSizes.slice(0, 20), + histogram: result.clusterSizes.histogram, + }; + + process.stdout.write(JSON.stringify(metrics, null, 2) + '\n'); +} finally { + service.close(); +} diff --git a/scripts/cluster-perf-large-compare.mjs b/scripts/cluster-perf-large-compare.mjs new file mode 100644 index 0000000..20fdf39 --- /dev/null +++ b/scripts/cluster-perf-large-compare.mjs @@ -0,0 +1,115 @@ +import fs from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; +import { execFileSync } from 'node:child_process'; + +const repoRoot = path.resolve(new URL('..', import.meta.url).pathname); +const apiCoreRoot = path.join(repoRoot, 'packages', 'api-core'); +const perfConfigPath = path.join(apiCoreRoot, 'src', 'cluster', 'perf-large.json'); +const perfEntryPath = path.join(apiCoreRoot, 'dist', 'cluster', 'perf.integration.js'); + +function formatDurationMs(durationMs) { + if (!Number.isFinite(durationMs)) return 'n/a'; + if (durationMs < 1000) return `${durationMs.toFixed(1)} ms`; + const totalSeconds = durationMs / 1000; + if (totalSeconds < 60) return `${totalSeconds.toFixed(2)} s`; + const minutes = Math.floor(totalSeconds / 60); + const seconds = totalSeconds - minutes * 60; + return `${minutes}m ${seconds.toFixed(1)}s`; +} + +function formatPercent(value) { + const sign = value > 0 ? '+' : ''; + return `${sign}${value.toFixed(1)}%`; +} + +function formatBytes(bytes) { + if (!Number.isFinite(bytes)) return 'n/a'; + if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KiB`; + return `${(bytes / (1024 * 1024)).toFixed(1)} MiB`; +} + +function runPerf({ backend, outputPath }) { + const env = { + ...process.env, + GHCRAWL_CLUSTER_PERF_BOOTSTRAP: '1', + GHCRAWL_CLUSTER_PERF_IGNORE_THRESHOLD: '1', + GHCRAWL_CLUSTER_PERF_CONFIG_PATH: perfConfigPath, + GHCRAWL_CLUSTER_PERF_OUTPUT_PATH: outputPath, + }; + + if (backend === 'vectorlite') { + env.GHCRAWL_CLUSTER_PERF_BACKEND = 'vectorlite'; + } else { + delete env.GHCRAWL_CLUSTER_PERF_BACKEND; + } + + execFileSync(process.execPath, [perfEntryPath], { + cwd: apiCoreRoot, + env, + stdio: 'inherit', + }); + + return JSON.parse(fs.readFileSync(outputPath, 'utf8')); +} + +function main() { + const tempRoot = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-cluster-perf-large-')); + try { + const exactOutputPath = path.join(tempRoot, 'exact.json'); + const vectorliteOutputPaths = [1, 2, 3].map((attempt) => path.join(tempRoot, `vectorlite-${attempt}.json`)); + + const exact = runPerf({ backend: 'exact', outputPath: exactOutputPath }); + let vectorlite = null; + for (const outputPath of vectorliteOutputPaths) { + vectorlite = runPerf({ backend: 'vectorlite', outputPath }); + } + + if (!vectorlite) { + throw new Error('Vectorlite perf result was not produced.'); + } + + const exactMedianMs = exact.result.medianMs; + const vectorliteMedianMs = vectorlite.result.medianMs; + const deltaMs = vectorliteMedianMs - exactMedianMs; + const deltaPercent = exactMedianMs > 0 ? (deltaMs / exactMedianMs) * 100 : 0; + const speedup = vectorliteMedianMs > 0 ? exactMedianMs / vectorliteMedianMs : 0; + + const lines = [ + '## Large Cluster Perf Comparison', + '', + `- Fixture config: ${path.relative(repoRoot, perfConfigPath)}`, + `- Exact median (cluster-only): ${formatDurationMs(exactMedianMs)}`, + `- Exact median (total run): ${formatDurationMs(exact.result.totalMedianMs)}`, + `- Exact edge-build median: ${formatDurationMs(exact.result.edgeBuildMedianMs)}`, + `- Exact cluster-assembly median: ${formatDurationMs(exact.result.clusterBuildMedianMs)}`, + `- Exact median peak RSS: ${formatBytes(exact.result.medianPeakRssBytes)}`, + `- Exact median peak heap used: ${formatBytes(exact.result.medianPeakHeapUsedBytes)}`, + `- Vectorlite median (cluster-only, run 3/3): ${formatDurationMs(vectorliteMedianMs)}`, + `- Vectorlite median (total run, run 3/3): ${formatDurationMs(vectorlite.result.totalMedianMs)}`, + `- Vectorlite setup median: ${formatDurationMs(vectorlite.result.setupMedianMs)}`, + `- Vectorlite index-build median: ${formatDurationMs(vectorlite.result.indexBuildMedianMs)}`, + `- Vectorlite query median: ${formatDurationMs(vectorlite.result.queryMedianMs)}`, + `- Vectorlite cluster-assembly median: ${formatDurationMs(vectorlite.result.clusterBuildMedianMs)}`, + `- Vectorlite median peak RSS: ${formatBytes(vectorlite.result.medianPeakRssBytes)}`, + `- Vectorlite median peak heap used: ${formatBytes(vectorlite.result.medianPeakHeapUsedBytes)}`, + `- Vectorlite delta vs exact: ${formatDurationMs(deltaMs)} (${formatPercent(deltaPercent)})`, + `- Speedup: ${speedup.toFixed(2)}x`, + '', + '### Exact Summary', + '', + exact.summary.trim(), + '', + '### Vectorlite Summary (run 3/3)', + '', + vectorlite.summary.trim(), + '', + ]; + + process.stdout.write(`${lines.join('\n')}\n`); + } finally { + fs.rmSync(tempRoot, { recursive: true, force: true }); + } +} + +main(); diff --git a/scripts/cluster-perf-real-compare.mjs b/scripts/cluster-perf-real-compare.mjs new file mode 100644 index 0000000..d833cc3 --- /dev/null +++ b/scripts/cluster-perf-real-compare.mjs @@ -0,0 +1,320 @@ +import { spawn } from 'node:child_process'; +import path from 'node:path'; +import readline from 'node:readline'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); + +const { GHCrawlService } = await import(serviceModulePath); + +function formatDurationMs(durationMs) { + if (!Number.isFinite(durationMs)) return 'n/a'; + if (durationMs < 1000) return `${durationMs.toFixed(1)} ms`; + const totalSeconds = durationMs / 1000; + if (totalSeconds < 60) return `${totalSeconds.toFixed(2)} s`; + const minutes = Math.floor(totalSeconds / 60); + const seconds = totalSeconds - minutes * 60; + return `${minutes}m ${seconds.toFixed(1)}s`; +} + +function formatBytes(bytes) { + if (!Number.isFinite(bytes)) return 'n/a'; + const absoluteBytes = Math.abs(bytes); + const sign = bytes < 0 ? '-' : ''; + if (absoluteBytes < 1024 * 1024) { + return `${sign}${(absoluteBytes / 1024).toFixed(1)} KiB`; + } + return `${sign}${(absoluteBytes / (1024 * 1024)).toFixed(1)} MiB`; +} + +function formatPercent(value) { + const sign = value > 0 ? '+' : ''; + return `${sign}${value.toFixed(1)}%`; +} + +function parseArgs(argv) { + let repo = 'openclaw/openclaw'; + let k; + let threshold; + let candidateK; + let childBackend = null; + let backend = 'both'; + let maxOldSpaceSizeMb; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--repo') { + repo = argv[index + 1] ?? repo; + index += 1; + continue; + } + if (token === '--k') { + k = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--threshold') { + threshold = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--candidate-k') { + candidateK = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--child-backend') { + childBackend = argv[index + 1] ?? null; + index += 1; + continue; + } + if (token === '--backend') { + backend = argv[index + 1] ?? backend; + index += 1; + continue; + } + if (token === '--max-old-space-size') { + maxOldSpaceSizeMb = Number(argv[index + 1]); + index += 1; + continue; + } + if (!token.startsWith('--')) { + repo = token; + } + } + + const [owner, name] = repo.split('/'); + if (!owner || !name) { + throw new Error(`Expected owner/repo, received: ${repo}`); + } + + return { + owner, + repo: name, + fullName: `${owner}/${name}`, + k: Number.isFinite(k) ? k : undefined, + threshold: Number.isFinite(threshold) ? threshold : undefined, + candidateK: Number.isFinite(candidateK) ? candidateK : undefined, + childBackend, + backend, + maxOldSpaceSizeMb: Number.isFinite(maxOldSpaceSizeMb) ? maxOldSpaceSizeMb : undefined, + }; +} + +function getRepoStats(service, fullName) { + const repoRow = service.db + .prepare('select id, full_name from repositories where full_name = ?') + .get(fullName); + if (!repoRow) { + throw new Error(`Repository not found in local DB: ${fullName}`); + } + + const openThreadCount = service.db + .prepare( + `select count(*) as count + from threads + where repo_id = ? + and state = 'open' + and closed_at_local is null`, + ) + .get(repoRow.id).count; + + const embeddingCounts = service.db + .prepare( + `select e.source_kind as sourceKind, count(*) as count + from document_embeddings e + join threads t on t.id = e.thread_id + where t.repo_id = ? + and t.state = 'open' + and t.closed_at_local is null + and e.model = ? + group by e.source_kind + order by e.source_kind asc`, + ) + .all(repoRow.id, service.config.embedModel); + + return { + repoId: repoRow.id, + openThreadCount, + embeddingCounts, + }; +} + +function buildReportLines(label, result) { + return [ + `### ${label}`, + '', + `- Cluster-only duration: ${formatDurationMs(result.durationMs)}`, + `- Total duration: ${formatDurationMs(result.totalDurationMs)}`, + `- Load stage: ${formatDurationMs(result.loadMs)}`, + `- Temp DB setup: ${formatDurationMs(result.setupMs)}`, + `- Exact edge-build: ${formatDurationMs(result.edgeBuildMs)}`, + `- Vector index-build: ${formatDurationMs(result.indexBuildMs)}`, + `- Vector query: ${formatDurationMs(result.queryMs)}`, + `- Cluster assembly: ${formatDurationMs(result.clusterBuildMs)}`, + `- Edges: ${result.edges}`, + `- Clusters: ${result.clusters}`, + `- Threads: ${result.threads}`, + `- Source kinds: ${result.sourceKinds}`, + `- Candidate K: ${result.candidateK}`, + `- Peak RSS: ${formatBytes(result.memory.peakRssBytes)}`, + `- Peak heap used: ${formatBytes(result.memory.peakHeapUsedBytes)}`, + '', + ]; +} + +function buildDeltaLines(exactResult, vectorliteResult) { + const clusterDeltaMs = vectorliteResult.durationMs - exactResult.durationMs; + const clusterDeltaPercent = exactResult.durationMs > 0 ? (clusterDeltaMs / exactResult.durationMs) * 100 : 0; + const totalDeltaMs = vectorliteResult.totalDurationMs - exactResult.totalDurationMs; + const totalDeltaPercent = exactResult.totalDurationMs > 0 ? (totalDeltaMs / exactResult.totalDurationMs) * 100 : 0; + const peakRssDelta = vectorliteResult.memory.peakRssBytes - exactResult.memory.peakRssBytes; + const peakHeapDelta = vectorliteResult.memory.peakHeapUsedBytes - exactResult.memory.peakHeapUsedBytes; + + return [ + '### Delta', + '', + `- Cluster-only delta vs exact: ${formatDurationMs(clusterDeltaMs)} (${formatPercent(clusterDeltaPercent)})`, + `- Total duration delta vs exact: ${formatDurationMs(totalDeltaMs)} (${formatPercent(totalDeltaPercent)})`, + `- Peak RSS delta vs exact: ${formatBytes(peakRssDelta)}`, + `- Peak heap used delta vs exact: ${formatBytes(peakHeapDelta)}`, + '', + ]; +} + +async function runChild(args) { + const service = new GHCrawlService(); + try { + const result = service.clusterExperiment({ + owner: args.owner, + repo: args.repo, + backend: args.childBackend, + k: args.k, + minScore: args.threshold, + candidateK: args.candidateK, + onProgress: (message) => process.stdout.write(`${message}\n`), + }); + process.stdout.write(`__GHCRAWL_RESULT__${JSON.stringify(result)}\n`); + } finally { + service.close(); + } +} + +async function runBackend(backend, args) { + return await new Promise((resolve, reject) => { + const childArgs = ['--expose-gc']; + if (args.maxOldSpaceSizeMb !== undefined) { + childArgs.push(`--max-old-space-size=${args.maxOldSpaceSizeMb}`); + } + childArgs.push( + path.join(repoRoot, 'scripts', 'cluster-perf-real-compare.mjs'), + `${args.fullName}`, + '--child-backend', + backend, + ); + if (args.k !== undefined) { + childArgs.push('--k', String(args.k)); + } + if (args.threshold !== undefined) { + childArgs.push('--threshold', String(args.threshold)); + } + if (args.candidateK !== undefined) { + childArgs.push('--candidate-k', String(args.candidateK)); + } + + const child = spawn(process.execPath, childArgs, { + cwd: repoRoot, + env: process.env, + stdio: ['ignore', 'pipe', 'pipe'], + }); + + let result = null; + const pipeStream = (stream, label) => { + const rl = readline.createInterface({ input: stream }); + rl.on('line', (line) => { + if (label === 'stdout' && line.startsWith('__GHCRAWL_RESULT__')) { + result = JSON.parse(line.slice('__GHCRAWL_RESULT__'.length)); + return; + } + process.stdout.write(`[${backend}] ${line}\n`); + }); + }; + + pipeStream(child.stdout, 'stdout'); + pipeStream(child.stderr, 'stderr'); + + child.on('error', reject); + child.on('close', (code, signal) => { + if (code !== 0) { + const detail = signal ? `signal ${signal}` : `code ${code}`; + reject(new Error(`${backend} benchmark exited with ${detail}`)); + return; + } + if (!result) { + reject(new Error(`${backend} benchmark did not emit a result payload`)); + return; + } + resolve(result); + }); + }); +} + +async function runParent(args) { + const service = new GHCrawlService(); + let stats; + let dbPath; + let embedModel; + try { + stats = getRepoStats(service, args.fullName); + dbPath = service.config.dbPath; + embedModel = service.config.embedModel; + } finally { + service.close(); + } + + const lines = [ + '## Real Cluster Perf Comparison', + '', + `- Repo: ${args.fullName}`, + `- Config DB: ${dbPath}`, + `- Embed model: ${embedModel}`, + `- Open threads: ${stats.openThreadCount}`, + `- Embedding counts: ${stats.embeddingCounts.map((row) => `${row.sourceKind}=${row.count}`).join(', ') || 'none'}`, + `- Parameters: k=${args.k ?? 'default'} threshold=${args.threshold ?? 'default'} candidateK=${args.candidateK ?? 'default'}`, + `- Requested backend(s): ${args.backend}`, + `- Child max old space size: ${args.maxOldSpaceSizeMb ?? 'default'}`, + '', + ]; + + let exactResult = null; + let vectorliteResult = null; + + if (args.backend === 'both' || args.backend === 'exact') { + process.stdout.write(`[exact] starting real-db cluster experiment for ${args.fullName}\n`); + exactResult = await runBackend('exact', args); + lines.push(...buildReportLines('Exact', exactResult)); + if (args.backend === 'both') { + process.stdout.write(`\n${lines.join('\n')}\n`); + } + } + + if (args.backend === 'both' || args.backend === 'vectorlite') { + process.stdout.write(`[vectorlite] starting real-db cluster experiment for ${args.fullName}\n`); + vectorliteResult = await runBackend('vectorlite', args); + lines.push(...buildReportLines('Vectorlite', vectorliteResult)); + if (exactResult) { + lines.push(...buildDeltaLines(exactResult, vectorliteResult)); + } + } + + process.stdout.write(`\n${lines.join('\n')}`); +} + +const args = parseArgs(process.argv.slice(2)); +if (args.childBackend === 'exact' || args.childBackend === 'vectorlite') { + await runChild(args); +} else { + await runParent(args); +} diff --git a/scripts/cluster-population-compare.mjs b/scripts/cluster-population-compare.mjs new file mode 100644 index 0000000..82435c4 --- /dev/null +++ b/scripts/cluster-population-compare.mjs @@ -0,0 +1,280 @@ +import { spawn } from 'node:child_process'; +import path from 'node:path'; +import readline from 'node:readline'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); + +const { GHCrawlService } = await import(serviceModulePath); + +function parseArgs(argv) { + let repo = 'openclaw/openclaw'; + let k; + let threshold; + let candidateK; + let childBackend = null; + let top = 20; + let maxSize = 20; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--repo') { + repo = argv[index + 1] ?? repo; + index += 1; + continue; + } + if (token === '--k') { + k = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--threshold') { + threshold = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--candidate-k') { + candidateK = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--child-backend') { + childBackend = argv[index + 1] ?? null; + index += 1; + continue; + } + if (token === '--top') { + top = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--max-size') { + maxSize = Number(argv[index + 1]); + index += 1; + continue; + } + if (!token.startsWith('--')) { + repo = token; + } + } + + const [owner, name] = repo.split('/'); + if (!owner || !name) { + throw new Error(`Expected owner/repo, received: ${repo}`); + } + + return { + owner, + repo: name, + fullName: `${owner}/${name}`, + k: Number.isFinite(k) ? k : undefined, + threshold: Number.isFinite(threshold) ? threshold : undefined, + candidateK: Number.isFinite(candidateK) ? candidateK : undefined, + childBackend, + top: Number.isFinite(top) ? Math.max(1, top) : 20, + maxSize: Number.isFinite(maxSize) ? Math.max(1, maxSize) : 20, + }; +} + +function formatPercent(value) { + return `${(value * 100).toFixed(1)}%`; +} + +function countThreadsRepresented(histogram) { + return histogram.reduce((sum, bucket) => sum + bucket.size * bucket.count, 0); +} + +function histogramToMap(histogram) { + return new Map(histogram.map((bucket) => [bucket.size, bucket.count])); +} + +function formatDelta(value) { + return value > 0 ? `+${value}` : String(value); +} + +function repeat(character, count) { + return count > 0 ? character.repeat(count) : ''; +} + +function buildBar(count, maxCount, width) { + if (maxCount <= 0) return ''; + const scaled = Math.round((count / maxCount) * width); + return repeat('#', scaled); +} + +async function runChild(args) { + const service = new GHCrawlService(); + try { + const result = service.clusterExperiment({ + owner: args.owner, + repo: args.repo, + backend: args.childBackend, + k: args.k, + minScore: args.threshold, + candidateK: args.candidateK, + onProgress: (message) => process.stdout.write(`${message}\n`), + }); + process.stdout.write(`__GHCRAWL_RESULT__${JSON.stringify(result)}\n`); + } finally { + service.close(); + } +} + +async function runBackend(backend, args) { + return await new Promise((resolve, reject) => { + const childArgs = [ + '--expose-gc', + path.join(repoRoot, 'scripts', 'cluster-population-compare.mjs'), + args.fullName, + '--child-backend', + backend, + '--top', + String(args.top), + '--max-size', + String(args.maxSize), + ]; + if (args.k !== undefined) { + childArgs.push('--k', String(args.k)); + } + if (args.threshold !== undefined) { + childArgs.push('--threshold', String(args.threshold)); + } + if (args.candidateK !== undefined) { + childArgs.push('--candidate-k', String(args.candidateK)); + } + + const child = spawn(process.execPath, childArgs, { + cwd: repoRoot, + env: process.env, + stdio: ['ignore', 'pipe', 'pipe'], + }); + + let result = null; + const pipeStream = (stream, label) => { + const rl = readline.createInterface({ input: stream }); + rl.on('line', (line) => { + if (label === 'stdout' && line.startsWith('__GHCRAWL_RESULT__')) { + result = JSON.parse(line.slice('__GHCRAWL_RESULT__'.length)); + return; + } + process.stdout.write(`[${backend}] ${line}\n`); + }); + }; + + pipeStream(child.stdout, 'stdout'); + pipeStream(child.stderr, 'stderr'); + + child.on('error', reject); + child.on('close', (code, signal) => { + if (code !== 0) { + const detail = signal ? `signal ${signal}` : `code ${code}`; + reject(new Error(`${backend} comparison exited with ${detail}`)); + return; + } + if (!result) { + reject(new Error(`${backend} comparison did not emit a result payload`)); + return; + } + resolve(result); + }); + }); +} + +function buildSummaryLines(args, exactResult, vectorliteResult) { + const exactHistogram = exactResult.clusterSizes.histogram; + const vectorHistogram = vectorliteResult.clusterSizes.histogram; + const exactThreadsRepresented = countThreadsRepresented(exactHistogram); + const vectorThreadsRepresented = countThreadsRepresented(vectorHistogram); + + return [ + '## Cluster Population Comparison', + '', + `- Repo: ${exactResult.repository.fullName}`, + `- Parameters: k=${args.k ?? 'default'} threshold=${args.threshold ?? 'default'} candidateK=${args.candidateK ?? 'default'}`, + `- Exact clusters: ${exactResult.clusters}`, + `- Vectorlite clusters: ${vectorliteResult.clusters}`, + `- Exact solo clusters: ${exactResult.clusterSizes.soloClusters} (${formatPercent(exactResult.clusterSizes.soloClusters / Math.max(exactResult.clusters, 1))})`, + `- Vectorlite solo clusters: ${vectorliteResult.clusterSizes.soloClusters} (${formatPercent(vectorliteResult.clusterSizes.soloClusters / Math.max(vectorliteResult.clusters, 1))})`, + `- Exact max cluster size: ${exactResult.clusterSizes.maxClusterSize}`, + `- Vectorlite max cluster size: ${vectorliteResult.clusterSizes.maxClusterSize}`, + `- Exact threads represented: ${exactThreadsRepresented}`, + `- Vectorlite threads represented: ${vectorThreadsRepresented}`, + '', + ]; +} + +function buildTopSizesLines(exactResult, vectorliteResult, topCount) { + const exactTop = exactResult.clusterSizes.topClusterSizes.slice(0, topCount); + const vectorTop = vectorliteResult.clusterSizes.topClusterSizes.slice(0, topCount); + const lines = ['## Largest Cluster Sizes', '', 'rank exact vectorlite delta', '---- ----- ---------- -----']; + + for (let index = 0; index < topCount; index += 1) { + const exactSize = exactTop[index] ?? 0; + const vectorSize = vectorTop[index] ?? 0; + lines.push( + `${String(index + 1).padStart(4)} ${String(exactSize).padStart(5)} ${String(vectorSize).padStart(10)} ${formatDelta(vectorSize - exactSize).padStart(5)}`, + ); + } + + lines.push(''); + return lines; +} + +function buildHistogramLines(exactResult, vectorliteResult, maxSize) { + const exactMap = histogramToMap(exactResult.clusterSizes.histogram); + const vectorMap = histogramToMap(vectorliteResult.clusterSizes.histogram); + const exactOverflow = exactResult.clusterSizes.histogram + .filter((bucket) => bucket.size > maxSize) + .reduce((sum, bucket) => sum + bucket.count, 0); + const vectorOverflow = vectorliteResult.clusterSizes.histogram + .filter((bucket) => bucket.size > maxSize) + .reduce((sum, bucket) => sum + bucket.count, 0); + + let maxCount = 0; + for (let size = 1; size <= maxSize; size += 1) { + maxCount = Math.max(maxCount, exactMap.get(size) ?? 0, vectorMap.get(size) ?? 0); + } + maxCount = Math.max(maxCount, exactOverflow, vectorOverflow); + + const lines = ['## Histogram By Cluster Size', '', 'size exact vectorlite delta bars', '---- ----- ---------- ----- ----']; + for (let size = 1; size <= maxSize; size += 1) { + const exactCount = exactMap.get(size) ?? 0; + const vectorCount = vectorMap.get(size) ?? 0; + const exactBar = buildBar(exactCount, maxCount, 12); + const vectorBar = buildBar(vectorCount, maxCount, 12); + lines.push( + `${String(size).padStart(4)} ${String(exactCount).padStart(5)} ${String(vectorCount).padStart(10)} ${formatDelta(vectorCount - exactCount).padStart(5)} E:${exactBar.padEnd(12)} V:${vectorBar.padEnd(12)}`, + ); + } + + lines.push( + `${`${maxSize}+`.padStart(4)} ${String(exactOverflow).padStart(5)} ${String(vectorOverflow).padStart(10)} ${formatDelta(vectorOverflow - exactOverflow).padStart(5)} E:${buildBar(exactOverflow, maxCount, 12).padEnd(12)} V:${buildBar(vectorOverflow, maxCount, 12).padEnd(12)}`, + ); + lines.push(''); + return lines; +} + +async function runParent(args) { + process.stdout.write(`[exact] starting cluster population comparison for ${args.fullName}\n`); + const exactResult = await runBackend('exact', args); + + process.stdout.write(`[vectorlite] starting cluster population comparison for ${args.fullName}\n`); + const vectorliteResult = await runBackend('vectorlite', args); + + const lines = [ + ...buildSummaryLines(args, exactResult, vectorliteResult), + ...buildTopSizesLines(exactResult, vectorliteResult, args.top), + ...buildHistogramLines(exactResult, vectorliteResult, args.maxSize), + ]; + + process.stdout.write(`\n${lines.join('\n')}`); +} + +const args = parseArgs(process.argv.slice(2)); +if (args.childBackend === 'exact' || args.childBackend === 'vectorlite') { + await runChild(args); +} else { + await runParent(args); +} diff --git a/scripts/cluster-refine-component.mjs b/scripts/cluster-refine-component.mjs new file mode 100644 index 0000000..eb1737a --- /dev/null +++ b/scripts/cluster-refine-component.mjs @@ -0,0 +1,247 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); +const buildModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'cluster', 'build.js'); +const exactEdgesModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'cluster', 'exact-edges.js'); + +const { GHCrawlService } = await import(serviceModulePath); +const { buildClusters } = await import(buildModulePath); +const { buildSourceKindEdges } = await import(exactEdgesModulePath); + +function parseArgs(argv) { + let repo = 'openclaw/openclaw'; + let clusterRank = 1; + let backend = 'vectorlite'; + let k; + let threshold; + let candidateK; + let efSearch; + let topSubclusters = 10; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--repo') { + repo = argv[index + 1] ?? repo; + index += 1; + continue; + } + if (token === '--cluster-rank') { + clusterRank = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--backend') { + backend = argv[index + 1] ?? backend; + index += 1; + continue; + } + if (token === '--k') { + k = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--threshold') { + threshold = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--candidate-k') { + candidateK = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--ef-search') { + efSearch = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--top-subclusters') { + topSubclusters = Number(argv[index + 1]); + index += 1; + continue; + } + if (!token.startsWith('--')) { + repo = token; + } + } + + const [owner, name] = repo.split('/'); + if (!owner || !name) { + throw new Error(`Expected owner/repo, received: ${repo}`); + } + + return { + owner, + repo: name, + fullName: `${owner}/${name}`, + clusterRank: Number.isFinite(clusterRank) ? Math.max(1, clusterRank) : 1, + backend: backend === 'exact' ? 'exact' : 'vectorlite', + k: Number.isFinite(k) ? k : undefined, + threshold: Number.isFinite(threshold) ? threshold : undefined, + candidateK: Number.isFinite(candidateK) ? candidateK : undefined, + efSearch: Number.isFinite(efSearch) ? efSearch : undefined, + topSubclusters: Number.isFinite(topSubclusters) ? Math.max(1, topSubclusters) : 10, + }; +} + +function edgeKey(leftThreadId, rightThreadId) { + const [left, right] = leftThreadId < rightThreadId ? [leftThreadId, rightThreadId] : [rightThreadId, leftThreadId]; + return `${left}:${right}`; +} + +function mergeSourceKindEdges(aggregated, edges) { + for (const edge of edges) { + const key = edgeKey(edge.leftThreadId, edge.rightThreadId); + const existing = aggregated.get(key); + if (existing) { + existing.score = Math.max(existing.score, edge.score); + continue; + } + aggregated.set(key, { + leftThreadId: edge.leftThreadId, + rightThreadId: edge.rightThreadId, + score: edge.score, + }); + } +} + +function loadThreadMeta(service, ids) { + const placeholders = ids.map(() => '?').join(', '); + const rows = service.db + .prepare( + `select id, number, kind, title + from threads + where id in (${placeholders})`, + ) + .all(...ids); + return new Map(rows.map((row) => [row.id, row])); +} + +function normalizeEmbedding(values) { + let normSquared = 0; + for (const value of values) { + normSquared += value * value; + } + const norm = Math.sqrt(normSquared); + if (norm === 0) { + return values.map(() => 0); + } + return values.map((value) => value / norm); +} + +function normalizeRows(rows) { + return rows.map((row) => ({ + id: row.id, + normalizedEmbedding: normalizeEmbedding(JSON.parse(row.embedding_json)), + })); +} + +function describeThread(threadId, metaById) { + const meta = metaById.get(threadId); + if (!meta) { + return `thread:${threadId}`; + } + const kind = meta.kind === 'pull_request' ? 'PR' : 'Issue'; + return `${kind} #${meta.number} ${meta.title}`; +} + +const args = parseArgs(process.argv.slice(2)); +const service = new GHCrawlService(); + +try { + const result = service.clusterExperiment({ + owner: args.owner, + repo: args.repo, + backend: args.backend, + k: args.k, + minScore: args.threshold, + candidateK: args.candidateK, + efSearch: args.efSearch, + includeClusters: true, + onProgress: (message) => process.stdout.write(`${message}\n`), + }); + + const rankedClusters = [...(result.clustersDetail ?? [])].sort( + (left, right) => right.memberThreadIds.length - left.memberThreadIds.length || left.representativeThreadId - right.representativeThreadId, + ); + const selectedCluster = rankedClusters[args.clusterRank - 1]; + if (!selectedCluster) { + throw new Error(`Cluster rank ${args.clusterRank} not found`); + } + + const repository = service.requireRepository(args.owner, args.repo); + const ids = [...selectedCluster.memberThreadIds]; + const metaById = loadThreadMeta(service, ids); + const sourceKinds = service.db + .prepare( + `select distinct e.source_kind as sourceKind + from document_embeddings e + join threads t on t.id = e.thread_id + where t.repo_id = ? + and t.id in (${ids.map(() => '?').join(', ')}) + and e.model = ? + order by e.source_kind asc`, + ) + .all(repository.id, ...ids, service.config.embedModel) + .map((row) => row.sourceKind); + + const aggregated = new Map(); + for (const sourceKind of sourceKinds) { + const rows = service.db + .prepare( + `select t.id, e.embedding_json + from document_embeddings e + join threads t on t.id = e.thread_id + where t.repo_id = ? + and t.id in (${ids.map(() => '?').join(', ')}) + and e.model = ? + and e.source_kind = ?`, + ) + .all(repository.id, ...ids, service.config.embedModel, sourceKind); + const normalizedRows = normalizeRows(rows); + const edges = buildSourceKindEdges(normalizedRows, { + limit: args.k ?? 6, + minScore: args.threshold ?? 0.82, + }); + mergeSourceKindEdges(aggregated, edges); + } + + const refinedClusters = buildClusters( + ids.map((threadId) => { + const meta = metaById.get(threadId); + return { + threadId, + number: meta?.number ?? threadId, + title: meta?.title ?? '', + }; + }), + Array.from(aggregated.values()), + ); + + const lines = [ + '## Refined Cluster', + '', + `- Repo: ${args.fullName}`, + `- Source backend cluster: ${args.backend}`, + `- Source cluster rank: ${args.clusterRank}`, + `- Source cluster size: ${selectedCluster.memberThreadIds.length}`, + `- Representative: ${describeThread(selectedCluster.representativeThreadId, metaById)}`, + `- Exact refined subclusters: ${refinedClusters.length}`, + '', + '### Refined Sizes', + '', + ]; + + for (const [index, cluster] of refinedClusters.slice(0, args.topSubclusters).entries()) { + lines.push( + `- #${index + 1} size=${cluster.members.length} representative=${describeThread(cluster.representativeThreadId, metaById)}`, + ); + } + + process.stdout.write(`\n${lines.join('\n')}\n`); +} finally { + service.close(); +} diff --git a/scripts/cluster-topology-compare.mjs b/scripts/cluster-topology-compare.mjs new file mode 100644 index 0000000..c5080b6 --- /dev/null +++ b/scripts/cluster-topology-compare.mjs @@ -0,0 +1,391 @@ +import { spawn } from 'node:child_process'; +import path from 'node:path'; +import readline from 'node:readline'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); + +const { GHCrawlService } = await import(serviceModulePath); + +function parseArgs(argv) { + let repo = 'openclaw/openclaw'; + let k; + let threshold; + let candidateK; + let childBackend = null; + let top = 5; + let sampleMembers = 12; + let efSearch; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--repo') { + repo = argv[index + 1] ?? repo; + index += 1; + continue; + } + if (token === '--k') { + k = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--threshold') { + threshold = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--candidate-k') { + candidateK = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--ef-search') { + efSearch = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--child-backend') { + childBackend = argv[index + 1] ?? null; + index += 1; + continue; + } + if (token === '--top') { + top = Number(argv[index + 1]); + index += 1; + continue; + } + if (token === '--sample-members') { + sampleMembers = Number(argv[index + 1]); + index += 1; + continue; + } + if (!token.startsWith('--')) { + repo = token; + } + } + + const [owner, name] = repo.split('/'); + if (!owner || !name) { + throw new Error(`Expected owner/repo, received: ${repo}`); + } + + return { + owner, + repo: name, + fullName: `${owner}/${name}`, + k: Number.isFinite(k) ? k : undefined, + threshold: Number.isFinite(threshold) ? threshold : undefined, + candidateK: Number.isFinite(candidateK) ? candidateK : undefined, + efSearch: Number.isFinite(efSearch) ? efSearch : undefined, + childBackend, + top: Number.isFinite(top) ? Math.max(1, top) : 5, + sampleMembers: Number.isFinite(sampleMembers) ? Math.max(1, sampleMembers) : 12, + }; +} + +async function runChild(args) { + const service = new GHCrawlService(); + try { + const result = service.clusterExperiment({ + owner: args.owner, + repo: args.repo, + backend: args.childBackend, + k: args.k, + minScore: args.threshold, + candidateK: args.candidateK, + efSearch: args.efSearch, + includeClusters: true, + onProgress: (message) => process.stdout.write(`${message}\n`), + }); + process.stdout.write(`__GHCRAWL_RESULT__${JSON.stringify(result)}\n`); + } finally { + service.close(); + } +} + +async function runBackend(backend, args) { + return await new Promise((resolve, reject) => { + const childArgs = [ + '--expose-gc', + path.join(repoRoot, 'scripts', 'cluster-topology-compare.mjs'), + args.fullName, + '--child-backend', + backend, + '--top', + String(args.top), + '--sample-members', + String(args.sampleMembers), + ]; + if (args.k !== undefined) { + childArgs.push('--k', String(args.k)); + } + if (args.threshold !== undefined) { + childArgs.push('--threshold', String(args.threshold)); + } + if (args.candidateK !== undefined) { + childArgs.push('--candidate-k', String(args.candidateK)); + } + if (args.efSearch !== undefined) { + childArgs.push('--ef-search', String(args.efSearch)); + } + + const child = spawn(process.execPath, childArgs, { + cwd: repoRoot, + env: process.env, + stdio: ['ignore', 'pipe', 'pipe'], + }); + + let result = null; + const pipeStream = (stream, label) => { + const rl = readline.createInterface({ input: stream }); + rl.on('line', (line) => { + if (label === 'stdout' && line.startsWith('__GHCRAWL_RESULT__')) { + result = JSON.parse(line.slice('__GHCRAWL_RESULT__'.length)); + return; + } + process.stdout.write(`[${backend}] ${line}\n`); + }); + }; + + pipeStream(child.stdout, 'stdout'); + pipeStream(child.stderr, 'stderr'); + + child.on('error', reject); + child.on('close', (code, signal) => { + if (code !== 0) { + const detail = signal ? `signal ${signal}` : `code ${code}`; + reject(new Error(`${backend} topology comparison exited with ${detail}`)); + return; + } + if (!result) { + reject(new Error(`${backend} topology comparison did not emit a result payload`)); + return; + } + resolve(result); + }); + }); +} + +function sortClusters(clusters) { + return [...clusters].sort((left, right) => { + const sizeDelta = right.memberThreadIds.length - left.memberThreadIds.length; + if (sizeDelta !== 0) return sizeDelta; + return left.representativeThreadId - right.representativeThreadId; + }); +} + +function buildClusterIndex(clusters) { + return clusters.map((cluster, index) => ({ + rank: index + 1, + representativeThreadId: cluster.representativeThreadId, + memberThreadIds: cluster.memberThreadIds, + memberSet: new Set(cluster.memberThreadIds), + size: cluster.memberThreadIds.length, + })); +} + +function findContributors(targetCluster, sourceClusters) { + const contributors = []; + for (const sourceCluster of sourceClusters) { + let overlap = 0; + for (const threadId of targetCluster.memberThreadIds) { + if (sourceCluster.memberSet.has(threadId)) { + overlap += 1; + } + } + if (overlap > 0) { + contributors.push({ + rank: sourceCluster.rank, + representativeThreadId: sourceCluster.representativeThreadId, + size: sourceCluster.size, + overlap, + }); + } + } + contributors.sort((left, right) => right.overlap - left.overlap || left.rank - right.rank); + return contributors; +} + +function collectSampleIds(vectorTop, exactTop, matches, sampleMembers) { + const ids = new Set(); + for (const cluster of [...vectorTop, ...exactTop]) { + ids.add(cluster.representativeThreadId); + } + for (const match of matches) { + if (!match.bestContributor) { + continue; + } + ids.add(match.bestContributor.representativeThreadId); + for (const contributor of match.contributors.slice(0, 5)) { + ids.add(contributor.representativeThreadId); + } + for (const threadId of match.vectorOnly.slice(0, sampleMembers)) { + ids.add(threadId); + } + for (const threadId of match.bestExactOnly.slice(0, sampleMembers)) { + ids.add(threadId); + } + } + return [...ids]; +} + +function fetchThreadMeta(ids) { + if (ids.length === 0) { + return new Map(); + } + + const service = new GHCrawlService(); + try { + const placeholders = ids.map(() => '?').join(', '); + const rows = service.db + .prepare( + `select id, number, kind, title + from threads + where id in (${placeholders})`, + ) + .all(...ids); + return new Map(rows.map((row) => [row.id, row])); + } finally { + service.close(); + } +} + +function describeThread(threadId, metaById) { + const meta = metaById.get(threadId); + if (!meta) { + return `thread:${threadId}`; + } + const kind = meta.kind === 'pull_request' ? 'PR' : 'Issue'; + return `${kind} #${meta.number} ${meta.title}`; +} + +function formatContributor(contributor, targetSize) { + const coverage = ((contributor.overlap / Math.max(targetSize, 1)) * 100).toFixed(1); + return `exact #${contributor.rank} size=${contributor.size} overlap=${contributor.overlap} (${coverage}% of vector cluster)`; +} + +function buildSummaryTable(exactTop, vectorTop, matches) { + const lines = ['## Top Cluster Size Comparison', '', 'rank exact vectorlite best exact overlap', '---- ----- ---------- ------------------']; + for (let index = 0; index < Math.max(exactTop.length, vectorTop.length); index += 1) { + const exactSize = exactTop[index]?.size ?? 0; + const vectorSize = vectorTop[index]?.size ?? 0; + const overlap = matches[index]?.bestContributor?.overlap ?? 0; + lines.push( + `${String(index + 1).padStart(4)} ${String(exactSize).padStart(5)} ${String(vectorSize).padStart(10)} ${String(overlap).padStart(18)}`, + ); + } + lines.push(''); + return lines; +} + +function buildDetailLines(vectorTop, matches, metaById, sampleMembers) { + const lines = ['## Largest Vectorlite Clusters Vs Exact', '']; + + for (let index = 0; index < vectorTop.length; index += 1) { + const vectorCluster = vectorTop[index]; + const match = matches[index]; + const representative = describeThread(vectorCluster.representativeThreadId, metaById); + + lines.push( + `### Vectorlite #${index + 1} size=${vectorCluster.size} representative=${representative}`, + ); + + if (!match.bestContributor) { + lines.push('- No overlapping exact cluster found.'); + lines.push(''); + continue; + } + + const contributorSummary = match.contributors + .slice(0, 5) + .map((contributor) => formatContributor(contributor, vectorCluster.size)) + .join('; '); + lines.push(`- Top exact contributors: ${contributorSummary}`); + + const bestRepresentative = describeThread(match.bestContributor.representativeThreadId, metaById); + lines.push(`- Best exact representative: ${bestRepresentative}`); + lines.push( + `- Members only in vectorlite vs best exact: ${match.vectorOnly.length}; members only in best exact vs vectorlite: ${match.bestExactOnly.length}`, + ); + + if (match.vectorOnly.length > 0) { + lines.push( + `- Sample vectorlite-only members: ${match.vectorOnly + .slice(0, sampleMembers) + .map((threadId) => describeThread(threadId, metaById)) + .join(' | ')}`, + ); + } + + if (match.bestExactOnly.length > 0) { + lines.push( + `- Sample exact-only members: ${match.bestExactOnly + .slice(0, sampleMembers) + .map((threadId) => describeThread(threadId, metaById)) + .join(' | ')}`, + ); + } + + lines.push(''); + } + + return lines; +} + +async function runParent(args) { + process.stdout.write(`[exact] starting topology comparison for ${args.fullName}\n`); + const exactResult = await runBackend('exact', args); + + process.stdout.write(`[vectorlite] starting topology comparison for ${args.fullName}\n`); + const vectorliteResult = await runBackend('vectorlite', args); + + const exactClusters = buildClusterIndex(sortClusters(exactResult.clustersDetail ?? [])); + const vectorClusters = buildClusterIndex(sortClusters(vectorliteResult.clustersDetail ?? [])); + const exactTop = exactClusters.slice(0, args.top); + const vectorTop = vectorClusters.slice(0, args.top); + + const matches = vectorTop.map((vectorCluster) => { + const contributors = findContributors(vectorCluster, exactClusters); + const bestContributor = contributors[0] ?? null; + const bestExactSet = bestContributor + ? exactClusters[bestContributor.rank - 1].memberSet + : new Set(); + const vectorOnly = vectorCluster.memberThreadIds.filter((threadId) => !bestExactSet.has(threadId)); + const bestExactOnly = bestContributor + ? exactClusters[bestContributor.rank - 1].memberThreadIds.filter((threadId) => !vectorCluster.memberSet.has(threadId)) + : []; + + return { + contributors, + bestContributor, + vectorOnly, + bestExactOnly, + }; + }); + + const metaById = fetchThreadMeta(collectSampleIds(vectorTop, exactTop, matches, args.sampleMembers)); + + const lines = [ + '## Cluster Topology Comparison', + '', + `- Repo: ${args.fullName}`, + `- Parameters: k=${args.k ?? 'default'} threshold=${args.threshold ?? 'default'} candidateK=${args.candidateK ?? 'default'}`, + `- Vectorlite efSearch: ${args.efSearch ?? 'default(10)'}`, + `- Exact clusters: ${exactResult.clusters}`, + `- Vectorlite clusters: ${vectorliteResult.clusters}`, + '', + ...buildSummaryTable(exactTop, vectorTop, matches), + ...buildDetailLines(vectorTop, matches, metaById, args.sampleMembers), + ]; + + process.stdout.write(`\n${lines.join('\n')}`); +} + +const args = parseArgs(process.argv.slice(2)); +if (args.childBackend === 'exact' || args.childBackend === 'vectorlite') { + await runChild(args); +} else { + await runParent(args); +} diff --git a/scripts/op-run.mjs b/scripts/op-run.mjs index d5ae137..6430200 100644 --- a/scripts/op-run.mjs +++ b/scripts/op-run.mjs @@ -115,6 +115,15 @@ function main(argv = process.argv.slice(2)) { return; } + if (mode === 'run') { + const args = rest[0] === '--' ? rest.slice(1) : rest; + if (args.length === 0) { + throw new Error('Missing command. Example: node scripts/op-run.mjs run -- node scripts/my-script.mjs'); + } + runWithEnv(args[0], args.slice(1)); + return; + } + throw new Error(`Unknown mode: ${mode}`); } diff --git a/scripts/run-all-prompt-experiments.mjs b/scripts/run-all-prompt-experiments.mjs new file mode 100644 index 0000000..a532f5a --- /dev/null +++ b/scripts/run-all-prompt-experiments.mjs @@ -0,0 +1,69 @@ +#!/usr/bin/env node +/** + * Run all prompt experiments sequentially. + * Usage: node scripts/op-run.mjs run -- node scripts/run-all-prompt-experiments.mjs + */ +import fs from 'node:fs'; +import path from 'node:path'; +import { execFileSync } from 'node:child_process'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const promptDir = path.join(repoRoot, '.context', 'compound-engineering', 'ce-optimize', 'summary-prompt', 'prompts'); +const resultsDir = path.join(repoRoot, '.context', 'compound-engineering', 'ce-optimize', 'summary-prompt', 'results'); + +const promptFiles = fs.readdirSync(promptDir) + .filter(f => f.endsWith('.txt')) + .sort(); + +// Check which experiments already have results +const existing = new Set( + fs.existsSync(resultsDir) + ? fs.readdirSync(resultsDir).filter(f => f.endsWith('.json')).map(f => f.replace('.json', '')) + : [] +); + +const summaryTable = []; + +for (const file of promptFiles) { + const experimentId = file.replace('.txt', ''); + + if (existing.has(experimentId)) { + // Load existing result + const result = JSON.parse(fs.readFileSync(path.join(resultsDir, `${experimentId}.json`), 'utf8')); + const scored = result.results.filter(r => r.judge?.score != null); + if (scored.length >= 30) { + process.stderr.write(`[SKIP] ${experimentId} — already has ${scored.length} scored results\n`); + summaryTable.push({ experiment_id: experimentId, ...result.aggregate, status: 'cached' }); + continue; + } + process.stderr.write(`[RERUN] ${experimentId} — only ${scored.length} scored results, rerunning\n`); + } + + process.stderr.write(`\n=== Running ${experimentId} ===\n`); + const promptPath = path.join(promptDir, file); + + try { + const stdout = execFileSync('node', [ + path.join(repoRoot, 'scripts', 'summarize-prompt-experiment.mjs'), + 'openclaw/openclaw', + '--prompt-file', promptPath, + '--experiment-id', experimentId, + ], { + cwd: repoRoot, + encoding: 'utf8', + stdio: ['ignore', 'pipe', 'inherit'], + timeout: 1_800_000, + env: process.env, + }); + + const result = JSON.parse(stdout.trim()); + summaryTable.push({ ...result, status: 'completed' }); + } catch (error) { + process.stderr.write(`[ERROR] ${experimentId}: ${error.message}\n`); + summaryTable.push({ experiment_id: experimentId, status: 'error', error: error.message }); + } +} + +process.stderr.write('\n\n=== SUMMARY TABLE ===\n'); +process.stdout.write(JSON.stringify(summaryTable, null, 2) + '\n'); diff --git a/scripts/run-cluster-experiments.mjs b/scripts/run-cluster-experiments.mjs new file mode 100644 index 0000000..9bbf8c2 --- /dev/null +++ b/scripts/run-cluster-experiments.mjs @@ -0,0 +1,99 @@ +#!/usr/bin/env node +/** + * Run all clustering experiments sequentially. + * Usage: node scripts/op-run.mjs run -- node scripts/run-cluster-experiments.mjs + */ +import fs from 'node:fs'; +import path from 'node:path'; +import { execFileSync } from 'node:child_process'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const resultsDir = path.join(repoRoot, '.context', 'compound-engineering', 'ce-optimize', 'embedding-clustering', 'results'); + +const EXPERIMENTS = [ + // Baseline: all 3 source kinds, max aggregation + { id: 'baseline-all-max', args: ['--aggregation', 'max'] }, + + // Source selection experiments + { id: 'source-dedupe-only', args: ['--source-kinds', 'dedupe_summary', '--aggregation', 'max'] }, + { id: 'source-title-dedupe', args: ['--source-kinds', 'title,dedupe_summary', '--aggregation', 'max'] }, + { id: 'source-body-dedupe', args: ['--source-kinds', 'body,dedupe_summary', '--aggregation', 'max'] }, + + // Aggregation method experiments (all 3 source kinds) + { id: 'agg-mean', args: ['--aggregation', 'mean'] }, + { id: 'agg-weighted', args: ['--aggregation', 'weighted'] }, + { id: 'agg-weighted-heavy-summary', args: ['--aggregation', 'weighted', '--weights', '{"dedupe_summary":0.7,"title":0.2,"body":0.1}'] }, + { id: 'agg-min-of-2', args: ['--aggregation', 'min-of-2'] }, + { id: 'agg-boost', args: ['--aggregation', 'boost'] }, + + // Parameter tuning (using dedupe_summary only, which is likely cleanest signal) + { id: 'param-low-threshold', args: ['--source-kinds', 'dedupe_summary', '--threshold', '0.75'] }, + { id: 'param-high-threshold', args: ['--source-kinds', 'dedupe_summary', '--threshold', '0.88'] }, + { id: 'param-more-neighbors', args: ['--source-kinds', 'dedupe_summary', '--k', '12'] }, + { id: 'param-large-clusters', args: ['--source-kinds', 'dedupe_summary', '--max-cluster-size', '400'] }, + + // Best combos (will add based on early results) + { id: 'combo-dedupe-weighted-low', args: ['--source-kinds', 'title,dedupe_summary', '--aggregation', 'weighted', '--threshold', '0.78'] }, + { id: 'combo-all-boost-low', args: ['--aggregation', 'boost', '--threshold', '0.78'] }, +]; + +// Check which experiments already have results +const existing = new Set( + fs.existsSync(resultsDir) + ? fs.readdirSync(resultsDir).filter(f => f.endsWith('.json')).map(f => f.replace('.json', '')) + : [] +); + +const summaryTable = []; +const commonArgs = [ + 'openclaw/openclaw', + '--cluster-mode', 'bounded', + '--max-cluster-size', '200', +]; + +for (const experiment of EXPERIMENTS) { + if (existing.has(experiment.id)) { + try { + const result = JSON.parse(fs.readFileSync(path.join(resultsDir, `${experiment.id}.json`), 'utf8')); + if (result.judge?.mean_score != null) { + process.stderr.write(`[SKIP] ${experiment.id} — already has judge results\n`); + summaryTable.push({ experiment_id: experiment.id, ...result.metrics, ...result.judge, status: 'cached' }); + continue; + } + } catch { /* rerun */ } + } + + process.stderr.write(`\n=== Running ${experiment.id} ===\n`); + + // Override max-cluster-size if the experiment specifies it + const expArgs = [...experiment.args]; + const hasMaxCluster = expArgs.includes('--max-cluster-size'); + + try { + const allArgs = [ + path.join(repoRoot, 'scripts', 'cluster-judge-experiment.mjs'), + ...commonArgs, + '--experiment-id', experiment.id, + ...(hasMaxCluster ? [] : ['--max-cluster-size', '200']), + ...expArgs, + ]; + + const stdout = execFileSync('node', allArgs, { + cwd: repoRoot, + encoding: 'utf8', + stdio: ['ignore', 'pipe', 'inherit'], + timeout: 600_000, + env: process.env, + }); + + const result = JSON.parse(stdout.trim()); + summaryTable.push({ ...result, status: 'completed' }); + } catch (error) { + process.stderr.write(`[ERROR] ${experiment.id}: ${error.message}\n`); + summaryTable.push({ experiment_id: experiment.id, status: 'error', error: error.message }); + } +} + +process.stderr.write('\n\n=== SUMMARY TABLE ===\n'); +process.stdout.write(JSON.stringify(summaryTable, null, 2) + '\n'); diff --git a/scripts/summarize-prompt-experiment.mjs b/scripts/summarize-prompt-experiment.mjs new file mode 100644 index 0000000..fa27ec5 --- /dev/null +++ b/scripts/summarize-prompt-experiment.mjs @@ -0,0 +1,275 @@ +/** + * Run a summarization prompt experiment against the 40 test threads. + * Summarizes each thread, then judges the summary quality. + * + * Usage: + * node scripts/summarize-prompt-experiment.mjs \ + * --prompt-file prompts/v1.txt \ + * --experiment-id baseline \ + * --output-dir .context/compound-engineering/ce-optimize/summary-prompt/results + * + * Outputs: JSON file per experiment with all summaries and judge scores. + * Requires OPENAI_API_KEY in environment. + */ +import fs from 'node:fs'; +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); + +const { GHCrawlService } = await import(serviceModulePath); + +const TEST_THREAD_IDS = [ + // Issues + 15126, 8920, 19616, 16324, 10106, 14855, 18179, 2538, 9401, 9156, + 18848, 14856, 14863, 18847, 5022, 14862, 14859, 14142, 14861, 21902, + // PRs + 22366, 17692, 20932, 13791, 4208, 9553, 8969, 17568, 4129, 21735, + 2463, 5418, 5796, 766, 17924, 5712, 21769, 8098, 539, 5565, +]; + +const DEFAULT_PROMPT = 'Summarize this GitHub issue or pull request thread. Return concise JSON only with keys problem_summary, solution_summary, maintainer_signal_summary, dedupe_summary. Each field should be plain text, no markdown, and usually 1-3 sentences.'; + +const JUDGE_PROMPT = `You are evaluating the quality of a dedupe_summary generated from a GitHub issue or pull request. The dedupe_summary will be embedded and used for clustering similar issues together. + +A good dedupe_summary: +- Captures the CORE problem or change in 1-3 sentences +- Strips away template boilerplate, checklists, testing instructions, deployment notes +- Focuses on WHAT the issue/PR is about, not HOW it was found or tested +- Uses specific technical terms that would match similar issues (e.g., "Discord REST API proxy" not "network issue") +- Avoids generic phrases that could match unrelated issues +- Does NOT include version numbers, dates, or reproduction steps (these don't help deduplication) + +Rate the dedupe_summary on a 1-5 scale: +- 5: Perfectly captures the core issue/change. Would cluster correctly with similar items. No noise. +- 4: Good signal, minor noise or missing detail that wouldn't hurt clustering much. +- 3: Adequate but includes some noise (testing details, template artifacts) or misses a key aspect. +- 2: Weak — too generic, too verbose, or includes significant noise that could cause false matches. +- 1: Poor — misses the point, is mostly boilerplate, or would cluster incorrectly. + +Also report: +- has_boilerplate: boolean — does the summary contain template artifacts, checklists, or testing notes? +- signal_density: 1-5 — how much of the summary is useful signal vs noise? +- would_cluster_correctly: boolean — given the title, would this summary help find duplicates? + +Return JSON only with keys: score, has_boilerplate, signal_density, would_cluster_correctly, reasoning (1 sentence).`; + +function parseArgs(argv) { + let repo = 'openclaw/openclaw'; + let promptFile = null; + let promptText = null; + let experimentId = 'unnamed'; + let outputDir = '.context/compound-engineering/ce-optimize/summary-prompt/results'; + let model = null; + let judgeModel = null; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--prompt-file') { promptFile = argv[++index]; continue; } + if (token === '--prompt') { promptText = argv[++index]; continue; } + if (token === '--experiment-id') { experimentId = argv[++index]; continue; } + if (token === '--output-dir') { outputDir = argv[++index]; continue; } + if (token === '--model') { model = argv[++index]; continue; } + if (token === '--judge-model') { judgeModel = argv[++index]; continue; } + if (!token.startsWith('--')) repo = token; + } + + const [owner, name] = repo.split('/'); + if (!owner || !name) throw new Error(`Expected owner/repo, received: ${repo}`); + + let systemPrompt = DEFAULT_PROMPT; + if (promptFile) { + systemPrompt = fs.readFileSync(path.resolve(promptFile), 'utf8').trim(); + } else if (promptText) { + systemPrompt = promptText; + } + + return { owner, repo: name, systemPrompt, experimentId, outputDir, model, judgeModel }; +} + +async function summarizeThread(client, model, systemPrompt, summaryInput, format, summarySchema) { + for (const [attemptIndex, maxOutputTokens] of [500, 900, 1400].entries()) { + try { + const response = await client.responses.create({ + model, + input: [ + { role: 'system', content: [{ type: 'input_text', text: systemPrompt }] }, + { role: 'user', content: [{ type: 'input_text', text: summaryInput }] }, + ], + text: { format, verbosity: 'low' }, + max_output_tokens: maxOutputTokens, + }); + const parsed = summarySchema.parse(JSON.parse(response.output_text ?? '')); + return { + summary: parsed, + usage: response.usage ? { + input_tokens: response.usage.input_tokens, + output_tokens: response.usage.output_tokens, + } : null, + }; + } catch (error) { + if (attemptIndex === 2) throw error; + } + } +} + +async function judgeResult(client, judgeModel, title, body, dedupeSummary, judgeSchema) { + const judgeInput = `Title: ${title}\n\nOriginal body (first 1000 chars): ${(body ?? '').slice(0, 1000)}\n\ndedupe_summary to evaluate: ${dedupeSummary}`; + + const response = await client.responses.create({ + model: judgeModel, + input: [ + { role: 'system', content: [{ type: 'input_text', text: JUDGE_PROMPT }] }, + { role: 'user', content: [{ type: 'input_text', text: judgeInput }] }, + ], + text: { + format: { type: 'json_schema', name: 'judge_result', strict: true, schema: { + type: 'object', + properties: { + score: { type: 'integer' }, + has_boilerplate: { type: 'boolean' }, + signal_density: { type: 'integer' }, + would_cluster_correctly: { type: 'boolean' }, + reasoning: { type: 'string' }, + }, + required: ['score', 'has_boilerplate', 'signal_density', 'would_cluster_correctly', 'reasoning'], + additionalProperties: false, + }}, + }, + max_output_tokens: 800, + }); + + const text = response.output_text ?? '{}'; + try { + return JSON.parse(text); + } catch { + process.stderr.write(` judge parse error, raw: ${text.slice(0, 200)}\n`); + return { score: null, has_boilerplate: null, signal_density: null, would_cluster_correctly: null, reasoning: `parse error: ${text.slice(0, 100)}` }; + } +} + +const args = parseArgs(process.argv.slice(2)); + +import { createRequire } from 'node:module'; +const apiCoreRequire = createRequire(path.join(repoRoot, 'packages', 'api-core', 'package.json')); +const { default: OpenAI } = await import(apiCoreRequire.resolve('openai')); +const { zodTextFormat } = await import(apiCoreRequire.resolve('openai/helpers/zod')); +const { z } = await import(apiCoreRequire.resolve('zod')); + +const apiKey = process.env.OPENAI_API_KEY; +if (!apiKey) throw new Error('OPENAI_API_KEY not set'); + +const client = new OpenAI({ apiKey }); +const service = new GHCrawlService(); +const summarySchema = z.object({ + problem_summary: z.string(), + solution_summary: z.string(), + maintainer_signal_summary: z.string(), + dedupe_summary: z.string(), +}); +const format = zodTextFormat(summarySchema, 'ghcrawl_thread_summary'); + +try { + const repository = service.requireRepository(args.owner, args.repo); + const model = args.model ?? service.config.summaryModel; + const judgeModel = args.judgeModel ?? 'gpt-5-mini'; + + const results = []; + let totalInputTokens = 0; + let totalOutputTokens = 0; + + for (const [index, threadId] of TEST_THREAD_IDS.entries()) { + const thread = service.db.prepare( + 'SELECT id, number, kind, title, body, labels_json FROM threads WHERE id = ?' + ).get(threadId); + + if (!thread) { + process.stderr.write(`[${index + 1}/${TEST_THREAD_IDS.length}] thread ${threadId} not found, skipping\n`); + continue; + } + + const body = (thread.body ?? '').replace(/\r/g, '\n').replace(/\s+/g, ' ').trim(); + const title = thread.title.replace(/\r/g, '\n').replace(/\s+/g, ' ').trim(); + const labels = JSON.parse(thread.labels_json || '[]'); + const parts = [`title: ${title}`]; + if (body) parts.push(`body: ${body}`); + if (labels.length > 0) parts.push(`labels: ${labels.join(', ')}`); + const summaryInput = parts.join('\n\n'); + + process.stderr.write(`[${index + 1}/${TEST_THREAD_IDS.length}] #${thread.number} (${thread.kind}) summarizing...\n`); + + try { + const summaryResult = await summarizeThread(client, model, args.systemPrompt, summaryInput, format, summarySchema); + if (summaryResult.usage) { + totalInputTokens += summaryResult.usage.input_tokens; + totalOutputTokens += summaryResult.usage.output_tokens; + } + + process.stderr.write(`[${index + 1}/${TEST_THREAD_IDS.length}] #${thread.number} judging...\n`); + const judgeResult_ = await judgeResult(client, judgeModel, title, thread.body, summaryResult.summary.dedupe_summary); + + results.push({ + thread_id: threadId, + number: thread.number, + kind: thread.kind, + title: thread.title, + summary: summaryResult.summary, + judge: judgeResult_, + usage: summaryResult.usage, + }); + } catch (error) { + process.stderr.write(`[${index + 1}/${TEST_THREAD_IDS.length}] #${thread.number} ERROR: ${error.message}\n`); + results.push({ + thread_id: threadId, + number: thread.number, + kind: thread.kind, + title: thread.title, + error: error.message, + }); + } + } + + // Aggregate scores + const scored = results.filter(r => r.judge?.score != null); + const avgScore = scored.length > 0 ? scored.reduce((s, r) => s + r.judge.score, 0) / scored.length : 0; + const avgSignalDensity = scored.length > 0 ? scored.reduce((s, r) => s + r.judge.signal_density, 0) / scored.length : 0; + const boilerplateCount = scored.filter(r => r.judge.has_boilerplate).length; + const wouldClusterCount = scored.filter(r => r.judge.would_cluster_correctly).length; + + const experiment = { + experiment_id: args.experimentId, + model, + judge_model: judgeModel, + system_prompt: args.systemPrompt, + timestamp: new Date().toISOString(), + aggregate: { + avg_score: Math.round(avgScore * 100) / 100, + avg_signal_density: Math.round(avgSignalDensity * 100) / 100, + boilerplate_count: boilerplateCount, + boilerplate_pct: Math.round((boilerplateCount / Math.max(scored.length, 1)) * 100), + would_cluster_correctly_pct: Math.round((wouldClusterCount / Math.max(scored.length, 1)) * 100), + total_scored: scored.length, + total_errors: results.length - scored.length, + total_input_tokens: totalInputTokens, + total_output_tokens: totalOutputTokens, + }, + results, + }; + + // Save to disk + fs.mkdirSync(path.resolve(args.outputDir), { recursive: true }); + const outputPath = path.resolve(args.outputDir, `${args.experimentId}.json`); + fs.writeFileSync(outputPath, JSON.stringify(experiment, null, 2)); + process.stderr.write(`\nResults saved to ${outputPath}\n`); + + // Print summary to stdout + process.stdout.write(JSON.stringify({ + experiment_id: args.experimentId, + ...experiment.aggregate, + prompt_preview: args.systemPrompt.slice(0, 120), + }, null, 2) + '\n'); +} finally { + service.close(); +} diff --git a/scripts/summarize-single.mjs b/scripts/summarize-single.mjs new file mode 100644 index 0000000..4f94bbc --- /dev/null +++ b/scripts/summarize-single.mjs @@ -0,0 +1,145 @@ +/** + * Summarize a single thread with an optional system prompt override. + * Outputs the summary JSON to stdout. Does NOT save to DB. + * + * Usage: + * node scripts/summarize-single.mjs [--prompt-file ] + * node scripts/summarize-single.mjs [--prompt ""] + * + * Requires OPENAI_API_KEY in environment (use pnpm op:shell or op:exec). + */ +import fs from 'node:fs'; +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; + +const repoRoot = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '..'); +const serviceModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'service.js'); +const providerModulePath = path.join(repoRoot, 'packages', 'api-core', 'dist', 'openai', 'provider.js'); + +const { GHCrawlService } = await import(serviceModulePath); +const { OpenAiProvider } = await import(providerModulePath); + +function parseArgs(argv) { + let repo = null; + let threadNumber = null; + let promptFile = null; + let promptText = null; + let model = null; + + for (let index = 0; index < argv.length; index += 1) { + const token = argv[index]; + if (!token) continue; + if (token === '--prompt-file') { promptFile = argv[++index]; continue; } + if (token === '--prompt') { promptText = argv[++index]; continue; } + if (token === '--model') { model = argv[++index]; continue; } + if (!token.startsWith('--')) { + if (!repo) { repo = token; continue; } + if (!threadNumber) { threadNumber = Number(token); continue; } + } + } + + if (!repo || !threadNumber) { + throw new Error('Usage: summarize-single.mjs [--prompt-file ] [--prompt ""] [--model ]'); + } + + const [owner, name] = repo.split('/'); + if (!owner || !name) throw new Error(`Expected owner/repo, received: ${repo}`); + + let systemPrompt = null; + if (promptFile) { + systemPrompt = fs.readFileSync(promptFile, 'utf8').trim(); + } else if (promptText) { + systemPrompt = promptText; + } + + return { owner, repo: name, threadNumber, systemPrompt, model }; +} + +const args = parseArgs(process.argv.slice(2)); + +const service = new GHCrawlService(); +try { + const repository = service.requireRepository(args.owner, args.repo); + + // Load thread data + const thread = service.db.prepare( + 'SELECT id, number, title, body, labels_json FROM threads WHERE repo_id = ? AND number = ?' + ).get(repository.id, args.threadNumber); + + if (!thread) { + throw new Error(`Thread #${args.threadNumber} not found in ${args.owner}/${args.repo}`); + } + + // Build summary input (same as service.buildSummarySource but accessible here) + const body = (thread.body ?? '').replace(/\r/g, '\n').replace(/\s+/g, ' ').trim(); + const title = thread.title.replace(/\r/g, '\n').replace(/\s+/g, ' ').trim(); + const labels = JSON.parse(thread.labels_json || '[]'); + + const parts = [`title: ${title}`]; + if (body) parts.push(`body: ${body}`); + if (labels.length > 0) parts.push(`labels: ${labels.join(', ')}`); + const summaryInput = parts.join('\n\n'); + + // Default system prompt (matches current production prompt) + const defaultPrompt = 'Summarize this GitHub issue or pull request thread. Return concise JSON only with keys problem_summary, solution_summary, maintainer_signal_summary, dedupe_summary. Each field should be plain text, no markdown, and usually 1-3 sentences.'; + + const systemPrompt = args.systemPrompt ?? defaultPrompt; + const model = args.model ?? service.config.summaryModel; + + // Call OpenAI directly with optional prompt override + const apiKey = process.env.OPENAI_API_KEY; + if (!apiKey) { + throw new Error('OPENAI_API_KEY not set. Use pnpm op:shell or set the env var.'); + } + + const { default: OpenAI } = await import('openai'); + const { zodTextFormat } = await import('openai/helpers/zod'); + const { z } = await import('zod'); + + const summarySchema = z.object({ + problem_summary: z.string(), + solution_summary: z.string(), + maintainer_signal_summary: z.string(), + dedupe_summary: z.string(), + }); + + const client = new OpenAI({ apiKey }); + const format = zodTextFormat(summarySchema, 'ghcrawl_thread_summary'); + + const response = await client.responses.create({ + model, + input: [ + { + role: 'system', + content: [{ type: 'input_text', text: systemPrompt }], + }, + { + role: 'user', + content: [{ type: 'input_text', text: summaryInput }], + }, + ], + text: { format, verbosity: 'low' }, + max_output_tokens: 900, + }); + + const parsed = summarySchema.parse(JSON.parse(response.output_text ?? '')); + + const result = { + thread_number: args.threadNumber, + thread_id: thread.id, + title: thread.title, + model, + system_prompt_preview: systemPrompt.slice(0, 100) + (systemPrompt.length > 100 ? '...' : ''), + input_length: summaryInput.length, + summary: parsed, + usage: response.usage ? { + input_tokens: response.usage.input_tokens, + output_tokens: response.usage.output_tokens, + total_tokens: response.usage.total_tokens, + } : null, + }; + + process.stdout.write(JSON.stringify(result, null, 2) + '\n'); +} finally { + service.close(); +} diff --git a/skills/ghcrawl/SKILL.md b/skills/ghcrawl/SKILL.md index 58526f4..38c44d6 100644 --- a/skills/ghcrawl/SKILL.md +++ b/skills/ghcrawl/SKILL.md @@ -21,6 +21,14 @@ In default mode, do not treat missing credentials as a problem unless the user e Even in API-enabled mode, never run `sync`, `embed`, `cluster`, or `refresh` unless the user explicitly asks for that work. Those commands can take a long time, consume paid API usage, and trigger rate limiting if used too often. +Current pipeline defaults to keep in mind: + +- persistent semantic search and clustering use a `vectorlite` sidecar index +- the default summary model is `gpt-5-mini` +- the default embedding basis is `title_original`, so `refresh` does not summarize unless the user explicitly switches to `title_summary` +- changing summary model or embedding basis with `ghcrawl configure` makes the next refresh rebuild vectors and clusters +- opting into `title_summary` can materially improve clustering quality, but it adds OpenAI cost; on `openclaw/openclaw` it improved non-solo cluster membership by about 50% + Also never run `close-thread` or `close-cluster` unless the user explicitly asks you to mark a local thread or cluster closed. ## When to use this skill @@ -59,6 +67,7 @@ ghcrawl threads owner/repo --numbers 42,43,44 --json ghcrawl author owner/repo --login lqquan --json ghcrawl search owner/repo --query "download stalls" --mode hybrid --json ghcrawl neighbors owner/repo --number 42 --limit 10 --json +ghcrawl configure --json ``` These operate on the existing local SQLite dataset. @@ -74,6 +83,8 @@ If the user explicitly wants to inspect those records, add `--include-closed`. Use `threads --numbers 12345` when you need to find the cluster for one specific issue/PR number. The returned thread record includes `clusterId`. If it is non-null, follow with `cluster-detail --id `. +Use `configure --json` when you need to confirm the currently selected summary model or embedding basis before suggesting an expensive refresh. + Use `threads --numbers ...` when you need a batch of specific issue/PR records. Do not pay the CLI startup cost 10 times for 10 separate single-thread lookups. Use `author --login ...` when you need one author's open threads and their strongest stored same-author similarity matches in one call. @@ -134,8 +145,9 @@ ghcrawl refresh owner/repo This runs, in fixed order: 1. GitHub sync/reconcile -2. embed refresh -3. cluster rebuild +2. summarize-if-needed +3. embed refresh +4. cluster rebuild You may skip steps only when the user explicitly wants that or the freshness state makes it unnecessary: diff --git a/skills/ghcrawl/references/protocol.md b/skills/ghcrawl/references/protocol.md index 9310032..582a4f4 100644 --- a/skills/ghcrawl/references/protocol.md +++ b/skills/ghcrawl/references/protocol.md @@ -24,6 +24,16 @@ Do not call this automatically on every skill invocation. Use it when: If the user asked only for read-only analysis, missing auth is not itself a blocker. Work from the existing local dataset through the CLI. +### `ghcrawl configure --json` + +Shows the current persisted summary model, embedding basis, vector backend, and the built-in one-time summary cost estimate. + +Use this when: + +- you need to confirm whether summaries are using `gpt-5-mini` or `gpt-5.4-mini` +- you need to confirm whether embeddings are built from `title_original` or `title_summary` +- you want to estimate whether a first refresh after a config change will be expensive + ### `ghcrawl threads owner/repo --numbers --json` Bulk read path for specific issue/PR numbers from the local DB. @@ -63,8 +73,9 @@ Useful flags: Runs the staged pipeline in fixed order: 1. GitHub sync/reconcile -2. embeddings -3. clusters +2. summarize-if-needed +3. embeddings +4. clusters Optional skips: @@ -173,6 +184,7 @@ If `ghcrawl` is not installed globally: ```bash pnpm --filter ghcrawl cli doctor --json +pnpm --filter ghcrawl cli configure --json pnpm --filter ghcrawl cli threads owner/repo --numbers 12345 --json pnpm --filter ghcrawl cli threads owner/repo --numbers 42,43,44 --json pnpm --filter ghcrawl cli threads owner/repo --numbers 42,43,44 --include-closed --json