pwrdrvr · huntharo · Apr 3, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/.gitignore b/.gitignore
@@ -13,3 +13,4 @@ packages/*/src/**/*.d.ts.map
 **/__pycache__/
 **/*.pyc
 **/.DS_Store
+.context/
diff --git a/README.md b/README.md
@@ -34,6 +34,7 @@ GitHub is required to crawl issue and PR data. OpenAI is required for embeddings
 
 ```bash
 ghcrawl init
+ghcrawl configure
 ghcrawl doctor
 ghcrawl refresh owner/repo
 ghcrawl tui owner/repo
@@ -44,19 +45,41 @@ ghcrawl tui owner/repo
 - save plaintext keys in `~/.config/ghcrawl/config.json`
 - or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file
 
-`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI.
+`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, summarizes changed items only when the active embedding basis depends on summaries, refreshes vectors, and rebuilds the clusters you browse in the TUI.
+
+## One-Time Migration
+
+Upgrading to this release changes the local vector and cluster pipeline:
+
+- vectors now use a persistent `vectorlite` sidecar index
+- the active vector is one vector per open thread
+- old multi-row `document_embeddings` are removed after the first successful rebuild
+
+For an existing repo, the one-time migration command is:
+
+```bash
+ghcrawl refresh owner/repo
+```
+
+Important notes:
+
+- `refresh` performs the migration; plain `sync` does not
+- with the default `title_original` basis, the migration rebuilds vectors and clusters without running LLM summaries
+- if you switch to `title_summary`, `refresh` also runs the summarize step before embedding
+- after the first successful migration refresh, ghcrawl removes legacy embeddings, compacts the local DB, and rebuilds clusters from the current vectors
 
 ## Typical Commands
 
 ```bash
+ghcrawl configure
 ghcrawl doctor
 ghcrawl refresh owner/repo
 ghcrawl tui owner/repo
 ```
 
 `refresh`, `sync`, and `embed` call remote services and should be run intentionally.
 
-`cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`.
+`cluster` does not call remote services, but it is still time consuming. It now uses a persistent `vectorlite` index instead of exact in-memory scans, so large-repo rebuilds are materially faster, but still not instant.
 
 `clusters` explores the clusters already stored in the local SQLite database and is expected to be the fast, read-only inspection path.
 
@@ -72,6 +95,7 @@ ghcrawl refresh --help
 For agent-facing and script-facing commands, prefer explicit machine mode:
 
 ```bash
+ghcrawl configure --json
 ghcrawl doctor --json
 ghcrawl threads owner/repo --numbers 42,43,44 --json
 ghcrawl clusters owner/repo --min-size 10 --limit 20 --sort recent --json
@@ -118,11 +142,12 @@ If you need tighter control, you can run the three stages yourself:
 
 ```bash
 ghcrawl sync owner/repo     # pull the latest open issues and pull requests from GitHub
-ghcrawl embed owner/repo    # generate or refresh OpenAI embeddings for changed items
+ghcrawl summarize owner/repo  # optional explicit summary refresh when using title_summary
+ghcrawl embed owner/repo    # generate or refresh the single active vector per thread
 ghcrawl cluster owner/repo  # rebuild local related-work clusters from the current vectors (local-only, but can take ~10 minutes on a ~12k issue/PR repo)
 ```
 
-Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you.
+Run them in that order. If your embedding basis is `title_summary`, `refresh` automatically inserts the summarize stage before embed for you. With the default `title_original` basis, `refresh` does not summarize unless you run `summarize` explicitly.
 
 ## Init And Doctor
 
@@ -158,8 +183,29 @@ GitHub token guidance:
 - local DB path wiring
 - GitHub token presence, token-shape validation, and a live auth smoke check
 - OpenAI key presence, key-shape validation, and a live auth smoke check
+- `vectorlite` runtime readiness
 - if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly
 
+## Configure
+
+Use `configure` to inspect or change the active summary model and embedding basis:
+
+```bash
+ghcrawl configure
+ghcrawl configure --summary-model gpt-5.4-mini
+ghcrawl configure --embedding-basis title_original
+```
+
+Current defaults:
+
+- summary model: `gpt-5-mini`
+- embedding basis: `title_original` (`title + original body`)
+- vector backend: `vectorlite`
+
+Changing the summary model or embedding basis makes the next `refresh` rebuild vectors and clusters for that repo.
+
+If you opt into `title_summary`, ghcrawl summarizes before embedding and uses `title + dedupe summary` as the active vector text. On `openclaw/openclaw`, that improved non-solo cluster membership by about 50% versus `title_original`, but it adds OpenAI spend. A first summarize of roughly `18k` open issues and PRs in that repo typically costs about `$15-$30` with `gpt-5-mini`; later refreshes are usually much cheaper because only changed items need summaries.
+
 ### 1Password CLI Example
 
 If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly:
@@ -214,10 +260,17 @@ Use `close-cluster` when you want to locally suppress a whole cluster from defau
 
 ## Cost To Operate
 
-The main variable cost is OpenAI embeddings. Current model pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).
+The main variable costs are summarization and embeddings. Embedding pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).
 
 On a real local run against roughly `12k` issues plus about `1.2x` related PR and issue inputs, [`text-embedding-3-large`](https://developers.openai.com/api/docs/pricing#embeddings) came out to about **$0.65 USD** total to embed the repo. Treat that as an approximate data point for something like `~14k` issue and PR inputs, not a hard guarantee.
 
+For one-time summary migration planning on a repo around the size of `openclaw/openclaw` (`~20k` issues and PRs), `ghcrawl configure` reports these operator estimates using the April 1, 2026 USD pricing assumptions for this release:
+
+- `gpt-5-mini`: about **$12 USD** one time
+- `gpt-5.4-mini`: about **$30 USD** one time
+
+`gpt-5-mini` is the default to keep that migration cost lower. `gpt-5.4-mini` is available when you want higher-quality summaries and are comfortable with the higher one-time spend.
+
 This screenshot is the reference point for that estimate:
 
 ![OpenAI embeddings cost for a 12k-issue repo](./docs/images/openai-embeddings-12k-issue-repo.png)
@@ -265,15 +318,16 @@ The agent and build contract for this repo lives in [SPEC.md](./SPEC.md).
 - a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo
 - `sync` is metadata-only by default
 - `sync --include-comments` enables issue comments, PR reviews, and review comments for deeper context
-- `embed` defaults to `text-embedding-3-large`
-- `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present
-- `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding
+- `embed` defaults to `text-embedding-3-large` with `dimensions=1024`
+- `embed` maintains one active vector per thread, stored in a persistent `vectorlite` sidecar index
+- `embed` stores an input hash per thread and will not resubmit unchanged text for re-embedding
+- the default embedding basis is `title + original body`; use `ghcrawl configure --embedding-basis title_summary` if you want to summarize before embedding
 - `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo`
 - `sync --limit <count>` is the best smoke-test path on a busy repository
 - `tui` remembers sort order and min cluster size per repository in the persisted config file
 - the TUI shows locally closed threads and clusters in gray; press `x` to hide or show them
 - on wide screens, press `l` to toggle between three columns and a wider cluster list with members/detail stacked on the right
-- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> embed -> cluster and opens that repo with min cluster size `1+`
+- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> summarize-if-needed -> embed -> cluster and opens that repo with min cluster size `1+`
 
 ## Responsibility Attestation
 

diff --git a/apps/cli/README.md b/apps/cli/README.md
@@ -36,6 +36,7 @@ GitHub is required to crawl issue and PR data. OpenAI is required for embeddings
 
 ```bash
 ghcrawl init
+ghcrawl configure
 ghcrawl doctor
 ghcrawl refresh owner/repo
 ghcrawl tui owner/repo
@@ -46,19 +47,20 @@ ghcrawl tui owner/repo
 - save plaintext keys in `~/.config/ghcrawl/config.json`
 - or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file
 
-`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI.
+`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, summarizes changed items when the active embedding basis depends on summaries, refreshes vectors, and rebuilds the clusters you browse in the TUI.
 
 ## Typical Commands
 
 ```bash
+ghcrawl configure
 ghcrawl doctor
 ghcrawl refresh owner/repo
 ghcrawl tui owner/repo
 ```
 
 `refresh`, `sync`, and `embed` call remote services and should be run intentionally.
 
-`cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`.
+`cluster` does not call remote services, but it is still time consuming. It now uses a persistent `vectorlite` index instead of exact in-memory scans, so large-repo rebuilds are materially faster, but still not instant.
 
 `clusters` explores the clusters already stored in the local SQLite database and is expected to be the fast, read-only inspection path.
 
@@ -74,6 +76,7 @@ ghcrawl refresh --help
 For agent-facing and script-facing commands, prefer explicit machine mode:
 
 ```bash
+ghcrawl configure --json
 ghcrawl doctor --json
 ghcrawl threads owner/repo --numbers 42,43,44 --json
 ghcrawl clusters owner/repo --min-size 10 --limit 20 --sort recent --json
@@ -120,11 +123,12 @@ If you need tighter control, you can run the three stages yourself:
 
 ```bash
 ghcrawl sync owner/repo     # pull the latest open issues and pull requests from GitHub
-ghcrawl embed owner/repo    # generate or refresh OpenAI embeddings for changed items
+ghcrawl summarize owner/repo  # optional explicit summary refresh when using title_summary
+ghcrawl embed owner/repo    # generate or refresh the single active vector per thread
 ghcrawl cluster owner/repo  # rebuild local related-work clusters from the current vectors (local-only, but can take ~10 minutes on a ~12k issue/PR repo)
 ```
 
-Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you.
+Run them in that order. If your embedding basis is `title_summary`, `refresh` automatically inserts the summarize stage before embed for you.
 
 ## Init And Doctor
 
@@ -160,8 +164,27 @@ GitHub token guidance:
 - local DB path wiring
 - GitHub token presence, token-shape validation, and a live auth smoke check
 - OpenAI key presence, key-shape validation, and a live auth smoke check
+- `vectorlite` runtime readiness
 - if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly
 
+## Configure
+
+Use `configure` to inspect or change the active summary model and embedding basis:
+
+```bash
+ghcrawl configure
+ghcrawl configure --summary-model gpt-5.4-mini
+ghcrawl configure --embedding-basis title_original
+```
+
+Current defaults:
+
+- summary model: `gpt-5-mini`
+- embedding basis: `title_summary` (`title + dedupe summary`)
+- vector backend: `vectorlite`
+
+Changing the summary model or embedding basis makes the next `refresh` rebuild vectors and clusters for that repo.
+
 ### 1Password CLI Example
 
 If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly:
@@ -216,10 +239,17 @@ Use `close-cluster` when you want to locally suppress a whole cluster from defau
 
 ## Cost To Operate
 
-The main variable cost is OpenAI embeddings. Current model pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).
+The main variable costs are summarization and embeddings. Embedding pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).
 
 On a real local run against roughly `12k` issues plus about `1.2x` related PR and issue inputs, [`text-embedding-3-large`](https://developers.openai.com/api/docs/pricing#embeddings) came out to about **$0.65 USD** total to embed the repo. Treat that as an approximate data point for something like `~14k` issue and PR inputs, not a hard guarantee.
 
+For one-time summary migration planning on a repo around the size of `openclaw/openclaw` (`~20k` issues and PRs), `ghcrawl configure` reports these operator estimates using the April 1, 2026 USD pricing assumptions for this release:
+
+- `gpt-5-mini`: about **$12 USD** one time
+- `gpt-5.4-mini`: about **$30 USD** one time
+
+`gpt-5-mini` is the default to keep that migration cost lower. `gpt-5.4-mini` is available when you want higher-quality summaries and are comfortable with the higher one-time spend.
+
 This screenshot is the reference point for that estimate:
 
 ![OpenAI embeddings cost for a 12k-issue repo](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/openai-embeddings-12k-issue-repo.png)
@@ -267,15 +297,16 @@ The agent and build contract for this repo lives in [SPEC.md](https://github.com
 - a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo
 - `sync` is metadata-only by default
 - `sync --include-comments` enables issue comments, PR reviews, and review comments for deeper context
-- `embed` defaults to `text-embedding-3-large`
-- `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present
-- `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding
+- `embed` defaults to `text-embedding-3-large` with `dimensions=1024`
+- `embed` maintains one active vector per thread, stored in a persistent `vectorlite` sidecar index
+- `embed` stores an input hash per thread and will not resubmit unchanged text for re-embedding
+- the default embedding basis is `title + dedupe summary`; use `ghcrawl configure` to switch to `title + original body`
 - `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo`
 - `sync --limit <count>` is the best smoke-test path on a busy repository
 - `tui` remembers sort order and min cluster size per repository in the persisted config file
 - the TUI shows locally closed threads and clusters in gray; press `x` to hide or show them
 - on wide screens, press `l` to toggle between three columns and a wider cluster list with members/detail stacked on the right
-- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> embed -> cluster and opens that repo with min cluster size `1+`
+- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> summarize-if-needed -> embed -> cluster and opens that repo with min cluster size `1+`
 
 ## Responsibility Attestation
 

diff --git a/apps/cli/src/main.test.ts b/apps/cli/src/main.test.ts
@@ -5,7 +5,7 @@ import os from 'node:os';
 import path from 'node:path';
 import { fileURLToPath } from 'node:url';
 
-import { GHCrawlService } from '@ghcrawl/api-core';
+import { GHCrawlService, readPersistedConfig } from '@ghcrawl/api-core';
 import { formatDoctorReport, formatLogLine, getExitCode, parseOwnerRepo, parseRepoFlags, resolveSinceValue, run, runCli } from './main.js';
 
 function createWritableCapture(isTTY?: boolean) {
@@ -39,6 +39,7 @@ function makeRunContext(): { env: NodeJS.ProcessEnv; cwd: string; cleanup: () =>
 const publicCommands = [
   'init',
   'doctor',
+  'configure',
   'version',
   'sync',
   'refresh',
@@ -171,6 +172,38 @@ test('run prints json doctor output when explicitly requested', async () => {
   assert.match(stdout.read(), /"github"/);
 });
 
+test('configure prints current persisted settings and cost estimates', async () => {
+  const stdout = createWritableCapture(true);
+  const context = makeRunContext();
+
+  try {
+    await run(['configure'], stdout.stream, { env: context.env, cwd: context.cwd });
+  } finally {
+    context.cleanup();
+  }
+
+  assert.match(stdout.read(), /ghcrawl configure/);
+  assert.match(stdout.read(), /summary model: gpt-5-mini/);
+  assert.match(stdout.read(), /embedding basis: title_original/);
+  assert.match(stdout.read(), /gpt-5\.4-mini: ~\$30 USD/);
+});
+
+test('configure persists summary model changes', async () => {
+  const stdout = createWritableCapture();
+  const context = makeRunContext();
+
+  try {
+    await run(['configure', '--summary-model', 'gpt-5.4-mini', '--json'], stdout.stream, {
+      env: context.env,
+      cwd: context.cwd,
+    });
+    const persisted = readPersistedConfig({ env: context.env, cwd: context.cwd });
+    assert.equal(persisted.data.summaryModel, 'gpt-5.4-mini');
+  } finally {
+    context.cleanup();
+  }
+});
+
 test('unknown command exits with code 2 and a top-level help hint', async () => {
   const stderr = createWritableCapture();
   const code = await runCli(['wat'], { stderr: stderr.stream });
@@ -410,6 +443,11 @@ test('formatDoctorReport renders a human-readable health summary', () => {
       authOk: false,
       error: 'missing',
     },
+    vectorlite: {
+      configured: true,
+      runtimeOk: true,
+      error: null,
+    },
   });
 
   assert.match(rendered, /config path: \/tmp\/config\.json/);