Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
54eb745
feat: add vectorlite cluster experiment
huntharo Mar 12, 2026
621166c
ci: compare cluster perf backends in pull requests
huntharo Mar 12, 2026
780762d
ci: report post-index vectorlite perf timings
huntharo Mar 12, 2026
9968cbe
fix: repair cluster perf ci checks
huntharo Mar 12, 2026
dc40c8d
ci: warm vectorlite perf before reporting
huntharo Mar 12, 2026
7f30965
test: add large cluster perf comparison harness
huntharo Mar 12, 2026
8c7ae7e
ci: use large fixture for cluster perf
huntharo Mar 12, 2026
aa368f7
ci: reuse current perf harness for baseline worktree
huntharo Mar 12, 2026
a3800f4
fix: restore vectorlite build after rebase
huntharo Mar 29, 2026
a9fda34
fix: measure cluster perf samples from run duration
huntharo Mar 29, 2026
ec4ead0
feat: break down cluster perf experiment metrics
huntharo Mar 29, 2026
a5bc24a
feat: add real-db cluster perf benchmark
huntharo Mar 29, 2026
0a26e31
fix: stream vectorlite cluster experiment inputs
huntharo Mar 29, 2026
54c7e50
feat: compare cluster population distributions
huntharo Mar 29, 2026
b91a522
feat: inspect cluster topology differences
huntharo Mar 29, 2026
82fc6b9
feat: refine oversized vector clusters
huntharo Mar 29, 2026
a14a3d5
feat: optimize summary prompt and add concurrent summarization pipeline
huntharo Mar 30, 2026
cb58005
feat: add source-kind selection and score aggregation to cluster expe…
huntharo Mar 31, 2026
9b1f9fc
feat: add LLM-as-judge cluster experiment runner
huntharo Mar 31, 2026
4dd92e7
fix: improve cluster judge reliability and add backend arg
huntharo Mar 31, 2026
ffd31f5
fix: handle missing clusterExperiment in CI base worktree
huntharo Mar 31, 2026
007b857
fix: correct parseRepoFlags arity and writeProgress signature in clus…
huntharo Apr 1, 2026
52009b3
docs: add clustering optimization results and recommendations
huntharo Apr 1, 2026
e1f8b76
docs: add vectorlite migration release brainstorm
huntharo Apr 1, 2026
63f2ab4
docs: add persistent vectorlite migration plan
huntharo Apr 1, 2026
2c0e2a8
feat: migrate core pipeline to persistent vectorlite vectors
huntharo Apr 1, 2026
57b4918
feat: add vector migration operator controls
huntharo Apr 1, 2026
2e334ce
fix: compact migrated vectors and tune cluster recall
huntharo Apr 2, 2026
3a9db81
fix: promote llm summary in tui detail pane
huntharo Apr 2, 2026
85909d2
feat: default vector refreshes to original issue text
huntharo Apr 3, 2026
013029d
chore: remove .context experiment artifacts
huntharo Apr 3, 2026
1fc1566
fix: prune stale vectorlite state
huntharo Apr 3, 2026
27672e7
ci: remove experimental cluster perf workflow
huntharo Apr 3, 2026
3ffa673
docs: add one-time vector migration notes
huntharo Apr 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ packages/*/src/**/*.d.ts.map
**/__pycache__/
**/*.pyc
**/.DS_Store
.context/
72 changes: 63 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ GitHub is required to crawl issue and PR data. OpenAI is required for embeddings

```bash
ghcrawl init
ghcrawl configure
ghcrawl doctor
ghcrawl refresh owner/repo
ghcrawl tui owner/repo
Expand All @@ -44,19 +45,41 @@ ghcrawl tui owner/repo
- save plaintext keys in `~/.config/ghcrawl/config.json`
- or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file

`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI.
`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, summarizes changed items only when the active embedding basis depends on summaries, refreshes vectors, and rebuilds the clusters you browse in the TUI.

## One-Time Migration

Upgrading to this release changes the local vector and cluster pipeline:

- vectors now use a persistent `vectorlite` sidecar index
- the active vector is one vector per open thread
- old multi-row `document_embeddings` are removed after the first successful rebuild

For an existing repo, the one-time migration command is:

```bash
ghcrawl refresh owner/repo
```

Important notes:

- `refresh` performs the migration; plain `sync` does not
- with the default `title_original` basis, the migration rebuilds vectors and clusters without running LLM summaries
- if you switch to `title_summary`, `refresh` also runs the summarize step before embedding
- after the first successful migration refresh, ghcrawl removes legacy embeddings, compacts the local DB, and rebuilds clusters from the current vectors

## Typical Commands

```bash
ghcrawl configure
ghcrawl doctor
ghcrawl refresh owner/repo
ghcrawl tui owner/repo
```

`refresh`, `sync`, and `embed` call remote services and should be run intentionally.

`cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`.
`cluster` does not call remote services, but it is still time consuming. It now uses a persistent `vectorlite` index instead of exact in-memory scans, so large-repo rebuilds are materially faster, but still not instant.

`clusters` explores the clusters already stored in the local SQLite database and is expected to be the fast, read-only inspection path.

Expand All @@ -72,6 +95,7 @@ ghcrawl refresh --help
For agent-facing and script-facing commands, prefer explicit machine mode:

```bash
ghcrawl configure --json
ghcrawl doctor --json
ghcrawl threads owner/repo --numbers 42,43,44 --json
ghcrawl clusters owner/repo --min-size 10 --limit 20 --sort recent --json
Expand Down Expand Up @@ -118,11 +142,12 @@ If you need tighter control, you can run the three stages yourself:

```bash
ghcrawl sync owner/repo # pull the latest open issues and pull requests from GitHub
ghcrawl embed owner/repo # generate or refresh OpenAI embeddings for changed items
ghcrawl summarize owner/repo # optional explicit summary refresh when using title_summary
ghcrawl embed owner/repo # generate or refresh the single active vector per thread
ghcrawl cluster owner/repo # rebuild local related-work clusters from the current vectors (local-only, but can take ~10 minutes on a ~12k issue/PR repo)
```

Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you.
Run them in that order. If your embedding basis is `title_summary`, `refresh` automatically inserts the summarize stage before embed for you. With the default `title_original` basis, `refresh` does not summarize unless you run `summarize` explicitly.

## Init And Doctor

Expand Down Expand Up @@ -158,8 +183,29 @@ GitHub token guidance:
- local DB path wiring
- GitHub token presence, token-shape validation, and a live auth smoke check
- OpenAI key presence, key-shape validation, and a live auth smoke check
- `vectorlite` runtime readiness
- if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly

## Configure

Use `configure` to inspect or change the active summary model and embedding basis:

```bash
ghcrawl configure
ghcrawl configure --summary-model gpt-5.4-mini
ghcrawl configure --embedding-basis title_original
```

Current defaults:

- summary model: `gpt-5-mini`
- embedding basis: `title_original` (`title + original body`)
- vector backend: `vectorlite`

Changing the summary model or embedding basis makes the next `refresh` rebuild vectors and clusters for that repo.

If you opt into `title_summary`, ghcrawl summarizes before embedding and uses `title + dedupe summary` as the active vector text. On `openclaw/openclaw`, that improved non-solo cluster membership by about 50% versus `title_original`, but it adds OpenAI spend. A first summarize of roughly `18k` open issues and PRs in that repo typically costs about `$15-$30` with `gpt-5-mini`; later refreshes are usually much cheaper because only changed items need summaries.

### 1Password CLI Example

If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly:
Expand Down Expand Up @@ -214,10 +260,17 @@ Use `close-cluster` when you want to locally suppress a whole cluster from defau

## Cost To Operate

The main variable cost is OpenAI embeddings. Current model pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).
The main variable costs are summarization and embeddings. Embedding pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).

On a real local run against roughly `12k` issues plus about `1.2x` related PR and issue inputs, [`text-embedding-3-large`](https://developers.openai.com/api/docs/pricing#embeddings) came out to about **$0.65 USD** total to embed the repo. Treat that as an approximate data point for something like `~14k` issue and PR inputs, not a hard guarantee.

For one-time summary migration planning on a repo around the size of `openclaw/openclaw` (`~20k` issues and PRs), `ghcrawl configure` reports these operator estimates using the April 1, 2026 USD pricing assumptions for this release:

- `gpt-5-mini`: about **$12 USD** one time
- `gpt-5.4-mini`: about **$30 USD** one time

`gpt-5-mini` is the default to keep that migration cost lower. `gpt-5.4-mini` is available when you want higher-quality summaries and are comfortable with the higher one-time spend.

This screenshot is the reference point for that estimate:

![OpenAI embeddings cost for a 12k-issue repo](./docs/images/openai-embeddings-12k-issue-repo.png)
Expand Down Expand Up @@ -265,15 +318,16 @@ The agent and build contract for this repo lives in [SPEC.md](./SPEC.md).
- a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo
- `sync` is metadata-only by default
- `sync --include-comments` enables issue comments, PR reviews, and review comments for deeper context
- `embed` defaults to `text-embedding-3-large`
- `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present
- `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding
- `embed` defaults to `text-embedding-3-large` with `dimensions=1024`
- `embed` maintains one active vector per thread, stored in a persistent `vectorlite` sidecar index
- `embed` stores an input hash per thread and will not resubmit unchanged text for re-embedding
- the default embedding basis is `title + original body`; use `ghcrawl configure --embedding-basis title_summary` if you want to summarize before embedding
- `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo`
- `sync --limit <count>` is the best smoke-test path on a busy repository
- `tui` remembers sort order and min cluster size per repository in the persisted config file
- the TUI shows locally closed threads and clusters in gray; press `x` to hide or show them
- on wide screens, press `l` to toggle between three columns and a wider cluster list with members/detail stacked on the right
- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> embed -> cluster and opens that repo with min cluster size `1+`
- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> summarize-if-needed -> embed -> cluster and opens that repo with min cluster size `1+`

## Responsibility Attestation

Expand Down
49 changes: 40 additions & 9 deletions apps/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ GitHub is required to crawl issue and PR data. OpenAI is required for embeddings

```bash
ghcrawl init
ghcrawl configure
ghcrawl doctor
ghcrawl refresh owner/repo
ghcrawl tui owner/repo
Expand All @@ -46,19 +47,20 @@ ghcrawl tui owner/repo
- save plaintext keys in `~/.config/ghcrawl/config.json`
- or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file

`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI.
`ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, summarizes changed items when the active embedding basis depends on summaries, refreshes vectors, and rebuilds the clusters you browse in the TUI.

## Typical Commands

```bash
ghcrawl configure
ghcrawl doctor
ghcrawl refresh owner/repo
ghcrawl tui owner/repo
```

`refresh`, `sync`, and `embed` call remote services and should be run intentionally.

`cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`.
`cluster` does not call remote services, but it is still time consuming. It now uses a persistent `vectorlite` index instead of exact in-memory scans, so large-repo rebuilds are materially faster, but still not instant.

`clusters` explores the clusters already stored in the local SQLite database and is expected to be the fast, read-only inspection path.

Expand All @@ -74,6 +76,7 @@ ghcrawl refresh --help
For agent-facing and script-facing commands, prefer explicit machine mode:

```bash
ghcrawl configure --json
ghcrawl doctor --json
ghcrawl threads owner/repo --numbers 42,43,44 --json
ghcrawl clusters owner/repo --min-size 10 --limit 20 --sort recent --json
Expand Down Expand Up @@ -120,11 +123,12 @@ If you need tighter control, you can run the three stages yourself:

```bash
ghcrawl sync owner/repo # pull the latest open issues and pull requests from GitHub
ghcrawl embed owner/repo # generate or refresh OpenAI embeddings for changed items
ghcrawl summarize owner/repo # optional explicit summary refresh when using title_summary
ghcrawl embed owner/repo # generate or refresh the single active vector per thread
ghcrawl cluster owner/repo # rebuild local related-work clusters from the current vectors (local-only, but can take ~10 minutes on a ~12k issue/PR repo)
```

Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you.
Run them in that order. If your embedding basis is `title_summary`, `refresh` automatically inserts the summarize stage before embed for you.

## Init And Doctor

Expand Down Expand Up @@ -160,8 +164,27 @@ GitHub token guidance:
- local DB path wiring
- GitHub token presence, token-shape validation, and a live auth smoke check
- OpenAI key presence, key-shape validation, and a live auth smoke check
- `vectorlite` runtime readiness
- if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly

## Configure

Use `configure` to inspect or change the active summary model and embedding basis:

```bash
ghcrawl configure
ghcrawl configure --summary-model gpt-5.4-mini
ghcrawl configure --embedding-basis title_original
```

Current defaults:

- summary model: `gpt-5-mini`
- embedding basis: `title_summary` (`title + dedupe summary`)
- vector backend: `vectorlite`

Changing the summary model or embedding basis makes the next `refresh` rebuild vectors and clusters for that repo.

### 1Password CLI Example

If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly:
Expand Down Expand Up @@ -216,10 +239,17 @@ Use `close-cluster` when you want to locally suppress a whole cluster from defau

## Cost To Operate

The main variable cost is OpenAI embeddings. Current model pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).
The main variable costs are summarization and embeddings. Embedding pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings).

On a real local run against roughly `12k` issues plus about `1.2x` related PR and issue inputs, [`text-embedding-3-large`](https://developers.openai.com/api/docs/pricing#embeddings) came out to about **$0.65 USD** total to embed the repo. Treat that as an approximate data point for something like `~14k` issue and PR inputs, not a hard guarantee.

For one-time summary migration planning on a repo around the size of `openclaw/openclaw` (`~20k` issues and PRs), `ghcrawl configure` reports these operator estimates using the April 1, 2026 USD pricing assumptions for this release:

- `gpt-5-mini`: about **$12 USD** one time
- `gpt-5.4-mini`: about **$30 USD** one time

`gpt-5-mini` is the default to keep that migration cost lower. `gpt-5.4-mini` is available when you want higher-quality summaries and are comfortable with the higher one-time spend.

This screenshot is the reference point for that estimate:

![OpenAI embeddings cost for a 12k-issue repo](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/openai-embeddings-12k-issue-repo.png)
Expand Down Expand Up @@ -267,15 +297,16 @@ The agent and build contract for this repo lives in [SPEC.md](https://github.com
- a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo
- `sync` is metadata-only by default
- `sync --include-comments` enables issue comments, PR reviews, and review comments for deeper context
- `embed` defaults to `text-embedding-3-large`
- `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present
- `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding
- `embed` defaults to `text-embedding-3-large` with `dimensions=1024`
- `embed` maintains one active vector per thread, stored in a persistent `vectorlite` sidecar index
- `embed` stores an input hash per thread and will not resubmit unchanged text for re-embedding
- the default embedding basis is `title + dedupe summary`; use `ghcrawl configure` to switch to `title + original body`
- `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo`
- `sync --limit <count>` is the best smoke-test path on a busy repository
- `tui` remembers sort order and min cluster size per repository in the persisted config file
- the TUI shows locally closed threads and clusters in gray; press `x` to hide or show them
- on wide screens, press `l` to toggle between three columns and a wider cluster list with members/detail stacked on the right
- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> embed -> cluster and opens that repo with min cluster size `1+`
- if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> summarize-if-needed -> embed -> cluster and opens that repo with min cluster size `1+`

## Responsibility Attestation

Expand Down
40 changes: 39 additions & 1 deletion apps/cli/src/main.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import os from 'node:os';
import path from 'node:path';
import { fileURLToPath } from 'node:url';

import { GHCrawlService } from '@ghcrawl/api-core';
import { GHCrawlService, readPersistedConfig } from '@ghcrawl/api-core';
import { formatDoctorReport, formatLogLine, getExitCode, parseOwnerRepo, parseRepoFlags, resolveSinceValue, run, runCli } from './main.js';

function createWritableCapture(isTTY?: boolean) {
Expand Down Expand Up @@ -39,6 +39,7 @@ function makeRunContext(): { env: NodeJS.ProcessEnv; cwd: string; cleanup: () =>
const publicCommands = [
'init',
'doctor',
'configure',
'version',
'sync',
'refresh',
Expand Down Expand Up @@ -171,6 +172,38 @@ test('run prints json doctor output when explicitly requested', async () => {
assert.match(stdout.read(), /"github"/);
});

test('configure prints current persisted settings and cost estimates', async () => {
const stdout = createWritableCapture(true);
const context = makeRunContext();

try {
await run(['configure'], stdout.stream, { env: context.env, cwd: context.cwd });
} finally {
context.cleanup();
}

assert.match(stdout.read(), /ghcrawl configure/);
assert.match(stdout.read(), /summary model: gpt-5-mini/);
assert.match(stdout.read(), /embedding basis: title_original/);
assert.match(stdout.read(), /gpt-5\.4-mini: ~\$30 USD/);
});

test('configure persists summary model changes', async () => {
const stdout = createWritableCapture();
const context = makeRunContext();

try {
await run(['configure', '--summary-model', 'gpt-5.4-mini', '--json'], stdout.stream, {
env: context.env,
cwd: context.cwd,
});
const persisted = readPersistedConfig({ env: context.env, cwd: context.cwd });
assert.equal(persisted.data.summaryModel, 'gpt-5.4-mini');
} finally {
context.cleanup();
}
});

test('unknown command exits with code 2 and a top-level help hint', async () => {
const stderr = createWritableCapture();
const code = await runCli(['wat'], { stderr: stderr.stream });
Expand Down Expand Up @@ -410,6 +443,11 @@ test('formatDoctorReport renders a human-readable health summary', () => {
authOk: false,
error: 'missing',
},
vectorlite: {
configured: true,
runtimeOk: true,
error: null,
},
});

assert.match(rendered, /config path: \/tmp\/config\.json/);
Expand Down
Loading
Loading