feat: migrate ghcrawl to persistent vectorlite search and clustering by huntharo · Pull Request #7 · pwrdrvr/ghcrawl

huntharo · 2026-03-12T02:46:57Z

Summary

I replaced the in-memory exact vector path with a persistent vectorlite sidecar index for semantic search, neighbors, and clustering.
I moved the active pipeline to one 1024-dimension vector per open thread and wired refresh to perform the one-time migration on the first rebuild.
I kept LLM summaries optional: the default embedding basis is now title_original, while title_summary is an opt-in mode that summarizes before embedding.
I added operator-facing controls and guidance for summary model, embedding basis, migration costs, and the TUI refresh flow.

What Changed

Persistent vector backend:
- vectorlite is now the default vector backend.
- active vectors are stored per thread and backed by the persistent sidecar store.
- semantic search, neighbors, and cluster use that persistent vector store.
Migration behavior:
- the first successful ghcrawl refresh owner/repo rebuild migrates the repo to the new vector path.
- plain sync does not perform that migration.
- after a successful rebuild, ghcrawl removes legacy document_embeddings, compacts stored vectors, checkpoints the WAL, and vacuums the DB.
Summaries and costs:
- default summary model is gpt-5-mini.
- default embedding basis is title_original to avoid surprise one-time LLM spend.
- title_summary is available as an opt-in mode through ghcrawl configure.
- README and TUI copy now explain that summarizing roughly 18k open issues/PRs in openclaw/openclaw is typically about $15-$30 one time with gpt-5-mini, and later refreshes are usually much cheaper.
TUI / UX:
- the update popup now says whether LLM summaries are enabled or disabled and explains the tradeoff.
- the detail pane now shows LLM Summary above the raw body instead of exposing the internal dedupe_summary label.

Clustering Results

On the live openclaw/openclaw dataset:

with title_original, non-solo cluster membership is about 43.7%
with title_summary, non-solo cluster membership improves by about 50% relative to title_original

That is why summaries are presented as an opt-in quality mode rather than the default.

Size / Migration Results

On the migrated local openclaw/openclaw data:

main DB dropped from about 5.2 GB to about 867 MB
total ~/.config/ghcrawl dropped to about 965 MB
legacy document_embeddings for the repo were removed after migration

How I Tested

pnpm --filter @ghcrawl/api-core test
pnpm --filter ghcrawl test
pnpm typecheck
pnpm --filter ghcrawl cli cluster openclaw/openclaw
pnpm --filter ghcrawl cli cluster openclaw/openclaw --threshold 0.78

I also verified the migration behavior against the real local openclaw/openclaw database, including:

one-time cleanup of legacy embeddings
DB compaction
updated cluster coverage
PR CI is green on this branch

github-actions · 2026-03-12T14:10:13Z

Cluster Performance

Backend: exact
Timing basis: cluster-only
Status: PASS
Fixture median (cluster-only): 339.5 ms (12 samples, 3 cluster rebuilds/sample)
Fixture median (total run): 394.0 ms
Fixture median load stage: 48.5 ms
Fixture median setup stage: 0.0 ms
Fixture median exact edge-build stage: 338.5 ms
Fixture median vector index-build stage: 0.0 ms
Fixture median vector query stage: 0.0 ms
Fixture median cluster-assembly stage: 1.0 ms
Median peak RSS: 223.3 MiB
Median peak heap used: 72.3 MiB
Fixture baseline: 450.6 ms
Fixture delta: -111.1 ms (-24.7%)
Projected openclaw/openclaw duration: 6m 20.7s
Projected openclaw/openclaw baseline: 8m 25.3s
Projected delta: -124586.8 ms (-24.7%)
Regression threshold: +50.0%
Fixture shape: 512 threads x 3 source kinds
Sample durations: 341.0 ms, 345.0 ms, 338.0 ms, 338.0 ms, 340.0 ms, 337.0 ms, 341.0 ms, 339.0 ms, 345.0 ms, 338.0 ms, 337.0 ms, 347.0 ms
Suggested baseline update: {"fixtureMedianMs":339.5,"projectedOpenclawMs":380713}

Run: workflow run for 3ffa673

Tested 11 system prompt variants with LLM-as-judge on 40 threads. Winner (v5-component-focused) scores 4.97/5 vs baseline 2.65/5 with 0% boilerplate and 100% clustering correctness. - Replace sequential summarize loop with two-stage IterableMapper pipeline (concurrency 5 for API calls, concurrency 1 for DB writes) - Add running cost estimate and ETA to summarize progress output - Add op-run.mjs 'run' mode for arbitrary commands with 1Password env - Add experiment scripts for prompt optimization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…riments Add --source-kinds and --aggregation params to clusterExperiment, supporting: - Source kind filtering (title, body, dedupe_summary, any combination) - Aggregation methods: max, mean, weighted, min-of-2, boost Collect per-source-kind scores then finalize with chosen method, replacing the hardcoded max() aggregation. This enables experiments comparing how different embedding signals and combination strategies affect cluster quality. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Creates cluster-judge-experiment.mjs that runs clusterExperiment with configurable params, samples clusters (stratified by size), judges them with an LLM for coherence scoring, and evaluates singletons for false negatives. Includes batch runner for 15 planned experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add --backend flag to cluster-judge-experiment.mjs (supports exact/vectorlite) - Increase max_output_tokens for cluster judge (500→800) and singleton judge (300→500) - Truncate large clusters to 25 representative items for judge context limits - Add .context/ to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The CI perf harness copies HEAD's perf.integration.ts into the base worktree, but clusterExperiment doesn't exist on main. Add a runtime guard that falls back to clusterRepository when the method is missing. Also regenerate lockfile after rebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ter-experiment CLI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Documents the embedding clustering optimization experiments comparing exact kNN vs vectorlite HNSW, source kind selection, and aggregation strategies. Records why source-dedupe-only was chosen as the recommended configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

huntharo force-pushed the codex/vectorlite branch from b28a24a to 51a6ab7 Compare March 12, 2026 13:45

huntharo mentioned this pull request Mar 12, 2026

fix: grant PR comment workflow pull-request scope #9

Merged

huntharo force-pushed the codex/vectorlite branch from 51a6ab7 to 7f7fd0c Compare March 12, 2026 14:09

huntharo added this to ghcrawl Mar 19, 2026

huntharo moved this to In Review in ghcrawl Mar 19, 2026

huntharo force-pushed the codex/vectorlite branch 2 times, most recently from ac9e920 to d6ccc80 Compare March 29, 2026 00:54

huntharo and others added 21 commits March 31, 2026 15:50

feat: add vectorlite cluster experiment

54eb745

ci: compare cluster perf backends in pull requests

621166c

ci: report post-index vectorlite perf timings

780762d

fix: repair cluster perf ci checks

9968cbe

ci: warm vectorlite perf before reporting

dc40c8d

test: add large cluster perf comparison harness

7f30965

ci: use large fixture for cluster perf

8c7ae7e

ci: reuse current perf harness for baseline worktree

aa368f7

fix: restore vectorlite build after rebase

a3800f4

fix: measure cluster perf samples from run duration

a9fda34

feat: break down cluster perf experiment metrics

ec4ead0

feat: add real-db cluster perf benchmark

a5bc24a

fix: stream vectorlite cluster experiment inputs

0a26e31

feat: compare cluster population distributions

54c7e50

feat: inspect cluster topology differences

b91a522

feat: refine oversized vector clusters

82fc6b9

huntharo force-pushed the codex/vectorlite branch from 52f0be6 to ffd31f5 Compare March 31, 2026 19:52

huntharo and others added 9 commits March 31, 2026 23:16

fix: correct parseRepoFlags arity and writeProgress signature in clus…

007b857

…ter-experiment CLI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add vectorlite migration release brainstorm

e1f8b76

docs: add persistent vectorlite migration plan

63f2ab4

feat: migrate core pipeline to persistent vectorlite vectors

2c0e2a8

feat: add vector migration operator controls

57b4918

fix: compact migrated vectors and tune cluster recall

2e334ce

fix: promote llm summary in tui detail pane

3a9db81

feat: default vector refreshes to original issue text

85909d2

huntharo changed the title ~~feat: add vectorlite cluster experiment~~ feat: add vectorlite for clustering / semantic search / LLM summaries option Apr 3, 2026

huntharo added 4 commits April 3, 2026 15:01

chore: remove .context experiment artifacts

013029d

fix: prune stale vectorlite state

1fc1566

ci: remove experimental cluster perf workflow

27672e7

docs: add one-time vector migration notes

3ffa673

huntharo changed the title ~~feat: add vectorlite for clustering / semantic search / LLM summaries option~~ feat: migrate ghcrawl to persistent vectorlite search and clustering Apr 3, 2026

huntharo merged commit 4f557cb into main Apr 3, 2026
8 checks passed

huntharo deleted the codex/vectorlite branch April 4, 2026 01:04

huntharo mentioned this pull request Apr 4, 2026

feat: default clustering to vectorlite #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: migrate ghcrawl to persistent vectorlite search and clustering#7

feat: migrate ghcrawl to persistent vectorlite search and clustering#7
huntharo merged 34 commits into
mainfrom
codex/vectorlite

huntharo commented Mar 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

huntharo commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Clustering Results

Size / Migration Results

How I Tested

Uh oh!

github-actions Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cluster Performance

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huntharo commented Mar 12, 2026 •

edited

Loading

github-actions Bot commented Mar 12, 2026 •

edited

Loading