Skip to content

feat: migrate ghcrawl to persistent vectorlite search and clustering#7

Merged
huntharo merged 34 commits into
mainfrom
codex/vectorlite
Apr 3, 2026
Merged

feat: migrate ghcrawl to persistent vectorlite search and clustering#7
huntharo merged 34 commits into
mainfrom
codex/vectorlite

Conversation

@huntharo
Copy link
Copy Markdown
Contributor

@huntharo huntharo commented Mar 12, 2026

Summary

  • I replaced the in-memory exact vector path with a persistent vectorlite sidecar index for semantic search, neighbors, and clustering.
  • I moved the active pipeline to one 1024-dimension vector per open thread and wired refresh to perform the one-time migration on the first rebuild.
  • I kept LLM summaries optional: the default embedding basis is now title_original, while title_summary is an opt-in mode that summarizes before embedding.
  • I added operator-facing controls and guidance for summary model, embedding basis, migration costs, and the TUI refresh flow.

What Changed

  • Persistent vector backend:
    • vectorlite is now the default vector backend.
    • active vectors are stored per thread and backed by the persistent sidecar store.
    • semantic search, neighbors, and cluster use that persistent vector store.
  • Migration behavior:
    • the first successful ghcrawl refresh owner/repo rebuild migrates the repo to the new vector path.
    • plain sync does not perform that migration.
    • after a successful rebuild, ghcrawl removes legacy document_embeddings, compacts stored vectors, checkpoints the WAL, and vacuums the DB.
  • Summaries and costs:
    • default summary model is gpt-5-mini.
    • default embedding basis is title_original to avoid surprise one-time LLM spend.
    • title_summary is available as an opt-in mode through ghcrawl configure.
    • README and TUI copy now explain that summarizing roughly 18k open issues/PRs in openclaw/openclaw is typically about $15-$30 one time with gpt-5-mini, and later refreshes are usually much cheaper.
  • TUI / UX:
    • the update popup now says whether LLM summaries are enabled or disabled and explains the tradeoff.
    • the detail pane now shows LLM Summary above the raw body instead of exposing the internal dedupe_summary label.

Clustering Results

On the live openclaw/openclaw dataset:

  • with title_original, non-solo cluster membership is about 43.7%
  • with title_summary, non-solo cluster membership improves by about 50% relative to title_original

That is why summaries are presented as an opt-in quality mode rather than the default.

Size / Migration Results

On the migrated local openclaw/openclaw data:

  • main DB dropped from about 5.2 GB to about 867 MB
  • total ~/.config/ghcrawl dropped to about 965 MB
  • legacy document_embeddings for the repo were removed after migration

How I Tested

  • pnpm --filter @ghcrawl/api-core test
  • pnpm --filter ghcrawl test
  • pnpm typecheck
  • pnpm --filter ghcrawl cli cluster openclaw/openclaw
  • pnpm --filter ghcrawl cli cluster openclaw/openclaw --threshold 0.78

I also verified the migration behavior against the real local openclaw/openclaw database, including:

  • one-time cleanup of legacy embeddings
  • DB compaction
  • updated cluster coverage
  • PR CI is green on this branch

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 12, 2026

Cluster Performance

  • Backend: exact
  • Timing basis: cluster-only
  • Status: PASS
  • Fixture median (cluster-only): 339.5 ms (12 samples, 3 cluster rebuilds/sample)
  • Fixture median (total run): 394.0 ms
  • Fixture median load stage: 48.5 ms
  • Fixture median setup stage: 0.0 ms
  • Fixture median exact edge-build stage: 338.5 ms
  • Fixture median vector index-build stage: 0.0 ms
  • Fixture median vector query stage: 0.0 ms
  • Fixture median cluster-assembly stage: 1.0 ms
  • Median peak RSS: 223.3 MiB
  • Median peak heap used: 72.3 MiB
  • Fixture baseline: 450.6 ms
  • Fixture delta: -111.1 ms (-24.7%)
  • Projected openclaw/openclaw duration: 6m 20.7s
  • Projected openclaw/openclaw baseline: 8m 25.3s
  • Projected delta: -124586.8 ms (-24.7%)
  • Regression threshold: +50.0%
  • Fixture shape: 512 threads x 3 source kinds
  • Sample durations: 341.0 ms, 345.0 ms, 338.0 ms, 338.0 ms, 340.0 ms, 337.0 ms, 341.0 ms, 339.0 ms, 345.0 ms, 338.0 ms, 337.0 ms, 347.0 ms
  • Suggested baseline update: {"fixtureMedianMs":339.5,"projectedOpenclawMs":380713}

Run: workflow run for 3ffa673

@huntharo huntharo added this to ghcrawl Mar 19, 2026
@huntharo huntharo moved this to In Review in ghcrawl Mar 19, 2026
@huntharo huntharo force-pushed the codex/vectorlite branch 2 times, most recently from ac9e920 to d6ccc80 Compare March 29, 2026 00:54
huntharo and others added 21 commits March 31, 2026 15:50
Tested 11 system prompt variants with LLM-as-judge on 40 threads.
Winner (v5-component-focused) scores 4.97/5 vs baseline 2.65/5 with
0% boilerplate and 100% clustering correctness.

- Replace sequential summarize loop with two-stage IterableMapper pipeline
  (concurrency 5 for API calls, concurrency 1 for DB writes)
- Add running cost estimate and ETA to summarize progress output
- Add op-run.mjs 'run' mode for arbitrary commands with 1Password env
- Add experiment scripts for prompt optimization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…riments

Add --source-kinds and --aggregation params to clusterExperiment, supporting:
- Source kind filtering (title, body, dedupe_summary, any combination)
- Aggregation methods: max, mean, weighted, min-of-2, boost

Collect per-source-kind scores then finalize with chosen method, replacing
the hardcoded max() aggregation. This enables experiments comparing how
different embedding signals and combination strategies affect cluster quality.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Creates cluster-judge-experiment.mjs that runs clusterExperiment with
configurable params, samples clusters (stratified by size), judges them
with an LLM for coherence scoring, and evaluates singletons for false
negatives. Includes batch runner for 15 planned experiments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --backend flag to cluster-judge-experiment.mjs (supports exact/vectorlite)
- Increase max_output_tokens for cluster judge (500→800) and singleton judge (300→500)
- Truncate large clusters to 25 representative items for judge context limits
- Add .context/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI perf harness copies HEAD's perf.integration.ts into the base
worktree, but clusterExperiment doesn't exist on main. Add a runtime
guard that falls back to clusterRepository when the method is missing.
Also regenerate lockfile after rebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
huntharo and others added 9 commits March 31, 2026 23:16
…ter-experiment CLI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the embedding clustering optimization experiments comparing
exact kNN vs vectorlite HNSW, source kind selection, and aggregation
strategies. Records why source-dedupe-only was chosen as the recommended
configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@huntharo huntharo changed the title feat: add vectorlite cluster experiment feat: add vectorlite for clustering / semantic search / LLM summaries option Apr 3, 2026
@huntharo huntharo changed the title feat: add vectorlite for clustering / semantic search / LLM summaries option feat: migrate ghcrawl to persistent vectorlite search and clustering Apr 3, 2026
@huntharo huntharo merged commit 4f557cb into main Apr 3, 2026
8 checks passed
@huntharo huntharo deleted the codex/vectorlite branch April 4, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant