feat: migrate ghcrawl to persistent vectorlite search and clustering#7
Merged
Conversation
b28a24a to
51a6ab7
Compare
51a6ab7 to
7f7fd0c
Compare
Cluster Performance
Run: workflow run for |
ac9e920 to
d6ccc80
Compare
Tested 11 system prompt variants with LLM-as-judge on 40 threads. Winner (v5-component-focused) scores 4.97/5 vs baseline 2.65/5 with 0% boilerplate and 100% clustering correctness. - Replace sequential summarize loop with two-stage IterableMapper pipeline (concurrency 5 for API calls, concurrency 1 for DB writes) - Add running cost estimate and ETA to summarize progress output - Add op-run.mjs 'run' mode for arbitrary commands with 1Password env - Add experiment scripts for prompt optimization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…riments Add --source-kinds and --aggregation params to clusterExperiment, supporting: - Source kind filtering (title, body, dedupe_summary, any combination) - Aggregation methods: max, mean, weighted, min-of-2, boost Collect per-source-kind scores then finalize with chosen method, replacing the hardcoded max() aggregation. This enables experiments comparing how different embedding signals and combination strategies affect cluster quality. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Creates cluster-judge-experiment.mjs that runs clusterExperiment with configurable params, samples clusters (stratified by size), judges them with an LLM for coherence scoring, and evaluates singletons for false negatives. Includes batch runner for 15 planned experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --backend flag to cluster-judge-experiment.mjs (supports exact/vectorlite) - Increase max_output_tokens for cluster judge (500→800) and singleton judge (300→500) - Truncate large clusters to 25 representative items for judge context limits - Add .context/ to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI perf harness copies HEAD's perf.integration.ts into the base worktree, but clusterExperiment doesn't exist on main. Add a runtime guard that falls back to clusterRepository when the method is missing. Also regenerate lockfile after rebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
52f0be6 to
ffd31f5
Compare
…ter-experiment CLI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the embedding clustering optimization experiments comparing exact kNN vs vectorlite HNSW, source kind selection, and aggregation strategies. Records why source-dedupe-only was chosen as the recommended configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vectorlitesidecar index for semantic search, neighbors, and clustering.refreshto perform the one-time migration on the first rebuild.title_original, whiletitle_summaryis an opt-in mode that summarizes before embedding.What Changed
vectorliteis now the default vector backend.search,neighbors, andclusteruse that persistent vector store.ghcrawl refresh owner/reporebuild migrates the repo to the new vector path.syncdoes not perform that migration.document_embeddings, compacts stored vectors, checkpoints the WAL, and vacuums the DB.gpt-5-mini.title_originalto avoid surprise one-time LLM spend.title_summaryis available as an opt-in mode throughghcrawl configure.openclaw/openclawis typically about$15-$30one time withgpt-5-mini, and later refreshes are usually much cheaper.LLM Summaryabove the raw body instead of exposing the internaldedupe_summarylabel.Clustering Results
On the live
openclaw/openclawdataset:title_original, non-solo cluster membership is about43.7%title_summary, non-solo cluster membership improves by about50%relative totitle_originalThat is why summaries are presented as an opt-in quality mode rather than the default.
Size / Migration Results
On the migrated local
openclaw/openclawdata:5.2 GBto about867 MB~/.config/ghcrawldropped to about965 MBdocument_embeddingsfor the repo were removed after migrationHow I Tested
pnpm --filter @ghcrawl/api-core testpnpm --filter ghcrawl testpnpm typecheckpnpm --filter ghcrawl cli cluster openclaw/openclawpnpm --filter ghcrawl cli cluster openclaw/openclaw --threshold 0.78I also verified the migration behavior against the real local
openclaw/openclawdatabase, including: