diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 9773617..f16697c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -15,7 +15,10 @@ Useful local commands from the repo root: ```bash pnpm tui openclaw/openclaw pnpm sync openclaw/openclaw --limit 25 +pnpm seed-install openclaw/openclaw pnpm refresh openclaw/openclaw +pnpm seed-export openclaw/openclaw --output /tmp/ghcrawl-seeds +pnpm seed-audit --asset /tmp/ghcrawl-seeds/.seed.json.gz --repo openclaw/openclaw --sources title,body pnpm embed openclaw/openclaw pnpm cluster openclaw/openclaw pnpm search openclaw/openclaw --query "download stalls" @@ -44,3 +47,22 @@ This repo uses tag-driven releases from the GitHub Releases UI. - `ghcrawl` CI also runs a package smoke check on pull requests and `main` by packing the publishable packages, installing them into a temporary project, and executing the packaged CLI. + +## Seed Audit + +Before publishing a starter seed, audit the exported sidecar locally: + +```bash +pnpm seed-export openclaw/openclaw --output /tmp/ghcrawl-seeds +pnpm seed-audit --asset /tmp/ghcrawl-seeds/.seed.json.gz --repo openclaw/openclaw --sources title,body +``` + +The audit is a streaming validation pass over the compressed sidecar. It fails if: + +- the manifest points at the wrong repository +- any thread or edge row references a different repo +- unexpected keys appear in the payload +- source kinds drift outside the expected set +- manifest counts do not match the observed rows + +Use `--json` if you want a machine-readable report for release notes or future automation. diff --git a/README.md b/README.md index 19eb8c7..3bde896 100644 --- a/README.md +++ b/README.md @@ -23,19 +23,20 @@ If you are working from source or maintaining the repo, use [CONTRIBUTING.md](./ ## Requirements -Normal `ghcrawl` use needs both: +Normal `ghcrawl` use always needs: - a GitHub personal access token -- an OpenAI API key -GitHub is required to crawl issue and PR data. OpenAI is required for embeddings and the maintainer clustering and search workflow. If you already have a populated local DB you can still browse it without live keys, but a fresh `sync` + `embed` + `cluster` or `refresh` run needs both. +OpenAI is optional for first-run bootstrap and local browsing, but required for local summarize/embed refreshes. + +GitHub is required to crawl issue and PR data. OpenAI is required for fresh local embeddings and summaries. If you already have a populated local DB or install the OpenClaw starter sidecar, you can browse clusters without a live OpenAI key. ## Quick Start ```bash ghcrawl init ghcrawl doctor -ghcrawl refresh owner/repo +ghcrawl seed-install openclaw/openclaw ghcrawl tui owner/repo ``` @@ -43,18 +44,22 @@ ghcrawl tui owner/repo - save plaintext keys in `~/.config/ghcrawl/config.json` - or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file +- and, if you already have a usable GitHub token plus a published seed asset is configured, offer starter data for `openclaw/openclaw` `ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI. +`ghcrawl seed-install openclaw/openclaw` is the low-cost bootstrap path. It runs a metadata-only sync, then imports published `title` and `body` embeddings plus derived similarity edges and rebuilds a local cluster run without needing an OpenAI key on day one. + ## Typical Commands ```bash ghcrawl doctor +ghcrawl seed-install openclaw/openclaw ghcrawl refresh owner/repo ghcrawl tui owner/repo ``` -`refresh`, `sync`, and `embed` call remote services and should be run intentionally. +`seed-install` and `sync` call GitHub. `refresh` and `embed` call GitHub plus OpenAI. Run them intentionally. `cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`. @@ -98,6 +103,24 @@ ghcrawl cluster owner/repo # rebuild local related-work clusters from the curre Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you. +## Starter Data For OpenClaw + +`ghcrawl` can import a published sidecar for `openclaw/openclaw`: + +```bash +ghcrawl seed-install openclaw/openclaw +``` + +That flow is intentionally narrow: + +- it currently only supports `openclaw/openclaw` +- it imports precomputed `title` and `body` embeddings plus derived similarity edges +- it rebuilds a normal local cluster run from those imported edges +- it does not overwrite your thread text, comments, summaries, or sync cursor state +- it does not make semantic query search or summary views fully local-feature-complete; for that you still need an OpenAI key and later local summarize/embed runs + +Use `--force` if you intentionally want to import starter data into an existing local repo. Use `--asset-url` to test a local or unpublished sidecar override. + ## Init And Doctor First run: @@ -110,12 +133,12 @@ ghcrawl doctor `init` behavior: - prompts you to choose one of two secret-storage modes: - - `plaintext`: saves both keys to `~/.config/ghcrawl/config.json` + - `plaintext`: saves your GitHub key, and optionally your OpenAI key, to `~/.config/ghcrawl/config.json` - `1Password CLI`: stores only vault and item metadata and tells you how to run `ghcrawl` through `op` - if you choose plaintext storage, init warns that anyone who can read that file can use your keys and that resulting API charges are your responsibility - if you choose 1Password CLI mode, init tells you to create a Secure Note with concealed fields named: - `GITHUB_TOKEN` - - `OPENAI_API_KEY` + - `OPENAI_API_KEY` when you are ready to enable summarize/embed flows GitHub token guidance: @@ -131,7 +154,7 @@ GitHub token guidance: - config file presence and path - local DB path wiring - GitHub token presence, token-shape validation, and a live auth smoke check -- OpenAI key presence, key-shape validation, and a live auth smoke check +- OpenAI key presence, token-shape validation, and a live auth smoke check when configured - if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly ### 1Password CLI Example @@ -139,7 +162,7 @@ GitHub token guidance: If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly: - `GITHUB_TOKEN` -- `OPENAI_API_KEY` +- `OPENAI_API_KEY` (optional until you want local summarize/embed refreshes) Then add this wrapper to `~/.zshrc`: @@ -167,6 +190,7 @@ These commands are intended more for scripts, bots, and agent integrations than ghcrawl threads owner/repo --numbers 42,43,44 ghcrawl threads owner/repo --numbers 42,43,44 --include-closed ghcrawl author owner/repo --login lqquan +ghcrawl seed-install openclaw/openclaw ghcrawl close-thread owner/repo --number 42 ghcrawl close-cluster owner/repo --id 123 ghcrawl clusters owner/repo --min-size 10 --limit 20 @@ -235,6 +259,7 @@ The agent and build contract for this repo lives in [SPEC.md](./SPEC.md). ## Current Caveats - `serve` starts the local HTTP API only. The web UI is not built yet. +- `seed-install` currently supports only `openclaw/openclaw` - `sync` only pulls open issues and PRs. - a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo - `sync` is metadata-only by default @@ -242,6 +267,7 @@ The agent and build contract for this repo lives in [SPEC.md](./SPEC.md). - `embed` defaults to `text-embedding-3-large` - `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present - `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding +- starter sidecars currently import `title` and `body` embeddings plus derived similarity edges - `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo` - `sync --limit ` is the best smoke-test path on a busy repository - `tui` remembers sort order and min cluster size per repository in the persisted config file diff --git a/apps/cli/README.md b/apps/cli/README.md index 4e3f6ce..6ff2e92 100644 --- a/apps/cli/README.md +++ b/apps/cli/README.md @@ -25,19 +25,20 @@ If you are working from source or maintaining the repo, use [CONTRIBUTING.md](ht ## Requirements -Normal `ghcrawl` use needs both: +Normal `ghcrawl` use always needs: - a GitHub personal access token -- an OpenAI API key -GitHub is required to crawl issue and PR data. OpenAI is required for embeddings and the maintainer clustering and search workflow. If you already have a populated local DB you can still browse it without live keys, but a fresh `sync` + `embed` + `cluster` or `refresh` run needs both. +OpenAI is optional for first-run bootstrap and local browsing, but required for local summarize/embed refreshes. + +GitHub is required to crawl issue and PR data. OpenAI is required for fresh local embeddings and summaries. If you already have a populated local DB or install the OpenClaw starter sidecar, you can browse clusters without a live OpenAI key. ## Quick Start ```bash ghcrawl init ghcrawl doctor -ghcrawl refresh owner/repo +ghcrawl seed-install openclaw/openclaw ghcrawl tui owner/repo ``` @@ -45,18 +46,22 @@ ghcrawl tui owner/repo - save plaintext keys in `~/.config/ghcrawl/config.json` - or guide you through a 1Password CLI (`op`) setup that keeps keys out of the config file +- and, if you already have a usable GitHub token plus a published seed asset is configured, offer starter data for `openclaw/openclaw` `ghcrawl refresh owner/repo` is the main pipeline command. It pulls the latest open GitHub issues and pull requests, refreshes embeddings for changed items, and rebuilds the clusters you browse in the TUI. +`ghcrawl seed-install openclaw/openclaw` is the low-cost bootstrap path. It runs a metadata-only sync, then imports published `title` and `body` embeddings plus derived similarity edges and rebuilds a local cluster run without needing an OpenAI key on day one. + ## Typical Commands ```bash ghcrawl doctor +ghcrawl seed-install openclaw/openclaw ghcrawl refresh owner/repo ghcrawl tui owner/repo ``` -`refresh`, `sync`, and `embed` call remote services and should be run intentionally. +`seed-install` and `sync` call GitHub. `refresh` and `embed` call GitHub plus OpenAI. Run them intentionally. `cluster` does not call remote services, but it is still time consuming. On a repo with roughly `12k` issues and PRs, a full cluster rebuild can take around `10 minutes`. @@ -70,6 +75,22 @@ ghcrawl refresh owner/repo ![ghcrawl refresh demo](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/ghcrawl-refresh-demo.gif) +### TUI Screenshots + +| User open issue/PR list modal | Refresh modal | +| --- | --- | +| ![User open issue and PR list modal](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/ghcrawl-tui-user-modal.png) | ![GitHub, embed, and cluster refresh modal](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/ghcrawl-tui-refresh-modal.png) | +| Press `u` to open the current user's issue and PR list modal. | Press `g` to open the GitHub/embed/cluster refresh modal. | + +| Closed members in a cluster | Fully closed cluster | +| --- | --- | +| ![Closed cluster members grayed out](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/ghcrawl-tui-closed-members.png) | ![Completely closed cluster grayed out](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/ghcrawl-tui-closed-cluster.png) | +| Closed members stay visible in gray so overlap is still easy to inspect. | A cluster with no open members is grayed out as a whole until you hide closed items. | + +![Stacked TUI layout](https://raw.githubusercontent.com/pwrdrvr/ghcrawl/main/docs/images/ghcrawl-tui-layout-stacked.png) + +Press `l` on wide screens to toggle the stacked layout with the cluster list on the left and members/detail stacked on the right. + ## Controlling The Refresh Flow More Intentionally Most users should run `ghcrawl refresh owner/repo` and let it do the full pipeline in the right order. @@ -84,6 +105,24 @@ ghcrawl cluster owner/repo # rebuild local related-work clusters from the curre Run them in that order. `refresh` is just the safe convenience command that performs the same sequence for you. +## Starter Data For OpenClaw + +`ghcrawl` can import a published sidecar for `openclaw/openclaw`: + +```bash +ghcrawl seed-install openclaw/openclaw +``` + +That flow is intentionally narrow: + +- it currently only supports `openclaw/openclaw` +- it imports precomputed `title` and `body` embeddings plus derived similarity edges +- it rebuilds a normal local cluster run from those imported edges +- it does not overwrite your thread text, comments, summaries, or sync cursor state +- it does not make semantic query search or summary views fully local-feature-complete; for that you still need an OpenAI key and later local summarize/embed runs + +Use `--force` if you intentionally want to import starter data into an existing local repo. Use `--asset-url` to test a local or unpublished sidecar override. + ## Init And Doctor First run: @@ -96,12 +135,12 @@ ghcrawl doctor `init` behavior: - prompts you to choose one of two secret-storage modes: - - `plaintext`: saves both keys to `~/.config/ghcrawl/config.json` + - `plaintext`: saves your GitHub key, and optionally your OpenAI key, to `~/.config/ghcrawl/config.json` - `1Password CLI`: stores only vault and item metadata and tells you how to run `ghcrawl` through `op` - if you choose plaintext storage, init warns that anyone who can read that file can use your keys and that resulting API charges are your responsibility - if you choose 1Password CLI mode, init tells you to create a Secure Note with concealed fields named: - `GITHUB_TOKEN` - - `OPENAI_API_KEY` + - `OPENAI_API_KEY` when you are ready to enable summarize/embed flows GitHub token guidance: @@ -117,7 +156,7 @@ GitHub token guidance: - config file presence and path - local DB path wiring - GitHub token presence, token-shape validation, and a live auth smoke check -- OpenAI key presence, key-shape validation, and a live auth smoke check +- OpenAI key presence, token-shape validation, and a live auth smoke check when configured - if init is configured for 1Password CLI but you forgot to run through your `op` wrapper, doctor tells you that explicitly ### 1Password CLI Example @@ -125,7 +164,7 @@ GitHub token guidance: If you choose 1Password CLI mode, create a 1Password Secure Note with concealed fields named exactly: - `GITHUB_TOKEN` -- `OPENAI_API_KEY` +- `OPENAI_API_KEY` (optional until you want local summarize/embed refreshes) Then add this wrapper to `~/.zshrc`: @@ -151,10 +190,15 @@ These commands are intended more for scripts, bots, and agent integrations than ```bash ghcrawl threads owner/repo --numbers 42,43,44 +ghcrawl threads owner/repo --numbers 42,43,44 --include-closed ghcrawl author owner/repo --login lqquan -ghcrawl cluster owner/repo +ghcrawl seed-install openclaw/openclaw +ghcrawl close-thread owner/repo --number 42 +ghcrawl close-cluster owner/repo --id 123 ghcrawl clusters owner/repo --min-size 10 --limit 20 +ghcrawl clusters owner/repo --min-size 10 --limit 20 --include-closed ghcrawl cluster-detail owner/repo --id 123 +ghcrawl cluster-detail owner/repo --id 123 --include-closed ghcrawl search owner/repo --query "download stalls" ``` @@ -162,6 +206,12 @@ Use `threads --numbers ...` when you want several specific issue or PR records i Use `author --login ...` when you want all currently open issue/PR records from one user plus the strongest stored same-author similarity match for each item. +By default, JSON list commands filter out locally closed issues/PRs and completely closed clusters. Use `--include-closed` when you need to inspect those records too. + +Use `close-thread` when you know a local issue/PR should be treated as closed before the next GitHub sync catches up. If that was the last open item in its cluster, `ghcrawl` automatically marks the cluster closed too. + +Use `close-cluster` when you want to locally suppress a whole cluster from default JSON exploration without waiting for a rebuild. + ## Cost To Operate The main variable cost is OpenAI embeddings. Current model pricing is published by OpenAI here: [OpenAI API pricing](https://developers.openai.com/api/docs/pricing#embeddings). @@ -190,6 +240,7 @@ The skill is built around the stable JSON CLI surface and is intentionally conse - default mode assumes no valid API keys and stays read-only - API-backed operations only become available after `ghcrawl doctor --json` shows healthy auth - even then, `refresh`, `sync`, `embed`, and `cluster` should only run when the user explicitly asks for them +- JSON list commands hide locally closed issues/PRs and closed clusters by default unless `--include-closed` is passed ```bash ghcrawl doctor --json @@ -210,6 +261,7 @@ The agent and build contract for this repo lives in [SPEC.md](https://github.com ## Current Caveats - `serve` starts the local HTTP API only. The web UI is not built yet. +- `seed-install` currently supports only `openclaw/openclaw` - `sync` only pulls open issues and PRs. - a plain `sync owner/repo` is incremental by default after the first full completed open scan for that repo - `sync` is metadata-only by default @@ -217,9 +269,12 @@ The agent and build contract for this repo lives in [SPEC.md](https://github.com - `embed` defaults to `text-embedding-3-large` - `embed` generates separate vectors for `title` and `body`, and also uses stored summary text when present - `embed` stores an input hash per source kind and will not resubmit unchanged text for re-embedding +- starter sidecars currently import `title` and `body` embeddings plus derived similarity edges - `sync --since` accepts ISO timestamps and relative durations like `15m`, `2h`, `7d`, and `1mo` - `sync --limit ` is the best smoke-test path on a busy repository - `tui` remembers sort order and min cluster size per repository in the persisted config file +- the TUI shows locally closed threads and clusters in gray; press `x` to hide or show them +- on wide screens, press `l` to toggle between three columns and a wider cluster list with members/detail stacked on the right - if you add a brand-new repo from the TUI with `p`, ghcrawl runs sync -> embed -> cluster and opens that repo with min cluster size `1+` ## Responsibility Attestation diff --git a/apps/cli/src/init-wizard.test.ts b/apps/cli/src/init-wizard.test.ts index e13ce1e..ae70a0f 100644 --- a/apps/cli/src/init-wizard.test.ts +++ b/apps/cli/src/init-wizard.test.ts @@ -56,6 +56,26 @@ test('runInitWizard skips prompting when config already has both API keys', asyn assert.equal(fs.existsSync(result.configPath), true); }); +test('runInitWizard skips prompting when config already has a GitHub token only', async () => { + const home = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-init-test-')); + const env = makeTestEnv({ HOME: home }); + writePersistedConfig( + { + githubToken: 'ghp_testtoken1234567890', + }, + { env }, + ); + + const result = await runInitWizard({ + env, + prompter: makePrompter(), + isInteractive: true, + }); + + assert.equal(result.changed, false); + assert.equal(fs.existsSync(result.configPath), true); +}); + test('runInitWizard prompts for missing keys and writes the config file', async () => { const home = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-init-test-')); const env = makeTestEnv({ HOME: home }); @@ -104,6 +124,26 @@ test('runInitWizard can persist detected environment keys without prompting for assert.equal(persisted.data.openaiApiKey, 'sk-proj-envkey1234567890'); }); +test('runInitWizard can save a GitHub-only plaintext config', async () => { + const home = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-init-test-')); + const env = makeTestEnv({ HOME: home }); + + const result = await runInitWizard({ + env, + prompter: makePrompter({ + select: async () => 'plaintext', + confirm: async ({ message }) => !message.includes('OpenAI API key now'), + password: async () => 'ghp_testtoken1234567890', + }), + isInteractive: true, + }); + + assert.equal(result.changed, true); + const persisted = readPersistedConfig({ env }); + assert.equal(persisted.data.githubToken, 'ghp_testtoken1234567890'); + assert.equal(persisted.data.openaiApiKey, undefined); +}); + test('runInitWizard can configure 1Password CLI metadata without persisting plaintext keys', async () => { const home = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-init-test-')); const env = makeTestEnv({ HOME: home }); diff --git a/apps/cli/src/init-wizard.ts b/apps/cli/src/init-wizard.ts index 43b7187..cc6b616 100644 --- a/apps/cli/src/init-wizard.ts +++ b/apps/cli/src/init-wizard.ts @@ -91,8 +91,9 @@ export async function runInitWizard( const stored = readPersistedConfig({ cwd, env }); const hasStoredGithub = Boolean(stored.data.githubToken); + const hasStoredGithubProvider = stored.data.secretProvider === 'op' && Boolean(stored.data.opVaultName && stored.data.opItemName); const hasStoredOpenAi = Boolean(stored.data.openaiApiKey); - if (!reconfigure && hasStoredGithub && hasStoredOpenAi) { + if (!reconfigure && (hasStoredGithub || hasStoredGithubProvider)) { return { configPath: current.configPath, changed: false }; } @@ -116,7 +117,7 @@ export async function runInitWizard( '- For private repos with a classic PAT, repo is the safe fallback', '', 'OpenAI key recommendation:', - '- Standard API key for the project/account you want to bill', + '- Optional for first-run bootstrap; required later for summarize/embed refreshes', ].join('\n'), 'Setup', ); @@ -193,7 +194,19 @@ export async function runInitWizard( changed = true; } - if (reconfigure || !hasStoredOpenAi) { + const shouldConfigureOpenAi = + reconfigure || hasStoredOpenAi + ? true + : await prompter.confirm({ + message: 'Would you like to configure an OpenAI API key now? You can skip this and add it later.', + initialValue: false, + }); + if (isCancel(shouldConfigureOpenAi)) { + prompter.cancel('init cancelled'); + throw new Error('init cancelled'); + } + + if (shouldConfigureOpenAi && (reconfigure || !hasStoredOpenAi)) { const detectedOpenAi = env.OPENAI_API_KEY; let openaiApiKey = stored.data.openaiApiKey; let usedDetectedOpenAi = false; @@ -228,6 +241,9 @@ export async function runInitWizard( } nextConfig.openaiApiKey = openaiApiKey; changed = true; + } else if (reconfigure && !shouldConfigureOpenAi) { + nextConfig.openaiApiKey = undefined; + changed = true; } nextConfig.secretProvider = 'plaintext'; @@ -271,11 +287,11 @@ export async function runInitWizard( '', 'Add concealed fields named exactly:', '- GITHUB_TOKEN', - '- OPENAI_API_KEY', + '- OPENAI_API_KEY (optional for later summarize/embed work)', '', 'Secret refs:', `- ${opReferenceBase}/GITHUB_TOKEN`, - `- ${opReferenceBase}/OPENAI_API_KEY`, + `- ${opReferenceBase}/OPENAI_API_KEY (optional)`, ].join('\n'), '1Password Setup', ); diff --git a/apps/cli/src/main.test.ts b/apps/cli/src/main.test.ts index 63fbc39..d8693d1 100644 --- a/apps/cli/src/main.test.ts +++ b/apps/cli/src/main.test.ts @@ -19,6 +19,7 @@ test('run prints usage with no command', async () => { assert.match(output, /ghcrawl /); assert.match(output, /\n version\n/); assert.match(output, /refresh /); + assert.match(output, /seed-install /); assert.match(output, /threads /); assert.match(output, /author --login /); assert.match(output, /close-thread --number /); @@ -42,6 +43,7 @@ test('run prints usage for help flag', async () => { assert.match(output, /ghcrawl /); assert.match(output, /\n version\n/); assert.match(output, /refresh /); + assert.match(output, /seed-install /); assert.match(output, /threads /); assert.match(output, /author --login /); assert.match(output, /close-thread --number /); @@ -62,6 +64,7 @@ test('run prints advanced commands when dev mode is enabled', async () => { assert.match(output, /Advanced Commands:/); assert.match(output, /summarize /); assert.match(output, /purge-comments /); + assert.match(output, /seed-export --output /); }); test('run prints version for version command', async () => { @@ -162,6 +165,15 @@ test('parseRepoFlags accepts include-closed boolean flag', () => { assert.equal(parsed.values['include-closed'], true); }); +test('parseRepoFlags accepts seed-install flags', () => { + const parsed = parseRepoFlags(['openclaw/openclaw', '--force', '--asset-url', '/tmp/openclaw.seed.json.gz', '--no-sync']); + assert.equal(parsed.owner, 'openclaw'); + assert.equal(parsed.repo, 'openclaw'); + assert.equal(parsed.values.force, true); + assert.equal(parsed.values['asset-url'], '/tmp/openclaw.seed.json.gz'); + assert.equal(parsed.values['no-sync'], true); +}); + test('resolveSinceValue keeps ISO timestamps', () => { assert.equal(resolveSinceValue('2026-03-01T00:00:00Z'), '2026-03-01T00:00:00.000Z'); }); diff --git a/apps/cli/src/main.ts b/apps/cli/src/main.ts index 7d88b59..1d36705 100644 --- a/apps/cli/src/main.ts +++ b/apps/cli/src/main.ts @@ -5,7 +5,8 @@ import path from 'node:path'; import { parseArgs } from 'node:util'; import { fileURLToPath } from 'node:url'; -import { createApiServer, GHCrawlService } from '@ghcrawl/api-core'; +import { confirm, isCancel, note } from '@clack/prompts'; +import { createApiServer, getKnownSeedManifestEntry, GHCrawlService } from '@ghcrawl/api-core'; import { runInitWizard } from './init-wizard.js'; import { startTui } from './tui/app.js'; @@ -15,12 +16,14 @@ type CommandName = | 'version' | 'sync' | 'refresh' + | 'seed-install' | 'threads' | 'author' | 'close-thread' | 'close-cluster' | 'summarize' | 'purge-comments' + | 'seed-export' | 'embed' | 'cluster' | 'clusters' @@ -44,6 +47,7 @@ function usage(devMode = false): string { ' version', ' sync [--since ] [--limit ] [--include-comments] [--full-reconcile]', ' refresh [--no-sync] [--no-embed] [--no-cluster]', + ' seed-install [--force] [--asset-url ] [--no-sync]', ' threads [--numbers ] [--kind issue|pull_request] [--include-closed]', ' author --login [--include-closed]', ' close-thread --number ', @@ -63,7 +67,13 @@ function usage(devMode = false): string { ' clusters reads the existing local cluster data and is intended to be fast.', ]; if (devMode) { - lines.push('', 'Advanced Commands:', ' summarize [--number ] [--include-comments]', ' purge-comments [--number ]'); + lines.push( + '', + 'Advanced Commands:', + ' summarize [--number ] [--include-comments]', + ' purge-comments [--number ]', + ' seed-export --output ', + ); } return `${lines.join('\n')}\n`; } @@ -119,6 +129,9 @@ export function parseRepoFlags(args: string[]): { owner: string; repo: string; v 'no-sync': { type: 'boolean' }, 'no-embed': { type: 'boolean' }, 'no-cluster': { type: 'boolean' }, + force: { type: 'boolean' }, + 'asset-url': { type: 'string' }, + output: { type: 'string' }, }, }); @@ -286,7 +299,9 @@ export async function run(argv: string[], stdout: NodeJS.WritableStream = proces }, }); await runInitWizard({ reconfigure: parsed.values.reconfigure === true }); - stdout.write(`${JSON.stringify(getService().init(), null, 2)}\n`); + const serviceForInit = getService(); + await maybePromptStarterSeedInstall(serviceForInit); + stdout.write(`${JSON.stringify(serviceForInit.init(), null, 2)}\n`); return; } case 'doctor': { @@ -335,6 +350,20 @@ export async function run(argv: string[], stdout: NodeJS.WritableStream = proces stdout.write(`${JSON.stringify(result, null, 2)}\n`); return; } + case 'seed-install': { + const { owner, repo, values } = parseRepoFlags(rest); + const result = await getService().seedInstallRepository({ + owner, + repo, + cliVersion: CLI_VERSION, + force: values.force === true, + assetUrl: typeof values['asset-url'] === 'string' ? values['asset-url'] : undefined, + skipSync: values['no-sync'] === true, + onProgress: writeProgress, + }); + stdout.write(`${JSON.stringify(result, null, 2)}\n`); + return; + } case 'threads': { const { owner, repo, values } = parseRepoFlags(rest); const kind = values.kind === 'issue' || values.kind === 'pull_request' ? values.kind : undefined; @@ -411,6 +440,21 @@ export async function run(argv: string[], stdout: NodeJS.WritableStream = proces stdout.write(`${JSON.stringify(result, null, 2)}\n`); return; } + case 'seed-export': { + const { owner, repo, values } = parseRepoFlags(rest); + if (typeof values.output !== 'string' || values.output.trim().length === 0) { + throw new Error('Missing --output'); + } + const result = await getService().exportSeedSidecar({ + owner, + repo, + cliVersion: CLI_VERSION, + outputDir: values.output, + onProgress: writeProgress, + }); + stdout.write(`${JSON.stringify(result, null, 2)}\n`); + return; + } case 'embed': { const { owner, repo, values } = parseRepoFlags(rest); const result = await getService().embedRepository({ @@ -555,3 +599,35 @@ function loadCliVersion(): string { const packageJson = JSON.parse(readFileSync(packageJsonPath, 'utf8')) as { version?: unknown }; return typeof packageJson.version === 'string' ? packageJson.version : '0.0.0'; } + +async function maybePromptStarterSeedInstall(service: GHCrawlService): Promise { + if (!process.stdin.isTTY || process.stdout.isTTY !== true) { + return; + } + if (!service.config.githubToken) { + return; + } + const manifest = getKnownSeedManifestEntry('openclaw', 'openclaw'); + if (!manifest || manifest.downloadUrl.includes('example.invalid')) { + return; + } + + const shouldInstall = await confirm({ + message: 'Download starter data for openclaw/openclaw now? This runs a metadata sync, then imports the published title/body embeddings and clusters.', + initialValue: false, + }); + if (isCancel(shouldInstall) || shouldInstall !== true) { + return; + } + + await note( + 'Starter data is only intended for openclaw/openclaw and only imports embeddings plus derived cluster data. It does not replace your thread text or sync cursors.', + 'Starter Data', + ); + await service.seedInstallRepository({ + owner: 'openclaw', + repo: 'openclaw', + cliVersion: CLI_VERSION, + onProgress: writeProgress, + }); +} diff --git a/package.json b/package.json index 1c63108..3a965a8 100644 --- a/package.json +++ b/package.json @@ -20,6 +20,9 @@ "op:tui": "node ./scripts/op-run.mjs exec -- tui", "sync": "node ./apps/cli/bin/ghcrawl.js sync", "refresh": "node ./apps/cli/bin/ghcrawl.js refresh", + "seed-install": "node ./apps/cli/bin/ghcrawl.js seed-install", + "seed-export": "node ./apps/cli/bin/ghcrawl.js --dev seed-export", + "seed-audit": "node ./scripts/seed-audit.mjs", "embed": "node ./apps/cli/bin/ghcrawl.js embed", "cluster": "node ./apps/cli/bin/ghcrawl.js cluster", "clusters": "node ./apps/cli/bin/ghcrawl.js clusters", diff --git a/packages/api-core/src/index.ts b/packages/api-core/src/index.ts index 15471e0..115fc94 100644 --- a/packages/api-core/src/index.ts +++ b/packages/api-core/src/index.ts @@ -3,4 +3,5 @@ export * from './config.js'; export * from './documents/normalize.js'; export * from './search/exact.js'; export * from './cluster/build.js'; +export * from './seed/sidecar.js'; export * from './service.js'; diff --git a/packages/api-core/src/seed/sidecar.ts b/packages/api-core/src/seed/sidecar.ts new file mode 100644 index 0000000..6541464 --- /dev/null +++ b/packages/api-core/src/seed/sidecar.ts @@ -0,0 +1,243 @@ +import crypto from 'node:crypto'; +import { once } from 'node:events'; +import fs from 'node:fs'; +import { Readable } from 'node:stream'; +import { finished } from 'node:stream/promises'; +import { createInterface } from 'node:readline'; +import { createGunzip, createGzip, gzipSync } from 'node:zlib'; + +import { z } from 'zod'; + +const seedThreadKindSchema = z.enum(['issue', 'pull_request']); +export const seedEmbeddingSourceKindSchema = z.enum(['title', 'body', 'dedupe_summary']); + +export const seedSidecarSchemaVersion = 1; +export const seedSidecarFormat = 'ghcrawl-seed-sidecar-gzip-v1'; + +const semverSchema = z.string().regex(/^\d+\.\d+\.\d+(?:[-+][0-9A-Za-z.-]+)?$/); +const compatibleRangeSchema = z.string().regex(/^>=\d+\.\d+\.\d+(?:[-+][0-9A-Za-z.-]+)? <\d+\.\d+\.\d+(?:[-+][0-9A-Za-z.-]+)?$/); + +export const seedThreadIdentitySchema = z.object({ + owner: z.string().min(1), + repo: z.string().min(1), + kind: seedThreadKindSchema, + number: z.number().int().positive(), + githubId: z.string().min(1), +}); + +export const seedThreadSidecarRowSchema = seedThreadIdentitySchema.extend({ + threadContentHash: z.string().min(1), + sourceKind: seedEmbeddingSourceKindSchema, + embeddingModel: z.string().min(1), + dimensions: z.number().int().positive(), + embedding: z.array(z.number()), +}); + +export const seedEdgeSidecarRowSchema = z.object({ + left: seedThreadIdentitySchema, + right: seedThreadIdentitySchema, + score: z.number(), + sources: z.array(seedEmbeddingSourceKindSchema).default(['title']), +}); + +export const seedSidecarManifestSchema = z.object({ + schemaVersion: z.literal(seedSidecarSchemaVersion), + format: z.literal(seedSidecarFormat), + snapshotId: z.string().min(1), + createdAt: z.string().datetime(), + compatibleCli: compatibleRangeSchema, + owner: z.string().min(1), + repo: z.string().min(1), + fullName: z.string().min(1), + embedModel: z.string().min(1), + sourceKinds: z.array(seedEmbeddingSourceKindSchema).min(1), + cluster: z.object({ + k: z.number().int().positive(), + minScore: z.number(), + }), + threadCount: z.number().int().nonnegative(), + embeddingCount: z.number().int().nonnegative(), + edgeCount: z.number().int().nonnegative(), +}); + +export const seedSidecarArchiveSchema = z.object({ + manifest: seedSidecarManifestSchema, + threads: z.array(seedThreadSidecarRowSchema), + edges: z.array(seedEdgeSidecarRowSchema), +}); + +export const knownSeedManifestEntrySchema = seedSidecarManifestSchema.extend({ + downloadUrl: z.string().min(1), + sha256: z.string().regex(/^[a-f0-9]{64}$/i), +}); + +export type SeedThreadIdentity = z.infer; +export type SeedThreadSidecarRow = z.infer; +export type SeedEdgeSidecarRow = z.infer; +export type SeedSidecarManifest = z.infer; +export type SeedSidecarArchive = z.infer; +export type KnownSeedManifestEntry = z.infer; +export type SeedSidecarArchiveWriterInput = { + manifest: SeedSidecarManifest; + threads: Iterable; + edges: Iterable; +}; + +const knownSeedManifest: Record = { + 'openclaw/openclaw': { + schemaVersion: seedSidecarSchemaVersion, + format: seedSidecarFormat, + snapshotId: 'replace-with-real-openclaw-seed', + createdAt: '2026-03-12T00:00:00.000Z', + compatibleCli: '>=0.0.0 <1.0.0', + owner: 'openclaw', + repo: 'openclaw', + fullName: 'openclaw/openclaw', + embedModel: 'text-embedding-3-large', + sourceKinds: ['title', 'body'], + cluster: { + k: 6, + minScore: 0.82, + }, + threadCount: 0, + embeddingCount: 0, + edgeCount: 0, + downloadUrl: 'https://example.invalid/replace-with-real-openclaw-seed.seed.json.gz', + sha256: '0000000000000000000000000000000000000000000000000000000000000000', + }, +}; + +export function getKnownSeedManifestEntry(owner: string, repo: string): KnownSeedManifestEntry | null { + const entry = knownSeedManifest[`${owner}/${repo}`]; + return entry ? knownSeedManifestEntrySchema.parse(entry) : null; +} + +export function serializeSeedSidecarArchive(value: SeedSidecarArchive): Buffer { + const archive = seedSidecarArchiveSchema.parse(value); + const lines = [ + JSON.stringify({ kind: 'manifest', payload: archive.manifest }), + ...archive.threads.map((row) => JSON.stringify({ kind: 'thread', payload: row })), + ...archive.edges.map((row) => JSON.stringify({ kind: 'edge', payload: row })), + ]; + return gzipSync(Buffer.from(lines.join('\n'), 'utf8')); +} + +export async function parseSeedSidecarArchive(buffer: Buffer): Promise { + const archive: Partial & { + threads: SeedThreadSidecarRow[]; + edges: SeedEdgeSidecarRow[]; + } = { + threads: [], + edges: [], + }; + const input = Readable.from(buffer).pipe(createGunzip()); + const reader = createInterface({ input, crlfDelay: Infinity }); + + for await (const line of reader) { + const trimmed = line.trim(); + if (!trimmed) continue; + const record = JSON.parse(trimmed) as { kind?: unknown; payload?: unknown }; + if (record.kind === 'manifest') { + archive.manifest = seedSidecarManifestSchema.parse(record.payload); + continue; + } + if (record.kind === 'thread') { + archive.threads.push(seedThreadSidecarRowSchema.parse(record.payload)); + continue; + } + if (record.kind === 'edge') { + archive.edges.push(seedEdgeSidecarRowSchema.parse(record.payload)); + continue; + } + throw new Error(`Unknown seed sidecar record kind: ${String(record.kind)}`); + } + + return seedSidecarArchiveSchema.parse(archive); +} + +export async function writeSeedSidecarArchive( + outputPath: string, + value: SeedSidecarArchive | SeedSidecarArchiveWriterInput, +): Promise<{ sha256: string }> { + const manifest = seedSidecarManifestSchema.parse(value.manifest); + const gzip = createGzip(); + const output = fs.createWriteStream(outputPath); + const hash = crypto.createHash('sha256'); + gzip.on('data', (chunk) => hash.update(chunk)); + gzip.pipe(output); + + await writeGzipLine(gzip, JSON.stringify({ kind: 'manifest', payload: manifest })); + for (const row of value.threads) { + await writeGzipLine(gzip, JSON.stringify({ kind: 'thread', payload: seedThreadSidecarRowSchema.parse(row) })); + } + for (const row of value.edges) { + await writeGzipLine(gzip, JSON.stringify({ kind: 'edge', payload: seedEdgeSidecarRowSchema.parse(row) })); + } + gzip.end(); + + await finished(output); + return { sha256: hash.digest('hex') }; +} + +async function writeGzipLine(stream: ReturnType, line: string): Promise { + if (stream.write(`${line}\n`)) { + return; + } + await once(stream, 'drain'); +} + +export function sha256Hex(buffer: Uint8Array): string { + return crypto.createHash('sha256').update(buffer).digest('hex'); +} + +export async function readSeedAsset(assetUrl: string): Promise { + if (/^https?:\/\//i.test(assetUrl)) { + const response = await fetch(assetUrl); + if (!response.ok) { + throw new Error(`Failed to download seed asset: ${response.status} ${response.statusText}`); + } + return Buffer.from(await response.arrayBuffer()); + } + + if (assetUrl.startsWith('file://')) { + return fs.readFileSync(new URL(assetUrl)); + } + + return fs.readFileSync(assetUrl); +} + +export function isCliVersionCompatible(version: string, range: string): boolean { + const parsedVersion = parseSemver(semverSchema.parse(version)); + const [minimum, maximum] = compatibleRangeSchema + .parse(range) + .split(' ') + .map((part) => parseSemver(part.replace(/^(>=|<)/, ''))); + return compareSemver(parsedVersion, minimum) >= 0 && compareSemver(parsedVersion, maximum) < 0; +} + +function parseSemver(value: string): [number, number, number, string] { + const match = semverSchema.parse(value).match(/^(\d+)\.(\d+)\.(\d+)(?:([-+].+))?$/); + if (!match) { + throw new Error(`Invalid semver: ${value}`); + } + return [Number(match[1]), Number(match[2]), Number(match[3]), match[4] ?? '']; +} + +function compareSemver(left: [number, number, number, string], right: [number, number, number, string]): number { + const majorDelta = left[0] - right[0]; + if (majorDelta !== 0) return majorDelta; + const minorDelta = left[1] - right[1]; + if (minorDelta !== 0) return minorDelta; + const patchDelta = left[2] - right[2]; + if (patchDelta !== 0) return patchDelta; + if (left[3] === right[3]) { + return 0; + } + if (!left[3]) { + return 1; + } + if (!right[3]) { + return -1; + } + return left[3].localeCompare(right[3]); +} diff --git a/packages/api-core/src/service.test.ts b/packages/api-core/src/service.test.ts index 4087df8..888ebb8 100644 --- a/packages/api-core/src/service.test.ts +++ b/packages/api-core/src/service.test.ts @@ -1,6 +1,10 @@ import test from 'node:test'; import assert from 'node:assert/strict'; +import fs from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; +import { parseSeedSidecarArchive } from './seed/sidecar.js'; import { GHCrawlService } from './service.js'; function makeTestConfig(overrides: Partial = {}): GHCrawlService['config'] { @@ -38,6 +42,65 @@ function makeTestService( }); } +function makeSeedGitHubStub(): NonNullable { + return { + checkAuth: async () => undefined, + getRepo: async () => ({ id: 1, full_name: 'openclaw/openclaw' }), + listRepositoryIssues: async () => [ + { + id: 100, + number: 42, + state: 'open', + title: 'Downloader hangs', + body: 'The transfer never finishes.', + html_url: 'https://github.com/openclaw/openclaw/issues/42', + labels: [{ name: 'bug' }], + assignees: [], + user: { login: 'alice', type: 'User' }, + }, + { + id: 101, + number: 43, + state: 'open', + title: 'Downloader PR', + body: 'Implements a fix.', + html_url: 'https://github.com/openclaw/openclaw/pull/43', + labels: [{ name: 'bug' }], + assignees: [], + pull_request: { url: 'https://api.github.com/repos/openclaw/openclaw/pulls/43' }, + user: { login: 'bob', type: 'User' }, + }, + ], + getIssue: async (_owner, _repo, number) => ({ + id: 100, + number, + state: 'open', + title: 'Downloader hangs', + body: 'The transfer never finishes.', + html_url: `https://github.com/openclaw/openclaw/issues/${number}`, + labels: [{ name: 'bug' }], + assignees: [], + user: { login: 'alice', type: 'User' }, + updated_at: '2026-03-09T00:00:00Z', + }), + getPull: async (_owner, _repo, number) => ({ + id: 101, + number, + state: 'open', + title: 'Downloader PR', + body: 'Implements a fix.', + html_url: `https://github.com/openclaw/openclaw/pull/${number}`, + labels: [{ name: 'bug' }], + assignees: [], + user: { login: 'bob', type: 'User' }, + updated_at: '2026-03-09T00:00:00Z', + }), + listIssueComments: async () => [], + listPullReviews: async () => [], + listPullReviewComments: async () => [], + }; +} + test('doctor reports config path and successful auth smoke checks', async () => { let githubChecked = 0; let openAiChecked = 0; @@ -138,6 +201,257 @@ test('doctor explains when secrets are expected from 1Password CLI env injection } }); +test('seed export and install round-trip imports embeddings and rebuilds a cluster run without mutating sync state', async () => { + const github = makeSeedGitHubStub(); + const sourceService = makeTestService(github); + const targetService = makeTestService(github); + const outputDir = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-seed-export-')); + + try { + await sourceService.syncRepository({ owner: 'openclaw', repo: 'openclaw' }); + const sourceThreads = sourceService.db + .prepare('select id, number from threads order by number asc') + .all() as Array<{ id: number; number: number }>; + const now = '2026-03-09T12:00:00Z'; + const insertEmbedding = sourceService.db.prepare( + `insert into document_embeddings (thread_id, source_kind, model, dimensions, content_hash, embedding_json, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ); + insertEmbedding.run(sourceThreads[0]?.id, 'title', 'text-embedding-3-large', 2, 'seed-hash-42-title', '[1,0]', now, now); + insertEmbedding.run(sourceThreads[0]?.id, 'body', 'text-embedding-3-large', 2, 'seed-hash-42-body', '[0.8,0.2]', now, now); + insertEmbedding.run(sourceThreads[1]?.id, 'title', 'text-embedding-3-large', 2, 'seed-hash-43-title', '[0.9,0.1]', now, now); + insertEmbedding.run(sourceThreads[1]?.id, 'body', 'text-embedding-3-large', 2, 'seed-hash-43-body', '[0.7,0.3]', now, now); + sourceService.db + .prepare(`insert into cluster_runs (id, repo_id, scope, status, started_at, finished_at) values (?, ?, ?, ?, ?, ?)`) + .run(1, 1, 'openclaw/openclaw', 'completed', now, now); + sourceService.db + .prepare( + `insert into similarity_edges (repo_id, cluster_run_id, left_thread_id, right_thread_id, method, score, explanation_json, created_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ) + .run(1, 1, sourceThreads[0]?.id, sourceThreads[1]?.id, 'seed', 0.95, '{"sources":["title","body"]}', now); + + const exported = await sourceService.exportSeedSidecar({ + owner: 'openclaw', + repo: 'openclaw', + cliVersion: '0.0.0', + outputDir, + onProgress: () => undefined, + }); + + await targetService.syncRepository({ owner: 'openclaw', repo: 'openclaw' }); + const syncStateBefore = targetService.db + .prepare('select updated_at from repo_sync_state where repo_id = 1') + .get() as { updated_at: string } | undefined; + + const installed = await targetService.seedInstallRepository({ + owner: 'openclaw', + repo: 'openclaw', + cliVersion: '0.0.0', + assetUrl: exported.outputPath, + skipSync: true, + force: true, + onProgress: () => undefined, + }); + + assert.equal(installed.importedThreads, 2); + assert.equal(installed.importedEmbeddings, 4); + assert.equal(installed.importedEdges, 1); + assert.equal(installed.skippedThreads, 0); + assert.equal(installed.skippedEdges, 0); + assert.equal(installed.clusters, 1); + + const embeddingRows = targetService.db + .prepare(`select source_kind, dimensions from document_embeddings where model = ? order by thread_id asc`) + .all('text-embedding-3-large') as Array<{ source_kind: string; dimensions: number }>; + assert.deepEqual( + embeddingRows.map((row) => row.source_kind), + ['body', 'title', 'body', 'title'], + ); + assert.deepEqual( + embeddingRows.map((row) => row.dimensions), + [2, 2, 2, 2], + ); + + const clusterRunRows = targetService.db + .prepare(`select id, scope, status from cluster_runs order by id asc`) + .all() as Array<{ id: number; scope: string; status: string }>; + assert.equal(clusterRunRows.length, 1); + assert.equal(clusterRunRows[0]?.id, installed.clusterRunId); + assert.equal(clusterRunRows[0]?.scope, `seed:${exported.snapshotId}`); + assert.equal(clusterRunRows[0]?.status, 'completed'); + const edgeRows = targetService.db + .prepare('select score from similarity_edges where cluster_run_id = ?') + .all(installed.clusterRunId) as Array<{ score: number }>; + assert.equal(edgeRows.length, 1); + assert.equal(edgeRows[0]?.score, 0.95); + + const clusterMemberRows = targetService.db + .prepare( + `select cm.thread_id + from cluster_members cm + join clusters c on c.id = cm.cluster_id + where c.cluster_run_id = ? + order by cm.thread_id asc`, + ) + .all(installed.clusterRunId) as Array<{ thread_id: number }>; + assert.equal(clusterMemberRows.length, 2); + + const syncStateAfter = targetService.db + .prepare('select updated_at from repo_sync_state where repo_id = 1') + .get() as { updated_at: string } | undefined; + assert.equal(syncStateAfter?.updated_at, syncStateBefore?.updated_at); + } finally { + sourceService.close(); + targetService.close(); + fs.rmSync(outputDir, { recursive: true, force: true }); + } +}); + +test('seed install rejects an existing repo without force and skips stale threads and edges when forced', async () => { + const github = makeSeedGitHubStub(); + const sourceService = makeTestService(github); + const targetService = makeTestService(github); + const outputDir = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-seed-export-')); + + try { + await sourceService.syncRepository({ owner: 'openclaw', repo: 'openclaw' }); + const sourceThreads = sourceService.db + .prepare('select id, number from threads order by number asc') + .all() as Array<{ id: number; number: number }>; + const now = '2026-03-09T12:00:00Z'; + sourceService.db + .prepare( + `insert into document_embeddings (thread_id, source_kind, model, dimensions, content_hash, embedding_json, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ) + .run(sourceThreads[0]?.id, 'title', 'text-embedding-3-large', 2, 'seed-hash-42-title', '[1,0]', now, now); + sourceService.db + .prepare( + `insert into document_embeddings (thread_id, source_kind, model, dimensions, content_hash, embedding_json, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ) + .run(sourceThreads[0]?.id, 'body', 'text-embedding-3-large', 2, 'seed-hash-42-body', '[0.8,0.2]', now, now); + sourceService.db + .prepare( + `insert into document_embeddings (thread_id, source_kind, model, dimensions, content_hash, embedding_json, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ) + .run(sourceThreads[1]?.id, 'title', 'text-embedding-3-large', 2, 'seed-hash-43-title', '[0.9,0.1]', now, now); + sourceService.db + .prepare( + `insert into document_embeddings (thread_id, source_kind, model, dimensions, content_hash, embedding_json, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ) + .run(sourceThreads[1]?.id, 'body', 'text-embedding-3-large', 2, 'seed-hash-43-body', '[0.7,0.3]', now, now); + sourceService.db + .prepare(`insert into cluster_runs (id, repo_id, scope, status, started_at, finished_at) values (?, ?, ?, ?, ?, ?)`) + .run(1, 1, 'openclaw/openclaw', 'completed', now, now); + sourceService.db + .prepare( + `insert into similarity_edges (repo_id, cluster_run_id, left_thread_id, right_thread_id, method, score, explanation_json, created_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ) + .run(1, 1, sourceThreads[0]?.id, sourceThreads[1]?.id, 'seed', 0.95, '{"sources":["title","body"]}', now); + const exported = await sourceService.exportSeedSidecar({ + owner: 'openclaw', + repo: 'openclaw', + cliVersion: '0.0.0', + outputDir, + onProgress: () => undefined, + }); + + await targetService.syncRepository({ owner: 'openclaw', repo: 'openclaw' }); + await assert.rejects( + targetService.seedInstallRepository({ + owner: 'openclaw', + repo: 'openclaw', + cliVersion: '0.0.0', + assetUrl: exported.outputPath, + skipSync: true, + onProgress: () => undefined, + }), + /already has local records/, + ); + + targetService.db.prepare(`update threads set content_hash = 'mismatch' where number = 43`).run(); + const installed = await targetService.seedInstallRepository({ + owner: 'openclaw', + repo: 'openclaw', + cliVersion: '0.0.0', + assetUrl: exported.outputPath, + skipSync: true, + force: true, + onProgress: () => undefined, + }); + + assert.equal(installed.importedThreads, 1); + assert.equal(installed.importedEmbeddings, 2); + assert.equal(installed.skippedThreads, 2); + assert.equal(installed.importedEdges, 0); + assert.equal(installed.skippedEdges, 1); + assert.equal(installed.clusters, 1); + + const importedEmbeddingRows = targetService.db + .prepare(`select count(*) as count from document_embeddings where model = ? and source_kind in ('title', 'body')`) + .get('text-embedding-3-large') as { count: number }; + assert.equal(importedEmbeddingRows.count, 2); + } finally { + sourceService.close(); + targetService.close(); + fs.rmSync(outputDir, { recursive: true, force: true }); + } +}); + +test('seed export excludes dedupe-only embeddings and edge sources from the sidecar', async () => { + const github = makeSeedGitHubStub(); + const service = makeTestService(github); + const outputDir = fs.mkdtempSync(path.join(os.tmpdir(), 'ghcrawl-seed-export-')); + + try { + await service.syncRepository({ owner: 'openclaw', repo: 'openclaw' }); + const sourceThreads = service.db + .prepare('select id, number from threads order by number asc') + .all() as Array<{ id: number; number: number }>; + const now = '2026-03-09T12:00:00Z'; + const insertEmbedding = service.db.prepare( + `insert into document_embeddings (thread_id, source_kind, model, dimensions, content_hash, embedding_json, created_at, updated_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ); + insertEmbedding.run(sourceThreads[0]?.id, 'title', 'text-embedding-3-large', 2, 'seed-hash-42-title', '[1,0]', now, now); + insertEmbedding.run(sourceThreads[0]?.id, 'dedupe_summary', 'text-embedding-3-large', 2, 'seed-hash-42-dedupe', '[0.2,0.8]', now, now); + service.db + .prepare(`insert into cluster_runs (id, repo_id, scope, status, started_at, finished_at) values (?, ?, ?, ?, ?, ?)`) + .run(1, 1, 'openclaw/openclaw', 'completed', now, now); + const insertEdge = service.db.prepare( + `insert into similarity_edges (repo_id, cluster_run_id, left_thread_id, right_thread_id, method, score, explanation_json, created_at) + values (?, ?, ?, ?, ?, ?, ?, ?)`, + ); + insertEdge.run(1, 1, sourceThreads[0]?.id, sourceThreads[1]?.id, 'seed', 0.95, '{"sources":["dedupe_summary"]}', now); + insertEdge.run(1, 1, sourceThreads[1]?.id, sourceThreads[0]?.id, 'seed', 0.91, '{"sources":["title","dedupe_summary"]}', now); + + const exported = await service.exportSeedSidecar({ + owner: 'openclaw', + repo: 'openclaw', + cliVersion: '0.0.0', + outputDir, + onProgress: () => undefined, + }); + const archive = await parseSeedSidecarArchive(fs.readFileSync(exported.outputPath)); + + assert.equal(exported.threads, 1); + assert.equal(exported.edges, 1); + assert.deepEqual(archive.manifest.sourceKinds, ['title']); + assert.equal(archive.manifest.embeddingCount, 1); + assert.equal(archive.manifest.edgeCount, 1); + assert.deepEqual(archive.threads.map((row) => row.sourceKind), ['title']); + assert.deepEqual(archive.edges.map((row) => row.sources), [['title']]); + } finally { + service.close(); + fs.rmSync(outputDir, { recursive: true, force: true }); + } +}); + test('syncRepository defaults to metadata-only mode, preserves thread kind, and tracks first/last pull timestamps', async () => { const messages: string[] = []; let listIssueCommentCalls = 0; diff --git a/packages/api-core/src/service.ts b/packages/api-core/src/service.ts index b03674f..ed7a44a 100644 --- a/packages/api-core/src/service.ts +++ b/packages/api-core/src/service.ts @@ -1,7 +1,8 @@ import http from 'node:http'; import crypto from 'node:crypto'; -import { existsSync } from 'node:fs'; +import { existsSync, mkdirSync, writeFileSync } from 'node:fs'; import os from 'node:os'; +import path from 'node:path'; import { fileURLToPath } from 'node:url'; import { Worker } from 'node:worker_threads'; @@ -62,6 +63,20 @@ import { openDb, type SqliteDatabase } from './db/sqlite.js'; import { buildCanonicalDocument, isBotLikeAuthor } from './documents/normalize.js'; import { makeGitHubClient, type GitHubClient } from './github/client.js'; import { OpenAiProvider, type AiProvider } from './openai/provider.js'; +import { + getKnownSeedManifestEntry, + isCliVersionCompatible, + parseSeedSidecarArchive, + readSeedAsset, + seedEmbeddingSourceKindSchema, + sha256Hex, + writeSeedSidecarArchive, + type KnownSeedManifestEntry, + type SeedEdgeSidecarRow, + type SeedSidecarArchive, + type SeedSidecarArchiveWriterInput, + type SeedThreadSidecarRow, +} from './seed/sidecar.js'; import { cosineSimilarity, normalizeEmbedding, rankNearestNeighbors } from './search/exact.js'; type RunTable = 'sync_runs' | 'summary_runs' | 'embedding_runs' | 'cluster_runs'; @@ -245,6 +260,29 @@ export type DoctorResult = { }; }; +export type SeedInstallResult = { + repository: RepositoryDto; + snapshotId: string; + assetUrl: string; + synced: boolean; + importedThreads: number; + importedEmbeddings: number; + skippedThreads: number; + importedEdges: number; + skippedEdges: number; + clusters: number; + clusterRunId: number; +}; + +export type SeedExportResult = { + repository: RepositoryDto; + snapshotId: string; + outputPath: string; + sha256: string; + threads: number; + edges: number; +}; + type SyncOptions = { owner: string; repo: string; @@ -1129,6 +1167,304 @@ export class GHCrawlService { } } + async seedInstallRepository(params: { + owner: string; + repo: string; + cliVersion: string; + force?: boolean; + assetUrl?: string; + skipSync?: boolean; + onProgress?: (message: string) => void; + }): Promise { + const knownManifest = params.assetUrl ? null : getKnownSeedManifestEntry(params.owner, params.repo); + if (!knownManifest && !params.assetUrl) { + throw new Error(`No known seed is configured for ${params.owner}/${params.repo}. Use --asset-url to install an override.`); + } + if (knownManifest && !isCliVersionCompatible(params.cliVersion, knownManifest.compatibleCli)) { + throw new Error( + `Seed ${knownManifest.snapshotId} is not compatible with ghcrawl ${params.cliVersion}. Expected ${knownManifest.compatibleCli}.`, + ); + } + + const existingRepository = this.findRepository(params.owner, params.repo); + if (!params.force && existingRepository && this.repositoryHasThreadRows(existingRepository.id)) { + throw new Error( + `Repository ${existingRepository.fullName} already has local records. Re-run with --force to import starter data into an existing repo.`, + ); + } + + let synced = false; + if (params.skipSync !== true) { + params.onProgress?.(`[seed] syncing ${params.owner}/${params.repo} metadata before starter import`); + await this.syncRepository({ + owner: params.owner, + repo: params.repo, + onProgress: (message) => params.onProgress?.(message.replace(/^\[sync\]/, '[seed/sync]')), + }); + synced = true; + } + + const repository = this.requireRepository(params.owner, params.repo); + if (!this.repositoryHasThreadRows(repository.id)) { + throw new Error(`Repository ${repository.fullName} has no local thread metadata. Run sync first or omit --no-sync.`); + } + + const assetUrl = params.assetUrl ?? knownManifest?.downloadUrl; + if (!assetUrl) { + throw new Error(`Seed ${repository.fullName} does not have a configured download URL yet.`); + } + params.onProgress?.(`[seed] downloading starter asset ${assetUrl}`); + const asset = await readSeedAsset(assetUrl); + const assetSha = sha256Hex(asset); + if (knownManifest && assetSha !== knownManifest.sha256) { + throw new Error(`Starter asset checksum mismatch for ${repository.fullName}. Expected ${knownManifest.sha256}, received ${assetSha}.`); + } + + const archive = await parseSeedSidecarArchive(asset); + this.validateInstalledSeedManifest(repository, params.cliVersion, archive.manifest, knownManifest); + + params.onProgress?.( + `[seed] archive snapshot=${archive.manifest.snapshotId} threads=${archive.threads.length} edges=${archive.edges.length}`, + ); + + const resolvedThreads = this.resolveSeedThreads(repository.id, archive.threads); + const importedEmbeddings = resolvedThreads.matched.length; + const importedThreads = resolvedThreads.matchedThreadIds.size; + const skippedThreads = resolvedThreads.skipped; + if (importedEmbeddings === 0) { + throw new Error(`Starter asset ${archive.manifest.snapshotId} did not match any current local threads for ${repository.fullName}.`); + } + + for (const row of resolvedThreads.matched) { + this.upsertEmbedding( + row.localThreadId, + row.sidecar.sourceKind, + this.seedImportedEmbeddingContentHash(archive.manifest.snapshotId, row.sidecar.sourceKind, row.sidecar.threadContentHash), + row.sidecar.embedding, + ); + } + + const resolvedEdges = this.resolveSeedEdges(resolvedThreads.byIdentity, archive.edges); + const runId = this.startRun('cluster_runs', repository.id, `seed:${archive.manifest.snapshotId}`); + + try { + const aggregatedEdges = new Map }>(); + for (const edge of resolvedEdges.matched) { + aggregatedEdges.set(this.edgeKey(edge.leftThreadId, edge.rightThreadId), { + leftThreadId: edge.leftThreadId, + rightThreadId: edge.rightThreadId, + score: edge.score, + sourceKinds: new Set(edge.sourceKinds), + }); + } + + const nodes = Array.from(resolvedThreads.matchedThreads.values()).map((row) => ({ + threadId: row.localThreadId, + number: row.number, + title: row.title, + })); + const clusters = buildClusters( + nodes, + Array.from(aggregatedEdges.values()).map((edge) => ({ + leftThreadId: edge.leftThreadId, + rightThreadId: edge.rightThreadId, + score: edge.score, + })), + ); + + this.persistClusterRun(repository.id, runId, aggregatedEdges, clusters); + this.pruneOldClusterRuns(repository.id, runId); + this.finishRun('cluster_runs', runId, 'completed', { + source: 'seed', + snapshotId: archive.manifest.snapshotId, + importedThreads, + importedEdges: resolvedEdges.matched.length, + skippedThreads, + skippedEdges: resolvedEdges.skipped, + }); + + params.onProgress?.( + `[seed] imported threads=${importedThreads} skipped_threads=${skippedThreads} edges=${resolvedEdges.matched.length} skipped_edges=${resolvedEdges.skipped} clusters=${clusters.length}`, + ); + + return { + repository, + snapshotId: archive.manifest.snapshotId, + assetUrl, + synced, + importedThreads, + importedEmbeddings, + skippedThreads, + importedEdges: resolvedEdges.matched.length, + skippedEdges: resolvedEdges.skipped, + clusters: clusters.length, + clusterRunId: runId, + }; + } catch (error) { + this.finishRun('cluster_runs', runId, 'failed', null, error); + throw error; + } + } + + async exportSeedSidecar(params: { + owner: string; + repo: string; + cliVersion: string; + outputDir: string; + snapshotId?: string; + onProgress?: (message: string) => void; + }): Promise { + const repository = this.requireRepository(params.owner, params.repo); + const latestRun = this.getLatestClusterRun(repository.id); + if (!latestRun) { + throw new Error(`Repository ${repository.fullName} does not have a completed cluster run to export.`); + } + + const embeddingWhereSql = `from threads t + join document_embeddings e on e.thread_id = t.id + where t.repo_id = ? + and t.state = 'open' + and t.closed_at_local is null + and e.model = ? + and e.source_kind != 'dedupe_summary'`; + const threadStatement = this.db.prepare( + `select + t.number, + t.kind, + t.github_id, + t.content_hash, + e.source_kind, + e.dimensions, + e.embedding_json + ${embeddingWhereSql} + order by t.number asc, e.source_kind asc`, + ); + const threadCountRow = this.db + .prepare( + `select count(*) as embedding_count, count(distinct t.id) as thread_count + ${embeddingWhereSql}`, + ) + .get(repository.id, this.config.embedModel) as { + embedding_count: number; + thread_count: number; + }; + if (threadCountRow.embedding_count === 0) { + throw new Error(`Repository ${repository.fullName} does not have any non-dedupe embeddings to export.`); + } + + const sourceKinds = this.db + .prepare( + `select distinct e.source_kind + ${embeddingWhereSql} + order by e.source_kind asc`, + ) + .all(repository.id, this.config.embedModel) + .map((row: unknown) => (row as { source_kind: EmbeddingSourceKind }).source_kind); + const edgeStatement = this.db + .prepare( + `select + left_t.number as left_number, + left_t.kind as left_kind, + left_t.github_id as left_github_id, + right_t.number as right_number, + right_t.kind as right_kind, + right_t.github_id as right_github_id, + se.score, + se.explanation_json + from similarity_edges se + join threads left_t on left_t.id = se.left_thread_id + join threads right_t on right_t.id = se.right_thread_id + where se.repo_id = ? + and se.cluster_run_id = ? + order by left_t.number asc, right_t.number asc`, + ); + let edgeCount = 0; + for (const row of edgeStatement.iterate(repository.id, latestRun.id) as Iterable<{ + left_number: number; + left_kind: 'issue' | 'pull_request'; + left_github_id: string; + right_number: number; + right_kind: 'issue' | 'pull_request'; + right_github_id: string; + score: number; + explanation_json: string; + }>) { + const sources = this.parseSeedExportEdgeSources(row.explanation_json); + if (sources.length > 0) { + edgeCount += 1; + } + } + + const snapshotId = + params.snapshotId ?? + `${params.owner}-${params.repo}-${nowIso().replace(/[:.]/g, '-').replace(/Z$/, 'Z')}`; + const archiveManifest: SeedSidecarArchiveWriterInput = { + manifest: { + schemaVersion: 1, + format: 'ghcrawl-seed-sidecar-gzip-v1', + snapshotId, + createdAt: nowIso(), + compatibleCli: this.cliCompatibilityRange(params.cliVersion), + owner: params.owner, + repo: params.repo, + fullName: repository.fullName, + embedModel: this.config.embedModel, + sourceKinds, + cluster: { + k: 6, + minScore: 0.82, + }, + threadCount: threadCountRow.thread_count, + embeddingCount: threadCountRow.embedding_count, + edgeCount, + }, + threads: this.iterExportSeedThreads( + threadStatement.iterate(repository.id, this.config.embedModel) as Iterable<{ + number: number; + kind: 'issue' | 'pull_request'; + github_id: string; + content_hash: string; + source_kind: EmbeddingSourceKind; + dimensions: number; + embedding_json: string; + }>, + params.owner, + params.repo, + ), + edges: this.iterExportSeedEdges( + edgeStatement.iterate(repository.id, latestRun.id) as Iterable<{ + left_number: number; + left_kind: 'issue' | 'pull_request'; + left_github_id: string; + right_number: number; + right_kind: 'issue' | 'pull_request'; + right_github_id: string; + score: number; + explanation_json: string; + }>, + params.owner, + params.repo, + ), + }; + + mkdirSync(params.outputDir, { recursive: true }); + const outputPath = path.join(params.outputDir, `${snapshotId}.seed.json.gz`); + const { sha256 } = await writeSeedSidecarArchive(outputPath, archiveManifest); + writeFileSync(`${outputPath}.sha256`, `${sha256} ${path.basename(outputPath)}\n`); + params.onProgress?.( + `[seed-export] wrote ${threadCountRow.thread_count} threads, ${threadCountRow.embedding_count} embeddings, and ${edgeCount} edges to ${outputPath}`, + ); + + return { + repository, + snapshotId, + outputPath, + sha256, + threads: threadCountRow.thread_count, + edges: edgeCount, + }; + } + async searchRepository(params: { owner: string; repo: string; @@ -2185,13 +2521,236 @@ export class GHCrawlService { return this.github as GitHubClient; } - private requireRepository(owner: string, repo: string): RepositoryDto { + private findRepository(owner: string, repo: string): RepositoryDto | null { const fullName = `${owner}/${repo}`; const row = this.db.prepare('select * from repositories where full_name = ? limit 1').get(fullName) as Record | undefined; + return row ? repositoryToDto(row) : null; + } + + private requireRepository(owner: string, repo: string): RepositoryDto { + const row = this.findRepository(owner, repo); if (!row) { - throw new Error(`Repository ${fullName} not found. Run sync first.`); + throw new Error(`Repository ${owner}/${repo} not found. Run sync first.`); + } + return row; + } + + private repositoryHasThreadRows(repoId: number): boolean { + const row = this.db.prepare('select count(*) as count from threads where repo_id = ?').get(repoId) as { count: number }; + return row.count > 0; + } + + private validateInstalledSeedManifest( + repository: RepositoryDto, + cliVersion: string, + manifest: SeedSidecarArchive['manifest'], + knownManifest: KnownSeedManifestEntry | null, + ): void { + if (manifest.fullName !== repository.fullName) { + throw new Error(`Starter asset ${manifest.snapshotId} is for ${manifest.fullName}, not ${repository.fullName}.`); + } + if (!isCliVersionCompatible(cliVersion, manifest.compatibleCli)) { + throw new Error( + `Starter asset ${manifest.snapshotId} is not compatible with ghcrawl ${cliVersion}. Expected ${manifest.compatibleCli}.`, + ); + } + if (knownManifest && knownManifest.snapshotId !== manifest.snapshotId) { + throw new Error( + `Starter asset snapshot mismatch for ${repository.fullName}. Expected ${knownManifest.snapshotId}, received ${manifest.snapshotId}.`, + ); + } + if (manifest.embedModel !== this.config.embedModel) { + throw new Error( + `Starter asset ${manifest.snapshotId} expects embed model ${manifest.embedModel}, but this install is configured for ${this.config.embedModel}.`, + ); + } + } + + private resolveSeedThreads( + repoId: number, + sidecarThreads: SeedThreadSidecarRow[], + ): { + matched: Array<{ localThreadId: number; number: number; title: string; sidecar: SeedThreadSidecarRow }>; + matchedThreadIds: Set; + matchedThreads: Map; + byIdentity: Map; + skipped: number; + } { + const localRows = this.db + .prepare( + `select id, number, kind, title, github_id, content_hash + from threads + where repo_id = ? + and state = 'open' + and closed_at_local is null`, + ) + .all(repoId) as Array<{ + id: number; + number: number; + kind: 'issue' | 'pull_request'; + title: string; + github_id: string; + content_hash: string; + }>; + const localByIdentity = new Map(localRows.map((row) => [this.seedIdentityKey(row.kind, row.number), row])); + const matched: Array<{ localThreadId: number; number: number; title: string; sidecar: SeedThreadSidecarRow }> = []; + const matchedThreadIds = new Set(); + const matchedThreads = new Map(); + const byIdentity = new Map(); + let skipped = 0; + + for (const sidecar of sidecarThreads) { + const local = localByIdentity.get(this.seedIdentityKey(sidecar.kind, sidecar.number)); + if (!local || local.github_id !== sidecar.githubId || local.content_hash !== sidecar.threadContentHash) { + skipped += 1; + continue; + } + const resolved = { + localThreadId: local.id, + number: local.number, + title: local.title, + sidecar, + }; + matched.push(resolved); + matchedThreadIds.add(local.id); + matchedThreads.set(local.id, { localThreadId: local.id, number: local.number, title: local.title }); + byIdentity.set(this.seedIdentityKey(sidecar.kind, sidecar.number), resolved); + } + + return { matched, matchedThreadIds, matchedThreads, byIdentity, skipped }; + } + + private *iterExportSeedThreads( + rows: Iterable<{ + number: number; + kind: 'issue' | 'pull_request'; + github_id: string; + content_hash: string; + source_kind: EmbeddingSourceKind; + dimensions: number; + embedding_json: string; + }>, + owner: string, + repo: string, + ): Iterable { + for (const row of rows) { + yield { + owner, + repo, + kind: row.kind, + number: row.number, + githubId: row.github_id, + threadContentHash: row.content_hash, + sourceKind: row.source_kind, + embeddingModel: this.config.embedModel, + dimensions: row.dimensions, + embedding: JSON.parse(row.embedding_json) as number[], + }; + } + } + + private *iterExportSeedEdges( + rows: Iterable<{ + left_number: number; + left_kind: 'issue' | 'pull_request'; + left_github_id: string; + right_number: number; + right_kind: 'issue' | 'pull_request'; + right_github_id: string; + score: number; + explanation_json: string; + }>, + owner: string, + repo: string, + ): Iterable { + for (const row of rows) { + const sources = this.parseSeedExportEdgeSources(row.explanation_json); + if (sources.length === 0) { + continue; + } + yield { + left: { + owner, + repo, + kind: row.left_kind, + number: row.left_number, + githubId: row.left_github_id, + }, + right: { + owner, + repo, + kind: row.right_kind, + number: row.right_number, + githubId: row.right_github_id, + }, + score: row.score, + sources, + }; + } + } + + private resolveSeedEdges( + resolvedThreads: Map, + sidecarEdges: SeedEdgeSidecarRow[], + ): { + matched: Array<{ leftThreadId: number; rightThreadId: number; score: number; sourceKinds: EmbeddingSourceKind[] }>; + skipped: number; + } { + const matched: Array<{ leftThreadId: number; rightThreadId: number; score: number; sourceKinds: EmbeddingSourceKind[] }> = []; + let skipped = 0; + + for (const edge of sidecarEdges) { + const left = resolvedThreads.get(this.seedIdentityKey(edge.left.kind, edge.left.number)); + const right = resolvedThreads.get(this.seedIdentityKey(edge.right.kind, edge.right.number)); + if (!left || !right || left.localThreadId === right.localThreadId) { + skipped += 1; + continue; + } + matched.push({ + leftThreadId: left.localThreadId, + rightThreadId: right.localThreadId, + score: edge.score, + sourceKinds: (() => { + const sources = edge.sources.filter((value): value is EmbeddingSourceKind => seedEmbeddingSourceKindSchema.safeParse(value).success); + return sources.length > 0 ? sources : ['title']; + })(), + }); + } + + return { matched, skipped }; + } + + private seedIdentityKey(kind: 'issue' | 'pull_request', number: number): string { + return `${kind}:${number}`; + } + + private seedImportedEmbeddingContentHash(snapshotId: string, sourceKind: EmbeddingSourceKind, threadContentHash: string): string { + return stableContentHash(`seed-embedding:${snapshotId}:${sourceKind}\n${threadContentHash}`); + } + + private parseSeedEdgeSources(explanationJson: string | null): EmbeddingSourceKind[] { + if (!explanationJson) { + return ['title']; } - return repositoryToDto(row); + try { + const parsed = JSON.parse(explanationJson) as { sources?: unknown }; + const sourceKinds = Array.isArray(parsed.sources) + ? parsed.sources.filter((value): value is EmbeddingSourceKind => seedEmbeddingSourceKindSchema.safeParse(value).success) + : []; + return sourceKinds.length > 0 ? sourceKinds : ['title']; + } catch { + return ['title']; + } + } + + private parseSeedExportEdgeSources(explanationJson: string | null): EmbeddingSourceKind[] { + return this.parseSeedEdgeSources(explanationJson).filter((sourceKind) => sourceKind !== 'dedupe_summary'); + } + + private cliCompatibilityRange(cliVersion: string): string { + const match = cliVersion.match(/^(\d+)\./); + const major = match ? Number(match[1]) : 0; + return `>=${cliVersion} <${major + 1}.0.0`; } private upsertRepository(owner: string, repo: string, payload: Record): number { diff --git a/scripts/op-run.mjs b/scripts/op-run.mjs index d5ae137..62208a2 100644 --- a/scripts/op-run.mjs +++ b/scripts/op-run.mjs @@ -47,6 +47,14 @@ function readSecret(reference) { }).trim(); } +function tryReadSecret(reference) { + try { + return readSecret(reference); + } catch { + return undefined; + } +} + function loadOpEnv(env = process.env) { const configPath = getConfigPath(env); const config = readConfig(configPath); @@ -54,7 +62,7 @@ function loadOpEnv(env = process.env) { return { ...env, GITHUB_TOKEN: readSecret(`op://${vaultName}/${itemName}/GITHUB_TOKEN`), - OPENAI_API_KEY: readSecret(`op://${vaultName}/${itemName}/OPENAI_API_KEY`), + OPENAI_API_KEY: tryReadSecret(`op://${vaultName}/${itemName}/OPENAI_API_KEY`), }; } diff --git a/scripts/seed-audit.mjs b/scripts/seed-audit.mjs new file mode 100644 index 0000000..ae57f2f --- /dev/null +++ b/scripts/seed-audit.mjs @@ -0,0 +1,458 @@ +import crypto from 'node:crypto'; +import fs from 'node:fs'; +import { createInterface } from 'node:readline'; +import { Readable } from 'node:stream'; +import { createGunzip } from 'node:zlib'; +import { parseArgs } from 'node:util'; + +const allowedRecordKinds = new Set(['manifest', 'thread', 'edge']); +const allowedThreadKinds = new Set(['issue', 'pull_request']); +const allowedSourceKinds = new Set(['title', 'body', 'dedupe_summary']); +const allowedManifestKeys = new Set([ + 'schemaVersion', + 'format', + 'snapshotId', + 'createdAt', + 'compatibleCli', + 'owner', + 'repo', + 'fullName', + 'embedModel', + 'sourceKinds', + 'cluster', + 'threadCount', + 'embeddingCount', + 'edgeCount', +]); +const allowedThreadKeys = new Set([ + 'owner', + 'repo', + 'kind', + 'number', + 'githubId', + 'threadContentHash', + 'sourceKind', + 'embeddingModel', + 'dimensions', + 'embedding', +]); +const allowedEdgeKeys = new Set(['left', 'right', 'score', 'sources']); +const allowedIdentityKeys = new Set(['owner', 'repo', 'kind', 'number', 'githubId']); +const maxUniqueIssueCount = 200; + +async function main() { + const options = parseCli(process.argv.slice(2)); + const expectedFullName = `${options.owner}/${options.repo}`; + const assetBuffer = await readAsset(options.asset); + const sha256 = crypto.createHash('sha256').update(assetBuffer).digest('hex'); + const input = Readable.from(assetBuffer).pipe(createGunzip()); + const reader = createInterface({ input, crlfDelay: Infinity }); + + const issueTracker = { counts: new Map(), order: [], omitted: 0 }; + const threadSourceCounts = new Map(); + const edgeSourceCounts = new Map(); + const uniqueThreadIdentities = new Set(); + const uniqueThreadSources = new Set(); + let manifest = null; + let threadRowCount = 0; + let edgeRowCount = 0; + + for await (const line of reader) { + const trimmed = line.trim(); + if (!trimmed) { + continue; + } + + let record; + try { + record = JSON.parse(trimmed); + } catch (error) { + throw new Error(`Invalid JSON record: ${error instanceof Error ? error.message : String(error)}`); + } + + if (!record || typeof record !== 'object' || Array.isArray(record)) { + recordIssue(issueTracker, 'Record is not an object.'); + continue; + } + if (!allowedRecordKinds.has(record.kind)) { + recordIssue(issueTracker, `Unexpected record kind: ${String(record.kind)}`); + continue; + } + + if (record.kind === 'manifest') { + if (manifest) { + recordIssue(issueTracker, 'Archive contains more than one manifest record.'); + continue; + } + manifest = validateManifest(record.payload, expectedFullName, options.expectedSources, issueTracker); + continue; + } + + if (!manifest) { + recordIssue(issueTracker, `Encountered ${record.kind} record before manifest.`); + continue; + } + + if (record.kind === 'thread') { + const row = validateThreadRow(record.payload, expectedFullName, options.expectedSources, issueTracker); + if (!row) { + continue; + } + threadRowCount += 1; + const identityKey = `${row.kind}:${row.number}`; + uniqueThreadIdentities.add(identityKey); + uniqueThreadSources.add(`${identityKey}:${row.sourceKind}`); + increment(threadSourceCounts, row.sourceKind); + continue; + } + + const row = validateEdgeRow(record.payload, expectedFullName, options.expectedSources, issueTracker); + if (!row) { + continue; + } + edgeRowCount += 1; + for (const sourceKind of row.sources) { + increment(edgeSourceCounts, sourceKind); + } + } + + if (!manifest) { + recordIssue(issueTracker, 'Archive did not contain a manifest record.'); + } else { + if (manifest.threadCount !== uniqueThreadIdentities.size) { + recordIssue( + issueTracker, + `Manifest threadCount=${manifest.threadCount} does not match unique thread identities=${uniqueThreadIdentities.size}.`, + ); + } + if (manifest.embeddingCount !== threadRowCount) { + recordIssue(issueTracker, `Manifest embeddingCount=${manifest.embeddingCount} does not match thread rows=${threadRowCount}.`); + } + if (manifest.edgeCount !== edgeRowCount) { + recordIssue(issueTracker, `Manifest edgeCount=${manifest.edgeCount} does not match edge rows=${edgeRowCount}.`); + } + const manifestSources = normalizeSourceKinds(manifest.sourceKinds); + const seenThreadSources = [...threadSourceCounts.keys()].sort(); + if (manifestSources.join(',') !== seenThreadSources.join(',')) { + recordIssue( + issueTracker, + `Manifest sourceKinds=${manifestSources.join(',')} do not match thread row source kinds=${seenThreadSources.join(',')}.`, + ); + } + } + + const report = { + ok: issueTracker.counts.size === 0 && issueTracker.omitted === 0, + asset: options.asset, + sha256, + expectedRepo: expectedFullName, + manifest: manifest + ? { + snapshotId: manifest.snapshotId, + schemaVersion: manifest.schemaVersion, + format: manifest.format, + compatibleCli: manifest.compatibleCli, + embedModel: manifest.embedModel, + sourceKinds: normalizeSourceKinds(manifest.sourceKinds), + threadCount: manifest.threadCount, + embeddingCount: manifest.embeddingCount, + edgeCount: manifest.edgeCount, + } + : null, + observed: { + uniqueThreadCount: uniqueThreadIdentities.size, + embeddingRowCount: threadRowCount, + edgeRowCount, + threadSourceCounts: objectFromMap(threadSourceCounts), + edgeSourceCounts: objectFromMap(edgeSourceCounts), + uniqueThreadSourceCount: uniqueThreadSources.size, + }, + issues: issueEntries(issueTracker), + omittedIssueCount: issueTracker.omitted, + }; + + if (options.json) { + process.stdout.write(`${JSON.stringify(report, null, 2)}\n`); + } else { + writeTextReport(report); + } + + if (!report.ok) { + process.exitCode = 1; + } +} + +function parseCli(argv) { + const parsed = parseArgs({ + args: argv, + options: { + asset: { type: 'string' }, + repo: { type: 'string' }, + sources: { type: 'string' }, + json: { type: 'boolean' }, + }, + allowPositionals: true, + }); + + const asset = parsed.values.asset ?? parsed.positionals[0]; + if (typeof asset !== 'string' || asset.trim().length === 0) { + throw new Error('Missing --asset '); + } + const repoValue = parsed.values.repo ?? 'openclaw/openclaw'; + if (typeof repoValue !== 'string' || !repoValue.includes('/')) { + throw new Error(`Invalid --repo value: ${String(repoValue)}`); + } + const [owner, repo] = repoValue.split('/'); + if (!owner || !repo) { + throw new Error(`Invalid --repo value: ${repoValue}`); + } + + const expectedSources = parsed.values.sources ? normalizeSourceKinds(parsed.values.sources.split(',')) : null; + return { + asset, + owner, + repo, + expectedSources, + json: parsed.values.json === true, + }; +} + +async function readAsset(asset) { + if (/^https?:\/\//i.test(asset)) { + const response = await fetch(asset); + if (!response.ok) { + throw new Error(`Failed to download seed asset: ${response.status} ${response.statusText}`); + } + return Buffer.from(await response.arrayBuffer()); + } + + if (asset.startsWith('file://')) { + return fs.readFileSync(new URL(asset)); + } + + return fs.readFileSync(asset); +} + +function validateManifest(value, expectedFullName, expectedSources, issueTracker) { + if (!isPlainObject(value)) { + recordIssue(issueTracker, 'Manifest payload is not an object.'); + return null; + } + rejectUnknownKeys('manifest', value, allowedManifestKeys, issueTracker); + const [expectedOwner, expectedRepo] = expectedFullName.split('/'); + if (value.owner !== expectedOwner || value.repo !== expectedRepo || value.fullName !== expectedFullName) { + recordIssue(issueTracker, `Manifest targets ${String(value.fullName)} instead of ${expectedFullName}.`); + } + if (!Array.isArray(value.sourceKinds) || value.sourceKinds.length === 0) { + recordIssue(issueTracker, 'Manifest sourceKinds is missing or empty.'); + } else { + for (const sourceKind of value.sourceKinds) { + if (!allowedSourceKinds.has(sourceKind)) { + recordIssue(issueTracker, `Manifest contains unsupported source kind: ${String(sourceKind)}.`); + } + } + } + if (expectedSources) { + const manifestSources = normalizeSourceKinds(Array.isArray(value.sourceKinds) ? value.sourceKinds : []); + if (manifestSources.join(',') !== expectedSources.join(',')) { + recordIssue( + issueTracker, + `Manifest sourceKinds=${manifestSources.join(',')} do not match expected sources=${expectedSources.join(',')}.`, + ); + } + } + if (!isPlainObject(value.cluster)) { + recordIssue(issueTracker, 'Manifest cluster metadata is missing.'); + } + return value; +} + +function validateThreadRow(value, expectedFullName, expectedSources, issueTracker) { + if (!isPlainObject(value)) { + recordIssue(issueTracker, 'Thread payload is not an object.'); + return null; + } + rejectUnknownKeys('thread', value, allowedThreadKeys, issueTracker); + validateIdentity(value, 'thread', expectedFullName, issueTracker, false); + if (!allowedSourceKinds.has(value.sourceKind)) { + recordIssue(issueTracker, `Thread row has unsupported source kind: ${String(value.sourceKind)}.`); + } + if (expectedSources && !expectedSources.includes(value.sourceKind)) { + recordIssue( + issueTracker, + `Thread row has source kind ${String(value.sourceKind)} outside expected sources ${expectedSources.join(',')}.`, + ); + } + if (!Array.isArray(value.embedding) || value.embedding.length === 0 || !value.embedding.every((item) => typeof item === 'number')) { + recordIssue(issueTracker, `Thread ${String(value.kind)}#${String(value.number)} has an invalid embedding payload.`); + } + if (!Number.isInteger(value.dimensions) || value.dimensions <= 0) { + recordIssue(issueTracker, `Thread ${String(value.kind)}#${String(value.number)} has invalid dimensions=${String(value.dimensions)}.`); + } + if (Array.isArray(value.embedding) && Number.isInteger(value.dimensions) && value.embedding.length !== value.dimensions) { + recordIssue( + issueTracker, + `Thread ${String(value.kind)}#${String(value.number)} dimensions=${value.dimensions} does not match embedding length=${value.embedding.length}.`, + ); + } + return value; +} + +function validateEdgeRow(value, expectedFullName, expectedSources, issueTracker) { + if (!isPlainObject(value)) { + recordIssue(issueTracker, 'Edge payload is not an object.'); + return null; + } + rejectUnknownKeys('edge', value, allowedEdgeKeys, issueTracker); + validateIdentity(value.left, 'edge.left', expectedFullName, issueTracker); + validateIdentity(value.right, 'edge.right', expectedFullName, issueTracker); + if (!Array.isArray(value.sources) || value.sources.length === 0) { + recordIssue(issueTracker, 'Edge row has no sources.'); + } else { + for (const sourceKind of value.sources) { + if (!allowedSourceKinds.has(sourceKind)) { + recordIssue(issueTracker, `Edge row has unsupported source kind: ${String(sourceKind)}.`); + continue; + } + if (expectedSources && !expectedSources.includes(sourceKind)) { + recordIssue( + issueTracker, + `Edge row has source kind ${String(sourceKind)} outside expected sources ${expectedSources.join(',')}.`, + ); + } + } + } + if (typeof value.score !== 'number' || !Number.isFinite(value.score)) { + recordIssue(issueTracker, `Edge row has invalid score=${String(value.score)}.`); + } + return value; +} + +function validateIdentity(value, label, expectedFullName, issueTracker, rejectUnknown = true) { + if (!isPlainObject(value)) { + recordIssue(issueTracker, `${label} is not an object.`); + return; + } + if (rejectUnknown) { + rejectUnknownKeys(label, value, allowedIdentityKeys, issueTracker); + } + const [expectedOwner, expectedRepo] = expectedFullName.split('/'); + if (value.owner !== expectedOwner || value.repo !== expectedRepo) { + recordIssue(issueTracker, `${label} points to ${String(value.owner)}/${String(value.repo)} instead of ${expectedFullName}.`); + } + if (!allowedThreadKinds.has(value.kind)) { + recordIssue(issueTracker, `${label} has invalid kind=${String(value.kind)}.`); + } + if (!Number.isInteger(value.number) || value.number <= 0) { + recordIssue(issueTracker, `${label} has invalid number=${String(value.number)}.`); + } + if (typeof value.githubId !== 'string' || value.githubId.length === 0) { + recordIssue(issueTracker, `${label} has invalid githubId.`); + } +} + +function rejectUnknownKeys(label, value, allowedKeys, issueTracker) { + for (const key of Object.keys(value)) { + if (!allowedKeys.has(key)) { + recordIssue(issueTracker, `${label} contains unexpected key ${key}.`); + } + } +} + +function recordIssue(issueTracker, issue) { + const current = issueTracker.counts.get(issue); + if (typeof current === 'number') { + issueTracker.counts.set(issue, current + 1); + return; + } + if (issueTracker.order.length < maxUniqueIssueCount) { + issueTracker.order.push(issue); + issueTracker.counts.set(issue, 1); + return; + } + issueTracker.omitted += 1; +} + +function issueEntries(issueTracker) { + return issueTracker.order.map((issue) => { + const count = issueTracker.counts.get(issue) ?? 1; + return count === 1 ? issue : `${issue} (${count} occurrences)`; + }); +} + +function isPlainObject(value) { + return typeof value === 'object' && value !== null && !Array.isArray(value); +} + +function normalizeSourceKinds(sourceKinds) { + return [...new Set(sourceKinds)].sort(); +} + +function increment(map, key) { + map.set(key, (map.get(key) ?? 0) + 1); +} + +function objectFromMap(map) { + return Object.fromEntries([...map.entries()].sort(([left], [right]) => left.localeCompare(right))); +} + +function writeTextReport(report) { + const lines = [ + 'seed audit', + `ok: ${report.ok ? 'yes' : 'no'}`, + `asset: ${report.asset}`, + `sha256: ${report.sha256}`, + `expected repo: ${report.expectedRepo}`, + ]; + + if (report.manifest) { + lines.push( + '', + 'manifest', + ` snapshot: ${report.manifest.snapshotId}`, + ` schema: ${report.manifest.schemaVersion}`, + ` format: ${report.manifest.format}`, + ` compatible cli: ${report.manifest.compatibleCli}`, + ` embed model: ${report.manifest.embedModel}`, + ` source kinds: ${report.manifest.sourceKinds.join(', ')}`, + ` thread count: ${report.manifest.threadCount}`, + ` embedding count: ${report.manifest.embeddingCount}`, + ` edge count: ${report.manifest.edgeCount}`, + ); + } + + lines.push( + '', + 'observed', + ` unique threads: ${report.observed.uniqueThreadCount}`, + ` embedding rows: ${report.observed.embeddingRowCount}`, + ` edge rows: ${report.observed.edgeRowCount}`, + ` thread sources: ${formatCounts(report.observed.threadSourceCounts)}`, + ` edge sources: ${formatCounts(report.observed.edgeSourceCounts)}`, + ); + + if (report.issues.length > 0 || report.omittedIssueCount > 0) { + lines.push('', 'issues'); + for (const issue of report.issues) { + lines.push(` - ${issue}`); + } + if (report.omittedIssueCount > 0) { + lines.push(` - ... ${report.omittedIssueCount} additional issues omitted`); + } + } + + process.stdout.write(`${lines.join('\n')}\n`); +} + +function formatCounts(value) { + const entries = Object.entries(value); + if (entries.length === 0) { + return 'none'; + } + return entries.map(([key, count]) => `${key}=${count}`).join(', '); +} + +main().catch((error) => { + process.stderr.write(`${error instanceof Error ? error.message : String(error)}\n`); + process.exitCode = 1; +});