From 8e3e911a0b3856307c7a71bcf24421d0a4c3d30a Mon Sep 17 00:00:00 2001 From: carlos-alm <127798846+carlos-alm@users.noreply.github.com> Date: Tue, 21 Apr 2026 01:04:06 -0600 Subject: [PATCH 1/3] docs: snapshot & parallel-worktree incremental-build audit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Research pass comparing codegraph to omnigraph.dev. Documents three filed correctness bugs (#995 snapshot TOCTOU, #996 unlocked journal appends, #997 watcher never updates journal header) and records why a content-addressed parse cache was considered and rejected: parsing is ~10% of full-build time, so the cache would save ~10% on first- build-in-new-worktree and near-zero on incremental/watch. Also corrects the initial "structural failure in multi-worktree incremental updates" framing — each linked worktree gets its own DB (verified), and incremental builds within a worktree work correctly. The real gap is that first-build-in-new-worktree doesn't amortize work other worktrees already did, which is an optimization opportunity, not a structural failure. --- docs/reports/snapshot-and-worktree-audit.md | 192 ++++++++++++++++++++ 1 file changed, 192 insertions(+) create mode 100644 docs/reports/snapshot-and-worktree-audit.md diff --git a/docs/reports/snapshot-and-worktree-audit.md b/docs/reports/snapshot-and-worktree-audit.md new file mode 100644 index 000000000..637d71e67 --- /dev/null +++ b/docs/reports/snapshot-and-worktree-audit.md @@ -0,0 +1,192 @@ +# Snapshot & parallel-worktree incremental-build audit + +**Date:** 2026-04-21 +**Framing:** comparison pass against [omnigraph.dev](https://www.omnigraph.dev/), focused on our snapshot design and incremental-build behavior when multiple Claude Code worktrees run against the same repo in parallel. +**Outcome:** 3 correctness bugs filed (#995, #996, #997). One proposed optimization (content-addressed parse cache) evaluated and rejected against the current performance profile. No architectural rewrite justified. + +--- + +## 0. TL;DR + +- **Each worktree does get its own `.codegraph/graph.db`** — `git rev-parse --show-toplevel` returns the *worktree* path in linked worktrees. Verified locally. So "two worktrees corrupt each other's DB" is not a real risk. +- **The framing of "structural failure in incremental updates" was overblown.** Incremental builds within a single worktree work correctly. What's actually imperfect: concurrent writers to the same worktree's journal/snapshot files, and first-build-in-new-worktree doesn't amortize work other worktrees already did. +- **3 real bugs found and filed** — all about non-atomic file operations: + - #995 — snapshot save TOCTOU race + - #996 — journal append isn't locked + - #997 — journal header lags appended entries during watch → silent performance cliff +- **Omnigraph's "commits are snapshots, branches are references" model does not fit us** — it couples graph freshness to git commits, which breaks the "graph tracks your working tree live" invariant that is codegraph's whole point. +- **Content-addressed parse cache (R1 below)** was the most promising omnigraph-inspired optimization, but parsing is ~10% of build time. A cache would save ~10% on first-build-in-new-worktree and near-zero on incremental/watch. **Not worth the engineering cost.** Rejected. +- **Honest scope limit:** omnigraph's public docs are thin on internals. Claims about their concurrency model and merge semantics come from marketing pages, not source. Calibrate the "what to learn" findings accordingly. + +--- + +## 1. What I verified vs. what I inferred + +| Claim | Source | Confidence | +|---|---|---| +| Linked worktree gets its own DB | Ran `git rev-parse --show-toplevel` in worktree → returned worktree path | **Verified** | +| Worktree and main each have independent ~30 MB `.codegraph/` dirs | `ls -la` in both locations | **Verified** | +| Advisory lock warns but never blocks | `src/db/connection.ts:112-130` — reads PID, calls `warn()`, writes own PID, proceeds | **Verified** | +| Snapshots are `VACUUM INTO` to `.codegraph/snapshots/.db` | `src/features/snapshot.ts:18-60` | **Verified** | +| Snapshots have no git identity | Name regex `^[a-zA-Z0-9_-]+$`; no git calls in the file | **Verified** | +| Journal header + append are separate writes, with no coordination | `src/domain/graph/journal.ts:62-83` vs `85-105` | **Verified** | +| Parsing is ~10% of full-build time | User-provided benchmark | **Verified (external)** | +| Omnigraph uses Lance + Arrow + DataFusion + Cedar | Omnigraph landing page | **Per marketing** | +| Omnigraph branches are copy-on-write, "no locks" | Omnigraph landing page | **Per marketing — no source dive** | +| Omnigraph merge conflict semantics | — | **Not documented publicly** | + +--- + +## 2. What's actually in our code + +### 2.1 Snapshot (`src/features/snapshot.ts`) + +```text +snapshotSave(name) → VACUUM INTO .codegraph/snapshots/.db (lines 27-61) +snapshotRestore(name) → copyFileSync over graph.db + wipe WAL/SHM (lines 67-88) +snapshotList/Delete → flat dir listing (lines 97-131) +``` + +Real correctness issue: `existsSync` → `unlinkSync` → `VACUUM INTO` is a three-step TOCTOU race. Filed as **#995**. + +Separately, snapshot name is free-form — no HEAD/branch/dirty-state binding. Two worktrees running `snapshot save main` from different commits silently overwrite each other with `--force`. This is a UX gap, not a correctness bug. + +### 2.2 Change journal (`src/domain/graph/journal.ts`) + +Format: +``` +# codegraph-journal v1 +path/to/changed-file.ts +DELETED path/to/gone.ts +``` + +Two real issues: + +- **#996** — `appendJournalEntries()` uses unlocked `fs.appendFileSync`. Concurrent writers (e.g. watcher + manual build) can interleave lines, producing corrupt entries. +- **#997** — Watcher appends entries but never updates the header timestamp; only builder's finalize does. Result: after a watch session, journal entries are newer than the header, Tier 0 journal-check bails, next build falls through to the expensive Tier 1/2 scan. Silent performance cliff. + +### 2.3 Advisory "lock" (`src/db/connection.ts:112-130`) + +```ts +if (pid && pid !== process.pid && isProcessAlive(pid)) { + warn(`Another process (PID ${pid}) may be using this database. Proceeding with caution.`); +} +// ... then writes own PID anyway +``` + +Not a lock — a log line. SQLite's WAL `busy_timeout = 5000` (`connection.ts:169`) provides actual DB-level serialization. The informational lock file doesn't cover surrounding file operations (journal, snapshot directory). Discussed in #996. + +### 2.4 Per-worktree isolation is correct — but each worktree starts cold + +- Every linked worktree has its own `.codegraph/` (verified). DB isolation is fine. +- But every new worktree re-parses every file, re-resolves every import, re-hashes, and re-embeds — even though most files are byte-identical to another worktree's parse output. +- For a 3-file branch off main, first build costs the same as indexing from scratch. +- This is an amortization gap, not a structural failure. It is the thing omnigraph's "copy-on-write branches" would address if ported — see §3. + +### 2.5 Registry (`src/infrastructure/registry.ts`) + +Keyed by `path.basename(absRoot)` with `-2`/`-3` auto-suffix collision handling. Two worktrees register as two unrelated "repos" rather than one repo with two worktrees. UX imperfection, not a bug. + +--- + +## 3. What omnigraph does differently + +Sourced from [omnigraph.dev](https://www.omnigraph.dev/) and [ModernRelay/omnigraph README](https://github.com/ModernRelay/omnigraph). Deep internals not publicly documented; quotations below are from their marketing pages. + +### 3.1 Storage model + +- **Lance** immutable versioned columnar format — every write produces a new version, old versions stay addressable. +- **Arrow** for in-memory columnar execution. +- **DataFusion** for query planning. +- Net effect: snapshots, versioning, time-travel are the storage format, not features layered on top. + +### 3.2 Git-style semantics + +```bash +omnigraph branch create --from main feature-x ./repo.omni +omnigraph branch merge feature-x --into main ./repo.omni +``` + +- Branches are references to versions, not disk copies ("copy-on-write branching eliminates locks," per marketing — mechanism not public). +- Every mutation is a commit. +- Merge exists as a first-class operation (conflict semantics not documented). + +### 3.3 Why the model doesn't fit us + +Codegraph's core value prop is **"the graph tracks your on-disk working tree in real time, committed or not."** You edit a file, save, and `codegraph audit --quick ` reflects the change. You can run `fn-impact` on uncommitted code. + +Omnigraph's model couples graph state to commits. To port it faithfully, we'd have to either: +- Auto-commit on every save (bad — pollutes git history), or +- Accept a gap between disk and graph state (defeats dogfooding). + +So the full model is off the table. The question became: **can we take a single useful piece** without breaking the live-dev invariant? See next section. + +--- + +## 4. Proposed R1 — and why it's rejected + +### The idea + +Content-addressed parse cache at `~/.codegraph/cache/symbols/.json`: + +1. Hash each file. +2. Look up `sha256 + parser_version` in cache. +3. Hit → skip tree-sitter, reuse cached symbols/edges. +4. Miss → parse, extract, store in cache. + +Key property: cache key is content hash of on-disk bytes, **not** a commit SHA. This preserves the "graph tracks working tree live" invariant — dirty files cache just as well as committed ones. No git coupling. + +### Why I initially ranked it #1 + +The "first-build-in-new-worktree is cold" problem is real, and tree-sitter parses are content-pure (same bytes → same AST), so the cache is a correct-by-construction optimization with no invalidation complexity. + +### Why it's rejected + +**Parsing is ~10% of full-build time** (user-provided benchmark). + +- First-build-in-new-worktree: cache saves ~10%. Real but modest. +- Incremental builds: cache saves ~0% — incremental already skips unchanged files, so the parses it runs are genuine cache misses too. +- Watch mode: ~0% — same reason. + +A content-addressed cache needs: directory layout, hash scheme with parser-version keying, LRU/TTL for size management, cross-platform path handling, test coverage. Call it a few hundred LOC plus ongoing ops concerns. Shipping that for ≤10% on one narrow scenario is a bad trade. + +### If first-build-in-new-worktree latency is the actual pain point + +The remaining 90% lives in: import resolution (`domain/graph/resolve.ts`, 6-level priority resolver), DB writes (batch size, transaction boundaries, indexes), and embeddings (if enabled, often dominates). Profile one of those and there's a real target. Separate investigation from this audit. + +--- + +## 5. Recommendations + +| ID | Action | Status | +|---|---|---| +| **A1** | Fix #995 — snapshot save TOCTOU race. Use `.db.tmp-` + atomic rename, or a real file lock. | **Do** | +| **A2** | Fix #996 — lock journal mutations (or replace with a per-line append-only log). | **Do** | +| **A3** | Fix #997 — watcher updates journal header in the same critical section as entry appends. | **Do** | +| ~~R1~~ | ~~Content-addressed parse cache.~~ | **Rejected** — 10% parse share | +| R3 | Bind snapshot names to git identity + metadata + GC. | Nice-to-have. Not urgent. | +| R4 | Share more than parses across worktrees (resolved edges, embeddings) by content hash. | Speculative — conditional on verifying content-purity. Not recommended now. | +| R5 | Replace SQLite-as-materialized-graph with append-only versioned store. | Long-horizon, out of scope. | + +--- + +## 6. Honest gaps + +- I did not read omnigraph's source. Their concurrency model, merge semantics, and on-disk layout are guessed from marketing pages. +- The 10% parse-time figure is one data point, not a profile. Embedding-enabled builds will skew differently. +- I did not audit whether resolved edges are content-pure (R4 hinges on this). Probably not — resolution depends on sibling files. +- The 3 filed bugs are individually small. Together they matter because the "multi-worktree + watcher + manual build" workflow is the documented use case (CLAUDE.md "Parallel Sessions"), and that's exactly when the races fire. + +--- + +## Sources + +- [Omnigraph landing page](https://www.omnigraph.dev/) +- [Omnigraph docs index](https://www.omnigraph.dev/docs) +- [ModernRelay/omnigraph on GitHub](https://github.com/ModernRelay/omnigraph) +- Filed issues: [#995](https://github.com/optave/ops-codegraph-tool/issues/995), [#996](https://github.com/optave/ops-codegraph-tool/issues/996), [#997](https://github.com/optave/ops-codegraph-tool/issues/997) +- Local code: + - `src/features/snapshot.ts:18-131` + - `src/domain/graph/journal.ts:5-105` + - `src/db/connection.ts:102-141, 160-225, 262-299` + - `.claude/hooks/enrich-context.sh:32-34` From dcf37051d69e0579fe68545c7b8ba45085be18ba Mon Sep 17 00:00:00 2001 From: carlos-alm <127798846+carlos-alm@users.noreply.github.com> Date: Tue, 21 Apr 2026 01:14:33 -0600 Subject: [PATCH 2/3] docs: renumber recommendations to close R2 gap (#999) --- docs/reports/snapshot-and-worktree-audit.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/reports/snapshot-and-worktree-audit.md b/docs/reports/snapshot-and-worktree-audit.md index 637d71e67..58dd0f8e5 100644 --- a/docs/reports/snapshot-and-worktree-audit.md +++ b/docs/reports/snapshot-and-worktree-audit.md @@ -15,7 +15,7 @@ - #996 — journal append isn't locked - #997 — journal header lags appended entries during watch → silent performance cliff - **Omnigraph's "commits are snapshots, branches are references" model does not fit us** — it couples graph freshness to git commits, which breaks the "graph tracks your working tree live" invariant that is codegraph's whole point. -- **Content-addressed parse cache (R1 below)** was the most promising omnigraph-inspired optimization, but parsing is ~10% of build time. A cache would save ~10% on first-build-in-new-worktree and near-zero on incremental/watch. **Not worth the engineering cost.** Rejected. +- **Content-addressed parse cache (R1 in §4)** was the most promising omnigraph-inspired optimization, but parsing is ~10% of build time. A cache would save ~10% on first-build-in-new-worktree and near-zero on incremental/watch. **Not worth the engineering cost.** Rejected. - **Honest scope limit:** omnigraph's public docs are thin on internals. Claims about their concurrency model and merge semantics come from marketing pages, not source. Calibrate the "what to learn" findings accordingly. --- @@ -163,10 +163,10 @@ The remaining 90% lives in: import resolution (`domain/graph/resolve.ts`, 6-leve | **A1** | Fix #995 — snapshot save TOCTOU race. Use `.db.tmp-` + atomic rename, or a real file lock. | **Do** | | **A2** | Fix #996 — lock journal mutations (or replace with a per-line append-only log). | **Do** | | **A3** | Fix #997 — watcher updates journal header in the same critical section as entry appends. | **Do** | -| ~~R1~~ | ~~Content-addressed parse cache.~~ | **Rejected** — 10% parse share | -| R3 | Bind snapshot names to git identity + metadata + GC. | Nice-to-have. Not urgent. | -| R4 | Share more than parses across worktrees (resolved edges, embeddings) by content hash. | Speculative — conditional on verifying content-purity. Not recommended now. | -| R5 | Replace SQLite-as-materialized-graph with append-only versioned store. | Long-horizon, out of scope. | +| ~~R1~~ | ~~Content-addressed parse cache.~~ (evaluated and rejected in §4) | **Rejected** — 10% parse share | +| R2 | Bind snapshot names to git identity + metadata + GC. | Nice-to-have. Not urgent. | +| R3 | Share more than parses across worktrees (resolved edges, embeddings) by content hash. | Speculative — conditional on verifying content-purity. Not recommended now. | +| R4 | Replace SQLite-as-materialized-graph with append-only versioned store. | Long-horizon, out of scope. | --- @@ -174,7 +174,7 @@ The remaining 90% lives in: import resolution (`domain/graph/resolve.ts`, 6-leve - I did not read omnigraph's source. Their concurrency model, merge semantics, and on-disk layout are guessed from marketing pages. - The 10% parse-time figure is one data point, not a profile. Embedding-enabled builds will skew differently. -- I did not audit whether resolved edges are content-pure (R4 hinges on this). Probably not — resolution depends on sibling files. +- I did not audit whether resolved edges are content-pure (R3 hinges on this). Probably not — resolution depends on sibling files. - The 3 filed bugs are individually small. Together they matter because the "multi-worktree + watcher + manual build" workflow is the documented use case (CLAUDE.md "Parallel Sessions"), and that's exactly when the races fire. --- From e1fbd6029ff5ad4a75e82469eae7c0296e33ed87 Mon Sep 17 00:00:00 2001 From: carlos-alm <127798846+carlos-alm@users.noreply.github.com> Date: Wed, 22 Apr 2026 00:25:28 -0600 Subject: [PATCH 3/3] =?UTF-8?q?docs:=20anchor=20enrich-context.sh=20refere?= =?UTF-8?q?nce=20in=20=C2=A72.4=20(#999)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/reports/snapshot-and-worktree-audit.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reports/snapshot-and-worktree-audit.md b/docs/reports/snapshot-and-worktree-audit.md index 58dd0f8e5..de435fab7 100644 --- a/docs/reports/snapshot-and-worktree-audit.md +++ b/docs/reports/snapshot-and-worktree-audit.md @@ -78,7 +78,7 @@ Not a lock — a log line. SQLite's WAL `busy_timeout = 5000` (`connection.ts:16 ### 2.4 Per-worktree isolation is correct — but each worktree starts cold -- Every linked worktree has its own `.codegraph/` (verified). DB isolation is fine. +- Every linked worktree has its own `.codegraph/` (verified). DB isolation is fine. Our own context-enrichment hook already relies on this — `.claude/hooks/enrich-context.sh:32-34` derives the DB path from `git rev-parse --show-toplevel`, so each worktree's hook naturally reads that worktree's DB. - But every new worktree re-parses every file, re-resolves every import, re-hashes, and re-embeds — even though most files are byte-identical to another worktree's parse output. - For a 3-file branch off main, first build costs the same as indexing from scratch. - This is an amortization gap, not a structural failure. It is the thing omnigraph's "copy-on-write branches" would address if ported — see §3.