perf(native): batch-load file/symbol IDs in edges phase (#1013) by carlos-alm · Pull Request #1028 · optave/ops-codegraph-tool

carlos-alm · 2026-04-30T04:22:16Z

Summary

Closes #1013.

Replaces per-import / per-file conn.query_row lookups in build_import_edges and build_and_insert_call_edges with one-shot HashMap pre-loads. Each query_row ran a fresh sqlite3_prepare/step/finalize cycle; on the codegraph self-build (~470 files) this paid ~2.3k cycles in the import-edge stage alone, dominating the edges phase.
Chunks the import-edge insert into multi-row VALUES batches (199 rows × 5 params), mirroring edges_db::do_insert_edges, to remove the per-row prepared-statement bind/step/reset overhead.
Roles converged as a side effect — its prior tail was driven by shared SQLite cache pressure during the same build session.

Benchmark (codegraph self-build, 743 files)

phase	3.9.5 native	3.9.5 wasm	this PR native	this PR wasm	ratio (native/wasm)
`edges`	310 ms	179 ms	119.2 ms	184 ms	0.65×
`roles`	269 ms	62 ms	70.7 ms	69.2 ms	1.02×

Both meet the issue's ≤1.2× WASM acceptance criterion. Native edges is now faster than WASM.

Test plan

cargo build --release -p codegraph-core — clean (5 pre-existing warnings)
cargo test --release — 181/181 passing (incl. import_edges::tests::barrel_detection)
npm run benchmark — confirms edges/roles ratios above
End-to-end codegraph build . + codegraph stats on a fresh copy — same node/edge counts and role classification as before
Incremental rebuild (touch 1 file) — 1 changed, 0 removed, full pipeline runs cleanly
Note: npx vitest run is currently broken in this environment due to a pre-existing Node 24 incompatibility (--strip-types is not allowed in NODE_OPTIONS from vitest.config.ts); CI runs Node 22 where this works. Functional verification was done via the benchmark + manual build instead.

Replaces per-import and per-file `conn.query_row` calls in `build_import_edges` and `build_and_insert_call_edges` with one-shot HashMap pre-loads. Each `query_row` ran a fresh sqlite3_prepare/step/ finalize cycle; on a ~470-file repo this paid ~2.3k cycles in the import-edge stage alone, dominating the edges phase. Also chunks the import-edge insert into multi-row VALUES batches (199 rows × 5 params), mirroring `edges_db::do_insert_edges`, to remove the per-row prepared-statement bind/step/reset overhead. Self-build benchmark (codegraph on itself, 743 files): edges: 124.5 ms native vs 193 ms wasm (was 310 ms / 179 ms in 3.9.5) roles: 73.7 ms native vs 70 ms wasm (was 269 ms / 62 ms in 3.9.5) Roles converged as a side effect — its prior tail was driven by shared SQLite cache pressure during the same build session. Closes #1013 (docs check acknowledged: internal perf fix, no API/language/feature surface changes)

claude · 2026-04-30T04:22:36Z

Claude finished @carlos-alm's task —— View job

Performance Optimization PR Review

Root Cause Analysis

Problem Being Solved: SQLite performance bottleneck in native engine's edges phase. Individual query_row calls for file/symbol ID lookups were paying a sqlite3_prepare/step/finalize cycle per query (~2.3k cycles for codegraph self-build), causing native engine to be slower than WASM.

Why This Approach: The fix uses classic database optimization patterns:

Batch loading: Replace N individual queries with 2 bulk HashMap pre-loads
Chunked inserts: Replace per-row prepared statements with multi-row VALUES batches (199 rows × 5 params = 995 bind parameters, under SQLite's 999 limit)

Risk Assessment: Low risk. Changes are contained to SQLite interaction patterns without altering logic. Mirrors existing proven patterns from edges_db::CHUNK.

Backlog Compliance

Zero-dep: ✓ — Uses existing rusqlite dependency, no new runtime deps
Foundation-aligned: ✓ — Pure performance optimization, maintains zero-infrastructure principle
Problem-fit: 4/5 — Directly addresses AI agent performance by making native engine faster, reducing build times
Breaking: No — Internal optimization, no API/output changes
Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Technical Analysis

Code Quality: ⭐⭐⭐⭐⭐

The implementation is excellent:

build_pipeline.rs:1084-1101: File node ID batch loading

let file_node_ids: HashMap<String, u32> = {
    let mut map = HashMap::new();
    if let Ok(mut stmt) = conn.prepare("SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0") {
        // Bulk load all file nodes in one query
    }
    map
};

import_edges.rs:159-198: Symbol node ID batch loading

Smart optimization: only loads symbol map when type-only imports exist
Preserves first-row semantics of legacy LIMIT 1 queries via map.entry().or_insert()

import_edges.rs:328-379: Chunked multi-row inserts

Uses proven 199-row chunk size (matching edges_db::CHUNK)
Proper transaction handling and error recovery
Dynamic SQL generation with parameterized placeholders

Performance Claims: ⭐⭐⭐⭐⭐

Benchmarks show dramatic improvements:

Edges: 310ms → 119ms (2.6x faster)
Roles: 269ms → 71ms (3.8x faster)
Native now outperforms WASM for edges phase

The "roles convergence as side effect" explanation makes technical sense - reduced SQLite cache pressure during build improves subsequent operations.

Test Coverage: ⭐⭐⭐☆☆

✅ Existing barrel_detection() test covers core logic
✅ End-to-end verification via benchmark suite
⚠️ No specific unit tests for the new batch loading functions
✅ Claims manual verification of node/edge counts remain identical

Error Handling: ⭐⭐⭐⭐☆

Good defensive programming:

Graceful fallback on SQLite errors (returns empty collections)
Proper transaction rollback on chunk insert failures
Maintains data consistency under failure scenarios

Critical Concerns

None identified. This is a well-executed performance optimization that:

Uses proven SQLite optimization patterns
Maintains behavioral equivalence
Has appropriate error handling
Shows measurable performance gains

Minor Observations

Chunking constant: Good choice to mirror edges_db::CHUNK for consistency
Type-only optimization: Smart to conditionally load symbol map only when needed
Windows path handling: Proper normalization to forward slashes for cache key consistency

Final Recommendation

Rating: ⭐⭐⭐⭐⭐ (5/5)
Action: APPROVE
Reasoning: Textbook database performance optimization. Addresses real bottleneck, uses proven patterns, maintains correctness, shows significant measurable improvements. Code quality is excellent with proper error handling.

This PR successfully makes the native engine faster than WASM for edges phase, directly supporting the project's performance goals.

greptile-apps · 2026-04-30T04:25:20Z

Greptile Summary

This PR replaces per-row query_row SQLite lookups in the edges phase with one-shot HashMap pre-loads and switches from single-row prepared statements to multi-row VALUES batch inserts. The previously-discussed issues (unbounded full-table scan, missing name = file predicate) were addressed in the head commit. The implementation is correct and the benchmark numbers are compelling — native edges now runs faster than WASM.

Confidence Score: 5/5

Safe to merge — all previously flagged issues are resolved and the new batch logic is correct.

No P0 or P1 findings. The pre-load queries are equivalent to the old per-row lookups, the chunked symbol lookup faithfully replicates barrel-resolution logic, numbered ?N parameters align correctly with raw_bind_parameter indices, and INSERT OR IGNORE semantics mean constraint conflicts within a chunk are still silently skipped at the SQL level. All concerns raised in prior review threads are addressed in the head commit.

No files require special attention.

Important Files Changed

Filename	Overview
crates/codegraph-core/src/import_edges.rs	Replaces lazy per-call `query_row` lookups with `load_file_node_ids` (full scan, guarded by `name = file`) and a chunked `load_symbol_node_ids` for type-only imports; adds `insert_edge_chunk` multi-row batch inserts (199 rows × 5 params). Logic is equivalent to the prior single-row path and the pre-computation in `collect_type_only_lookup_pairs` faithfully mirrors the barrel-resolution logic in `build_import_edges`.
crates/codegraph-core/src/build_pipeline.rs	Adds a single `SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0 AND name = file` pre-load into `HashMap<String, u32>` before the `file_symbols` loop, replacing one `query_row` call per file. The `u32` type is consistent with `FileEdgeInput.file_node_id: u32`.

Sequence Diagram

sequenceDiagram
    participant BP as build_pipeline
    participant IE as import_edges
    participant DB as SQLite

    Note over BP,DB: Old path (per-row)
    loop for each file in file_symbols
        BP->>DB: query_row (prepare/step/finalize)
        DB-->>BP: file_node_id
    end
    loop for each edge
        BP->>DB: INSERT one row (prepare/bind/step/reset)
    end

    Note over BP,DB: New path (batch)
    BP->>DB: SELECT file,id WHERE kind='file' AND name=file
    DB-->>BP: all file node IDs → HashMap
    IE->>IE: collect_type_only_lookup_pairs()
    IE->>DB: SELECT name,file,id WHERE (name,file) IN (...) [chunked, ≤332 pairs]
    DB-->>IE: symbol node IDs → HashMap
    loop build edges (HashMap lookups only, no DB round-trips)
        IE->>IE: file_node_ids.get() / symbol_node_ids.get()
    end
    loop for each 199-row chunk
        IE->>DB: INSERT OR IGNORE ... VALUES (?1,…,?995) [raw_bind + raw_execute]
    end

_{Reviews (3): Last reviewed commit: "fix(native): surface bind/execute errors..." | Re-trigger Greptile}

greptile-apps · 2026-04-30T04:25:24Z

+fn load_symbol_node_ids(conn: &Connection) -> HashMap<(String, String), i64> {
+    let mut map = HashMap::new();
+    if let Ok(mut stmt) =
+        conn.prepare("SELECT name, file, id FROM nodes WHERE kind != 'file'")
+    {
+        if let Ok(rows) = stmt.query_map([], |row| {
+            Ok((
+                row.get::<_, String>(0)?,
+                row.get::<_, String>(1)?,
+                row.get::<_, i64>(2)?,
+            ))
+        }) {
+            for r in rows.flatten() {
+                map.entry((r.0, r.1)).or_insert(r.2);
+            }
+        }
+    }
+    map
 }


Unbounded full-table scan into memory for symbol nodes

SELECT name, file, id FROM nodes WHERE kind != 'file' has no row limit and loads every non-file symbol into a single HashMap. For large monorepos (100k+ symbols) this can easily push several hundred MB of resident RAM in a single build pass. The old get_symbol_node_id was lazy — it only hit SQLite for the specific (name, file) pairs that were actually needed, and only when type-only imports existed.

Consider capping with a two-phase approach: collect the distinct (name, file) pairs needed while iterating ctx.file_symbols, then issue a single WHERE (name, file) IN (…) query for only those pairs. This keeps the one-round-trip benefit without pulling the entire nodes table into memory.

Fixed in 1081b64. Replaced the unbounded SELECT name, file, id FROM nodes WHERE kind != 'file' with a chunked (name, file) IN (...) query. build_import_edges now walks type-only imports up front (resolving barrels the same way the edge loop does) to collect the distinct (name, file) pairs that will actually be looked up, then load_symbol_node_ids issues those in 332-pair chunks (664 binds, well under SQLITE_MAX_VARIABLE_NUMBER's legacy 999). The full-table scan path is gone — even on 100k+-symbol monorepos, only the small slice referenced by type-only imports is hit, while preserving the one-round-trip win.

greptile-apps · 2026-04-30T04:25:25Z

+        if let Ok(mut stmt) =
+            conn.prepare("SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0")
+        {
+            if let Ok(rows) = stmt.query_map([], |row| {
+                Ok((row.get::<_, String>(0)?, row.get::<_, i64>(1)? as u32))
+            }) {
+                for r in rows.flatten() {
+                    map.insert(r.0, r.1);
+                }
+            }
+        }
+        map
+    };


Query drops name = ? filter; semantics diverge from old lookup

The original query was:

SELECT id FROM nodes WHERE name = ? AND kind = 'file' AND file = ? AND line = 0

Both bind parameters were rel_path, so only nodes where name = file = rel_path were matched.

The new bulk query is SELECT file, id … WHERE kind = 'file' AND line = 0 — the name column is no longer checked, and any collision where two file-kind nodes share the same file value (different name) would silently overwrite an earlier map entry. The same pattern is repeated in import_edges.rs::load_file_node_ids. While name = file is the overwhelming convention for file nodes, the prior guard was explicit and free; dropping it is a silent behavioral change worth documenting or re-adding as a WHERE name = file predicate if the column is guaranteed redundant.

Fixed in 1081b64. Added AND name = file to both the load_file_node_ids query in import_edges.rs and the parallel pre-load in build_and_insert_call_edges in build_pipeline.rs. This restores the explicit guard from the legacy per-row lookup (WHERE name = ? AND file = ? with both binds set to rel_path), so an unrelated file-kind row sharing the same file value but different name can no longer silently overwrite the map entry.

github-actions · 2026-04-30T04:30:00Z

Codegraph Impact Analysis

7 functions changed → 3 callers affected across 2 files

build_and_insert_call_edges in crates/codegraph-core/src/build_pipeline.rs:912 (1 transitive callers)
load_file_node_ids in crates/codegraph-core/src/import_edges.rs:165 (1 transitive callers)
load_symbol_node_ids in crates/codegraph-core/src/import_edges.rs:190 (1 transitive callers)
collect_type_only_lookup_pairs in crates/codegraph-core/src/import_edges.rs:242 (1 transitive callers)
build_import_edges in crates/codegraph-core/src/import_edges.rs:279 (0 transitive callers)
insert_edges in crates/codegraph-core/src/import_edges.rs:412 (0 transitive callers)
insert_edge_chunk in crates/codegraph-core/src/import_edges.rs:443 (1 transitive callers)

) Address two P2 review concerns from greptile on #1013: 1. **Bounded symbol-node lookup.** `load_symbol_node_ids` previously did an unbounded `SELECT name, file, id FROM nodes WHERE kind != 'file'`, loading every non-file symbol into memory. On 100k+-symbol monorepos this could push hundreds of MB. Now we walk type-only imports up front to collect the distinct `(name, file)` pairs we'll actually need, then issue a chunked `(name, file) IN (...)` query (332 pairs per chunk × 2 binds = 664, safely under `SQLITE_MAX_VARIABLE_NUMBER`). The full-scan path is gone; only the symbols referenced by type-only imports are hit, preserving the one-round-trip win without the memory blow-up. 2. **Restore `name = file` guard on file-node lookups.** The original per-row query bound `rel_path` to both `name = ?` and `file = ?`, matching only nodes where the two columns agreed. The bulk query keyed on `file` alone, so an unrelated file-kind row sharing the same `file` value (different `name`) could silently overwrite the map entry. Add `AND name = file` to both `load_file_node_ids` and the parallel pre-load in `build_and_insert_call_edges` to keep the legacy semantics explicit.

carlos-alm · 2026-04-30T05:02:45Z

@greptileai

Greptile flagged two P2 issues in `insert_edges`: 1. Silent bind failures: `let _ = stmt.raw_bind_parameter(...)` silently discarded errors. A failed bind would leave that position unbound (NULL), causing `raw_execute()` to insert ghost edge rows with NULL `source_id`/`target_id`. Bind/execute now run inside a fallible `insert_edge_chunk` helper; failures emit a stderr warning and the chunk is skipped instead of producing partial rows. 2. `prepare_cached` mismatched with dynamic SQL: the SQL string varies with chunk length, so trailing partial chunks were always cache misses. Switched to plain `tx.prepare(&sql)` to match intent. No behavioural change for the success path. cargo test -p codegraph-core --lib: 181 passed.

carlos-alm · 2026-04-30T05:08:39Z

Addressed both P2 notes from the re-review summary in cb492ef:

Silent bind failures (import_edges.rs:435-447) — Replaced let _ = stmt.raw_bind_parameter(...) with proper ? propagation inside a new insert_edge_chunk helper. Bind/execute errors now emit a stderr warning ([codegraph] insert_edges: skipped chunk of N rows due to error: ...) and the affected chunk is skipped, instead of producing rows with NULL source_id/target_id columns.
prepare_cached mismatched with dynamic SQL (import_edges.rs:435) — Switched to plain tx.prepare(&sql). As you noted, the SQL string varies with chunk length so the trailing partial chunk was always a fresh prepare anyway; using prepare for all chunks makes the intent explicit and avoids LRU churn.

cargo check + cargo test -p codegraph-core --lib (181 passed). Biome lint clean.

carlos-alm · 2026-04-30T05:08:53Z

@greptileai

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

carlos-alm merged commit fea7b5e into main Apr 30, 2026
27 checks passed

carlos-alm deleted the perf/native-edges-roles-1013 branch April 30, 2026 05:34

github-actions Bot locked and limited conversation to collaborators Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(native): batch-load file/symbol IDs in edges phase (#1013)#1028

perf(native): batch-load file/symbol IDs in edges phase (#1013)#1028
carlos-alm merged 3 commits intomainfrom
perf/native-edges-roles-1013

carlos-alm commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

carlos-alm Apr 30, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

carlos-alm Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

carlos-alm commented Apr 30, 2026

Uh oh!

carlos-alm commented Apr 30, 2026

Uh oh!

carlos-alm commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 30, 2026

Summary

Benchmark (codegraph self-build, 743 files)

Test plan

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Optimization PR Review

Root Cause Analysis

Backlog Compliance

Technical Analysis

Code Quality: ⭐⭐⭐⭐⭐

Performance Claims: ⭐⭐⭐⭐⭐

Test Coverage: ⭐⭐⭐☆☆

Error Handling: ⭐⭐⭐⭐☆

Critical Concerns

Minor Observations

Final Recommendation

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Apr 30, 2026

Uh oh!

carlos-alm commented Apr 30, 2026

Uh oh!

carlos-alm commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 30, 2026 •

edited

Loading

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

github-actions Bot commented Apr 30, 2026 •

edited

Loading