perf(native): batch-load file/symbol IDs in edges phase (#1013)#1028
perf(native): batch-load file/symbol IDs in edges phase (#1013)#1028carlos-alm merged 3 commits intomainfrom
Conversation
Replaces per-import and per-file `conn.query_row` calls in `build_import_edges` and `build_and_insert_call_edges` with one-shot HashMap pre-loads. Each `query_row` ran a fresh sqlite3_prepare/step/ finalize cycle; on a ~470-file repo this paid ~2.3k cycles in the import-edge stage alone, dominating the edges phase. Also chunks the import-edge insert into multi-row VALUES batches (199 rows × 5 params), mirroring `edges_db::do_insert_edges`, to remove the per-row prepared-statement bind/step/reset overhead. Self-build benchmark (codegraph on itself, 743 files): edges: 124.5 ms native vs 193 ms wasm (was 310 ms / 179 ms in 3.9.5) roles: 73.7 ms native vs 70 ms wasm (was 269 ms / 62 ms in 3.9.5) Roles converged as a side effect — its prior tail was driven by shared SQLite cache pressure during the same build session. Closes #1013 (docs check acknowledged: internal perf fix, no API/language/feature surface changes)
|
Claude finished @carlos-alm's task —— View job Performance Optimization PR ReviewRoot Cause AnalysisProblem Being Solved: SQLite performance bottleneck in native engine's edges phase. Individual Why This Approach: The fix uses classic database optimization patterns:
Risk Assessment: Low risk. Changes are contained to SQLite interaction patterns without altering logic. Mirrors existing proven patterns from Backlog Compliance
Technical AnalysisCode Quality: ⭐⭐⭐⭐⭐The implementation is excellent: build_pipeline.rs:1084-1101: File node ID batch loading let file_node_ids: HashMap<String, u32> = {
let mut map = HashMap::new();
if let Ok(mut stmt) = conn.prepare("SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0") {
// Bulk load all file nodes in one query
}
map
};import_edges.rs:159-198: Symbol node ID batch loading
import_edges.rs:328-379: Chunked multi-row inserts
Performance Claims: ⭐⭐⭐⭐⭐Benchmarks show dramatic improvements:
The "roles convergence as side effect" explanation makes technical sense - reduced SQLite cache pressure during build improves subsequent operations. Test Coverage: ⭐⭐⭐☆☆
Error Handling: ⭐⭐⭐⭐☆Good defensive programming:
Critical ConcernsNone identified. This is a well-executed performance optimization that:
Minor Observations
Final Recommendation
This PR successfully makes the native engine faster than WASM for edges phase, directly supporting the project's performance goals. |
Greptile SummaryThis PR replaces per-row Confidence Score: 5/5Safe to merge — all previously flagged issues are resolved and the new batch logic is correct. No P0 or P1 findings. The pre-load queries are equivalent to the old per-row lookups, the chunked symbol lookup faithfully replicates barrel-resolution logic, numbered ?N parameters align correctly with raw_bind_parameter indices, and INSERT OR IGNORE semantics mean constraint conflicts within a chunk are still silently skipped at the SQL level. All concerns raised in prior review threads are addressed in the head commit. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant BP as build_pipeline
participant IE as import_edges
participant DB as SQLite
Note over BP,DB: Old path (per-row)
loop for each file in file_symbols
BP->>DB: query_row (prepare/step/finalize)
DB-->>BP: file_node_id
end
loop for each edge
BP->>DB: INSERT one row (prepare/bind/step/reset)
end
Note over BP,DB: New path (batch)
BP->>DB: SELECT file,id WHERE kind='file' AND name=file
DB-->>BP: all file node IDs → HashMap
IE->>IE: collect_type_only_lookup_pairs()
IE->>DB: SELECT name,file,id WHERE (name,file) IN (...) [chunked, ≤332 pairs]
DB-->>IE: symbol node IDs → HashMap
loop build edges (HashMap lookups only, no DB round-trips)
IE->>IE: file_node_ids.get() / symbol_node_ids.get()
end
loop for each 199-row chunk
IE->>DB: INSERT OR IGNORE ... VALUES (?1,…,?995) [raw_bind + raw_execute]
end
Reviews (3): Last reviewed commit: "fix(native): surface bind/execute errors..." | Re-trigger Greptile |
| fn load_symbol_node_ids(conn: &Connection) -> HashMap<(String, String), i64> { | ||
| let mut map = HashMap::new(); | ||
| if let Ok(mut stmt) = | ||
| conn.prepare("SELECT name, file, id FROM nodes WHERE kind != 'file'") | ||
| { | ||
| if let Ok(rows) = stmt.query_map([], |row| { | ||
| Ok(( | ||
| row.get::<_, String>(0)?, | ||
| row.get::<_, String>(1)?, | ||
| row.get::<_, i64>(2)?, | ||
| )) | ||
| }) { | ||
| for r in rows.flatten() { | ||
| map.entry((r.0, r.1)).or_insert(r.2); | ||
| } | ||
| } | ||
| } | ||
| map | ||
| } |
There was a problem hiding this comment.
Unbounded full-table scan into memory for symbol nodes
SELECT name, file, id FROM nodes WHERE kind != 'file' has no row limit and loads every non-file symbol into a single HashMap. For large monorepos (100k+ symbols) this can easily push several hundred MB of resident RAM in a single build pass. The old get_symbol_node_id was lazy — it only hit SQLite for the specific (name, file) pairs that were actually needed, and only when type-only imports existed.
Consider capping with a two-phase approach: collect the distinct (name, file) pairs needed while iterating ctx.file_symbols, then issue a single WHERE (name, file) IN (…) query for only those pairs. This keeps the one-round-trip benefit without pulling the entire nodes table into memory.
There was a problem hiding this comment.
Fixed in 1081b64. Replaced the unbounded SELECT name, file, id FROM nodes WHERE kind != 'file' with a chunked (name, file) IN (...) query. build_import_edges now walks type-only imports up front (resolving barrels the same way the edge loop does) to collect the distinct (name, file) pairs that will actually be looked up, then load_symbol_node_ids issues those in 332-pair chunks (664 binds, well under SQLITE_MAX_VARIABLE_NUMBER's legacy 999). The full-table scan path is gone — even on 100k+-symbol monorepos, only the small slice referenced by type-only imports is hit, while preserving the one-round-trip win.
| if let Ok(mut stmt) = | ||
| conn.prepare("SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0") | ||
| { | ||
| if let Ok(rows) = stmt.query_map([], |row| { | ||
| Ok((row.get::<_, String>(0)?, row.get::<_, i64>(1)? as u32)) | ||
| }) { | ||
| for r in rows.flatten() { | ||
| map.insert(r.0, r.1); | ||
| } | ||
| } | ||
| } | ||
| map | ||
| }; |
There was a problem hiding this comment.
Query drops
name = ? filter; semantics diverge from old lookup
The original query was:
SELECT id FROM nodes WHERE name = ? AND kind = 'file' AND file = ? AND line = 0Both bind parameters were rel_path, so only nodes where name = file = rel_path were matched.
The new bulk query is SELECT file, id … WHERE kind = 'file' AND line = 0 — the name column is no longer checked, and any collision where two file-kind nodes share the same file value (different name) would silently overwrite an earlier map entry. The same pattern is repeated in import_edges.rs::load_file_node_ids. While name = file is the overwhelming convention for file nodes, the prior guard was explicit and free; dropping it is a silent behavioral change worth documenting or re-adding as a WHERE name = file predicate if the column is guaranteed redundant.
There was a problem hiding this comment.
Fixed in 1081b64. Added AND name = file to both the load_file_node_ids query in import_edges.rs and the parallel pre-load in build_and_insert_call_edges in build_pipeline.rs. This restores the explicit guard from the legacy per-row lookup (WHERE name = ? AND file = ? with both binds set to rel_path), so an unrelated file-kind row sharing the same file value but different name can no longer silently overwrite the map entry.
Codegraph Impact Analysis7 functions changed → 3 callers affected across 2 files
|
) Address two P2 review concerns from greptile on #1013: 1. **Bounded symbol-node lookup.** `load_symbol_node_ids` previously did an unbounded `SELECT name, file, id FROM nodes WHERE kind != 'file'`, loading every non-file symbol into memory. On 100k+-symbol monorepos this could push hundreds of MB. Now we walk type-only imports up front to collect the distinct `(name, file)` pairs we'll actually need, then issue a chunked `(name, file) IN (...)` query (332 pairs per chunk × 2 binds = 664, safely under `SQLITE_MAX_VARIABLE_NUMBER`). The full-scan path is gone; only the symbols referenced by type-only imports are hit, preserving the one-round-trip win without the memory blow-up. 2. **Restore `name = file` guard on file-node lookups.** The original per-row query bound `rel_path` to both `name = ?` and `file = ?`, matching only nodes where the two columns agreed. The bulk query keyed on `file` alone, so an unrelated file-kind row sharing the same `file` value (different `name`) could silently overwrite the map entry. Add `AND name = file` to both `load_file_node_ids` and the parallel pre-load in `build_and_insert_call_edges` to keep the legacy semantics explicit.
Greptile flagged two P2 issues in `insert_edges`: 1. Silent bind failures: `let _ = stmt.raw_bind_parameter(...)` silently discarded errors. A failed bind would leave that position unbound (NULL), causing `raw_execute()` to insert ghost edge rows with NULL `source_id`/`target_id`. Bind/execute now run inside a fallible `insert_edge_chunk` helper; failures emit a stderr warning and the chunk is skipped instead of producing partial rows. 2. `prepare_cached` mismatched with dynamic SQL: the SQL string varies with chunk length, so trailing partial chunks were always cache misses. Switched to plain `tx.prepare(&sql)` to match intent. No behavioural change for the success path. cargo test -p codegraph-core --lib: 181 passed.
|
Addressed both P2 notes from the re-review summary in cb492ef:
cargo check + cargo test -p codegraph-core --lib (181 passed). Biome lint clean. |
Summary
Closes #1013.
conn.query_rowlookups inbuild_import_edgesandbuild_and_insert_call_edgeswith one-shot HashMap pre-loads. Eachquery_rowran a freshsqlite3_prepare/step/finalizecycle; on the codegraph self-build (~470 files) this paid ~2.3k cycles in the import-edge stage alone, dominating the edges phase.VALUESbatches (199 rows × 5 params), mirroringedges_db::do_insert_edges, to remove the per-row prepared-statement bind/step/reset overhead.Benchmark (codegraph self-build, 743 files)
edgesrolesBoth meet the issue's ≤1.2× WASM acceptance criterion. Native edges is now faster than WASM.
Test plan
cargo build --release -p codegraph-core— clean (5 pre-existing warnings)cargo test --release— 181/181 passing (incl.import_edges::tests::barrel_detection)npm run benchmark— confirms edges/roles ratios abovecodegraph build .+codegraph statson a fresh copy — same node/edge counts and role classification as before1 changed, 0 removed, full pipeline runs cleanlynpx vitest runis currently broken in this environment due to a pre-existing Node 24 incompatibility (--strip-types is not allowed in NODE_OPTIONSfromvitest.config.ts); CI runs Node 22 where this works. Functional verification was done via the benchmark + manual build instead.