Skip to content

perf(native): batch-load file/symbol IDs in edges phase (#1013)#1028

Merged
carlos-alm merged 3 commits intomainfrom
perf/native-edges-roles-1013
Apr 30, 2026
Merged

perf(native): batch-load file/symbol IDs in edges phase (#1013)#1028
carlos-alm merged 3 commits intomainfrom
perf/native-edges-roles-1013

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Closes #1013.

  • Replaces per-import / per-file conn.query_row lookups in build_import_edges and build_and_insert_call_edges with one-shot HashMap pre-loads. Each query_row ran a fresh sqlite3_prepare/step/finalize cycle; on the codegraph self-build (~470 files) this paid ~2.3k cycles in the import-edge stage alone, dominating the edges phase.
  • Chunks the import-edge insert into multi-row VALUES batches (199 rows × 5 params), mirroring edges_db::do_insert_edges, to remove the per-row prepared-statement bind/step/reset overhead.
  • Roles converged as a side effect — its prior tail was driven by shared SQLite cache pressure during the same build session.

Benchmark (codegraph self-build, 743 files)

phase 3.9.5 native 3.9.5 wasm this PR native this PR wasm ratio (native/wasm)
edges 310 ms 179 ms 119.2 ms 184 ms 0.65×
roles 269 ms 62 ms 70.7 ms 69.2 ms 1.02×

Both meet the issue's ≤1.2× WASM acceptance criterion. Native edges is now faster than WASM.

Test plan

  • cargo build --release -p codegraph-core — clean (5 pre-existing warnings)
  • cargo test --release — 181/181 passing (incl. import_edges::tests::barrel_detection)
  • npm run benchmark — confirms edges/roles ratios above
  • End-to-end codegraph build . + codegraph stats on a fresh copy — same node/edge counts and role classification as before
  • Incremental rebuild (touch 1 file) — 1 changed, 0 removed, full pipeline runs cleanly
  • Note: npx vitest run is currently broken in this environment due to a pre-existing Node 24 incompatibility (--strip-types is not allowed in NODE_OPTIONS from vitest.config.ts); CI runs Node 22 where this works. Functional verification was done via the benchmark + manual build instead.

Replaces per-import and per-file `conn.query_row` calls in
`build_import_edges` and `build_and_insert_call_edges` with one-shot
HashMap pre-loads. Each `query_row` ran a fresh sqlite3_prepare/step/
finalize cycle; on a ~470-file repo this paid ~2.3k cycles in the
import-edge stage alone, dominating the edges phase.

Also chunks the import-edge insert into multi-row VALUES batches
(199 rows × 5 params), mirroring `edges_db::do_insert_edges`, to
remove the per-row prepared-statement bind/step/reset overhead.

Self-build benchmark (codegraph on itself, 743 files):

  edges:  124.5 ms native vs 193 ms wasm  (was 310 ms / 179 ms in 3.9.5)
  roles:   73.7 ms native vs  70 ms wasm  (was 269 ms /  62 ms in 3.9.5)

Roles converged as a side effect — its prior tail was driven by
shared SQLite cache pressure during the same build session.

Closes #1013

(docs check acknowledged: internal perf fix, no API/language/feature surface changes)
@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @carlos-alm's task —— View job


Performance Optimization PR Review

Root Cause Analysis

Problem Being Solved: SQLite performance bottleneck in native engine's edges phase. Individual query_row calls for file/symbol ID lookups were paying a sqlite3_prepare/step/finalize cycle per query (~2.3k cycles for codegraph self-build), causing native engine to be slower than WASM.

Why This Approach: The fix uses classic database optimization patterns:

  1. Batch loading: Replace N individual queries with 2 bulk HashMap pre-loads
  2. Chunked inserts: Replace per-row prepared statements with multi-row VALUES batches (199 rows × 5 params = 995 bind parameters, under SQLite's 999 limit)

Risk Assessment: Low risk. Changes are contained to SQLite interaction patterns without altering logic. Mirrors existing proven patterns from edges_db::CHUNK.

Backlog Compliance

  • Zero-dep: ✓ — Uses existing rusqlite dependency, no new runtime deps
  • Foundation-aligned: ✓ — Pure performance optimization, maintains zero-infrastructure principle
  • Problem-fit: 4/5 — Directly addresses AI agent performance by making native engine faster, reducing build times
  • Breaking: No — Internal optimization, no API/output changes
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Technical Analysis

Code Quality: ⭐⭐⭐⭐⭐

The implementation is excellent:

build_pipeline.rs:1084-1101: File node ID batch loading

let file_node_ids: HashMap<String, u32> = {
    let mut map = HashMap::new();
    if let Ok(mut stmt) = conn.prepare("SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0") {
        // Bulk load all file nodes in one query
    }
    map
};

import_edges.rs:159-198: Symbol node ID batch loading

  • Smart optimization: only loads symbol map when type-only imports exist
  • Preserves first-row semantics of legacy LIMIT 1 queries via map.entry().or_insert()

import_edges.rs:328-379: Chunked multi-row inserts

  • Uses proven 199-row chunk size (matching edges_db::CHUNK)
  • Proper transaction handling and error recovery
  • Dynamic SQL generation with parameterized placeholders

Performance Claims: ⭐⭐⭐⭐⭐

Benchmarks show dramatic improvements:

  • Edges: 310ms → 119ms (2.6x faster)
  • Roles: 269ms → 71ms (3.8x faster)
  • Native now outperforms WASM for edges phase

The "roles convergence as side effect" explanation makes technical sense - reduced SQLite cache pressure during build improves subsequent operations.

Test Coverage: ⭐⭐⭐☆☆

  • ✅ Existing barrel_detection() test covers core logic
  • ✅ End-to-end verification via benchmark suite
  • ⚠️ No specific unit tests for the new batch loading functions
  • ✅ Claims manual verification of node/edge counts remain identical

Error Handling: ⭐⭐⭐⭐☆

Good defensive programming:

  • Graceful fallback on SQLite errors (returns empty collections)
  • Proper transaction rollback on chunk insert failures
  • Maintains data consistency under failure scenarios

Critical Concerns

None identified. This is a well-executed performance optimization that:

  • Uses proven SQLite optimization patterns
  • Maintains behavioral equivalence
  • Has appropriate error handling
  • Shows measurable performance gains

Minor Observations

  1. Chunking constant: Good choice to mirror edges_db::CHUNK for consistency
  2. Type-only optimization: Smart to conditionally load symbol map only when needed
  3. Windows path handling: Proper normalization to forward slashes for cache key consistency

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: Textbook database performance optimization. Addresses real bottleneck, uses proven patterns, maintains correctness, shows significant measurable improvements. Code quality is excellent with proper error handling.

This PR successfully makes the native engine faster than WASM for edges phase, directly supporting the project's performance goals.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR replaces per-row query_row SQLite lookups in the edges phase with one-shot HashMap pre-loads and switches from single-row prepared statements to multi-row VALUES batch inserts. The previously-discussed issues (unbounded full-table scan, missing name = file predicate) were addressed in the head commit. The implementation is correct and the benchmark numbers are compelling — native edges now runs faster than WASM.

Confidence Score: 5/5

Safe to merge — all previously flagged issues are resolved and the new batch logic is correct.

No P0 or P1 findings. The pre-load queries are equivalent to the old per-row lookups, the chunked symbol lookup faithfully replicates barrel-resolution logic, numbered ?N parameters align correctly with raw_bind_parameter indices, and INSERT OR IGNORE semantics mean constraint conflicts within a chunk are still silently skipped at the SQL level. All concerns raised in prior review threads are addressed in the head commit.

No files require special attention.

Important Files Changed

Filename Overview
crates/codegraph-core/src/import_edges.rs Replaces lazy per-call query_row lookups with load_file_node_ids (full scan, guarded by name = file) and a chunked load_symbol_node_ids for type-only imports; adds insert_edge_chunk multi-row batch inserts (199 rows × 5 params). Logic is equivalent to the prior single-row path and the pre-computation in collect_type_only_lookup_pairs faithfully mirrors the barrel-resolution logic in build_import_edges.
crates/codegraph-core/src/build_pipeline.rs Adds a single SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0 AND name = file pre-load into HashMap<String, u32> before the file_symbols loop, replacing one query_row call per file. The u32 type is consistent with FileEdgeInput.file_node_id: u32.

Sequence Diagram

sequenceDiagram
    participant BP as build_pipeline
    participant IE as import_edges
    participant DB as SQLite

    Note over BP,DB: Old path (per-row)
    loop for each file in file_symbols
        BP->>DB: query_row (prepare/step/finalize)
        DB-->>BP: file_node_id
    end
    loop for each edge
        BP->>DB: INSERT one row (prepare/bind/step/reset)
    end

    Note over BP,DB: New path (batch)
    BP->>DB: SELECT file,id WHERE kind='file' AND name=file
    DB-->>BP: all file node IDs → HashMap
    IE->>IE: collect_type_only_lookup_pairs()
    IE->>DB: SELECT name,file,id WHERE (name,file) IN (...) [chunked, ≤332 pairs]
    DB-->>IE: symbol node IDs → HashMap
    loop build edges (HashMap lookups only, no DB round-trips)
        IE->>IE: file_node_ids.get() / symbol_node_ids.get()
    end
    loop for each 199-row chunk
        IE->>DB: INSERT OR IGNORE ... VALUES (?1,…,?995) [raw_bind + raw_execute]
    end
Loading

Reviews (3): Last reviewed commit: "fix(native): surface bind/execute errors..." | Re-trigger Greptile

Comment on lines 180 to 198
fn load_symbol_node_ids(conn: &Connection) -> HashMap<(String, String), i64> {
let mut map = HashMap::new();
if let Ok(mut stmt) =
conn.prepare("SELECT name, file, id FROM nodes WHERE kind != 'file'")
{
if let Ok(rows) = stmt.query_map([], |row| {
Ok((
row.get::<_, String>(0)?,
row.get::<_, String>(1)?,
row.get::<_, i64>(2)?,
))
}) {
for r in rows.flatten() {
map.entry((r.0, r.1)).or_insert(r.2);
}
}
}
map
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unbounded full-table scan into memory for symbol nodes

SELECT name, file, id FROM nodes WHERE kind != 'file' has no row limit and loads every non-file symbol into a single HashMap. For large monorepos (100k+ symbols) this can easily push several hundred MB of resident RAM in a single build pass. The old get_symbol_node_id was lazy — it only hit SQLite for the specific (name, file) pairs that were actually needed, and only when type-only imports existed.

Consider capping with a two-phase approach: collect the distinct (name, file) pairs needed while iterating ctx.file_symbols, then issue a single WHERE (name, file) IN (…) query for only those pairs. This keeps the one-round-trip benefit without pulling the entire nodes table into memory.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1081b64. Replaced the unbounded SELECT name, file, id FROM nodes WHERE kind != 'file' with a chunked (name, file) IN (...) query. build_import_edges now walks type-only imports up front (resolving barrels the same way the edge loop does) to collect the distinct (name, file) pairs that will actually be looked up, then load_symbol_node_ids issues those in 332-pair chunks (664 binds, well under SQLITE_MAX_VARIABLE_NUMBER's legacy 999). The full-table scan path is gone — even on 100k+-symbol monorepos, only the small slice referenced by type-only imports is hit, while preserving the one-round-trip win.

Comment on lines +1089 to +1101
if let Ok(mut stmt) =
conn.prepare("SELECT file, id FROM nodes WHERE kind = 'file' AND line = 0")
{
if let Ok(rows) = stmt.query_map([], |row| {
Ok((row.get::<_, String>(0)?, row.get::<_, i64>(1)? as u32))
}) {
for r in rows.flatten() {
map.insert(r.0, r.1);
}
}
}
map
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Query drops name = ? filter; semantics diverge from old lookup

The original query was:

SELECT id FROM nodes WHERE name = ? AND kind = 'file' AND file = ? AND line = 0

Both bind parameters were rel_path, so only nodes where name = file = rel_path were matched.

The new bulk query is SELECT file, id … WHERE kind = 'file' AND line = 0 — the name column is no longer checked, and any collision where two file-kind nodes share the same file value (different name) would silently overwrite an earlier map entry. The same pattern is repeated in import_edges.rs::load_file_node_ids. While name = file is the overwhelming convention for file nodes, the prior guard was explicit and free; dropping it is a silent behavioral change worth documenting or re-adding as a WHERE name = file predicate if the column is guaranteed redundant.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1081b64. Added AND name = file to both the load_file_node_ids query in import_edges.rs and the parallel pre-load in build_and_insert_call_edges in build_pipeline.rs. This restores the explicit guard from the legacy per-row lookup (WHERE name = ? AND file = ? with both binds set to rel_path), so an unrelated file-kind row sharing the same file value but different name can no longer silently overwrite the map entry.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

Codegraph Impact Analysis

7 functions changed3 callers affected across 2 files

  • build_and_insert_call_edges in crates/codegraph-core/src/build_pipeline.rs:912 (1 transitive callers)
  • load_file_node_ids in crates/codegraph-core/src/import_edges.rs:165 (1 transitive callers)
  • load_symbol_node_ids in crates/codegraph-core/src/import_edges.rs:190 (1 transitive callers)
  • collect_type_only_lookup_pairs in crates/codegraph-core/src/import_edges.rs:242 (1 transitive callers)
  • build_import_edges in crates/codegraph-core/src/import_edges.rs:279 (0 transitive callers)
  • insert_edges in crates/codegraph-core/src/import_edges.rs:412 (0 transitive callers)
  • insert_edge_chunk in crates/codegraph-core/src/import_edges.rs:443 (1 transitive callers)

)

Address two P2 review concerns from greptile on #1013:

1. **Bounded symbol-node lookup.** `load_symbol_node_ids` previously did
   an unbounded `SELECT name, file, id FROM nodes WHERE kind != 'file'`,
   loading every non-file symbol into memory. On 100k+-symbol monorepos
   this could push hundreds of MB. Now we walk type-only imports up
   front to collect the distinct `(name, file)` pairs we'll actually
   need, then issue a chunked `(name, file) IN (...)` query (332 pairs
   per chunk × 2 binds = 664, safely under `SQLITE_MAX_VARIABLE_NUMBER`).
   The full-scan path is gone; only the symbols referenced by type-only
   imports are hit, preserving the one-round-trip win without the
   memory blow-up.

2. **Restore `name = file` guard on file-node lookups.** The original
   per-row query bound `rel_path` to both `name = ?` and `file = ?`,
   matching only nodes where the two columns agreed. The bulk query
   keyed on `file` alone, so an unrelated file-kind row sharing the
   same `file` value (different `name`) could silently overwrite the
   map entry. Add `AND name = file` to both `load_file_node_ids` and
   the parallel pre-load in `build_and_insert_call_edges` to keep the
   legacy semantics explicit.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Greptile flagged two P2 issues in `insert_edges`:

1. Silent bind failures: `let _ = stmt.raw_bind_parameter(...)` silently
   discarded errors. A failed bind would leave that position unbound
   (NULL), causing `raw_execute()` to insert ghost edge rows with NULL
   `source_id`/`target_id`. Bind/execute now run inside a fallible
   `insert_edge_chunk` helper; failures emit a stderr warning and the
   chunk is skipped instead of producing partial rows.

2. `prepare_cached` mismatched with dynamic SQL: the SQL string varies
   with chunk length, so trailing partial chunks were always cache
   misses. Switched to plain `tx.prepare(&sql)` to match intent.

No behavioural change for the success path. cargo test
-p codegraph-core --lib: 181 passed.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed both P2 notes from the re-review summary in cb492ef:

  1. Silent bind failures (import_edges.rs:435-447) — Replaced let _ = stmt.raw_bind_parameter(...) with proper ? propagation inside a new insert_edge_chunk helper. Bind/execute errors now emit a stderr warning ([codegraph] insert_edges: skipped chunk of N rows due to error: ...) and the affected chunk is skipped, instead of producing rows with NULL source_id/target_id columns.

  2. prepare_cached mismatched with dynamic SQL (import_edges.rs:435) — Switched to plain tx.prepare(&sql). As you noted, the SQL string varies with chunk length so the trailing partial chunk was always a fresh prepare anyway; using prepare for all chunks makes the intent explicit and avoids LRU churn.

cargo check + cargo test -p codegraph-core --lib (181 passed). Biome lint clean.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit fea7b5e into main Apr 30, 2026
27 checks passed
@carlos-alm carlos-alm deleted the perf/native-edges-roles-1013 branch April 30, 2026 05:34
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(native): full-build edges +73% and roles +334% slower than WASM on 3.9.5 despite producing fewer outputs

1 participant