Skip to content

fix(native): persist file_hashes for dropped/symbol-less files#1069

Merged
carlos-alm merged 3 commits into
mainfrom
fix/1068-file-hashes-collected-files
May 6, 2026
Merged

fix(native): persist file_hashes for dropped/symbol-less files#1069
carlos-alm merged 3 commits into
mainfrom
fix/1068-file-hashes-collected-files

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Iterate filesToParse in buildFileHashes / updateFileHashes so files with zero symbols (empty, parser no-op, optional-language grammar unavailable) still get a file_hashes row.
  • Persist file_hashes rows in backfillNativeDroppedFiles for every file the Rust orchestrator dropped (e.g. .clj when no Rust extractor exists).
  • Run the engine-parity backfill on every successful orchestrator pass, not just full builds — the Rust orchestrator's narrower file_collector deletes nodes + file_hashes rows for files outside its SUPPORTED_EXTENSIONS on every incremental run, and the JS-side fast-skip pre-flight (perf(native): ~2 second flat overhead added to rebuild operations in v3.10.0 #1054) then permanently rejects on "collected file missing from file_hashes". Restructured the function to do the cheap missing-file check before the proxy→better-sqlite3 handoff so no-op rebuilds remain fast.

Closes #1068

Test plan

  • CI green
  • Manual E2E (mixed a.js + main.clj repo, cfg: false, dataflow: false):
  • New unit test tests/builder/insert-nodes.test.ts (6 cases) covers buildFileHashes directly: symbol-less files, precomputed hash, _reverseDepOnly, disk fallback, metadata-only updates, deduplication.

Notes

  • Vitest runs are blocked locally on Node 24 by a pre-existing vitest.config.ts issue (--strip-types is not allowed in NODE_OPTIONS); behavior was verified via build + lint + TypeScript + standalone E2E against the built dist/.

The JS-side fast-skip pre-flight (#1054) was permanently rejecting on
repos containing optional-language files (e.g. .clj) because their rows
were missing from file_hashes:

- buildFileHashes / updateFileHashes iterated allSymbols, so files with
  zero symbols (empty, parser no-op, grammar-missing optional language)
  never got a hash row. Iterate filesToParse instead.
- backfillNativeDroppedFiles now writes file_hashes rows for every file
  the Rust orchestrator dropped (e.g. .clj when no Rust extractor
  exists), so the fast-skip pre-flight can match them on rebuild.
- Backfill runs on every successful orchestrator pass, not only on full
  builds. The Rust orchestrator's narrower file_collector treats files
  outside SUPPORTED_EXTENSIONS as 'removed' and deletes their nodes +
  file_hashes rows on every incremental run; backfill repairs that.
  Restructured the function to do the cheap missing-file check before
  the proxy->better-sqlite3 handoff so no-op rebuilds remain cheap.
@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Claude finished @carlos-alm's task —— View job


PR Review Complete

Reviewing PR #1069: "fix(native): persist file_hashes for dropped/symbol-less files"

Todo List:

  • Phase 1: Root cause analysis - understand the core problem
  • Phase 2: Check for configuration changes that might silence errors
  • Phase 3: Verify test integrity and coverage
  • Phase 4: Code quality review
  • Phase 5: Backlog compliance assessment
  • Provide structured final assessment

Root Cause Analysis

Problem Being Solved: Files with zero symbols (empty files, parser no-ops, or files for languages without Rust extractors) weren't getting file_hashes entries, causing the JS-side fast-skip pre-flight (#1054) to permanently reject with "collected file missing from file_hashes", forcing expensive full pipeline runs (~2s) on every no-op rebuild.

Why This Approach: The fix correctly addresses the root cause by ensuring buildFileHashes iterates over filesToParse (every collected file) rather than just parsed symbols, and ensures backfillNativeDroppedFiles runs on every successful orchestrator pass to persist hashes for dropped files.

Risk Assessment: Low risk. This is a pure addition - files that previously wouldn't get hash entries now do, which only improves fast-skip accuracy. No existing behavior changes.

Backlog Compliance

  • Zero-dep: ✓ — No new dependencies, builds on existing SQLite/tree-sitter infrastructure
  • Foundation-aligned: ✓ — Directly serves P1 (always-current graph) by ensuring incremental builds work correctly
  • Problem-fit: 4 — Prevents performance regression (2s → 20ms) that directly impacts AI agent experience
  • Breaking: No — Purely additive, no API or behavior changes
  • Tier: 1

Code Quality Analysis

Architecture: Clean separation of concerns. buildFileHashes handles the core logic, updateFileHashes is the JS fallback, and backfillNativeDroppedFiles is the engine-parity correction. All functions properly iterate filesToParse rather than just symbols.

Error Handling: Robust defensive programming:

  • readFileSafe failures are logged and skipped (lines 141, 378, 899)
  • File stat failures gracefully default to 0 (lines 130-133, 366-369)
  • DB table missing handled gracefully (lines 435, 912)

Performance: Intelligent optimization - the missing-file check happens before expensive DB handoff, so no-op rebuilds remain fast (lines 760-789).

Testing: Excellent coverage - 6 unit tests specifically for buildFileHashes covering edge cases including symbol-less files, precomputed hashes, _reverseDepOnly logic, disk fallback, metadata updates, and deduplication.

SQL Safety: All queries use parameterized statements. The UPSERT pattern is correct (INSERT OR REPLACE).

Minor Observations:

Critical Concerns

None identified. This is a well-engineered fix that addresses a real performance regression with proper testing and error handling.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Solves a legitimate performance issue (2s → 20ms) with clean implementation, comprehensive testing, and zero risk of regression. The approach correctly fixes the root cause rather than working around symptoms. The only reason this isn't 5 stars is that it's a bug fix rather than a new feature - solid engineering work that should merge.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR fixes a persistent fast-skip regression (#1068) where files that produce zero symbols (empty files, optional-language extensions with no Rust extractor, parsers that no-op) were never written to file_hashes, causing the JS-side pre-flight check to reject every subsequent no-op rebuild and fall back to the full native pipeline.

  • insert-nodes.ts: buildFileHashes / updateFileHashes now iterate filesToParse (every collected file) instead of allSymbols (only files with ≥1 symbol), so symbol-less files always get a file_hashes row.
  • pipeline.ts: backfillNativeDroppedFiles now runs on every successful orchestrator pass (not just full builds), writes file_hashes rows for every file it backfills, and evaluates the missing-file set before the expensive proxy→better-sqlite3 handoff, keeping no-op incremental builds cheap (1 FS scan + 2 DB queries). The check now requires a file to appear in both nodes and file_hashes to be considered present, addressing an earlier edge-case divergence between the two tables.

Confidence Score: 5/5

Safe to merge — the changes are targeted, well-commented, and backed by unit tests that directly exercise the corrected code path.

The core logic change (iterate filesToParse instead of allSymbols) is straightforward and the new test suite validates all six branches in buildFileHashes. The pipeline restructuring keeps the expensive proxy handoff behind the fast early-return, so no-op incremental builds pay only a filesystem scan and two DB queries. No regressions were identified in the data-flow or DB write ordering.

No files require special attention.

Important Files Changed

Filename Overview
src/domain/graph/builder/stages/insert-nodes.ts Core fix: buildFileHashes and updateFileHashes now iterate filesToParse instead of allSymbols, ensuring symbol-less files get a file_hashes row. Deduplication via seen set added correctly. buildFileHashes exported for the new unit tests.
src/domain/graph/builder/pipeline.ts Backfill now runs every orchestrator pass; missing-file computation moved before proxy handoff for no-op fast-return; dual-table check (nodes + file_hashes) tightens completeness signal; file_hashes upsert block added for all backfilled files.
tests/builder/insert-nodes.test.ts New unit tests: 6 cases covering symbol-less files, precomputed hash, _reverseDepOnly, disk fallback, metadata-only updates, and deduplication. Solid coverage of the changed buildFileHashes logic.

Sequence Diagram

sequenceDiagram
    participant Orch as Rust Orchestrator
    participant BN as backfillNativeDroppedFiles
    participant FS as Filesystem
    participant DB as SQLite (nodes / file_hashes)
    participant WASM as WASM Parser

    Orch->>DB: deletes nodes+file_hashes for non-SUPPORTED_EXTENSIONS files
    Orch->>BN: (every orchestrator pass, incl. incremental)
    BN->>FS: collectFilesUtil to expected set
    BN->>DB: SELECT DISTINCT file FROM nodes WHERE kind='file'
    BN->>DB: SELECT DISTINCT file FROM file_hashes
    BN->>BN: missing = expected minus nodes intersect file_hashes
    alt missingAbs.length == 0
        BN-->>Orch: early return (no proxy handoff)
    else files missing
        BN->>BN: close NativeDb, open better-sqlite3
        BN->>WASM: parseFilesWasmForBackfill(missingAbs)
        WASM-->>BN: wasmResults (symbols per file)
        BN->>DB: INSERT OR IGNORE nodes (file + symbol rows)
        BN->>DB: INSERT OR REPLACE file_hashes for every missingRel
        Note over BN,DB: Iterates missingRel not wasmResults so symbol-less files still get a hash row
    end
Loading

Reviews (2): Last reviewed commit: "fix(native): treat file_hashes gap as mi..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Codegraph Impact Analysis

6 functions changed9 callers affected across 6 files

  • tryNativeOrchestrator in src/domain/graph/builder/pipeline.ts:597 (5 transitive callers)
  • backfillNativeDroppedFiles in src/domain/graph/builder/pipeline.ts:759 (4 transitive callers)
  • buildFileHashes in src/domain/graph/builder/stages/insert-nodes.ts:105 (3 transitive callers)
  • tryNativeInsert in src/domain/graph/builder/stages/insert-nodes.ts:165 (3 transitive callers)
  • updateFileHashes in src/domain/graph/builder/stages/insert-nodes.ts:338 (3 transitive callers)
  • insertNodes in src/domain/graph/builder/stages/insert-nodes.ts:400 (5 transitive callers)

Greptile feedback: backfillNativeDroppedFiles read 'nodes WHERE kind=file'
to decide what's missing, but the fast-skip pre-flight (#1054) rejects on
'file_hashes' gaps. If a DB has a node row but no file_hashes row (e.g.
state written by pre-#1068 code), the early-return triggers and the gap
persists across rebuilds.

Also query file_hashes and treat any expected file absent from EITHER
table as missing. The existing upsert path repairs both rows. The
file_hashes read is wrapped in try/catch so legacy DBs without the table
fall through to the existing recovery path.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile's feedback (P2 'Early-return signal diverges from fast-skip signal'):

  • backfillNativeDroppedFiles now reads BOTH nodes and file_hashes and treats any expected file absent from EITHER table as missing.
  • Closes the divergence gap where a file present in nodes but missing from file_hashes would early-return without repairing the hash row, leaving the fast-skip pre-flight (perf(native): ~2 second flat overhead added to rebuild operations in v3.10.0 #1054) permanently rejecting.
  • The existing INSERT OR REPLACE upsert path now repairs both rows. The new file_hashes SELECT is wrapped in try/catch so legacy DBs without the table fall through gracefully.

Commit: 9b2aa25

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 8fcdb86 into main May 6, 2026
13 checks passed
@carlos-alm carlos-alm deleted the fix/1068-file-hashes-collected-files branch May 6, 2026 03:32
@github-actions github-actions Bot locked and limited conversation to collaborators May 6, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fast-skip: file_hashes drops files when grammar isn't installed (breaks fast-skip permanently)

1 participant