fix(native): persist file_hashes for dropped/symbol-less files#1069
Conversation
The JS-side fast-skip pre-flight (#1054) was permanently rejecting on repos containing optional-language files (e.g. .clj) because their rows were missing from file_hashes: - buildFileHashes / updateFileHashes iterated allSymbols, so files with zero symbols (empty, parser no-op, grammar-missing optional language) never got a hash row. Iterate filesToParse instead. - backfillNativeDroppedFiles now writes file_hashes rows for every file the Rust orchestrator dropped (e.g. .clj when no Rust extractor exists), so the fast-skip pre-flight can match them on rebuild. - Backfill runs on every successful orchestrator pass, not only on full builds. The Rust orchestrator's narrower file_collector treats files outside SUPPORTED_EXTENSIONS as 'removed' and deletes their nodes + file_hashes rows on every incremental run; backfill repairs that. Restructured the function to do the cheap missing-file check before the proxy->better-sqlite3 handoff so no-op rebuilds remain cheap.
|
Claude finished @carlos-alm's task —— View job PR Review CompleteReviewing PR #1069: "fix(native): persist file_hashes for dropped/symbol-less files" Todo List:
Root Cause AnalysisProblem Being Solved: Files with zero symbols (empty files, parser no-ops, or files for languages without Rust extractors) weren't getting Why This Approach: The fix correctly addresses the root cause by ensuring Risk Assessment: Low risk. This is a pure addition - files that previously wouldn't get hash entries now do, which only improves fast-skip accuracy. No existing behavior changes. Backlog Compliance
Code Quality AnalysisArchitecture: Clean separation of concerns. Error Handling: Robust defensive programming:
Performance: Intelligent optimization - the missing-file check happens before expensive DB handoff, so no-op rebuilds remain fast (lines 760-789). Testing: Excellent coverage - 6 unit tests specifically for SQL Safety: All queries use parameterized statements. The Minor Observations:
Critical ConcernsNone identified. This is a well-engineered fix that addresses a real performance regression with proper testing and error handling. Final Recommendation
|
Greptile SummaryThis PR fixes a persistent fast-skip regression (#1068) where files that produce zero symbols (empty files, optional-language extensions with no Rust extractor, parsers that no-op) were never written to
Confidence Score: 5/5Safe to merge — the changes are targeted, well-commented, and backed by unit tests that directly exercise the corrected code path. The core logic change (iterate No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Orch as Rust Orchestrator
participant BN as backfillNativeDroppedFiles
participant FS as Filesystem
participant DB as SQLite (nodes / file_hashes)
participant WASM as WASM Parser
Orch->>DB: deletes nodes+file_hashes for non-SUPPORTED_EXTENSIONS files
Orch->>BN: (every orchestrator pass, incl. incremental)
BN->>FS: collectFilesUtil to expected set
BN->>DB: SELECT DISTINCT file FROM nodes WHERE kind='file'
BN->>DB: SELECT DISTINCT file FROM file_hashes
BN->>BN: missing = expected minus nodes intersect file_hashes
alt missingAbs.length == 0
BN-->>Orch: early return (no proxy handoff)
else files missing
BN->>BN: close NativeDb, open better-sqlite3
BN->>WASM: parseFilesWasmForBackfill(missingAbs)
WASM-->>BN: wasmResults (symbols per file)
BN->>DB: INSERT OR IGNORE nodes (file + symbol rows)
BN->>DB: INSERT OR REPLACE file_hashes for every missingRel
Note over BN,DB: Iterates missingRel not wasmResults so symbol-less files still get a hash row
end
Reviews (2): Last reviewed commit: "fix(native): treat file_hashes gap as mi..." | Re-trigger Greptile |
Codegraph Impact Analysis6 functions changed → 9 callers affected across 6 files
|
Greptile feedback: backfillNativeDroppedFiles read 'nodes WHERE kind=file' to decide what's missing, but the fast-skip pre-flight (#1054) rejects on 'file_hashes' gaps. If a DB has a node row but no file_hashes row (e.g. state written by pre-#1068 code), the early-return triggers and the gap persists across rebuilds. Also query file_hashes and treat any expected file absent from EITHER table as missing. The existing upsert path repairs both rows. The file_hashes read is wrapped in try/catch so legacy DBs without the table fall through to the existing recovery path.
|
Addressed Greptile's feedback (P2 'Early-return signal diverges from fast-skip signal'):
Commit: 9b2aa25 |
Summary
filesToParseinbuildFileHashes/updateFileHashesso files with zero symbols (empty, parser no-op, optional-language grammar unavailable) still get afile_hashesrow.file_hashesrows inbackfillNativeDroppedFilesfor every file the Rust orchestrator dropped (e.g..cljwhen no Rust extractor exists).nodes+file_hashesrows for files outside itsSUPPORTED_EXTENSIONSon every incremental run, and the JS-side fast-skip pre-flight (perf(native): ~2 second flat overhead added to rebuild operations in v3.10.0 #1054) then permanently rejects on "collected file missing from file_hashes". Restructured the function to do the cheap missing-file check before the proxy→better-sqlite3 handoff so no-op rebuilds remain fast.Closes #1068
Test plan
a.js+main.cljrepo,cfg: false, dataflow: false):file_hashesincludes botha.jsandmain.clj.a.js: orchestrator runs, picks up new symbol,main.cljrow still preserved.tests/builder/insert-nodes.test.ts(6 cases) coversbuildFileHashesdirectly: symbol-less files, precomputed hash,_reverseDepOnly, disk fallback, metadata-only updates, deduplication.Notes
vitest.config.tsissue (--strip-types is not allowed in NODE_OPTIONS); behavior was verified via build + lint + TypeScript + standalone E2E against the builtdist/.