You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(parity): align WASM and native ast_nodes extraction (#1016)
* fix(parity): align WASM and native ast_nodes extraction (#1010)
Resolves#1010. Three independent divergences were causing the native
engine to emit ~7,200 excess `string` AST nodes vs WASM on self-build:
1. Language coverage gap (~6,653 rows) — WASM's AST_TYPE_MAPS registered
only javascript/typescript/tsx; native emitted ast_nodes for 19 more
languages via `walk_ast_nodes_with_config`. Mirrored every
LangAstConfig from `helpers.rs` into WASM as `AST_TYPE_MAPS` +
`AST_STRING_CONFIGS` entries and threaded a per-language
`stopRecurseKinds` set through `createAstStoreVisitor`.
2. WASM `await` skipChildren (~500 rows) — the visitor returned
skipChildren for `await_expression`, so string/call children of
`await import('x')` / `await fn('y')` were never walked. Native's
javascript.rs explicitly recurses. Removed `await` from the
skipChildren filter.
3. UTF-8 byte-length check in native (~40 rows) —
`crates/codegraph-core/src/extractors/javascript.rs` gated string
emission on `content.len() < 2` (UTF-8 byte count). Any single non-
ASCII glyph like `─` (3 bytes) was emitted. Changed to
`content.chars().count() < 2` for parity with helpers.rs and JS
`.length`. WASM's filter uses code-point count (`[...s].length`).
Measured parity after fix on 775 shared files (excluding files edited
in this PR): 37,605 (WASM) vs 37,649 (native) = 0.12 % delta. Every
kind except `string` is at 0 delta; the remaining 44-row string gap is
the UTF fix still waiting on the next native binary rebuild.
New parity test in tests/engines/ast-parity.test.ts asserts ≤1 row
divergence between engines for six languages (js, ts, python, rust,
go, java).
* perf: hoist SKIP_KEYWORDS set to module scope (#1016)
Avoid reallocating a Set on every extractChildExpressionText call —
the contents are stateless and this function runs per throw/await
node during AST-store extraction.
Impact: 1 functions changed, 7 affected
* test: add ast_nodes parity fixtures for 15 more languages (#1016)
PR #1016 added AST_TYPE_MAPS entries for 16 languages beyond js/ts/
python/rust/go/java but PARITY_FIXTURES only covered the original 6,
leaving silent-divergence risk for languages with distinct string
node types (encapsed_string, sigil, etc.). Adds minimal fixtures for
csharp, ruby, php, c, cpp, kotlin, swift, scala, bash, elixir, lua,
dart, zig, haskell, ocaml — each exercises a string literal plus at
least one other kind from its AST_TYPE_MAP. ocaml-interface (.mli)
already covered by reusing the ocaml map. Tests return early when a
grammar is locally unavailable; CI has all grammars.
Copy file name to clipboardExpand all lines: CLAUDE.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -161,7 +161,7 @@ Source is TypeScript in `src/`, compiled via `tsup`. The Rust native engine live
161
161
162
162
**Configuration:** All tunable behavioral constants live in `DEFAULTS` in `src/infrastructure/config.ts`, grouped by concern (`analysis`, `risk`, `search`, `display`, `community`, `structure`, `mcp`, `check`, `coChange`, `manifesto`). Users override via `.codegraphrc.json` — `mergeConfig` deep-merges recursively so partial overrides preserve sibling keys. Env vars override LLM settings (`CODEGRAPH_LLM_*`). When adding new behavioral constants, **always add them to `DEFAULTS`** and wire them through config — never introduce new hardcoded magic numbers in individual modules. Category F values (safety boundaries, standard formulas, platform concerns) are the only exception.
163
163
164
-
**Database:** SQLite at `.codegraph/graph.db` with tables: `nodes`, `edges`, `metadata`, `embeddings`, `function_complexity`
164
+
**Database:** SQLite at `.codegraph/graph.db` with tables: `nodes`, `edges`, `metadata`, `embeddings`, `function_complexity`, `ast_nodes` (stored `new`/`throw`/`await`/`string`/`regex` literals queryable via `codegraph ast`). Both engines must extract `ast_nodes` for every language they parse — per-language node-type maps live in `src/ast-analysis/rules/index.ts` (`AST_TYPE_MAPS`, `AST_STRING_CONFIGS`) and mirror the native `LangAstConfig` constants in `crates/codegraph-core/src/extractors/helpers.rs`. Adding a new language requires a matching entry in both.
0 commit comments