Skip to content

Commit a865127

Browse files
authored
fix(parity): align WASM and native ast_nodes extraction (#1016)
* fix(parity): align WASM and native ast_nodes extraction (#1010) Resolves #1010. Three independent divergences were causing the native engine to emit ~7,200 excess `string` AST nodes vs WASM on self-build: 1. Language coverage gap (~6,653 rows) — WASM's AST_TYPE_MAPS registered only javascript/typescript/tsx; native emitted ast_nodes for 19 more languages via `walk_ast_nodes_with_config`. Mirrored every LangAstConfig from `helpers.rs` into WASM as `AST_TYPE_MAPS` + `AST_STRING_CONFIGS` entries and threaded a per-language `stopRecurseKinds` set through `createAstStoreVisitor`. 2. WASM `await` skipChildren (~500 rows) — the visitor returned skipChildren for `await_expression`, so string/call children of `await import('x')` / `await fn('y')` were never walked. Native's javascript.rs explicitly recurses. Removed `await` from the skipChildren filter. 3. UTF-8 byte-length check in native (~40 rows) — `crates/codegraph-core/src/extractors/javascript.rs` gated string emission on `content.len() < 2` (UTF-8 byte count). Any single non- ASCII glyph like `─` (3 bytes) was emitted. Changed to `content.chars().count() < 2` for parity with helpers.rs and JS `.length`. WASM's filter uses code-point count (`[...s].length`). Measured parity after fix on 775 shared files (excluding files edited in this PR): 37,605 (WASM) vs 37,649 (native) = 0.12 % delta. Every kind except `string` is at 0 delta; the remaining 44-row string gap is the UTF fix still waiting on the next native binary rebuild. New parity test in tests/engines/ast-parity.test.ts asserts ≤1 row divergence between engines for six languages (js, ts, python, rust, go, java). * perf: hoist SKIP_KEYWORDS set to module scope (#1016) Avoid reallocating a Set on every extractChildExpressionText call — the contents are stateless and this function runs per throw/await node during AST-store extraction. Impact: 1 functions changed, 7 affected * test: add ast_nodes parity fixtures for 15 more languages (#1016) PR #1016 added AST_TYPE_MAPS entries for 16 languages beyond js/ts/ python/rust/go/java but PARITY_FIXTURES only covered the original 6, leaving silent-divergence risk for languages with distinct string node types (encapsed_string, sigil, etc.). Adds minimal fixtures for csharp, ruby, php, c, cpp, kotlin, swift, scala, bash, elixir, lua, dart, zig, haskell, ocaml — each exercises a string literal plus at least one other kind from its AST_TYPE_MAP. ocaml-interface (.mli) already covered by reusing the ocaml map. Tests return early when a grammar is locally unavailable; CI has all grammars.
1 parent cc2a7e7 commit a865127

15 files changed

Lines changed: 702 additions & 58 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ Source is TypeScript in `src/`, compiled via `tsup`. The Rust native engine live
161161

162162
**Configuration:** All tunable behavioral constants live in `DEFAULTS` in `src/infrastructure/config.ts`, grouped by concern (`analysis`, `risk`, `search`, `display`, `community`, `structure`, `mcp`, `check`, `coChange`, `manifesto`). Users override via `.codegraphrc.json``mergeConfig` deep-merges recursively so partial overrides preserve sibling keys. Env vars override LLM settings (`CODEGRAPH_LLM_*`). When adding new behavioral constants, **always add them to `DEFAULTS`** and wire them through config — never introduce new hardcoded magic numbers in individual modules. Category F values (safety boundaries, standard formulas, platform concerns) are the only exception.
163163

164-
**Database:** SQLite at `.codegraph/graph.db` with tables: `nodes`, `edges`, `metadata`, `embeddings`, `function_complexity`
164+
**Database:** SQLite at `.codegraph/graph.db` with tables: `nodes`, `edges`, `metadata`, `embeddings`, `function_complexity`, `ast_nodes` (stored `new`/`throw`/`await`/`string`/`regex` literals queryable via `codegraph ast`). Both engines must extract `ast_nodes` for every language they parse — per-language node-type maps live in `src/ast-analysis/rules/index.ts` (`AST_TYPE_MAPS`, `AST_STRING_CONFIGS`) and mirror the native `LangAstConfig` constants in `crates/codegraph-core/src/extractors/helpers.rs`. Adding a new language requires a matching entry in both.
165165

166166
## Test Structure
167167

crates/codegraph-core/src/extractors/javascript.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -551,7 +551,12 @@ fn walk_ast_nodes_depth(node: &Node, source: &[u8], ast_nodes: &mut Vec<AstNode>
551551
let content = raw
552552
.trim_start_matches(|c| c == '\'' || c == '"' || c == '`')
553553
.trim_end_matches(|c| c == '\'' || c == '"' || c == '`');
554-
if content.len() < 2 {
554+
// Count Unicode code points, not UTF-8 bytes, so the filter matches
555+
// helpers.rs `build_string_node` and the WASM visitor — a single non-
556+
// ASCII glyph like `─` (3 bytes / 1 code point) must be treated as one
557+
// character, otherwise we emit "excess" string nodes the WASM engine
558+
// skips (see parity issue #1010).
559+
if content.chars().count() < 2 {
555560
// Still recurse children (template_string may have nested expressions)
556561
for i in 0..node.child_count() {
557562
if let Some(child) = node.child(i) {

src/ast-analysis/engine.ts

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,9 @@ import type {
4343
} from '../types.js';
4444
import { computeLOCMetrics, computeMaintainabilityIndex } from './metrics.js';
4545
import {
46+
AST_STRING_CONFIGS,
4647
AST_TYPE_MAPS,
48+
astStopRecurseKinds,
4749
CFG_RULES,
4850
COMPLEXITY_RULES,
4951
DATAFLOW_RULES,
@@ -458,7 +460,15 @@ function setupAstVisitor(
458460
for (const row of bulkNodeIdsByFile(db, relPath)) {
459461
nodeIdMap.set(`${row.name}|${row.kind}|${row.line}`, row.id);
460462
}
461-
return createAstStoreVisitor(astTypeMap, symbols.definitions || [], relPath, nodeIdMap);
463+
const stringConfig = AST_STRING_CONFIGS.get(langId);
464+
return createAstStoreVisitor(
465+
astTypeMap,
466+
symbols.definitions || [],
467+
relPath,
468+
nodeIdMap,
469+
stringConfig,
470+
astStopRecurseKinds(langId),
471+
);
462472
}
463473

464474
/** Set up complexity visitor if any definitions need WASM complexity analysis. */

src/ast-analysis/rules/csharp.ts

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,4 +200,11 @@ export const dataflow: DataflowRulesConfig = makeDataflowRules({
200200

201201
// ─── AST Node Types ───────────────────────────────────────────────────────
202202

203-
export const astTypes: Record<string, string> | null = null;
203+
export const astTypes: Record<string, string> | null = {
204+
object_creation_expression: 'new',
205+
throw_statement: 'throw',
206+
throw_expression: 'throw',
207+
await_expression: 'await',
208+
string_literal: 'string',
209+
verbatim_string_literal: 'string',
210+
};

src/ast-analysis/rules/go.ts

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,4 +181,7 @@ export const dataflow: DataflowRulesConfig = makeDataflowRules({
181181

182182
// ─── AST Node Types ───────────────────────────────────────────────────────
183183

184-
export const astTypes: Record<string, string> | null = null;
184+
export const astTypes: Record<string, string> | null = {
185+
interpreted_string_literal: 'string',
186+
raw_string_literal: 'string',
187+
};

src/ast-analysis/rules/index.ts

Lines changed: 181 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,10 +73,187 @@ export const DATAFLOW_RULES: Map<string, DataflowRulesConfig> = new Map([
7373
['ruby', ruby.dataflow],
7474
]);
7575

76-
// ─── AST Type Maps ───────────────────────────────────────────────────────
76+
// ─── AST Node Type Maps ──────────────────────────────────────────────────
77+
//
78+
// These mirror the per-language `LangAstConfig` constants in the native Rust
79+
// engine (`crates/codegraph-core/src/extractors/helpers.rs`). WASM and native
80+
// must agree on which tree-sitter node types to emit as `ast_nodes` rows.
81+
// Languages without a dedicated rules/*.ts file have their maps inlined here.
82+
83+
const JS_AST_TYPES = javascript.astTypes as Record<string, string>;
84+
const PY_AST_TYPES = python.astTypes as Record<string, string>;
85+
const GO_AST_TYPES = go.astTypes as Record<string, string>;
86+
const RS_AST_TYPES = rust.astTypes as Record<string, string>;
87+
const JAVA_AST_TYPES = java.astTypes as Record<string, string>;
88+
const CS_AST_TYPES = csharp.astTypes as Record<string, string>;
89+
const RB_AST_TYPES = ruby.astTypes as Record<string, string>;
90+
const PHP_AST_TYPES = php.astTypes as Record<string, string>;
91+
92+
const C_AST_TYPES: Record<string, string> = {
93+
string_literal: 'string',
94+
};
95+
96+
const CPP_AST_TYPES: Record<string, string> = {
97+
new_expression: 'new',
98+
throw_statement: 'throw',
99+
co_await_expression: 'await',
100+
string_literal: 'string',
101+
raw_string_literal: 'string',
102+
};
103+
104+
const KOTLIN_AST_TYPES: Record<string, string> = {
105+
throw_expression: 'throw',
106+
string_literal: 'string',
107+
};
108+
109+
const SWIFT_AST_TYPES: Record<string, string> = {
110+
throw_statement: 'throw',
111+
await_expression: 'await',
112+
string_literal: 'string',
113+
};
114+
115+
const SCALA_AST_TYPES: Record<string, string> = {
116+
object_creation_expression: 'new',
117+
throw_expression: 'throw',
118+
string_literal: 'string',
119+
};
120+
121+
const BASH_AST_TYPES: Record<string, string> = {
122+
string: 'string',
123+
expansion: 'string',
124+
};
125+
126+
const ELIXIR_AST_TYPES: Record<string, string> = {
127+
string: 'string',
128+
sigil: 'regex',
129+
};
130+
131+
const LUA_AST_TYPES: Record<string, string> = {
132+
string: 'string',
133+
};
134+
135+
const DART_AST_TYPES: Record<string, string> = {
136+
new_expression: 'new',
137+
constructor_invocation: 'new',
138+
throw_expression: 'throw',
139+
await_expression: 'await',
140+
string_literal: 'string',
141+
};
142+
143+
const ZIG_AST_TYPES: Record<string, string> = {
144+
string_literal: 'string',
145+
};
146+
147+
const HASKELL_AST_TYPES: Record<string, string> = {
148+
string: 'string',
149+
char: 'string',
150+
};
151+
152+
const OCAML_AST_TYPES: Record<string, string> = {
153+
string: 'string',
154+
};
77155

78156
export const AST_TYPE_MAPS: Map<string, Record<string, string>> = new Map([
79-
['javascript', javascript.astTypes as Record<string, string>],
80-
['typescript', javascript.astTypes as Record<string, string>],
81-
['tsx', javascript.astTypes as Record<string, string>],
157+
['javascript', JS_AST_TYPES],
158+
['typescript', JS_AST_TYPES],
159+
['tsx', JS_AST_TYPES],
160+
['python', PY_AST_TYPES],
161+
['go', GO_AST_TYPES],
162+
['rust', RS_AST_TYPES],
163+
['java', JAVA_AST_TYPES],
164+
['csharp', CS_AST_TYPES],
165+
['ruby', RB_AST_TYPES],
166+
['php', PHP_AST_TYPES],
167+
['c', C_AST_TYPES],
168+
['cpp', CPP_AST_TYPES],
169+
['kotlin', KOTLIN_AST_TYPES],
170+
['swift', SWIFT_AST_TYPES],
171+
['scala', SCALA_AST_TYPES],
172+
['bash', BASH_AST_TYPES],
173+
['elixir', ELIXIR_AST_TYPES],
174+
['lua', LUA_AST_TYPES],
175+
['dart', DART_AST_TYPES],
176+
['zig', ZIG_AST_TYPES],
177+
['haskell', HASKELL_AST_TYPES],
178+
['ocaml', OCAML_AST_TYPES],
179+
['ocaml-interface', OCAML_AST_TYPES],
180+
]);
181+
182+
// ─── Per-language string-extraction config ───────────────────────────────
183+
//
184+
// Mirrors `quote_chars` + `string_prefixes` in the native `LangAstConfig`.
185+
// Used by the AST-store visitor to strip quote characters and language-
186+
// specific prefix sigils (Python `r"..."`, C# verbatim `@"..."`, Rust raw
187+
// `r#"..."#`, etc.) when computing string content for the `name` column.
188+
189+
export interface AstStringConfig {
190+
quoteChars: string;
191+
stringPrefixes: string;
192+
}
193+
194+
const JS_STRING_CONFIG: AstStringConfig = { quoteChars: '\'"`', stringPrefixes: '' };
195+
const PY_STRING_CONFIG: AstStringConfig = { quoteChars: '\'"', stringPrefixes: 'rbfuRBFU' };
196+
const GO_STRING_CONFIG: AstStringConfig = { quoteChars: '"`', stringPrefixes: '' };
197+
const RS_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
198+
const JAVA_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
199+
const CS_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
200+
const RB_STRING_CONFIG: AstStringConfig = { quoteChars: '\'"', stringPrefixes: '' };
201+
const PHP_STRING_CONFIG: AstStringConfig = { quoteChars: '\'"', stringPrefixes: '' };
202+
const C_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
203+
const CPP_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: 'LuUR' };
204+
const KOTLIN_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
205+
const SWIFT_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
206+
const SCALA_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
207+
const BASH_STRING_CONFIG: AstStringConfig = { quoteChars: '"\'', stringPrefixes: '' };
208+
const ELIXIR_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
209+
const LUA_STRING_CONFIG: AstStringConfig = { quoteChars: '\'"', stringPrefixes: '' };
210+
const DART_STRING_CONFIG: AstStringConfig = { quoteChars: '\'"', stringPrefixes: '' };
211+
const ZIG_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
212+
const HASKELL_STRING_CONFIG: AstStringConfig = { quoteChars: '"\'', stringPrefixes: '' };
213+
const OCAML_STRING_CONFIG: AstStringConfig = { quoteChars: '"', stringPrefixes: '' };
214+
215+
export const AST_STRING_CONFIGS: Map<string, AstStringConfig> = new Map([
216+
['javascript', JS_STRING_CONFIG],
217+
['typescript', JS_STRING_CONFIG],
218+
['tsx', JS_STRING_CONFIG],
219+
['python', PY_STRING_CONFIG],
220+
['go', GO_STRING_CONFIG],
221+
['rust', RS_STRING_CONFIG],
222+
['java', JAVA_STRING_CONFIG],
223+
['csharp', CS_STRING_CONFIG],
224+
['ruby', RB_STRING_CONFIG],
225+
['php', PHP_STRING_CONFIG],
226+
['c', C_STRING_CONFIG],
227+
['cpp', CPP_STRING_CONFIG],
228+
['kotlin', KOTLIN_STRING_CONFIG],
229+
['swift', SWIFT_STRING_CONFIG],
230+
['scala', SCALA_STRING_CONFIG],
231+
['bash', BASH_STRING_CONFIG],
232+
['elixir', ELIXIR_STRING_CONFIG],
233+
['lua', LUA_STRING_CONFIG],
234+
['dart', DART_STRING_CONFIG],
235+
['zig', ZIG_STRING_CONFIG],
236+
['haskell', HASKELL_STRING_CONFIG],
237+
['ocaml', OCAML_STRING_CONFIG],
238+
['ocaml-interface', OCAML_STRING_CONFIG],
82239
]);
240+
241+
// ─── Per-language "stop-after-collect" kinds ─────────────────────────────
242+
//
243+
// Mirrors the subtle difference between the native JS walker
244+
// (`extractors/javascript.rs::walk_ast_nodes_depth`) — which *returns* after
245+
// collecting `new_expression` and `throw_statement` to avoid double-counting
246+
// the wrapped expression — and the generic walker (`helpers.rs::walk_ast_
247+
// nodes_with_config_depth`), which always recurses. For WASM/native parity
248+
// the JS family must skip recursion on `new` and `throw`; every other
249+
// language recurses normally.
250+
251+
const JS_STOP_RECURSE: ReadonlySet<string> = new Set(['new', 'throw']);
252+
const EMPTY_STOP_RECURSE: ReadonlySet<string> = new Set();
253+
254+
export function astStopRecurseKinds(langId: string): ReadonlySet<string> {
255+
if (langId === 'javascript' || langId === 'typescript' || langId === 'tsx') {
256+
return JS_STOP_RECURSE;
257+
}
258+
return EMPTY_STOP_RECURSE;
259+
}

src/ast-analysis/rules/java.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,4 +174,8 @@ export const dataflow: DataflowRulesConfig = makeDataflowRules({
174174

175175
// ─── AST Node Types ───────────────────────────────────────────────────────
176176

177-
export const astTypes: Record<string, string> | null = null;
177+
export const astTypes: Record<string, string> | null = {
178+
object_creation_expression: 'new',
179+
throw_statement: 'throw',
180+
string_literal: 'string',
181+
};

src/ast-analysis/rules/php.ts

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,4 +218,9 @@ export const dataflow: DataflowRulesConfig = makeDataflowRules({
218218

219219
// ─── AST Node Types ───────────────────────────────────────────────────────
220220

221-
export const astTypes: Record<string, string> | null = null;
221+
export const astTypes: Record<string, string> | null = {
222+
object_creation_expression: 'new',
223+
throw_expression: 'throw',
224+
string: 'string',
225+
encapsed_string: 'string',
226+
};

src/ast-analysis/rules/python.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,4 +195,8 @@ export const dataflow: DataflowRulesConfig = makeDataflowRules({
195195

196196
// ─── AST Node Types ───────────────────────────────────────────────────────
197197

198-
export const astTypes: Record<string, string> | null = null;
198+
export const astTypes: Record<string, string> | null = {
199+
raise_statement: 'throw',
200+
await: 'await',
201+
string: 'string',
202+
};

src/ast-analysis/rules/ruby.ts

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,4 +203,7 @@ export const dataflow: DataflowRulesConfig = makeDataflowRules({
203203

204204
// ─── AST Node Types ───────────────────────────────────────────────────────
205205

206-
export const astTypes: Record<string, string> | null = null;
206+
export const astTypes: Record<string, string> | null = {
207+
string: 'string',
208+
regex: 'regex',
209+
};

0 commit comments

Comments
 (0)