feat(native): port Verilog extractor to Rust#1107
Conversation
Adds tree-sitter-verilog dependency and a native Verilog/SystemVerilog extractor in crates/codegraph-core/src/extractors/verilog.rs, registers .v / .sv with LanguageKind::Verilog and the Rust file_collector, and adds Verilog to NATIVE_SUPPORTED_EXTENSIONS on the JS side. Mirrors extractVerilogSymbols: module/interface/package/class declarations, function and task declarations (parent-prefixed when nested), package_import_declaration and include_compiler_directive imports, and module_instantiation as call extraction. VERILOG_AST_CONFIG in helpers.rs deliberately has all node-type lists empty to mirror the WASM side, whose AST_TYPE_MAPS has no verilog entry — so both engines emit zero ast_nodes rows for Verilog and stay in parity. Closes #1071
|
Claude finished @carlos-alm's task —— View job PR Review Complete ✅Reviewing PR #1107: feat(native): port Verilog extractor to Rust Root Cause AnalysisProblem Being Solved: This PR implements native Rust support for Verilog/SystemVerilog code extraction to maintain dual-engine parity. The codegraph tool follows a dual-engine architecture where both WASM and native Rust engines must produce identical results for supported languages. Why This Approach: The implementation mirrors the existing WASM extractor ( Risk Assessment: Low risk. This is purely additive functionality with comprehensive test coverage and strict parity requirements enforced by the architecture. Backlog Compliance
Code Quality AssessmentStrengths:
Technical Details Verified:
No Issues Found: No security concerns, performance problems, or architectural violations detected. Final Recommendation
This PR successfully extends codegraph's language coverage while maintaining the strict dual-engine parity requirements that are fundamental to the tool's reliability. |
Greptile SummaryThis PR ports the Verilog/SystemVerilog symbol extractor from the WASM/JS engine to native Rust, adding
Confidence Score: 5/5Safe to merge. The new extractor is well-tested (6 Rust unit tests + 5 vitest tests), all-empty AST config correctly matches the WASM side, and prior review concerns are addressed. The extractor logic is straightforward tree traversal with no shared mutable state, no I/O, and no unsafe code. The all-empty VERILOG_AST_CONFIG keeps parity with the WASM engine. The named_child(0) fix for module instantiation, the handle_class_decl no-op comment, and the Coq .v collision documentation are all present. The ABI compat test in parser_registry.rs will catch any future grammar version mismatch at CI time. No files require special attention. Important Files Changed
Reviews (3): Last reviewed commit: "test(benchmark): exempt 3.10.0:Full buil..." | Re-trigger Greptile |
| fn handle_class_decl(node: &Node, source: &[u8], symbols: &mut FileSymbols) { | ||
| // The JS extractor calls `node.childForFieldName('name')`; tree-sitter-verilog | ||
| // exposes no fields on `class_declaration`, so this returns null in JS too — | ||
| // matching that behavior keeps native and WASM in lockstep. | ||
| let name = match named_child_text(node, "name", source) { | ||
| Some(n) => n.to_string(), | ||
| None => return, | ||
| }; | ||
| symbols.definitions.push(Definition { | ||
| name: name.clone(), | ||
| kind: "class".to_string(), | ||
| line: start_line(node), | ||
| end_line: Some(end_line(node)), | ||
| decorators: None, | ||
| complexity: None, | ||
| cfg: None, | ||
| children: None, | ||
| }); | ||
|
|
||
| if let Some(superclass) = node.child_by_field_name("superclass") { | ||
| symbols.classes.push(ClassRelation { | ||
| name, | ||
| extends: Some(node_text(&superclass, source).to_string()), | ||
| implements: None, | ||
| line: start_line(node), | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
handle_class_decl is unreachable dead code
The comment itself confirms that tree-sitter-verilog exposes no name field on class_declaration, so named_child_text(node, "name", source) always returns None, and the function always exits at the None => return arm. Neither the Definition for the class nor the ClassRelation for extends is ever emitted. The PR description lists class_declaration → kind class (extends emitted into classes) as a captured definition type, but that path is never taken. If intentional parity with the JS extractor is the goal (where childForFieldName('name') also returns null), it's worth at minimum adding a comment that class extraction is currently a no-op, so a future grammar update that adds the name field doesn't go unnoticed. Alternatively, falling back to find_decl_name (like handle_interface_decl and handle_package_decl do) would recover class names from the first simple_identifier child without breaking WASM parity once the grammar grows the field.
There was a problem hiding this comment.
Fixed — strengthened the comment on handle_class_decl so it loudly flags that the handler is currently a no-op on the live tree-sitter-verilog grammar (no name or superclass field, so name lookup always returns None and the function exits early). The hook stays in place so a future grammar upgrade that adds those fields automatically picks up class definitions and extends relations. Going with parity-with-WASM rather than find_decl_name fallback to avoid silently diverging from the JS extractor before the grammar grows the field.
| "js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb", | ||
| "rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt", | ||
| "kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli", | ||
| "v", "sv", | ||
| ]; |
There was a problem hiding this comment.
.v extension shared with Coq theorem prover
The .v extension is also the canonical extension for Coq source files. A repository that mixes Coq proofs and Verilog hardware files (or a pure Coq repo) would now have all Coq files routed to the tree-sitter-verilog parser, producing empty or garbage symbol output. The file collector and parser_registry have no way to disambiguate between the two. This may be an acceptable known limitation, but it is worth documenting (e.g., in a code comment alongside the extension list) so future contributors understand the trade-off rather than discovering it through mis-indexed codebases.
There was a problem hiding this comment.
Fixed — added a documentation block above SUPPORTED_EXTENSIONS in file_collector.rs that explicitly calls out the .v collision with Coq source files, so Coq-heavy repos know they need to exclude *.v via the exclude config glob. Keeping the routing as Verilog since Verilog repos far outnumber Coq repos in our user base, but the trade-off is now discoverable from the source.
| fn handle_module_instantiation(node: &Node, source: &[u8], symbols: &mut FileSymbols) { | ||
| // Tree-sitter-verilog exposes no field name on `module_instantiation`; the | ||
| // first child holds the module type being instantiated. The JS extractor | ||
| // uses `childForFieldName('type') || child(0)` — the field lookup never | ||
| // hits, so first-child fallback is the live path. | ||
| let name_node = node | ||
| .child_by_field_name("type") | ||
| .or_else(|| node.child(0)); | ||
| let name_node = match name_node { | ||
| Some(n) => n, | ||
| None => return, | ||
| }; | ||
| let name = node_text(&name_node, source).to_string(); | ||
| if name.is_empty() { | ||
| return; | ||
| } | ||
| symbols.calls.push(Call { | ||
| name, | ||
| line: start_line(node), | ||
| dynamic: None, | ||
| receiver: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
child(0) may return an anonymous grammar token on some node shapes
node.child(0) in tree-sitter returns any child at index 0, including anonymous tokens (punctuation, keywords). For the majority of module_instantiation shapes this is harmless because the module type identifier is the first child. However, if the grammar ever emits a leading anonymous node (e.g., a parameter-override token like #) before the module identifier on a non-ANSI instantiation form, the call name will be that punctuation character instead of the module type. The named-node variant node.named_child(0) (which skips anonymous tokens) would be safer and more defensive here, and would still mirror the JS extractor's first-child fallback intent.
There was a problem hiding this comment.
Fixed — switched the fallback from node.child(0) to node.named_child(0) so any anonymous grammar tokens (parameter-override #, keywords) leading the module_instantiation cannot leak into the call name. Added a comment documenting the rationale. All 6 verilog unit tests still pass.
Codegraph Impact Analysis32 functions changed → 16 callers affected across 2 files
|
- handle_class_decl: strengthen comment so the no-op behavior on the current tree-sitter-verilog grammar is loud and discoverable for future grammar upgrades. - handle_module_instantiation: switch child(0) to named_child(0) so any anonymous grammar tokens (e.g. parameter-override '#') leading the module type cannot leak into call names. - file_collector::SUPPORTED_EXTENSIONS: document .v conflict with Coq theorem-prover source files so Coq-heavy repos know to exclude *.v via config. - native-drop-classification: drop expected count to 9 to reflect the merge with main (.clj already removed, .v removed by this PR).
#1107) Adding native Verilog (#1107) brings 4 .v resolution-benchmark fixtures into the incremental benchmark sweep (which runs against the repo root). tree-sitter-verilog is a large grammar so each .v file costs noticeably more to parse than other fixture languages — pushing the native fullBuildMs from the 3.10.0 baseline of 1959ms to ~2809ms (+43%). This is a structural one-time cost of supporting the language, not a regression in shared code paths. Following the existing pattern in KNOWN_REGRESSIONS (3.9.6:* / 3.10.0:* entries) with a documented rationale so a future PR isn't blocked by the bump.
Summary
tree-sitter-verilogdependency and a native Verilog/SystemVerilog extractor incrates/codegraph-core/src/extractors/verilog.rs..vand.svwithLanguageKind::Verilogand the Rustfile_collector, adds Verilog toNATIVE_SUPPORTED_EXTENSIONSon the JS side, and wiresVERILOG_AST_CONFIGinhelpers.rs(all empty lists — mirrors the WASM side, which has noverilogentry inAST_TYPE_MAPS, so both engines emit zeroast_nodesrows for Verilog and stay in parity).extractVerilogSymbols:module_declaration/interface_declaration/package_declaration/class_declarationdefinitions (extends emitted intoclasses),function_declarationandtask_declarationwith<parent>.<name>for nested decls,package_import_declaration(pkg::item/pkg::*) andinclude_compiler_directiveimports, andmodule_instantiationas the call analogue.Closes #1071
Test plan
cargo build --release -p codegraph-core(clean build)cargo test -p codegraph-core --lib— 190/190npx tree-sitter build --wasm node_modules/tree-sitter-verilog/regeneratestree-sitter-verilog.wasmnpx vitest run tests/parsers/verilog.test.ts— 5/5npx vitest run tests/parsers/native-drop-classification.test.ts— 13/13