Skip to content

feat(native): port Verilog extractor to Rust#1107

Open
carlos-alm wants to merge 5 commits into
mainfrom
feat/1071-verilog-rust-extractor
Open

feat(native): port Verilog extractor to Rust#1107
carlos-alm wants to merge 5 commits into
mainfrom
feat/1071-verilog-rust-extractor

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Adds tree-sitter-verilog dependency and a native Verilog/SystemVerilog extractor in crates/codegraph-core/src/extractors/verilog.rs.
  • Registers .v and .sv with LanguageKind::Verilog and the Rust file_collector, adds Verilog to NATIVE_SUPPORTED_EXTENSIONS on the JS side, and wires VERILOG_AST_CONFIG in helpers.rs (all empty lists — mirrors the WASM side, which has no verilog entry in AST_TYPE_MAPS, so both engines emit zero ast_nodes rows for Verilog and stay in parity).
  • Mirrors extractVerilogSymbols: module_declaration / interface_declaration / package_declaration / class_declaration definitions (extends emitted into classes), function_declaration and task_declaration with <parent>.<name> for nested decls, package_import_declaration (pkg::item / pkg::*) and include_compiler_directive imports, and module_instantiation as the call analogue.

Closes #1071

Test plan

  • cargo build --release -p codegraph-core (clean build)
  • cargo test -p codegraph-core --lib — 190/190
  • npx tree-sitter build --wasm node_modules/tree-sitter-verilog/ regenerates tree-sitter-verilog.wasm
  • npx vitest run tests/parsers/verilog.test.ts — 5/5
  • npx vitest run tests/parsers/native-drop-classification.test.ts — 13/13

Adds tree-sitter-verilog dependency and a native Verilog/SystemVerilog
extractor in crates/codegraph-core/src/extractors/verilog.rs, registers
.v / .sv with LanguageKind::Verilog and the Rust file_collector, and
adds Verilog to NATIVE_SUPPORTED_EXTENSIONS on the JS side.

Mirrors extractVerilogSymbols: module/interface/package/class
declarations, function and task declarations (parent-prefixed when
nested), package_import_declaration and include_compiler_directive
imports, and module_instantiation as call extraction.

VERILOG_AST_CONFIG in helpers.rs deliberately has all node-type lists
empty to mirror the WASM side, whose AST_TYPE_MAPS has no verilog
entry — so both engines emit zero ast_nodes rows for Verilog and stay
in parity.

Closes #1071
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


PR Review Complete ✅

Reviewing PR #1107: feat(native): port Verilog extractor to Rust

Root Cause Analysis

Problem Being Solved: This PR implements native Rust support for Verilog/SystemVerilog code extraction to maintain dual-engine parity. The codegraph tool follows a dual-engine architecture where both WASM and native Rust engines must produce identical results for supported languages.

Why This Approach: The implementation mirrors the existing WASM extractor (src/extractors/verilog.ts) precisely to ensure engine parity. This is a foundational requirement per CLAUDE.md - both engines must extract identical symbols, or it's considered a bug.

Risk Assessment: Low risk. This is purely additive functionality with comprehensive test coverage and strict parity requirements enforced by the architecture.

Backlog Compliance

  • Zero-dep: ✓ - Only adds tree-sitter-verilog crate dependency, builds on existing tree-sitter infrastructure
  • Foundation-aligned: ✓ - Fully compliant with all principles, extends parsing capabilities without changing core architecture
  • Problem-fit: 4/5 - Directly enables AI agents to understand Verilog codebases, reducing token waste and preventing hallucinated dependencies
  • Breaking: No - Purely additive, no existing API changes
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Code Quality Assessment

Strengths:

  1. Perfect Parity Implementation: The Rust extractor precisely mirrors the WASM version:

    • Identical symbol extraction logic (modules, interfaces, packages, classes, functions, tasks)
    • Matching import handling (package_import_declaration, include_compiler_directive)
    • Same call extraction (module instantiations)
    • Consistent nested function qualification (parent.name)
  2. Comprehensive Integration:

    • Correctly added to all registry points (LanguageKind::Verilog, from_extension, extract_symbols)
    • Proper AST config with empty types (maintains WASM parity where no verilog entry exists in AST_TYPE_MAPS)
    • File collector updated with .v and .sv extensions
    • Native supported extensions updated correctly
  3. Thorough Testing:

    • Rust unit tests covering core extraction scenarios
    • TypeScript integration tests maintaining existing coverage
    • Drift guard test properly updated to include Verilog extensions
    • All tests demonstrate parity between engines
  4. Clean Architecture: Well-structured extractor following established patterns with proper error handling and defensive programming practices

Technical Details Verified:

  • Tree-sitter grammar integration properly configured ✅
  • AST node walking with VERILOG_AST_CONFIG (empty types matching WASM) ✅
  • Symbol extraction handles all major Verilog constructs ✅
  • Import resolution for both package imports and include directives ✅
  • Parent-child relationships correctly established for nested declarations ✅

No Issues Found: No security concerns, performance problems, or architectural violations detected.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation that perfectly achieves its goal of adding native Verilog support with full engine parity. The code is well-structured, thoroughly tested, and follows all architectural patterns. Only minor deduction for being a straightforward feature addition rather than exceptional innovation.

This PR successfully extends codegraph's language coverage while maintaining the strict dual-engine parity requirements that are fundamental to the tool's reliability.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ports the Verilog/SystemVerilog symbol extractor from the WASM/JS engine to native Rust, adding tree-sitter-verilog 1.0.3 and a full VerilogExtractor implementation that mirrors extractVerilogSymbols in behavior. Registration is wired end-to-end: .v/.sv extensions in the Rust file collector and parser_registry, VERILOG_AST_CONFIG (all-empty, matching the absent WASM-side entry), and NATIVE_SUPPORTED_EXTENSIONS on the JS side.

  • New extractor (verilog.rs) captures module_declaration, interface_declaration, package_declaration, function_declaration, task_declaration (with <parent>.<name> qualification for nested decls), package_import_declaration, `include directives, and module_instantiation as calls — all mirroring the JS extractor.
  • handle_class_decl is intentionally a no-op on the current grammar (no name field on class_declaration), with a prominent comment so a future grammar upgrade automatically activates the hook.
  • Previous review findings (Coq .v collision docs, child(0)named_child(0) for module instantiation, handle_class_decl dead-code comment) are all addressed in this revision.

Confidence Score: 5/5

Safe to merge. The new extractor is well-tested (6 Rust unit tests + 5 vitest tests), all-empty AST config correctly matches the WASM side, and prior review concerns are addressed.

The extractor logic is straightforward tree traversal with no shared mutable state, no I/O, and no unsafe code. The all-empty VERILOG_AST_CONFIG keeps parity with the WASM engine. The named_child(0) fix for module instantiation, the handle_class_decl no-op comment, and the Coq .v collision documentation are all present. The ABI compat test in parser_registry.rs will catch any future grammar version mismatch at CI time.

No files require special attention.

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/verilog.rs New Verilog/SystemVerilog extractor; handles module, interface, package, function, task declarations, module instantiation calls, package imports and `include directives. Class handler is intentionally a no-op per grammar constraints (well-documented). Logic is sound and all 6 unit tests pass.
crates/codegraph-core/src/parser_registry.rs Adds Verilog variant to LanguageKind enum, from_extension (.v/.sv), lang_id_str, tree_sitter_language, all() slice, and both exhaustive-match tests; ABI compat guard will catch grammar version mismatches at test time.
crates/codegraph-core/src/extractors/helpers.rs Adds VERILOG_AST_CONFIG with all-empty lists to match the WASM side's absent verilog entry in AST_TYPE_MAPS; correctly documented rationale.
crates/codegraph-core/src/file_collector.rs Adds .v and .sv to SUPPORTED_EXTENSIONS; the pre-existing Coq .v collision is now explicitly documented in a block comment above the constant.
src/domain/parser.ts Adds .v and .sv to NATIVE_SUPPORTED_EXTENSIONS so the drop-classifier correctly routes Verilog files to native-extractor-failure instead of unsupported-by-native.

Reviews (3): Last reviewed commit: "test(benchmark): exempt 3.10.0:Full buil..." | Re-trigger Greptile

Comment on lines +109 to +136
fn handle_class_decl(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// The JS extractor calls `node.childForFieldName('name')`; tree-sitter-verilog
// exposes no fields on `class_declaration`, so this returns null in JS too —
// matching that behavior keeps native and WASM in lockstep.
let name = match named_child_text(node, "name", source) {
Some(n) => n.to_string(),
None => return,
};
symbols.definitions.push(Definition {
name: name.clone(),
kind: "class".to_string(),
line: start_line(node),
end_line: Some(end_line(node)),
decorators: None,
complexity: None,
cfg: None,
children: None,
});

if let Some(superclass) = node.child_by_field_name("superclass") {
symbols.classes.push(ClassRelation {
name,
extends: Some(node_text(&superclass, source).to_string()),
implements: None,
line: start_line(node),
});
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 handle_class_decl is unreachable dead code

The comment itself confirms that tree-sitter-verilog exposes no name field on class_declaration, so named_child_text(node, "name", source) always returns None, and the function always exits at the None => return arm. Neither the Definition for the class nor the ClassRelation for extends is ever emitted. The PR description lists class_declaration → kind class (extends emitted into classes) as a captured definition type, but that path is never taken. If intentional parity with the JS extractor is the goal (where childForFieldName('name') also returns null), it's worth at minimum adding a comment that class extraction is currently a no-op, so a future grammar update that adds the name field doesn't go unnoticed. Alternatively, falling back to find_decl_name (like handle_interface_decl and handle_package_decl do) would recover class names from the first simple_identifier child without breaking WASM parity once the grammar grows the field.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — strengthened the comment on handle_class_decl so it loudly flags that the handler is currently a no-op on the live tree-sitter-verilog grammar (no name or superclass field, so name lookup always returns None and the function exits early). The hook stays in place so a future grammar upgrade that adds those fields automatically picks up class definitions and extends relations. Going with parity-with-WASM rather than find_decl_name fallback to avoid silently diverging from the JS extractor before the grammar grows the field.

Comment on lines 36 to 40
"js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb",
"rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt",
"kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli",
"v", "sv",
];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 .v extension shared with Coq theorem prover

The .v extension is also the canonical extension for Coq source files. A repository that mixes Coq proofs and Verilog hardware files (or a pure Coq repo) would now have all Coq files routed to the tree-sitter-verilog parser, producing empty or garbage symbol output. The file collector and parser_registry have no way to disambiguate between the two. This may be an acceptable known limitation, but it is worth documenting (e.g., in a code comment alongside the extension list) so future contributors understand the trade-off rather than discovering it through mis-indexed codebases.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added a documentation block above SUPPORTED_EXTENSIONS in file_collector.rs that explicitly calls out the .v collision with Coq source files, so Coq-heavy repos know they need to exclude *.v via the exclude config glob. Keeping the routing as Verilog since Verilog repos far outnumber Coq repos in our user base, but the trade-off is now discoverable from the source.

Comment on lines +182 to +204
fn handle_module_instantiation(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// Tree-sitter-verilog exposes no field name on `module_instantiation`; the
// first child holds the module type being instantiated. The JS extractor
// uses `childForFieldName('type') || child(0)` — the field lookup never
// hits, so first-child fallback is the live path.
let name_node = node
.child_by_field_name("type")
.or_else(|| node.child(0));
let name_node = match name_node {
Some(n) => n,
None => return,
};
let name = node_text(&name_node, source).to_string();
if name.is_empty() {
return;
}
symbols.calls.push(Call {
name,
line: start_line(node),
dynamic: None,
receiver: None,
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 child(0) may return an anonymous grammar token on some node shapes

node.child(0) in tree-sitter returns any child at index 0, including anonymous tokens (punctuation, keywords). For the majority of module_instantiation shapes this is harmless because the module type identifier is the first child. However, if the grammar ever emits a leading anonymous node (e.g., a parameter-override token like #) before the module identifier on a non-ANSI instantiation form, the call name will be that punctuation character instead of the module type. The named-node variant node.named_child(0) (which skips anonymous tokens) would be safer and more defensive here, and would still mirror the JS extractor's first-child fallback intent.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — switched the fallback from node.child(0) to node.named_child(0) so any anonymous grammar tokens (parameter-override #, keywords) leading the module_instantiation cannot leak into the call name. Added a comment documenting the rationale. All 6 verilog unit tests still pass.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Codegraph Impact Analysis

32 functions changed16 callers affected across 2 files

  • extract_symbols_with_opts in crates/codegraph-core/src/extractors/mod.rs:60 (1 transitive callers)
  • VerilogExtractor.extract in crates/codegraph-core/src/extractors/verilog.rs:32 (0 transitive callers)
  • match_verilog_node in crates/codegraph-core/src/extractors/verilog.rs:40 (0 transitive callers)
  • handle_module_decl in crates/codegraph-core/src/extractors/verilog.rs:57 (1 transitive callers)
  • handle_interface_decl in crates/codegraph-core/src/extractors/verilog.rs:75 (1 transitive callers)
  • handle_package_decl in crates/codegraph-core/src/extractors/verilog.rs:92 (1 transitive callers)
  • handle_class_decl in crates/codegraph-core/src/extractors/verilog.rs:109 (1 transitive callers)
  • handle_function_decl in crates/codegraph-core/src/extractors/verilog.rs:146 (1 transitive callers)
  • handle_task_decl in crates/codegraph-core/src/extractors/verilog.rs:168 (1 transitive callers)
  • handle_module_instantiation in crates/codegraph-core/src/extractors/verilog.rs:190 (1 transitive callers)
  • handle_package_import in crates/codegraph-core/src/extractors/verilog.rs:220 (1 transitive callers)
  • handle_include_directive in crates/codegraph-core/src/extractors/verilog.rs:239 (1 transitive callers)
  • find_module_name in crates/codegraph-core/src/extractors/verilog.rs:270 (5 transitive callers)
  • find_decl_name in crates/codegraph-core/src/extractors/verilog.rs:292 (7 transitive callers)
  • find_function_or_task_name in crates/codegraph-core/src/extractors/verilog.rs:309 (3 transitive callers)
  • extract_identifier_text in crates/codegraph-core/src/extractors/verilog.rs:336 (4 transitive callers)
  • find_verilog_parent in crates/codegraph-core/src/extractors/verilog.rs:351 (3 transitive callers)
  • extract_ports in crates/codegraph-core/src/extractors/verilog.rs:371 (2 transitive callers)
  • collect_ports in crates/codegraph-core/src/extractors/verilog.rs:377 (3 transitive callers)
  • parse in crates/codegraph-core/src/extractors/verilog.rs:427 (6 transitive callers)

- handle_class_decl: strengthen comment so the no-op behavior on the
  current tree-sitter-verilog grammar is loud and discoverable for
  future grammar upgrades.
- handle_module_instantiation: switch child(0) to named_child(0) so
  any anonymous grammar tokens (e.g. parameter-override '#') leading
  the module type cannot leak into call names.
- file_collector::SUPPORTED_EXTENSIONS: document .v conflict with Coq
  theorem-prover source files so Coq-heavy repos know to exclude *.v
  via config.
- native-drop-classification: drop expected count to 9 to reflect the
  merge with main (.clj already removed, .v removed by this PR).
#1107)

Adding native Verilog (#1107) brings 4 .v resolution-benchmark fixtures
into the incremental benchmark sweep (which runs against the repo root).
tree-sitter-verilog is a large grammar so each .v file costs noticeably
more to parse than other fixture languages — pushing the native
fullBuildMs from the 3.10.0 baseline of 1959ms to ~2809ms (+43%).

This is a structural one-time cost of supporting the language, not a
regression in shared code paths. Following the existing pattern in
KNOWN_REGRESSIONS (3.9.6:* / 3.10.0:* entries) with a documented
rationale so a future PR isn't blocked by the bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rust engine parity: port the 11 remaining JS-only language extractors

1 participant