Skip to content

feat(native): port Erlang extractor to Rust#1103

Merged
carlos-alm merged 11 commits into
mainfrom
feat/1071-erlang-rust-extractor
May 13, 2026
Merged

feat(native): port Erlang extractor to Rust#1103
carlos-alm merged 11 commits into
mainfrom
feat/1071-erlang-rust-extractor

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Adds tree-sitter-erlang dependency and a native Erlang extractor in crates/codegraph-core/src/extractors/erlang.rs.
  • Registers .erl/.hrl with LanguageKind::Erlang and the Rust file_collector, removes Erlang from the WASM-only drop list, and wires ERLANG_AST_CONFIG (string literals) on both the native and JS sides.
  • Mirrors extractErlangSymbols: module declarations as module, function clauses (deduplicated per arity) as function, records as record, type aliases/opaque as type, macros as variable, plus -include/-include_lib/-import directives and module:function/local call extraction.

Closes #1071

Test plan

  • cargo build --release -p codegraph-core (clean build)
  • cargo test -p codegraph-core --lib — 191/191 (7 new erlang extractor tests)
  • npx tree-sitter build --wasm regenerates tree-sitter-erlang.wasm
  • npx vitest run tests/parsers/erlang.test.ts — 5/5
  • npx vitest run tests/parsers/native-drop-classification.test.ts — 13/13

Adds tree-sitter-erlang dependency and native extractor matching
the WASM-side behavior for Erlang symbol, import, and call extraction.

Part of #1071
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


Root Cause Analysis

Problem Being Solved: This PR ports the Erlang extractor from WASM-only to native Rust implementation, achieving engine parity. Currently, Erlang files are only parsed by the WASM engine, causing performance overhead and engine inconsistencies.

Why This Approach: Following the established dual-engine architecture pattern used for all other supported languages. The native Rust implementation provides better performance and eliminates the need for WASM fallback on Erlang files.

Risk Assessment: Low risk - this is additive functionality following existing patterns. The implementation mirrors the existing WASM extractErlangSymbols function and includes comprehensive test coverage.

Backlog Compliance

  • Zero-dep: ✓ - Only adds tree-sitter-erlang = "0.16" as a parsing dependency, which follows the existing pattern for all other languages
  • Foundation-aligned: ✓ - Fully compliant with all principles. Adds local parsing capability without external dependencies or cloud services
  • Problem-fit: 4/5 - Directly improves native engine coverage and parsing performance for Erlang codebases
  • Breaking: No - Purely additive, no existing behavior changes
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Phase-by-Phase Analysis

Phase 2: Configuration Changes ✓

No concerning rule relaxations found. All changes are additive:

  • Cargo.toml: adds tree-sitter-erlang dependency
  • No timeout increases, test threshold reductions, or linting rule disables

Phase 3: Test Integrity ✓

Strong test coverage across both native and WASM engines:

  • Native Rust: 7 new extractor tests in erlang.rs:324-396 covering module, function, record, import, call, include, and multi-clause deduplication
  • WASM JS: 5 tests in tests/parsers/erlang.test.ts verifying equivalent extraction
  • Resolution benchmark: Hand-annotated expected edges in tests/benchmarks/resolution/fixtures/erlang/expected-edges.json
  • Author claims 191/191 cargo tests and 5/5 vitest tests passing

Phase 4: Code Quality Analysis ✅

Architecture Adherence: Perfect consistency with existing patterns

  • Follows identical structure to other language extractors (walk_tree + walk_ast_nodes_with_config)
  • Proper registration in parser_registry.rs with .erl/.hrl extensions
  • Native/WASM AST config sync: ERLANG_AST_CONFIG mirrors JS ERLANG_AST_TYPES

Symbol Extraction Coverage: Comprehensive Erlang language support

  • Modules: -module(name) → kind: "module"
  • Functions: Multi-clause deduplication prevents duplicate definitions (lines 135-142)
  • Records: Field extraction with children (lines 62-77)
  • Types: -type and -opaque declarations → kind: "type"
  • Macros: -define(NAME, value) → kind: "variable" (appropriate choice)
  • Imports: -import and -include/-include_lib directives
  • Calls: Local and remote calls (module:function syntax) with proper receiver handling

Implementation Quality:

  • Robust error handling with early returns on missing nodes
  • Memory safety through proper tree-sitter node handling
  • Legacy grammar compatibility (lines 276-287) handles different tree-sitter-erlang versions
  • Consistent use of helper functions (find_child, node_text, start_line)

Engine Parity: Excellent synchronization

  • Native ERLANG_AST_CONFIG exactly matches JS ERLANG_AST_TYPES (both: string literals only)
  • Extension registration sync: .erl/.hrlin bothNATIVE_SUPPORTED_EXTENSIONSandLANGUAGE_REGISTRY`
  • Test removes Erlang from WASM-only drop list (tests/parsers/native-drop-classification.test.ts:80-81)

Minor Observations:

  • Function deduplication logic (lines 135-142) is sound but could benefit from a comment explaining multi-clause Erlang functions
  • String literal handling uses double quotes only, appropriate for Erlang
  • Call extraction handles both legacy and modern tree-sitter-erlang grammar shapes

Phase 5: Final Assessment

No Critical Issues Identified

This is a well-executed language port that follows all established patterns and provides comprehensive Erlang support. The dual-engine architecture requires perfect parity between native and WASM implementations, and this PR achieves that through:

  1. Comprehensive symbol extraction mirroring the WASM implementation
  2. Proper AST configuration synchronization
  3. Thorough test coverage across both engines
  4. Clean integration with existing infrastructure

The implementation quality is high, following Rust best practices and the codebase's established patterns. Engine parity compliance is excellent.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation that properly extends native engine coverage to Erlang. Follows all established patterns, includes comprehensive tests, and maintains perfect engine parity. This is exactly the kind of incremental language support addition the codebase is designed for.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ports the existing WASM-based Erlang extractor to a native Rust implementation (tree-sitter-erlang v0.16), bringing Erlang to parity with other natively-supported languages. It also backports several correctness fixes to the JS extractor so both engines produce consistent output.

  • Adds crates/codegraph-core/src/extractors/erlang.rs with full symbol extraction: modules, functions (deduped by name+arity), records, types/opaques, macros, includes, imports, and calls — mirroring the JS extractor's contract.
  • Registers .erl/.hrl with LanguageKind::Erlang in the file collector, parser registry, and native-supported-extensions set, removing Erlang from the WASM-only drop list.
  • Fixes shared in both engines: arity-aware deduplication replacing name-only deduplication, namedChild iteration for correct complex-pattern arity counting, childForFieldName preference for robustness, include/include_lib kind distinction, and lowercase parametric macro name extraction.

Confidence Score: 5/5

Safe to merge — the Erlang extractor is a new path that does not touch any existing extraction logic, and the JS extractor changes are correctness-only fixes validated by the full test suite.

All issues flagged in earlier review rounds have been addressed and covered with new tests. The Rust extractor mirrors the JS engine contract, infrastructure wiring follows the exact pattern of every prior language addition, and the test suite exercises all key edge cases. No remaining defects were identified.

No files require special attention.

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/erlang.rs New 573-line Rust extractor; covers module/function/record/type/macro/include/import/call extraction with arity-aware dedup, complex-pattern arity counting via named-child iteration, and field-name-preferred node lookup. All 14 unit tests pass including edge cases added in this PR.
src/extractors/erlang.ts JS extractor updated to mirror Rust behavior: arity-based dedup, namedChild iteration for params, childForFieldName preference, include_lib kind distinction, macro_lhs atom-before-var preference. Changes are a strict correctness upgrade.
crates/codegraph-core/src/parser_registry.rs Adds Erlang variant to LanguageKind enum, extension mappings (.erl/.hrl), language name mapping, tree-sitter language binding, and includes it in all() and the exhaustiveness test (counter bumped to 29).
crates/codegraph-core/src/file_collector.rs Adds erl and hrl to SUPPORTED_EXTENSIONS so Erlang header files are collected by the native file scanner.
tests/parsers/erlang.test.ts Adds 10 new test cases exercising distinct-arity dedup, complex-pattern arity, -type/-opaque, macros (parametric uppercase/lowercase), and include/include_lib kind distinction.

Reviews (12): Last reviewed commit: "fix(extractors): mirror Rust handle_reco..." | Re-trigger Greptile

Comment on lines +135 to +142
// Don't duplicate if we already have this function
if symbols
.definitions
.iter()
.any(|d| d.name == name && d.kind == "function")
{
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Deduplication ignores arity, silently drops overloaded functions

The guard checks only d.name == name, so when a module defines foo/1 and foo/2 as two separate fun_decl nodes, the second fun_decl is processed, hits the check, finds the already-registered "foo" entry, and returns early. Only the first-encountered arity survives in definitions. Erlang's overloading by arity is idiomatic and common, so this will silently omit real definitions in virtually every non-trivial module. The fix is to compute arity before the guard and include it in the comparison, for example by counting expr_args children upfront and matching against d.children.as_ref().map_or(0, |c| c.len()).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c3c13f — dedup now compares (name, arity) so all clauses (foo/1, foo/2, ...) survive. Added a test covering three arities for the same name.

Comment on lines +170 to +182
for i in 0..args_node.child_count() {
let child = match args_node.child(i) {
Some(c) => c,
None => continue,
};
if child.kind() == "var" || child.kind() == "atom" {
params.push(child_def(
node_text(&child, source).to_string(),
"parameter",
start_line(&child),
));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Parameters only capture bare var and atom patterns; complex pattern-match arguments (tuples {ok, X}, lists [H|T], binaries <<Bin>>) are silently skipped. For Erlang functions that rely on pattern-matching to distinguish clauses, the extracted param list will be shorter than the actual arity, which will interact badly with any arity-based deduplication fix. Consider counting all non-punctuation children of expr_args to get a stable arity count independent of pattern complexity.

Suggested change
for i in 0..args_node.child_count() {
let child = match args_node.child(i) {
Some(c) => c,
None => continue,
};
if child.kind() == "var" || child.kind() == "atom" {
params.push(child_def(
node_text(&child, source).to_string(),
"parameter",
start_line(&child),
));
}
}
for i in 0..args_node.child_count() {
let child = match args_node.child(i) {
Some(c) => c,
None => continue,
};
// Skip punctuation so every argument pattern counts as one parameter.
if matches!(child.kind(), "," | "(" | ")") {
continue;
}
let label = if child.kind() == "var" || child.kind() == "atom" {
node_text(&child, source).to_string()
} else {
format!("_{}", i) // placeholder for complex patterns
};
params.push(child_def(label, "parameter", start_line(&child)));
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c3c13f — now iterates named children so every argument pattern (tuple, list, binary) counts as one parameter. Complex patterns get a positional placeholder label so arity is preserved. Added a test for {ok, X} / [H|T] arguments.

Comment on lines +37 to +42
fn handle_module_attr(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// module_attribute: - module ( atom ) .
let name_node = match find_child(node, "atom") {
Some(n) => n,
None => return,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 handle_module_attr uses find_child(node, "atom") which returns the first atom child. In tree-sitter-erlang v0.16 the -module(name). attribute produces a module_attribute whose first child is often the anonymous keyword token module, but if that keyword is represented as a named atom node (grammar-version dependent), the extracted name would be the literal string "module" rather than the actual module name. The existing test passes on the current grammar version, but it is fragile. Using a field name lookup — node.child_by_field_name("name") — is more robust if the grammar exposes it.

Suggested change
fn handle_module_attr(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// module_attribute: - module ( atom ) .
let name_node = match find_child(node, "atom") {
Some(n) => n,
None => return,
};
fn handle_module_attr(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// module_attribute: - module ( atom ) .
// Prefer the named field to avoid accidentally picking up the `module`
// keyword itself if the grammar represents it as an atom node.
let name_node = node
.child_by_field_name("name")
.or_else(|| find_child(node, "atom"))
.filter(|n| node_text(n, source) != "module");
let name_node = match name_node {
Some(n) => n,
None => return,
};

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c3c13f — handle_module_attr now prefers child_by_field_name("name") and only falls back to find_child(node, "atom") if the grammar doesn't expose the field. handle_function_clause and extract_params use the same approach for the function-clause name and args. The behavior matches the current grammar's tests and is robust to future grammar revisions.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Codegraph Impact Analysis

43 functions changed23 callers affected across 3 files

  • ErlangExtractor.extract in crates/codegraph-core/src/extractors/erlang.rs:9 (0 transitive callers)
  • match_erlang_node in crates/codegraph-core/src/extractors/erlang.rs:17 (0 transitive callers)
  • handle_module_attr in crates/codegraph-core/src/extractors/erlang.rs:37 (1 transitive callers)
  • handle_record_decl in crates/codegraph-core/src/extractors/erlang.rs:62 (1 transitive callers)
  • handle_type_alias in crates/codegraph-core/src/extractors/erlang.rs:103 (1 transitive callers)
  • handle_fun_decl in crates/codegraph-core/src/extractors/erlang.rs:129 (1 transitive callers)
  • handle_function_clause in crates/codegraph-core/src/extractors/erlang.rs:139 (2 transitive callers)
  • extract_params in crates/codegraph-core/src/extractors/erlang.rs:181 (3 transitive callers)
  • handle_define in crates/codegraph-core/src/extractors/erlang.rs:210 (1 transitive callers)
  • handle_include in crates/codegraph-core/src/extractors/erlang.rs:250 (1 transitive callers)
  • handle_import_attr in crates/codegraph-core/src/extractors/erlang.rs:274 (1 transitive callers)
  • handle_call in crates/codegraph-core/src/extractors/erlang.rs:305 (1 transitive callers)
  • parse_erlang in crates/codegraph-core/src/extractors/erlang.rs:364 (15 transitive callers)
  • extracts_module_declaration in crates/codegraph-core/src/extractors/erlang.rs:374 (0 transitive callers)
  • extracts_function_definition in crates/codegraph-core/src/extractors/erlang.rs:385 (0 transitive callers)
  • extracts_record_definition in crates/codegraph-core/src/extractors/erlang.rs:396 (0 transitive callers)
  • extracts_import_attribute in crates/codegraph-core/src/extractors/erlang.rs:411 (0 transitive callers)
  • extracts_function_calls in crates/codegraph-core/src/extractors/erlang.rs:421 (0 transitive callers)
  • extracts_include_directive in crates/codegraph-core/src/extractors/erlang.rs:427 (0 transitive callers)
  • extracts_include_lib_directive in crates/codegraph-core/src/extractors/erlang.rs:440 (0 transitive callers)

…1103)

- Dedupe Erlang function defs by (name, arity) so foo/1 and foo/2 are
  both kept
- Count every argument pattern (tuple, list, binary) as one parameter
  via named children, using placeholder labels for complex patterns
- Prefer the named 'name'/'args' fields for module attributes and clause
  args, falling back to the previous atom/expr_args lookups
- Add Rust and TS tests covering multi-arity overloads and complex
  pattern args
…-field fallback (#1103)

- Rust handle_record_decl now prefers child_by_field_name("name")
  before falling back to find_child(atom), matching the other Erlang
  handlers and avoiding accidental keyword pickup if the grammar
  exposes 'record' as a named atom.
- TypeScript handleTypeAlias now mirrors the Rust type_name->atom
  fallback so the two engines agree when the grammar wraps the alias
  name in a type_name node.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile review-summary issues from review 4425068434:

  • handle_call in crates/codegraph-core/src/extractors/erlang.rs now uses named_child(0) instead of child(0), consistent with the rest of the file and robust against grammar revisions that insert anonymous leading tokens.
  • handleTypeAlias in src/extractors/erlang.ts no longer calls findChild(node, 'type_name') twice; the result is extracted into a local. The JS handleCall was updated to namedChild(0) in lockstep with the Rust change so engine parity is preserved.

Merged in origin/main and resolved conflicts in file_collector.rs, parser_registry.rs, and native-drop-classification.test.ts (Julia/Cuda landed in main while Erlang was in flight — all three are now natively supported).

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile follow-up from review 4425068434 (last update 2026-05-13T09:56:11Z):

  • Added Rust unit tests for handle_type_alias (simple alias and -opaque) and handle_define in crates/codegraph-core/src/extractors/erlang.rs — 12 total Erlang tests now (was 9).
  • Added matching JS tests in tests/parsers/erlang.test.ts for -type, -opaque, and -define — 10 total JS Erlang tests (was 7). Both engines now lock in the type-alias / type_name fallback and macro extraction shapes.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile follow-up from review 4425068434 (last update 2026-05-13T10:32:26Z):

  • handle_define (Rust) and handleDefine (JS) now prefer atom over var when extracting the macro name from macro_lhs. For -define(foo(X), X+1) the previous code returned the argument variable X instead of the macro name foo; the suggested fix is applied identically on both engines so they agree on parametric macro extraction.
  • Added regression tests for both shapes — extracts_uppercase_parametric_macro_name and extracts_lowercase_parametric_macro_name on the Rust side, and matching tests on the JS side. Rust Erlang test count is now 14; JS Erlang test count is now 12.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile follow-up from review 4425068434 (last update 2026-05-13T10:35:07Z):

  • handle_include (Rust) and handleInclude (JS) now record the import kind based on the node type — pp_includeinclude, pp_include_libinclude_lib — so downstream consumers can pick the correct path-resolution strategy (local file vs OTP app code path). Both engines are updated symmetrically.
  • Added regression tests on both sides asserting the names payload now carries the correct kind. Rust Erlang test count is now 15; JS Erlang test count is now 14.

Note: this round's Greptile review was generated against commit 4c97a2f (the test-coverage commit). Commit 16d6b33 (parametric macro_lhs fix) is already pushed and should be picked up on the next review pass.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile follow-up from the last summary review (1e9971e):

  • src/extractors/erlang.ts handleRecordDecl now prefers childForFieldName('name') and falls back to findChild(node, 'atom'), mirroring the Rust handle_record_decl defensive pattern.
  • Rewrote the block comment on handle_define in crates/codegraph-core/src/extractors/erlang.rs so it scopes the atom-before-var ordering explicitly to the inner macro_lhs branch, and notes that the outer var-first ordering is correct for non-parametric macros.

Also merged origin/main (Solidity #1100) — resolved conflicts in Cargo.toml, helpers.rs, mod.rs, file_collector.rs, parser_registry.rs (Erlang and Solidity now coexist, EXPECTED_LEN bumped to 29), src/ast-analysis/rules/index.ts, src/domain/parser.ts, and tests/parsers/native-drop-classification.test.ts.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 7eb7482 into main May 13, 2026
27 checks passed
@carlos-alm carlos-alm deleted the feat/1071-erlang-rust-extractor branch May 13, 2026 12:08
@github-actions github-actions Bot locked and limited conversation to collaborators May 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rust engine parity: port the 11 remaining JS-only language extractors

1 participant