Skip to content

feat(native): port Clojure extractor to Rust#1097

Merged
carlos-alm merged 3 commits into
mainfrom
feat/1071-clojure-rust-extractor
May 12, 2026
Merged

feat(native): port Clojure extractor to Rust#1097
carlos-alm merged 3 commits into
mainfrom
feat/1071-clojure-rust-extractor

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Adds tree-sitter-clojure-orchard dependency and a native Clojure extractor in crates/codegraph-core/src/extractors/clojure.rs.
  • Registers .clj/.cljs/.cljc with LanguageKind::Clojure and the Rust file_collector, removes Clojure from the WASM-only drop list, and wires CLOJURE_AST_CONFIG (string/regex literals) on both the native and JS sides.
  • Mirrors extractClojureSymbols in src/extractors/clojure.ts: namespace threading via currentNs, defn/defn-/defmacro/defmulti/defmethod as function, defprotocol as interface, defrecord as record, deftype as type, def/defonce as variable, plus (:require ...)/(:import ...)/(:use ...) imports.

Part of #1071

Test plan

  • cargo build --release -p codegraph-core (clean build)
  • cargo test -p codegraph-core --lib — 193/193 (9 new clojure extractor tests)
  • napi build --release regenerates .node artifact
  • npx vitest run tests/parsers/clojure.test.ts — 5/5
  • npx vitest run tests/parsers/native-drop-classification.test.ts — 13/13
  • npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts -t clojure — 5/5

Adds tree-sitter-clojure dependency and native extractor matching
the WASM-side behavior for Clojure symbol, import, and call extraction.

Part of #1071
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Native Clojure Extractor Port

Conducting comprehensive review of the Clojure extractor port from WASM to native Rust.

Review Progress

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity - verify test quality and coverage
  • Phase 4: Code quality - review implementation details
  • Phase 5: Backlog compliance - evaluate against project criteria
  • Final Assessment: Provide rating and recommendation

Root Cause Analysis

Problem Being Solved: Porting the Clojure language extractor from WASM-only to dual-engine (native Rust + WASM fallback) to achieve parsing performance parity with other supported languages.

Why This Approach: Follows the established dual-engine architecture principle (Foundation P2). The Rust implementation mirrors the existing JavaScript extractor behavior exactly, preserving compatibility while enabling native performance gains.

Risk Assessment: LOW RISK - This is a straightforward language addition following established patterns with comprehensive test coverage.

Backlog Compliance

  • Zero-dep: ✓ — Adds tree-sitter-clojure-orchard but this is the expected tree-sitter grammar dependency, consistent with other language support
  • Foundation-aligned: ✓ — Perfectly aligns with Principle 2 (Native speed, universal reach) and Principle 6 (One registry, one schema, no magic)
  • Problem-fit: 5/5 — Directly addresses the "native speed, universal reach" foundation principle by eliminating Clojure as a performance gap
  • Breaking: No — Purely additive, no existing behavior changes
  • Tier: Tier 1 (zero-dep + foundation-aligned)

Critical Concerns

None identified. This is an exemplary implementation that:

  1. Follows established architecture perfectly: Dual-engine pattern with behavioral parity
  2. Maintains code quality standards: Comprehensive tests, proper error handling, clear documentation
  3. Preserves compatibility: No breaking changes, mirrors existing WASM behavior exactly
  4. Addresses real user need: Eliminates performance disparity for Clojure users

Technical Review Summary

Phase 2: Configuration Changes ✅

  • ✅ New dependency tree-sitter-clojure-orchard (expected)
  • ✅ Clojure removed from WASM-only list (expected for native support)
  • ✅ No configuration rules relaxed

Phase 3: Test Integrity ✅

  • ✅ 9 new native Rust unit tests covering all constructs
  • ✅ 5/5 WASM integration tests passing
  • ✅ Hand-annotated resolution benchmark with 15 expected edges
  • ✅ Updated classification tests

Phase 4: Code Quality ✅

Rust Implementation (clojure.rs):

  • ✅ Perfect behavioral parity with JS extractor
  • ✅ Comprehensive symbol extraction (ns, defn/defn-, defmacro, defprotocol, etc.)
  • ✅ Proper namespace threading and delimiter handling
  • ✅ Complete integration (complexity, CFG, AST analysis)
  • ✅ Idiomatic Rust with proper error handling

Integration:

  • ✅ Proper language registry integration
  • ✅ AST configuration mirrored on both sides
  • ✅ File collection integration

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: This is an exemplary implementation that perfectly follows the project's dual-engine architecture, maintains behavioral parity with existing extractors, includes comprehensive test coverage, and directly advances the foundation principle of "native speed, universal reach." The code quality is excellent and there are no breaking changes or security concerns.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ports the Clojure symbol extractor from TypeScript (WASM) to Rust native, adding tree-sitter-clojure-orchard and wiring .clj/.cljs/.cljc into the full Rust pipeline (file collector, parser registry, change detection). The implementation faithfully mirrors extractClojureSymbols in clojure.ts, threading the namespace through list_lit forms and handling all canonical definition kinds.

  • New file clojure.rs implements ClojureExtractor with walk_clojure, handle_list_form, find_first_symbol, find_second_symbol, and helpers for every Clojure def-form; 9 unit tests are included and all pass.
  • parser_registry.rs, file_collector.rs, change_detection.rs are updated to add LanguageKind::Clojure and remove Clojure from the WASM-only skip list; WASM-side ast-analysis/rules/index.ts and domain/parser.ts receive matching config.
  • Known limitations inherited from the JS extractor (multi-arity param extraction, defmethod dispatch-vector, top-level require with quoted args) are documented with inline comments.

Confidence Score: 5/5

Safe to merge — this is a straightforward additive port with no modifications to existing extraction paths.

Every changed file is either purely additive (new extractor, new language variant, new config entries) or a targeted removal of Clojure from the WASM-only skip lists. The Rust implementation is a faithful line-by-line translation of the JS extractor, all known behavioral gaps are documented, and the full test suite (193 Rust + 5 Vitest Clojure + 13 native-drop tests) passes cleanly.

No files require special attention.

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/clojure.rs New 477-line Clojure extractor; faithfully mirrors the JS counterpart, with all previously-flagged issues addressed and documented. Logic, parity, and tests look correct.
crates/codegraph-core/src/parser_registry.rs Adds LanguageKind::Clojure variant; all() slice, from_extension, from_str, tree_sitter_language, and the exhaustiveness test are all updated consistently.
crates/codegraph-core/src/extractors/mod.rs Registers clojure module and wires LanguageKind::Clojure into extract_symbols_with_opts; correct and complete.
crates/codegraph-core/src/file_collector.rs Adds .clj/.cljs/.cljc to SUPPORTED_EXTENSIONS and updates the WASM-only doc-comment example; correct.
crates/codegraph-core/src/change_detection.rs Removes .clj from WASM-only language list in comment and test fixture; matches the new native support.
crates/codegraph-core/src/extractors/helpers.rs Adds CLOJURE_AST_CONFIG with str_lit/regex_lit types and " quote char; consistent with other language configs and the TS side.
src/ast-analysis/rules/index.ts Adds CLOJURE_AST_TYPES and CLOJURE_STRING_CONFIG entries mirroring the Rust CLOJURE_AST_CONFIG; correctly wired into both maps.
src/domain/parser.ts Adds .clj/.cljs/.cljc to NATIVE_SUPPORTED_EXTENSIONS; correct counterpart to the Rust file_collector change.
tests/parsers/native-drop-classification.test.ts Removes .clj from the unsupported-by-native list and decrements the count from 11 to 10; matches the new native support.

Reviews (3): Last reviewed commit: "Merge branch 'main' into feat/1071-cloju..." | Re-trigger Greptile

Comment on lines +132 to +140
break;
}
None
}

/// Find the second `sym_lit` or `kwd_lit` child. Used to extract the bound
/// name from forms like `(defn foo [...] ...)`.
fn find_second_symbol<'a>(list_node: &Node<'a>) -> Option<Node<'a>> {
let mut count = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 ? exits the function on None — JS uses continue

Both find_first_symbol and find_second_symbol use let child = list_node.child(i)?;. If child(i) ever returns None for a valid index, the ? operator returns None from the whole function immediately, silently skipping all remaining children. The JS counterpart uses if (!child) continue; which just advances to the next iteration. In practice tree-sitter's child(i) shouldn't return None for i < child_count(), but the divergence is fragile and could produce silent extraction failures under unexpected grammar edge-cases.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d9e3c9b. Replaced the ? with an explicit match that skips None and continues, so the scan no longer aborts on an unexpected None from list_node.child(i) — matches the JS extractor's if (!child) continue; semantics. Practically inert today (tree-sitter always returns Some for i < child_count()), but the invariant is now explicit on both engines.

Comment on lines +290 to +311
fn extract_clojure_params(defn_node: &Node, source: &[u8]) -> Vec<Definition> {
let mut params = Vec::new();
// First `vec_lit` child is the parameter vector `[x y z]`.
for i in 0..defn_node.child_count() {
let child = match defn_node.child(i) {
Some(c) if c.kind() == "vec_lit" => c,
_ => continue,
};
for j in 0..child.child_count() {
if let Some(param) = child.child(j) {
if param.kind() == "sym_lit" {
params.push(child_def(
node_text(&param, source).to_string(),
"parameter",
start_line(&param),
));
}
}
}
break; // Only the first vector is the params
}
params
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 defmethod with vector dispatch value misses params

extract_clojure_params breaks on the first vec_lit. For (defmethod foo [:a :b] [x] body) the dispatch vector [:a :b] is the first vec_lit, so the actual parameter vector [x] is never visited. Since :a/:b are kwd_lit, not sym_lit, the extracted children list ends up empty rather than wrong — but x is silently dropped. The JS extractor has the identical break behavior, so this matches parity, but it is a gap worth documenting explicitly given the comment already calls out visibility.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d9e3c9b — added an explicit comment in extract_clojure_params documenting the defmethod dispatch-vector gap (dispatch vec_lit consumes the break slot; param vector silently skipped; children ends up empty rather than wrong). Behavior unchanged to preserve parity with the JS extractor.

Comment on lines +346 to +358
}

fn handle_import_form(node: &Node, source: &[u8], symbols: &mut FileSymbols, keyword: &str) {
let name_node = match find_second_symbol(node) {
Some(n) => n,
None => return,
};
symbols.imports.push(Import::new(
node_text(&name_node, source).to_string(),
vec![keyword.to_string()],
start_line(node),
));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Top-level require/use/import effectively dead for standard Clojure syntax

handle_import_form delegates to find_second_symbol, which only matches sym_lit or kwd_lit. In real Clojure code the argument is almost always a quoted form ('some.nsquoting_lit) or a vector ('[some.ns :as s]). Both cause find_second_symbol to return None and the import is silently dropped. The JS handleImportForm has the same limitation, so parity is maintained. A note distinguishing this path from the correctly-handled (:require …) inside ns forms would help future maintainers understand why top-level require coverage is incomplete.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d9e3c9b — added a doc-comment on handle_import_form explaining that top-level (require 'some.ns) / (require '[some.ns :as s]) shapes return None from find_second_symbol and are silently dropped, while the real-world path (ns ... (:require ...)) is handled correctly by extract_ns_requires. Parity with JS preserved; future maintainers now have the breadcrumb.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Codegraph Impact Analysis

32 functions changed22 callers affected across 2 files

  • detect_removed_skips_unsupported_extensions in crates/codegraph-core/src/change_detection.rs:776 (0 transitive callers)
  • ClojureExtractor.extract in crates/codegraph-core/src/extractors/clojure.rs:23 (0 transitive callers)
  • walk_clojure in crates/codegraph-core/src/extractors/clojure.rs:43 (1 transitive callers)
  • handle_list_form in crates/codegraph-core/src/extractors/clojure.rs:77 (2 transitive callers)
  • find_first_symbol in crates/codegraph-core/src/extractors/clojure.rs:127 (5 transitive callers)
  • find_second_symbol in crates/codegraph-core/src/extractors/clojure.rs:149 (8 transitive callers)
  • is_delimiter_or_meta in crates/codegraph-core/src/extractors/clojure.rs:171 (11 transitive callers)
  • handle_ns_form in crates/codegraph-core/src/extractors/clojure.rs:175 (3 transitive callers)
  • extract_ns_requires in crates/codegraph-core/src/extractors/clojure.rs:209 (3 transitive callers)
  • handle_def_form in crates/codegraph-core/src/extractors/clojure.rs:241 (3 transitive callers)
  • handle_defn_form in crates/codegraph-core/src/extractors/clojure.rs:270 (3 transitive callers)
  • extract_clojure_params in crates/codegraph-core/src/extractors/clojure.rs:303 (3 transitive callers)
  • handle_defprotocol in crates/codegraph-core/src/extractors/clojure.rs:335 (3 transitive callers)
  • handle_defrecord in crates/codegraph-core/src/extractors/clojure.rs:352 (3 transitive callers)
  • handle_import_form in crates/codegraph-core/src/extractors/clojure.rs:380 (3 transitive callers)
  • parse_clj in crates/codegraph-core/src/extractors/clojure.rs:397 (9 transitive callers)
  • extracts_defn in crates/codegraph-core/src/extractors/clojure.rs:407 (0 transitive callers)
  • extracts_private_defn in crates/codegraph-core/src/extractors/clojure.rs:418 (0 transitive callers)
  • extracts_ns_and_requires in crates/codegraph-core/src/extractors/clojure.rs:425 (0 transitive callers)
  • extracts_qualified_call in crates/codegraph-core/src/extractors/clojure.rs:438 (0 transitive callers)

…rity gaps

Address Greptile review feedback on PR #1097:

- `find_first_symbol`/`find_second_symbol`: replace `?` with `match` so a
  `None` from `list_node.child(i)` skips and continues instead of aborting
  the whole scan, matching the JS extractor's `if (!child) continue;`
  semantics. Practically inert today (tree-sitter always returns `Some`
  for `i < child_count()`), but makes the invariant explicit and removes
  a fragile divergence.
- `extract_clojure_params`: document the known `defmethod` dispatch-vector
  gap inherited from the JS extractor.
- `handle_import_form`: document why top-level `(require 'some.ns)` and
  `(require '[some.ns :as s])` are silently dropped and point maintainers
  at the working `(ns ...)` path in `extract_ns_requires`.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 1acf15e into main May 12, 2026
27 checks passed
@carlos-alm carlos-alm deleted the feat/1071-clojure-rust-extractor branch May 12, 2026 04:59
@github-actions github-actions Bot locked and limited conversation to collaborators May 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant