feat(native): port Objective-C extractor to Rust#1106
Conversation
Mirrors `src/extractors/objc.ts` in `crates/codegraph-core/src/extractors/objc.rs`. Adds the `tree-sitter-objc` dependency, wires `LanguageKind::ObjC` (`.m`) in the Rust `parser_registry` and `file_collector`, adds `.m` to `NATIVE_SUPPORTED_EXTENSIONS` on the JS side, and registers `OBJC_AST_TYPES` / `OBJC_AST_CONFIG` so the native and WASM engines extract identical `ast_nodes` for Objective-C source. Handles class interfaces / implementations (with `: Superclass`), protocols, instance and class method declarations/definitions with multi-part selectors assembled from leading identifiers and `method_parameter` children, C-level function declarations/definitions, `#import`/`@import` imports, and message expression call extraction.
|
Claude finished @carlos-alm's task —— View job PR Review: feat(native): port Objective-C extractor to RustTodo List:
Phase 1: Root Cause Analysis ✅ Problem Being Solved: Adding Objective-C language support to complete the dual-engine architecture. The native Rust engine lacks Objective-C extraction capabilities that exist in the WASM engine, creating a parser gap that violates the "both engines must produce identical results" principle. Why This Approach: The PR follows the established pattern for adding language support - adds tree-sitter grammar dependency, implements native Rust extractor mirroring existing WASM extractor ( Risk Assessment: Low risk - purely additive change following established patterns, comprehensive test coverage, no breaking changes to existing functionality. Phase 2: Configuration Change Detection ✅ CLEAN - No high-risk configuration changes detected:
Phase 3: Test Integrity ✅ HEALTHY - Test changes are appropriate:
Phase 4: Code Quality Assessment ✅ EXCELLENT - High-quality implementation: Architecture Consistency:
Implementation Quality:
Code Structure:
No Issues Detected: Security, performance, or maintainability concerns. Root Cause AnalysisProblem Being Solved: Engine parity violation - native Rust engine missing Objective-C support that exists in WASM engine Backlog Compliance
Critical ConcernsNone identified. This is a textbook example of proper language support addition. Final Recommendation
Excellent work - this PR exemplifies how language support should be added to codegraph. |
Codegraph Impact Analysis50 functions changed → 27 callers affected across 3 files
|
Greptile SummaryThis PR ports the Objective-C symbol extractor from a WASM/JS path to a native Rust
Confidence Score: 4/5The Rust extractor is solid and the JS fixes are correct, but three issues previously identified in src/extractors/objc.ts remain unresolved: multi-keyword method selectors are silently dropped, method parameters are never populated, and @Property members are always omitted from class children on the JS code path. The new Rust ObjCExtractor is well-implemented, thoroughly tested (9 unit tests), and correctly handles all v3 grammar node types. The JS extractor received the right fixes for message-expression selector assembly, C-call fallback, and module_import dispatch. However, buildSelector in the JS extractor still iterates keyword_selector wrapper nodes that don't exist in the v3 grammar, so any multi-keyword method like setName:age: is silently dropped from definitions on the JS path; extractMethodParams has the same structural mismatch; and collectClassMembers looks up property names via childForFieldName('name') which the grammar doesn't expose. src/extractors/objc.ts — buildSelector, extractMethodParams, and collectClassMembers (property lookup) each need to be updated to match the flat v3 grammar structure that the Rust extractor already implements correctly. Important Files Changed
Reviews (11): Last reviewed commit: "fix: resolve merge conflicts with main (..." | Re-trigger Greptile |
| "implementation_definition" => { | ||
| // Wraps a `method_definition` inside `class_implementation`. | ||
| if let Some(method) = find_child(&child, "method_definition") { | ||
| if let Some(sel) = build_selector(&method, source) { | ||
| members.push(child_def(sel, "method", start_line(&method))); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Dead-code branch:
implementation_definition inside collect_class_members
collect_class_members is only ever called from handle_class_interface, and class_interface nodes never contain implementation_definition children — those live exclusively inside class_implementation. This branch can never execute and the comment ("Wraps a method_definition inside class_implementation") even acknowledges the mismatch. If the intent was to also populate children for class implementations, collect_class_members would need to be called from handle_class_implementation as well.
There was a problem hiding this comment.
Fixed in de05222 — removed the unreachable implementation_definition branch from collect_class_members. Confirmed: collect_class_members is only invoked by handle_class_interface, so a class_implementation wrapper would never reach it. The Rust test suite still passes (9 ObjC tests).
| "js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb", | ||
| "rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt", | ||
| "kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli", | ||
| "m", |
There was a problem hiding this comment.
.m extension collision with MATLAB/Octave
The .m extension is also the canonical extension for MATLAB and GNU Octave source files. Any MATLAB code in a repository will now be silently routed to the ObjC parser and produce garbled or empty symbol output without any error signal. If supporting MATLAB is a future goal, disambiguation will require heuristics (e.g., checking for ObjC-specific tokens like @interface vs MATLAB keywords). Worth at least documenting the known false-positive risk in a comment.
There was a problem hiding this comment.
Documented in de05222 — added a "Known extension collisions" note to SUPPORTED_EXTENSIONS in crates/codegraph-core/src/file_collector.rs calling out the .m overlap with MATLAB/Octave, the resulting silent garbled-symbol output for MATLAB files, and the disambiguation heuristics that would be needed to support MATLAB later. Disambiguation itself is out of scope for this PR — .m is unambiguously routed to the ObjC parser today.
| fn find_objc_parent_class(node: &Node, source: &[u8]) -> Option<String> { | ||
| let mut current = node.parent(); | ||
| while let Some(parent) = current { | ||
| match parent.kind() { | ||
| "class_interface" | ||
| | "class_implementation" | ||
| | "protocol_declaration" => { | ||
| let name_node = find_objc_decl_name(&parent)?; | ||
| let base = node_text(&name_node, source).to_string(); | ||
| // Categories: include `(Cat)` so methods are grouped per category. | ||
| if let Some(cat) = parent.child_by_field_name("category") { | ||
| return Some(format!("{}({})", base, node_text(&cat, source))); | ||
| } | ||
| return Some(base); | ||
| } | ||
| _ => {} | ||
| } | ||
| current = parent.parent(); | ||
| } | ||
| None |
There was a problem hiding this comment.
Category method naming diverges from the JS extractor
find_objc_parent_class includes the category suffix in the parent name (e.g. Foo(Cat).catMethod), but the JS findObjCParentClass returns only the bare class name (Foo.catMethod). The Rust behaviour is technically more precise — the JS version is arguably a bug — but the two engines will produce different definitions entries for category methods, which breaks the goal of identical symbol output stated in the PR description. Worth a deliberate decision: fix the JS side to match Rust, or document the intentional divergence.
There was a problem hiding this comment.
Aligned by updating the JS extractor in de05222 — handleClassInterface, handleClassImplementation, and findObjCParentClass now include the (Category) suffix when the grammar emits class_interface/class_implementation with a category field. Per the CLAUDE.md guidance, when engines diverge we fix the less-accurate side rather than document the gap; the qualified parent is the correct behaviour because two categories on the same class can declare same-named methods that need to stay distinguished in the symbol output.
| let module_node = node.child_by_field_name("path") | ||
| .or_else(|| find_child(node, "identifier")); |
There was a problem hiding this comment.
handle_at_import uses wrong field name "path"
The JS extractor looks up childForFieldName('module') for @import declarations, consistent with the grammar's module field on import_declaration / module_import. The Rust side tries child_by_field_name("path"), which is the field name used for #import (preprocessor includes). Since both fall back to the first identifier child, extraction still works — but the path fallback means the named-field fast-path will always miss, and the field name signals an incorrect assumption about the grammar structure.
| let module_node = node.child_by_field_name("path") | |
| .or_else(|| find_child(node, "identifier")); | |
| let module_node = node.child_by_field_name("module") | |
| .or_else(|| find_child(node, "identifier")); |
There was a problem hiding this comment.
Fixed in de05222 — switched handle_at_import to look up child_by_field_name("module") first, matching the JS extractor and the actual module_import grammar field. The identifier fallback is preserved so older grammar revisions still parse.
- Use `module` field (not `path`) for `@import` in Rust to mirror the JS extractor and match the tree-sitter-objc `module_import` grammar field. - Drop the unreachable `implementation_definition` branch from `collect_class_members` — it is only invoked from `handle_class_interface` and `class_interface` nodes do not contain `implementation_definition` children. - Qualify category methods with `(Category)` in the JS extractor so its output matches Rust for `@interface Foo (Cat)` / `@implementation Foo (Cat)` when the grammar emits `class_interface`/`class_implementation` rather than dedicated `category_interface` nodes. Two categories can declare same-named methods, so the qualified parent disambiguates the symbols. - Document the `.m` extension collision with MATLAB/Octave in the file collector since `.m` files are unconditionally routed to the ObjC parser.
| ]); | ||
| expect(totals['unsupported-by-native']).toBe(10); |
There was a problem hiding this comment.
Off-by-one assertion after removing
'src/k.m'
The diff removes 'src/k.m' from the input array (because .m is now natively supported and would no longer fall into the unsupported-by-native bucket), but the expected count was not updated. The array now contains 9 paths yet the assertion is toBe(10), so this test fails at runtime.
| ]); | |
| expect(totals['unsupported-by-native']).toBe(10); | |
| expect(totals['unsupported-by-native']).toBe(9); |
There was a problem hiding this comment.
Resolved by post-review merges with main (cfcd5a2 "fix: correct expected count after merge dropped two entries"). The current assertion is toBe(7) after subsequent merges with main removed two additional entries from the unsupported-by-native bucket. Verified locally — tests/parsers/native-drop-classification.test.ts is green.
The native-drop-classification test asserts the count of WASM-only languages. The Clojure PR (#1097) removed src/c.clj and decremented the count from 11 to 10. This PR removes src/k.m (now natively supported via ObjC). After the merge both entries are removed, so the assertion needs to be 9, not 10.
| } | ||
|
|
||
| if parts.is_empty() { | ||
| return None; | ||
| } | ||
| if has_params { | ||
| Some(format!("{}:", parts.join(":"))) | ||
| } else { | ||
| Some(parts.join(":")) | ||
| } | ||
| } | ||
|
|
||
| /// Build a message-expression selector by collecting all `identifier` | ||
| /// children annotated with the `method` field. | ||
| fn build_message_selector(message_node: &Node, source: &[u8]) -> String { | ||
| let mut parts: Vec<String> = Vec::new(); | ||
| let mut has_colon = false; | ||
| for i in 0..message_node.child_count() { | ||
| if let Some(child) = message_node.child(i) { | ||
| if let Some(field) = message_node.field_name_for_child(i as u32) { | ||
| if field == "method" { | ||
| parts.push(node_text(&child, source).to_string()); |
There was a problem hiding this comment.
Rust captures C-style calls that the JS extractor silently skips
The Rust comment on line 392 states that tree-sitter-objc lacks a function field on call_expression, so the named-field lookup always misses and the identifier-child fallback is always used. The JS handleCCallExpr in src/extractors/objc.ts also calls childForFieldName('function') but returns immediately when it is null — no fallback, no call recorded. In practice this means C-style calls like printf(...) or CGContextFillRect(...) appear in the native graph but are absent from the JS one. The stated goal of identical output between both engines is not met for this node type.
There was a problem hiding this comment.
Fixed in 7294941 — JS handleCCallExpr now falls back to the first identifier/field_expression child when the function field is missing, matching handle_c_call_expr in the Rust extractor. C-style calls like printf(...) are no longer silently dropped on the JS path. Added a regression test (extracts C-style function calls without a \function` field) in tests/parsers/objc.test.tsthat assertsprintfappears insymbols.calls`. Per CLAUDE.md ("both engines must produce identical results"), fixing the less-accurate side rather than documenting the gap.
Three engine-parity gaps Greptile flagged on the WASM side that broke the "identical output" goal: - @import statements: tree-sitter-objc v3 emits `module_import` not `import_declaration`, so the JS dispatch arm never matched and every `@import Foundation;` was silently dropped. Accept both node types. - C-style calls (printf, CGContextFillRect, …): the grammar lacks a `function` field on `call_expression`, so the named-field lookup always misses. Rust falls back to the first identifier child; JS did not, so every C call was dropped. Add the same fallback. - Message expressions: the grammar tags each keyword identifier with the `method` field rather than exposing a `selector` field, so the JS selector lookup misfired for multi-keyword selectors. Assemble the selector from `method` children with `:` joining, matching `build_message_selector` in the Rust extractor. Also expose `fieldNameForChild` on the `TreeSitterNode` type and add three JS extractor tests covering the new parity behaviour.
|
Addressed the remaining Greptile feedback in 7294941:
Per CLAUDE.md ("both engines must produce identical results"), the parity gaps were fixed on the less-accurate side (JS) rather than documented. |
Summary
tree-sitter-objcdependency and a native Objective-C extractor incrates/codegraph-core/src/extractors/objc.rs..mwithLanguageKind::ObjCand the Rustfile_collector, adds Objective-C toNATIVE_SUPPORTED_EXTENSIONSon the JS side, and wiresOBJC_AST_TYPES/OBJC_AST_CONFIGon both the native and JS sides so the two engines extract identicalast_nodesfor Objective-C source.extractObjCSymbols:class_interface/class_implementation(with: Superclass) asclass,@protocolasinterface, instance and class method declarations / definitions with multi-part selectors assembled from leading identifiers andmethod_parameterchildren, C-level function declarations / definitions,#import/@importimports, and message-expression call extraction.Closes #1071
Test plan
cargo build --release -p codegraph-core(clean build)cargo test -p codegraph-core --lib— 193/193npx tree-sitter build --wasmregeneratestree-sitter-objc.wasmnpx vitest run tests/parsers/objc.test.ts— 5/5npx vitest run tests/parsers/native-drop-classification.test.ts— 13/13