Skip to content

feat: switch to pure-go javascript endpoint extraction using goja#1529

Open
t6harsh wants to merge 2 commits intoprojectdiscovery:devfrom
t6harsh:feat/pure-go-js-extractor
Open

feat: switch to pure-go javascript endpoint extraction using goja#1529
t6harsh wants to merge 2 commits intoprojectdiscovery:devfrom
t6harsh:feat/pure-go-js-extractor

Conversation

@t6harsh
Copy link

@t6harsh t6harsh commented Feb 11, 2026

Proposed changes

This PR replaces the github.com/BishopFox/jsluice dependency (and its CGO requirement github.com/smacker/go-tree-sitter) with a pure-Go JavaScript endpoint extractor using github.com/dop251/goja's AST parser.

Impact

  • Eliminates CGO: Removes all CGO requirements, significantly simplifying cross-platform builds (especially for Windows and Linux ARM/386).
  • Cross-Platform Consistency: The jsluice parser was previously guarded by //go:build !(386 || windows), meaning Windows/386 users had reduced functionality. This PR enables full JavaScript analysis on all platforms.
  • Dependency Cleanup: Removes jsluice and go-tree-sitter from the dependency graph.

Implementation Details

  • Rewrote pkg/utils/jsluice.go to use dop251/goja/parser and dop251/goja/ast.
  • Implemented a robust AST walker that detects URLs in:
    • fetch() calls
    • XMLHttpRequest.open()
    • window.open()
    • location.href / img.src assignments
    • Object literals, arrays, and template literals
    • jQuery ($.ajax) and axios calls
  • Added a regex fallback for malformed JavaScript (graceful degradation).
  • Updated .goreleaser/*.yml to remove CGO_ENABLED=1 and cross-compiler requirements.

Proof

I have added a comprehensive test suite in pkg/utils/jsluice_test.go covering 25+ scenarios including all supported extraction patterns and edge cases.

New Tests Passing:

=== RUN   TestExtractJsluiceEndpoints
--- PASS: TestExtractJsluiceEndpoints (0.01s)
    --- PASS: TestExtractJsluiceEndpoints/fetch_call (0.00s)
    --- PASS: TestExtractJsluiceEndpoints/XMLHttpRequest_open (0.00s)
    --- PASS: TestExtractJsluiceEndpoints/window.open (0.00s)
    --- PASS: TestExtractJsluiceEndpoints/location.href_assignment (0.00s)
    --- PASS: TestExtractJsluiceEndpoints/malformed_JS_falls_back_to_regex (0.00s)
    ...

Cross-Platform Build Verification:
Builds now succeed without CGO on previously problematic platforms:

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build ./cmd/katana/
CGO_ENABLED=0 GOOS=darwin GOARCH=arm64 go build ./cmd/katana/
CGO_ENABLED=0 GOOS=windows GOARCH=386 go build ./cmd/katana/

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

/claim #1367

Summary by CodeRabbit

  • New Features

    • Improved JavaScript endpoint extraction — more accurate detection across diverse JS patterns with robust fallback handling.
  • Chores

    • Updated project dependencies for maintainability.
  • Platform

    • Parser compatibility extended to all OS/architecture combinations.
  • Tests

    • Added comprehensive tests validating extraction accuracy and preprocessing behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

Walkthrough

Removed platform-specific parser file and build constraints; replaced external jsluice analyzer with a pure-Go AST-based JavaScript endpoint extractor using goja and added tests; updated go.mod to add goja and sourcemap and remove several jsluice-related dependencies.

Changes

Cohort / File(s) Summary
Dependency Management
go.mod
Removed BishopFox/jsluice, smacker/go-tree-sitter, ditashi/jsbeautifier-go; added dop251/goja and go-sourcemap/sourcemap as indirect deps.
Build Constraint Removal
pkg/engine/parser/parser_generic.go, pkg/utils/jsluice_test.go
Removed `//go:build !(386
Platform-Specific Deletion
pkg/engine/parser/parser_nojs.go
Deleted Windows/386-conditional file that defined public Options type and (*Parser).InitWithOptions; related conditional parser registration logic removed.
AST-Based JavaScript Extraction
pkg/utils/jsluice.go
Replaced external jsluice analyzer with a pure-Go AST-based extractor using goja. Added JSLuiceEndpoint type, ES6 preprocessing, AST-walking and classification functions, URL-detection heuristics, and a regex fallback when parsing fails.
Tests — Extraction & Helpers
pkg/utils/jsluice_test.go
Added extensive tests for endpoint extraction, URL-like detection, ES6 preprocessing, deduplication, and regex fallback; asserts endpoint values and inferred types across many JS patterns.

Sequence Diagram

sequenceDiagram
    participant Code as "JavaScript Code"
    participant Preprocess as "Preprocess ES6"
    participant Engine as "Goja Engine"
    participant AST as "AST Walker"
    participant Classifier as "Type Classifier"
    participant Fallback as "Regex Fallback"
    participant Result as "JSLuiceEndpoint"

    Code->>Preprocess: Raw JS (may include imports/exports)
    Preprocess->>Engine: Cleaned JS source
    Engine->>AST: Parse into AST and traverse
    alt AST parse success
        AST->>Classifier: Emit URL-like strings with context
        Classifier->>Result: Endpoint + Type (fetch/xhr/axios/import/...)
    else AST parse failure
        Engine->>Fallback: Provide raw JS for regex matching
        Fallback->>Result: Endpoint + Type "regex"
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hopping through code with a twitchy nose,
I trimmed old walls where platform-flagged code froze.
I parsed wild JS with a nimble paw,
Found endpoints aplenty—oh what a haul!
A tiny rabbit cheers: new paths, new prose. 🥕🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely describes the main change: switching from jsluice to a pure-Go goja-based JavaScript endpoint extraction, which is the primary objective across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
pkg/utils/jsluice.go (2)

66-73: ES6 preprocessing regex won't match multiline imports or re-exports like export * from 'mod'.

This is acceptable since failed preprocessing → failed parse → regex fallback. Just noting that patterns like multiline destructured imports or export * from '...' / export { x } from '...' will cause fallback to regex extraction.


193-197: Method parameter defaults not walked in walkClass.

MethodDefinition.Body is a *FunctionLiteral — its ParameterList defaults aren't walked here, unlike FunctionLiteral and ArrowFunctionLiteral in walkExpression (lines 251-254, 258-261). Low likelihood of URLs in parameter defaults, but worth keeping consistent.

Proposed fix
 		case *ast.MethodDefinition:
 			walkExpression(ce.Key, emit)
 			if ce.Body != nil {
 				walkStatement(ce.Body.Body, emit)
+				if ce.Body.ParameterList != nil {
+					for _, p := range ce.Body.ParameterList.List {
+						walkExpression(p.Initializer, emit)
+					}
+				}
 			}

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@pkg/utils/jsluice.go`:
- Around line 396-400: The isXHROpen function mixes a case-insensitive suffix
check (using lower) with a case-sensitive exclusion (funcName != "window.open"),
causing names like "Window.open" to slip through; fix by comparing the exclusion
against the lowercased name (e.g., use lower != "window.open") so both the
suffix check and the "window.open" exclusion are performed case-insensitively
while keeping the existing logic in isXHROpen.
- Around line 377-394: classifyCallType currently treats any function name
containing "open" as "xhr", which misclassifies "window.open"; update
classifyCallType to check for "window.open" (e.g., strings.Contains(lower,
"window.open") or exact match) before the generic strings.Contains(lower,
"open") case and return a distinct type like "window_open" (so downstream
fmt.Sprintf("jsluice-%s", item.Type) labels it correctly); modify the switch in
classifyCallType (and keep isXHROpen logic unchanged) so "window.open" is
handled first and does not fall through to the "xhr" branch.
- Around line 84-161: The expression walker is missing handling for optional
chaining wrappers; update the walkExpression function to add cases for
*ast.Optional and *ast.OptionalChain and recursively call walkExpression on
their contained/wrapped expression(s) (i.e., unwrap the optional node and
traverse its inner expression/chain) so optional chaining like obj?.foo() or
arr?.[i] is visited; this complements the existing walkStatement and ensures
URL-like strings inside optional chains are not skipped.
🧹 Nitpick comments (4)
pkg/utils/jsluice.go (4)

16-17: urlLikeStringRegex misses protocol-relative URLs and could over-match path segments.

The regex only matches https?://... or /... paths. It will miss:

  1. Protocol-relative URLs like //cdn.example.com/resource.js
  2. Relative paths like ./api/data or ../api/data (these are common in bundled JS)

Additionally, the path branch /[a-zA-Z0-9_\-./]+ allows consecutive dots and slashes (e.g., /foo/..//bar), which may produce noisy results.

Suggested regex improvement
-	urlLikeStringRegex = regexp.MustCompile(`^(?:https?://[^\s'"` + "`" + `]+|/[a-zA-Z0-9_\-./]+(?:\?[^\s'"` + "`" + `]*)?)$`)
+	urlLikeStringRegex = regexp.MustCompile(`^(?:https?://[^\s'"` + "`" + `]+|//[^\s'"` + "`" + `]+|\.{0,2}/[a-zA-Z0-9_\-./]+(?:\?[^\s'"` + "`" + `]*)?)$`)

66-74: ES6 preprocessing regex has limitations with multi-line and complex import forms.

The regex works line-by-line ((?m)) which means multi-line imports like:

import {
  foo,
  bar
} from 'baz';

won't be fully stripped, potentially leaving } from 'baz'; which could cause a parse error. Also, export * from 'module'; and export { default } from 'module'; (re-exports) aren't matched by the export branch.

This is acceptable as a best-effort preprocessor since the regex fallback catches parse failures, but worth documenting these known limitations.


313-335: walkCallExpression walks the callee, which may cause unintended URL emission from nested expressions in the callee chain.

Line 315 unconditionally walks the callee via walkExpression. For simple cases (fetch, $.ajax), the callee is an Identifier or DotExpression with no URL strings. However, for computed callees like getConfig("/api/base").fetch("/api/endpoint"), the inner call's string argument "/api/base" would be emitted as a generic "string" (from checkAndEmitURL via walkExpression) rather than with proper call-context typing.

This is a minor accuracy concern, not a correctness bug — the URL is still extracted.


434-451: Redundant deduplication in regexFallbackExtract.

ExtractRelativeEndpoints (from pkg/utils/regex.go:50-67) already deduplicates results using its own unique map. The seen map in regexFallbackExtract performs the same deduplication again.

This is harmless (defensive), but if you want to keep the code minimal:

Simplified version
 func regexFallbackExtract(data string) []JSLuiceEndpoint {
 	matches := ExtractRelativeEndpoints(data)
-	seen := make(map[string]struct{})
 	var endpoints []JSLuiceEndpoint
-
 	for _, match := range matches {
-		if _, ok := seen[match]; ok {
-			continue
-		}
-		seen[match] = struct{}{}
 		endpoints = append(endpoints, JSLuiceEndpoint{
 			Endpoint: match,
 			Type:     "regex",
 		})
 	}
 	return endpoints
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant