Add XSS context analyzer with AST-based detection#7063
Add XSS context analyzer with AST-based detection#7063odvcencio wants to merge 4 commits intoprojectdiscovery:devfrom
Conversation
- Add XSS context analyzer for the fuzzer that detects precise injection contexts - Implement HTML tokenizer-based detection for element text, comments, and attributes - Add JavaScript AST parsing using gotreesitter for script string/comment/expression context classification - Add CSS AST parsing for style value and url() function detection - Support detection of event handlers, URL attributes, and style attributes - Implement unique canary generation for reliable reflection detection - Register analyzer with fuzz framework via blank import in http protocol - Include comprehensive tests covering all supported context types
WalkthroughAdds a new XSS context analyzer for the fuzzer: it generates deterministic canaries, injects them into requests, sends requests, parses responses (HTML/JS/CSS) with Tree-sitter to find reflections, and classifies reflection contexts. Also adds tests, a dependency, and a side-effect import to register the analyzer. Changes
Sequence DiagramsequenceDiagram
participant Fuzzer as Fuzzer
participant Analyzer as XSS Analyzer
participant HTTPClient as HTTP Client
participant Response as HTTP Response
participant Parser as Tree-sitter Parser
Fuzzer->>Analyzer: Analyze(options)
Analyzer->>Analyzer: generateCanary()
Analyzer->>Analyzer: inject canary into request
Analyzer->>HTTPClient: send rebuilt request
HTTPClient->>Response: receive response
Analyzer->>Response: read body (<=10MB)
alt canary reflected
Analyzer->>Parser: parse HTML (tokenizer)
Parser-->>Analyzer: tokens / reflection positions
Analyzer->>Analyzer: classify HTML contexts (text, comment, attr)
Analyzer->>Parser: parse JS/CSS fragments as needed
Parser-->>Analyzer: AST nodes for JS/CSS
Analyzer-->>Fuzzer: (true, comma-separated contexts, nil)
else not reflected
Analyzer-->>Fuzzer: (false, "", nil)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/fuzz/analyzers/xss/js_context.go (1)
10-13: Unused parameterscriptOffset.The
scriptOffsetparameter is documented as "not currently needed" but is still part of the function signature. Consider removing it or prefixing with_to clarify it's intentionally unused.♻️ Suggested fix
-// classifyJSContext parses JavaScript source and determines the sub-context -// of a canary reflection. scriptOffset is the byte offset of the script content -// within the original HTML document (used for error context, not currently needed). -func classifyJSContext(jsSource []byte, canary string, scriptOffset uint32) XSSContext { +// classifyJSContext parses JavaScript source and determines the sub-context +// of a canary reflection. +func classifyJSContext(jsSource []byte, canary string) XSSContext {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/fuzz/analyzers/xss/js_context.go` around lines 10 - 13, The parameter scriptOffset on function classifyJSContext is unused; either remove it from the signature and all call sites (update callers of classifyJSContext) or mark it as intentionally unused by renaming to _scriptOffset (or prefixing with _) in the classifyJSContext declaration and any implementations to silence linter warnings; update function comments to reflect the change and ensure references to scriptOffset inside classifyJSContext (if any) are removed or adjusted.pkg/fuzz/analyzers/xss/html_context.go (1)
162-200: Edge case:findQuoteContextmay match wrong attribute occurrence.If the same attribute key appears multiple times in the HTML (e.g., multiple
value=assignments), this function may return the quoting context of the wrong occurrence. However, given the canary's uniqueness, this is unlikely to cause incorrect results in practice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/fuzz/analyzers/xss/html_context.go` around lines 162 - 200, findQuoteContext can pick the wrong occurrence when the same attribute key appears multiple times; update the search to only accept a match if the found key is a true attribute name and the specific occurrence's value matches attr.Val. Concretely, in findQuoteContext(bodyStr, attr) ensure the character before the found key is not an identifier character (so you match whole attribute names), parse from the '=' to the attribute value boundary (handling quotes) and only return the context when that parsed raw value contains attr.Val; otherwise continue searching after the end of that value (not just after '=') so you advance to the next occurrence. Use the existing symbols key, pos, rawVal and attr.Val to implement these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/fuzz/analyzers/xss/analyzer.go`:
- Around line 50-58: The HTTP response body returned by
options.HttpClient.Do(rebuilt) is not being closed, causing a resource leak;
update the logic in analyzer.go around the call to options.HttpClient.Do (where
resp, err := options.HttpClient.Do(rebuilt) and body, err :=
io.ReadAll(resp.Body) live) to ensure resp.Body.Close() is always
called—preferably by adding a defer resp.Body.Close() immediately after
confirming err == nil from the Do call so the body is closed whether ReadAll
succeeds or fails.
---
Nitpick comments:
In `@pkg/fuzz/analyzers/xss/html_context.go`:
- Around line 162-200: findQuoteContext can pick the wrong occurrence when the
same attribute key appears multiple times; update the search to only accept a
match if the found key is a true attribute name and the specific occurrence's
value matches attr.Val. Concretely, in findQuoteContext(bodyStr, attr) ensure
the character before the found key is not an identifier character (so you match
whole attribute names), parse from the '=' to the attribute value boundary
(handling quotes) and only return the context when that parsed raw value
contains attr.Val; otherwise continue searching after the end of that value (not
just after '=') so you advance to the next occurrence. Use the existing symbols
key, pos, rawVal and attr.Val to implement these checks.
In `@pkg/fuzz/analyzers/xss/js_context.go`:
- Around line 10-13: The parameter scriptOffset on function classifyJSContext is
unused; either remove it from the signature and all call sites (update callers
of classifyJSContext) or mark it as intentionally unused by renaming to
_scriptOffset (or prefixing with _) in the classifyJSContext declaration and any
implementations to silence linter warnings; update function comments to reflect
the change and ensure references to scriptOffset inside classifyJSContext (if
any) are removed or adjusted.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (8)
go.modpkg/fuzz/analyzers/xss/analyzer.gopkg/fuzz/analyzers/xss/analyzer_test.gopkg/fuzz/analyzers/xss/context.gopkg/fuzz/analyzers/xss/css_context.gopkg/fuzz/analyzers/xss/html_context.gopkg/fuzz/analyzers/xss/js_context.gopkg/protocols/http/http.go
- Add benchmark test covering HTML text, attribute, event handler contexts - Include script string/template, CSS value/URL, and multi-reflection scenarios - Enable performance measurement across all XSS context detection patterns
- Add comprehensive test cases covering realistic HTML scenarios from actual web applications - Test various XSS contexts including HTML text, attributes, URLs, CSS, scripts, and event handlers - Validate context detection on complex patterns like SPAs, search results, error pages, and user profiles - Ensure analyzer correctly identifies multiple reflection points within single responses
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/fuzz/analyzers/xss/analyzer_test.go (1)
152-158: Add a regression test for encoded-canary behavior.Please add one case asserting that fully/heavily encoded canary forms are not treated as exploitable reflections in detection, so this contract stays protected.
Based on learnings: in
pkg/fuzz/analyzers/xss/, detection should use exact-canary matching as phase 1, and not treat heavily encoded reflections as exploitable detection hits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/fuzz/analyzers/xss/analyzer_test.go` around lines 152 - 158, Add a regression test alongside TestNoReflection that verifies heavily/fully encoded canary forms are not treated as exploitable reflections: create a new test (e.g., TestEncodedCanaryNotMatched) that constructs HTML containing encoded variants of testCanary (use findReflections([]byte(...), testCanary)) and assert len(points) == 0; this ensures the detection in findReflections uses exact-canary matching and does not flag encoded canary payloads as hits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/fuzz/analyzers/xss/analyzer_test.go`:
- Around line 190-192: The test slices c1[:4] directly which can panic if the
canary is shorter; update the assertion in
pkg/fuzz/analyzers/xss/analyzer_test.go to first guard the length (e.g., assert
len(c1) >= 4 and fail the test with a clear message if not) or replace the slice
check with a safe strings.HasPrefix(c1, "gtss") call so the test fails cleanly
instead of panicking; reference the variable c1 and the existing prefix check to
locate where to change.
---
Nitpick comments:
In `@pkg/fuzz/analyzers/xss/analyzer_test.go`:
- Around line 152-158: Add a regression test alongside TestNoReflection that
verifies heavily/fully encoded canary forms are not treated as exploitable
reflections: create a new test (e.g., TestEncodedCanaryNotMatched) that
constructs HTML containing encoded variants of testCanary (use
findReflections([]byte(...), testCanary)) and assert len(points) == 0; this
ensures the detection in findReflections uses exact-canary matching and does not
flag encoded canary payloads as hits.
Neo - PR Security ReviewHigh: 1 · Medium: 1 Highlights
High (1)
Medium (1)
Security ImpactUnbounded response body read enables memory exhaustion DoS ( Gotreesitter parser DoS from malicious HTML/JS/CSS structures ( Attack ExamplesUnbounded response body read enables memory exhaustion DoS ( Gotreesitter parser DoS from malicious HTML/JS/CSS structures ( Suggested FixesUnbounded response body read enables memory exhaustion DoS ( Gotreesitter parser DoS from malicious HTML/JS/CSS structures ( 🤖 Prompt for AI AgentsHardening Notes
Comment |
pkg/fuzz/analyzers/xss/analyzer.go
Outdated
| return false, "", errors.Wrap(err, "could not send canary request") | ||
| } | ||
|
|
||
| body, err := io.ReadAll(resp.Body) |
There was a problem hiding this comment.
🟠 Unbounded response body read enables memory exhaustion DoS (CWE-770) — The analyzer reads the entire HTTP response body into memory using io.ReadAll without verifying that response size limits are enforced. While nuclei has MaxBodyRead protections at the protocol layer (default 10MB), the analyzer makes its own HTTP request via options.HttpClient.Do() and directly calls ReadAll on the response body. If the HTTP client is misconfigured or lacks MaxRespBodySize enforcement, a malicious server could return gigabytes of data causing memory exhaustion.
Attack Example
Attacker's malicious server responds to the canary request with Transfer-Encoding: chunked and sends 10GB of data in chunks. The io.ReadAll call attempts to buffer all 10GB in memory, causing the nuclei process to crash with OOM.
Suggested Fix
Wrap resp.Body with io.LimitReader before calling ReadAll: body, err := io.ReadAll(io.LimitReader(resp.Body, maxBodyLimit)) where maxBodyLimit is derived from nuclei's MaxBodyRead constant or a configurable analyzer parameter. This ensures the analyzer respects size limits regardless of HTTP client configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/fuzz/analyzers/xss/analyzer.go` at line 55, replace the unbounded
`io.ReadAll(resp.Body)` call with a size-limited read: `body, err :=
io.ReadAll(io.LimitReader(resp.Body, 10*1024*1024))` to prevent memory
exhaustion from malicious servers returning gigabyte-sized responses. Import the
io package if not already imported. Consider making the 10MB limit configurable
via analyzer parameters.
pkg/fuzz/analyzers/xss/js_context.go
Outdated
| func classifyJSContext(jsSource []byte, canary string, scriptOffset uint32) XSSContext { | ||
| lang := grammars.JavascriptLanguage() | ||
| parser := gotreesitter.NewParser(lang) | ||
| tree, err := parser.Parse(jsSource) |
There was a problem hiding this comment.
🟡 Gotreesitter parser DoS from malicious HTML/JS/CSS structures (CWE-400) — The gotreesitter parser processes untrusted HTML/JS/CSS content from HTTP responses without timeout or resource limits. The parser.Parse() calls in js_context.go:16 and css_context.go:15 can hang indefinitely or consume excessive CPU/memory when processing specially crafted deeply nested structures. Known tree-sitter issues include indefinite hangs on malicious JavaScript (tree-sitter-javascript #322) and timeout enforcement not applying during tree balancing (tree-sitter #4019).
Attack Example
Attacker returns response with <script> tag containing 50,000 levels of nested arrays: [[[[...]]]] or deeply nested function calls. The gotreesitter parser attempts to build an AST with 50,000 depth, exhausting stack space or hanging in tree balancing operations that don't respect timeouts.
Suggested Fix
Implement a timeout mechanism for parser.Parse() calls using context.WithTimeout and recover from panics: ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second); defer cancel(); use a goroutine with select on ctx.Done() to abort parsing. Alternatively, add depth/size pre-validation: reject JS/CSS content >100KB or with nesting depth >500 before parsing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/fuzz/analyzers/xss/js_context.go` around lines 14-19, add timeout
protection for gotreesitter parsing: wrap the parser.Parse(jsSource) call in a
goroutine with context.WithTimeout(5*time.Second), and return
ContextScriptExpression if the timeout is exceeded. Also add size validation
before parsing: if len(jsSource) > 100*1024, return ContextScriptExpression
immediately. Apply the same pattern to css_context.go:15.
… and resource limits - Add 10MB response body size limit to prevent OOM on large responses - Add panic recovery and size limits to CSS/JS tree-sitter parsers (100KB limit) - Fix HTML attribute quote context matching for complex edge cases - Add test coverage for encoded canary rejection and attribute boundary cases - Add defer resp.Body.Close() to fix potential resource leak - Optimize canary prefix check in tests using strings.HasPrefix - Add exceedsNestingDepth guard against deeply nested parser inputs - Improve attribute name matching with whole-word detection
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/fuzz/analyzers/xss/html_context.go (1)
162-250: Consider edge case: multiple attributes with the same name.
findQuoteContextiterates through the raw HTML to find the attribute and determine its quoting style. If the same attribute name appears multiple times in the document (possibly in different tags), the search may match the wrong occurrence. This could lead to incorrect quote context classification.The current approach is pragmatic for most real-world cases, but be aware this could cause misclassification in pathological HTML with repeated attribute names across elements.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/fuzz/analyzers/xss/html_context.go` around lines 162 - 250, findQuoteContext can pick the wrong attribute instance when the same attribute name appears multiple times; update the search so after locating a candidate key (keyStart/keyEnd) you also confirm the key/value belong to the same tag and correspond to this html.Attribute: (1) ensure there is no '<' between keyStart and valueEnd (and that there is a matching '>' after valueEnd) to guarantee the pair is inside one element, and (2) when checking rawVal, verify attr.Val actually appears inside rawVal and use the position of attr.Val inside rawVal (not just strings.Contains) to tie the matched value to this key occurrence before returning ContextAttrValue*. Apply these checks in findQuoteContext where keyStart/keyEnd, pos, valueStart, valueEnd and rawVal are computed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/fuzz/analyzers/xss/html_context.go`:
- Around line 162-250: findQuoteContext can pick the wrong attribute instance
when the same attribute name appears multiple times; update the search so after
locating a candidate key (keyStart/keyEnd) you also confirm the key/value belong
to the same tag and correspond to this html.Attribute: (1) ensure there is no
'<' between keyStart and valueEnd (and that there is a matching '>' after
valueEnd) to guarantee the pair is inside one element, and (2) when checking
rawVal, verify attr.Val actually appears inside rawVal and use the position of
attr.Val inside rawVal (not just strings.Contains) to tie the matched value to
this key occurrence before returning ContextAttrValue*. Apply these checks in
findQuoteContext where keyStart/keyEnd, pos, valueStart, valueEnd and rawVal are
computed.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
pkg/fuzz/analyzers/xss/analyzer.gopkg/fuzz/analyzers/xss/analyzer_test.gopkg/fuzz/analyzers/xss/context.gopkg/fuzz/analyzers/xss/css_context.gopkg/fuzz/analyzers/xss/html_context.gopkg/fuzz/analyzers/xss/js_context.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/fuzz/analyzers/xss/analyzer_test.go
Summary
Adds an XSS context analyzer that detects the precise injection context of reflected input in HTTP responses. This enables context-aware XSS payload generation in the fuzzer.
onclick/onload/etc.), URL attributes (href/src/action), JavaScript strings (double/single/template literal), JS expressions, JS comments (line/block), CSS values, CSSurl(), and style attributesgolang.org/x/net/htmltokenizer for HTML-level context + gotreesitter AST parsing for JavaScript and CSS sub-context classificationHow it works
gtss+ 8 random alphanum chars)<script>or<style>blocksanalyzer_detailsWhy gotreesitter for JS/CSS
Previous attempts at this feature used regex or flat tokenizers, which can't distinguish between:
<script>var x = "REFLECTED"</script>(string context — needs quote escape)<script>var x = REFLECTED</script>(expression context — direct injection)<script>// REFLECTED</script>(comment context — needs newline)AST parsing with tree-sitter grammars handles these correctly via structural analysis rather than pattern matching.
Files
pkg/fuzz/analyzers/xss/context.gopkg/fuzz/analyzers/xss/html_context.gopkg/fuzz/analyzers/xss/js_context.gopkg/fuzz/analyzers/xss/css_context.gopkg/fuzz/analyzers/xss/analyzer.gopkg/fuzz/analyzers/xss/analyzer_test.gopkg/protocols/http/http.goBuild
Requires the
grammar_set_corebuild tag for gotreesitter grammars (includes HTML/JS/CSS, ~1MB):Test Results
All 31 tests pass:
Real-World Test Scenarios
TestRealWorldReflectionsvalidates context detection against realistic HTML from actual web applications:html_text,attr_value_double_quotedscript_string_double,html_textcss_url,url_attribute,html_text,url_attribute,attr_value_double_quoted,event_handlerscript_string_doubleattr_value_double_quoted,html_textscript_template_literal,event_handlerBenchmarks
HTML-only contexts (text, attribute, event handler) run in ~1μs. JS/CSS sub-parsing adds ~100-600μs for grammar loading — negligible for network-bound fuzzing.
Test plan
grammar_set_coretag/claim #5838
Summary by CodeRabbit
Release Notes
New Features
Tests