Conversation
v1.4.0 (Full headless mode support)
When both -cwu (Chrome WebSocket URL) and -headless flags are used together, -cwu should take precedence since the user explicitly wants to connect to an existing browser session. Without this fix, when users run: katana -headless -u URL -cwu ws://... -no-incognito The -headless flag causes Katana to launch a new Chrome instance, completely ignoring the -cwu WebSocket URL. This makes authenticated crawling via existing browser sessions impossible when following the official documentation. Changes: - Prioritize ChromeWSUrl check in engine selection (runner.go) - Add warning when -headless is used with -cwu (options.go) - Ensure hybrid engine is selected when -cwu is specified This fix maintains backward compatibility while making the documented workflow actually work as intended. Fixes issue where -cwu flag is ignored when -headless is present.
## Page Load Strategy Adds --page-load-strategy (-pls) flag with options: - heuristic (default): Smart wait with URL polling and network idle - domcontentloaded: Fixed wait time for SPAs with continuous activity - networkidle: Wait for network to be idle - load: Wait for page load event only - none: No waiting, proceed immediately Also adds --dom-wait-time (-dwt) flag to configure wait time for domcontentloaded strategy (default: 5 seconds). Fixes timeout issues with modern SPAs that have continuous network activity (WebSockets, polling, streaming APIs) that never reach a truly 'idle' state. ## Pure Headless ChromeWSUrl Support Enhances the previous -cwu fix (af49d12) to use pure headless engine instead of hybrid, enabling full crawling capabilities: - Modified runner to use headless.New() when -cwu is set - Added ChromeWSUrl support to pure headless browser launcher - Prevents closing external browser (only closes tabs) - Properly handles temp directory creation with -cwu ## Testing Tested on TurboTax SPA - increased discovery from 125 to 5,300+ endpoints using '-cwu ws://... -pls domcontentloaded -dwt 10'. Browser stays open for reusable sessions.
WalkthroughAdds CLI flags and Options fields to configure page-load strategy and DOM wait time, wires them through runner, crawler, and headless launcher, enables connecting to an existing Chrome via WebSocket (ChromeWSUrl) or launching a new browser, and implements multiple page-load strategies. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as Katana CLI
participant Runner as Runner
participant Crawler as Headless Crawler
participant Launcher as Browser Launcher
participant Browser as Chrome
participant Page as Browser Page
User->>CLI: pass flags (--page-load-strategy, --dom-wait-time, --chrome-ws-url)
CLI->>Runner: parse & validate options
Runner->>Crawler: instantiate headless crawler with options
Crawler->>Launcher: NewLauncher(PageLoadStrategy, DOMWaitTime, ChromeWSUrl)
alt ChromeWSUrl provided
Launcher->>Browser: connect via WebSocket (reuse existing)
else
Launcher->>Browser: launch new Chrome (create temp user-data dir)
end
Launcher->>Page: create page & navigate
alt strategy = "none"
Page->>Page: no waiting
else strategy = "load"
Page->>Page: wait for load event
else strategy = "domcontentloaded"
Page->>Page: sleep DOMWaitTime
else strategy = "networkidle"
Page->>Page: wait for network idle
else
Page->>Page: apply heuristic stability checks
end
Page->>User: return crawled content
alt connected via WebSocket
Note right of Launcher: skip browser/process cleanup
else
Launcher->>Browser: close browser
Launcher->>Launcher: cleanup temp dir
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@internal/runner/options.go`:
- Around line 58-65: The current check for options.Headless and
options.ChromeWSUrl logs a message but doesn't explicitly warn when the headless
shorthand (-hh) is used; update the conditional handling around options.Headless
and options.ChromeWSUrl in internal/runner/options.go to emit a clear warning
via gologger.Warning().Msgf that the -hh/-headless flag is ignored when
ChromeWSUrl is provided and ChromeWSUrl takes precedence (include the
ChromeWSUrl in the info log as before), ensuring the message references the -hh
shorthand so users know their flag is being suppressed.
In `@pkg/engine/headless/browser/browser.go`:
- Around line 209-280: The domcontentloaded branch in
BrowserPage.WaitPageLoadHeurisitics currently sleeps for a fixed duration
instead of actually waiting for the DOMContentLoaded lifecycle event; replace
the sleep with an explicit wait for that event by creating a timeout chain (like
other branches use) and calling
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) (or the equivalent
WaitNavigation method on the chained Timeout object) before returning so the
behavior matches the "domcontentloaded" semantics.
In `@pkg/engine/hybrid/crawl.go`:
- Around line 218-229: The domcontentloaded branch in the stability wait switch
is hard-coded to time.Sleep(1 * time.Second), ignoring the CLI --dom-wait-time
setting; update the branch in the function that uses
c.Options.Options.PageLoadStrategy to read the configured DOM wait duration
(e.g., c.Options.Options.DomWaitTime or the appropriate field that holds the
--dom-wait-time value) and call time.Sleep with that duration instead of 1s,
ensuring to keep the existing debug log but include the actual wait duration in
the message.
|
@fail-open merge conflict, fyi |
|
@coderabbitai create an issue for this PR |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
🏁 Script executed: Length of output: 54
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/engine/headless/browser/browser.go (1)
326-420:⚠️ Potential issue | 🟡 MinorTemp user‑data dir can leak when page creation fails
If
launchBrowserWithDataDirsucceeds butbrowser.Page(...)or later init fails, the deferred cleanup closes the browser but never removestempDir. That can leak directories on repeated failures.🧹 Add temp dir cleanup on failure
defer func() { if !successfulPageCreation { _ = page.Close() // Only close browser if we launched it (not connecting via ChromeWSUrl) if l.opts.ChromeWSUrl == "" { _ = browser.Close() } + if shouldCleanupTempDir { + _ = os.RemoveAll(tempDir) + } } }()
🤖 Fix all issues with AI agents
In `@pkg/engine/headless/browser/browser.go`:
- Around line 67-75: The DOM wait defaulting currently treats 0 as unset; change
the logic in browser.go so that opts.DOMWaitTime is only defaulted when negative
(e.g., use a check for opts.DOMWaitTime < 0) so an explicit -dwt 0 is honored as
"no extra wait"; update the block that sets opts.DOMWaitTime (the DOM wait
defaulting around opts.DOMWaitTime) to only assign 5 when the value is negative
or otherwise sentinel, leaving 0 intact.
- Around line 210-285: The DOMContentLoaded branch currently arms
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation
has already completed, which can miss the event and block until timeout; change
this to use a post-navigation check such as calling chained.WaitLoad() (the same
primitive used in the "load" case) or a helper that detects whether
DOMContentLoaded has already fired before waiting, then apply the existing
optional sleep using b.launcher.opts.DOMWaitTime; update the "domcontentloaded"
case to call chained := b.Timeout(15 * time.Second); _ = chained.WaitLoad(); if
b.launcher.opts.DOMWaitTime > 0 { time.Sleep(...) } and remove the
WaitNavigation/wait() pairing so the listener cannot miss the event.
In `@pkg/engine/hybrid/crawl.go`:
- Around line 218-249: The code currently calls waitNavigation() (which blocks
on FirstMeaningfulPaint) before the strategy switch, so even "none" and other
strategies always wait; move the FirstMeaningfulPaint lifecycle wait out of the
unconditional waitNavigation() call and into the appropriate branches of the
switch inside the same function (i.e., call/await FirstMeaningfulPaint only for
strategies that need it such as "load" or the default heuristic, skip it
entirely for "none", and keep DOMWaitTime handling in "domcontentloaded"),
ensuring you update or remove the prior waitNavigation() call and instead invoke
the lifecycle wait where strategy-specific waits (e.g., page.WaitStable,
DOMWaitTime sleep, or a short load sleep) are performed so behavior matches the
PageLoadStrategy flag.
| // Default to "heuristic" if not specified | ||
| if opts.PageLoadStrategy == "" { | ||
| opts.PageLoadStrategy = "heuristic" | ||
| } | ||
|
|
||
| // Default DOM wait time if not specified | ||
| if opts.DOMWaitTime <= 0 { | ||
| opts.DOMWaitTime = 5 | ||
| } |
There was a problem hiding this comment.
Honor DOMWaitTime=0 when explicitly set
With <= 0, a user‑supplied -dwt 0 is forced back to 5s, so there’s no way to disable the extra wait. If 0 is intended to mean “no extra wait,” consider only defaulting on negative values or introducing a sentinel.
✏️ Possible tweak (allow 0)
- if opts.DOMWaitTime <= 0 {
+ if opts.DOMWaitTime < 0 {
opts.DOMWaitTime = 5
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Default to "heuristic" if not specified | |
| if opts.PageLoadStrategy == "" { | |
| opts.PageLoadStrategy = "heuristic" | |
| } | |
| // Default DOM wait time if not specified | |
| if opts.DOMWaitTime <= 0 { | |
| opts.DOMWaitTime = 5 | |
| } | |
| // Default to "heuristic" if not specified | |
| if opts.PageLoadStrategy == "" { | |
| opts.PageLoadStrategy = "heuristic" | |
| } | |
| // Default DOM wait time if not specified | |
| if opts.DOMWaitTime < 0 { | |
| opts.DOMWaitTime = 5 | |
| } |
🤖 Prompt for AI Agents
In `@pkg/engine/headless/browser/browser.go` around lines 67 - 75, The DOM wait
defaulting currently treats 0 as unset; change the logic in browser.go so that
opts.DOMWaitTime is only defaulted when negative (e.g., use a check for
opts.DOMWaitTime < 0) so an explicit -dwt 0 is honored as "no extra wait";
update the block that sets opts.DOMWaitTime (the DOM wait defaulting around
opts.DOMWaitTime) to only assign 5 when the value is negative or otherwise
sentinel, leaving 0 intact.
| // Respect the page load strategy from launcher options | ||
| strategy := b.launcher.opts.PageLoadStrategy | ||
|
|
||
| switch strategy { | ||
| case "none": | ||
| // Don't wait at all, return immediately | ||
| return nil | ||
|
|
||
| case "load": | ||
| // Just wait for the load event | ||
| chained := b.Timeout(15 * time.Second) | ||
| return chained.WaitLoad() | ||
|
|
||
| case "domcontentloaded": | ||
| // Wait for the DOMContentLoaded event, then optionally wait additional time | ||
| // for JavaScript to render interactive elements | ||
| chained := b.Timeout(15 * time.Second) | ||
| wait := chained.WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) | ||
| wait() | ||
| // Additional wait time for JS rendering after DOMContentLoaded | ||
| if b.launcher.opts.DOMWaitTime > 0 { | ||
| time.Sleep(time.Duration(b.launcher.opts.DOMWaitTime) * time.Second) | ||
| } | ||
| return nil | ||
|
|
||
| case "networkidle": | ||
| // Wait for network activity to stop | ||
| chained := b.Timeout(15 * time.Second) | ||
| _ = chained.WaitLoad() | ||
| _ = chained.WaitIdle(2 * time.Second) | ||
| return nil | ||
|
|
||
| case "heuristic": | ||
| fallthrough | ||
| default: | ||
| // Use the original heuristic approach | ||
| opts := defaultWaitOptions | ||
|
|
||
| chained := b.Timeout(opts.MaxTimeout) | ||
|
|
||
| // 1. Wait for the basic load event (DOMContentLoaded / load). | ||
| _ = chained.WaitLoad() | ||
|
|
||
| // 2. Capture the current URL so we can detect route changes. | ||
| urlVal, _ := b.Eval("() => window.location.href") | ||
| startURL := "" | ||
| if urlVal != nil { | ||
| startURL = urlVal.Value.Str() | ||
| } | ||
|
|
||
| // 3. Poll for a different URL for up to URLPollTimeout. | ||
| urlChanged := false | ||
| if startURL != "" { | ||
| pollCount := int(opts.URLPollTimeout / opts.URLPollInterval) | ||
| for i := 0; i < pollCount; i++ { | ||
| time.Sleep(opts.URLPollInterval) | ||
| cur, err := b.Eval("() => window.location.href") | ||
| if err == nil && cur != nil && cur.Value.Str() != startURL { | ||
| urlChanged = true | ||
| break | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if urlChanged { | ||
| // 4a. URL changed – short grace period then network idle & done. | ||
| _ = chained.WaitIdle(opts.PostChangeWait) | ||
| return nil | ||
| } | ||
| if urlChanged { | ||
| // 4a. URL changed – short grace period then network idle & done. | ||
| _ = chained.WaitIdle(opts.PostChangeWait) | ||
| return nil | ||
| } | ||
|
|
||
| // 4b. URL didn't change – fall back to broader heuristics. | ||
| _ = chained.WaitIdle(opts.IdleWait) | ||
| _ = b.WaitNewStable(opts.DOMStableWait) | ||
| // 4b. URL didn't change – fall back to broader heuristics. | ||
| _ = chained.WaitIdle(opts.IdleWait) | ||
| _ = b.WaitNewStable(opts.DOMStableWait) | ||
|
|
||
| return nil | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find where WaitPageLoadHeurisitics is called
rg -n "WaitPageLoadHeurisitics" --type go -A 5 -B 5Repository: projectdiscovery/katana
Length of output: 5055
🏁 Script executed:
# Find Navigate calls to understand the pattern
rg -n "\.Navigate\(" --type go -A 3 -B 3Repository: projectdiscovery/katana
Length of output: 1760
🏁 Script executed:
# Check the WaitNavigation implementation to understand its requirements
rg -n "func.*WaitNavigation" --type go -A 20Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# Also check if WaitPageLoadHeurisitics is defined in this file
rg -n "func.*WaitPageLoadHeurisitics" --type go -A 5Repository: projectdiscovery/katana
Length of output: 518
🏁 Script executed:
# Search for WaitNavigation usage patterns in the codebase
rg -n "WaitNavigation" --type go -B 3 -A 3Repository: projectdiscovery/katana
Length of output: 1170
🏁 Script executed:
# Check if there's Rod browser documentation or comments about WaitNavigation
rg -n "rod\|Rod\|navigation" --type go pkg/engine/headless/browser/browser.go | head -30Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# Check the imports to understand where WaitNavigation comes from
head -50 pkg/engine/headless/browser/browser.go | rg -n "import|WaitNavigation"Repository: projectdiscovery/katana
Length of output: 77
🏁 Script executed:
# Look for any configuration or initialization of page load strategy
rg -n "PageLoadStrategy\|page.load.strategy" --type go -B 2 -A 2Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# Check the BrowserPage type and its methods
rg -n "type BrowserPage\|func.*BrowserPage" --type go -A 10 pkg/engine/headless/browser/Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# Look at Timeout implementation
rg -n "func.*Timeout" --type go -B 2 -A 10 pkg/engine/headless/browser/Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# Check go.mod to see Rod version
cat go.mod | rg -i "rod\|go-rod"Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# Search for any comments about navigation timing or WaitNavigation behavior
rg -n "WaitNavigation\|navigation.*timing\|arm.*before\|before.*navigate" --type go -B 2 -A 2 | head -50Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# Read the browser.go file around the WaitPageLoadHeurisitics function
cat -n pkg/engine/headless/browser/browser.go | head -250 | tail -100Repository: projectdiscovery/katana
Length of output: 4155
🏁 Script executed:
# Check the imports and type definitions at the top of the file
head -100 pkg/engine/headless/browser/browser.goRepository: projectdiscovery/katana
Length of output: 2731
🏁 Script executed:
# Look for go.mod in various locations
find . -name "go.mod" -type f | head -5Repository: projectdiscovery/katana
Length of output: 75
🏁 Script executed:
# Check the exact implementation of WaitPageLoadHeurisitics for the domcontentloaded case
sed -n '220,235p' pkg/engine/headless/browser/browser.goRepository: projectdiscovery/katana
Length of output: 641
Fix WaitNavigation() timing for DOMContentLoaded case
The "domcontentloaded" case calls WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation has already started. Since WaitPageLoadHeurisitics() is called after Navigate() completes in all call sites (see pkg/engine/headless/crawler/crawler.go:435–438), the DOMContentLoaded event may have already fired before the listener is armed, causing wait() to block until timeout (15s) or the next navigation event.
Replace WaitNavigation() with a post-navigation primitive that checks current state. Consider using WaitLoad() (matches the "load" case pattern) or a helper that verifies if DOMContentLoaded has already occurred, followed by the optional DOMWaitTime sleep.
🤖 Prompt for AI Agents
In `@pkg/engine/headless/browser/browser.go` around lines 210 - 285, The
DOMContentLoaded branch currently arms
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation
has already completed, which can miss the event and block until timeout; change
this to use a post-navigation check such as calling chained.WaitLoad() (the same
primitive used in the "load" case) or a helper that detects whether
DOMContentLoaded has already fired before waiting, then apply the existing
optional sleep using b.launcher.opts.DOMWaitTime; update the "domcontentloaded"
case to call chained := b.Timeout(15 * time.Second); _ = chained.WaitLoad(); if
b.launcher.opts.DOMWaitTime > 0 { time.Sleep(...) } and remove the
WaitNavigation/wait() pairing so the listener cannot miss the event.
| // Wait the page to be stable a duration based on page load strategy | ||
| strategy := c.Options.Options.PageLoadStrategy | ||
|
|
||
| switch strategy { | ||
| case "none": | ||
| // Don't wait at all | ||
| gologger.Debug().Msgf("page-load-strategy=none: skipping stability wait\n") | ||
|
|
||
| case "domcontentloaded": | ||
| // Wait for DOM to render using the configured wait time | ||
| waitTime := time.Duration(c.Options.Options.DOMWaitTime) * time.Second | ||
| gologger.Debug().Msgf("page-load-strategy=domcontentloaded: waiting %s for DOM\n", waitTime) | ||
| time.Sleep(waitTime) | ||
|
|
||
| case "load": | ||
| // Wait for load event but don't check network stability | ||
| gologger.Debug().Msgf("page-load-strategy=load: basic load wait only\n") | ||
| time.Sleep(500 * time.Millisecond) | ||
|
|
||
| default: | ||
| // heuristic, networkidle, or any other strategy: use WaitStable | ||
| timeStable := time.Duration(c.Options.Options.TimeStable) * time.Second | ||
|
|
||
| if timeout < timeStable { | ||
| gologger.Warning().Msgf("timeout is less than time stable, setting time stable to half of timeout to avoid timeout\n") | ||
| timeStable = timeout / 2 | ||
| gologger.Warning().Msgf("setting time stable to %s\n", timeStable) | ||
| } | ||
|
|
||
| if err := page.WaitStable(timeStable); err != nil { | ||
| gologger.Warning().Msgf("could not wait for page to be stable: %s\n", err) | ||
| if err := page.WaitStable(timeStable); err != nil { | ||
| gologger.Warning().Msgf("could not wait for page to be stable: %s\n", err) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the relevant sections of the crawl.go file
cd /tmp && find . -name "crawl.go" -path "*/engine/hybrid/*" 2>/dev/null | head -5Repository: projectdiscovery/katana
Length of output: 49
🏁 Script executed:
# List files in the repo structure to understand where crawl.go is
git ls-files | grep -E "crawl\.go|engine.*hybrid" | head -20Repository: projectdiscovery/katana
Length of output: 203
🏁 Script executed:
# Get the file and read a larger section around lines 218-249 to see context
wc -l pkg/engine/hybrid/crawl.goRepository: projectdiscovery/katana
Length of output: 97
🏁 Script executed:
# Read the context around the strategy switch (lines 180-260) to see if waitNavigation() is called
sed -n '180,260p' pkg/engine/hybrid/crawl.go | cat -nRepository: projectdiscovery/katana
Length of output: 3265
Page‑load strategy is gated by an unconditional FirstMeaningfulPaint wait that undermines strategy behavior
waitNavigation() (line 27, called at line 37) unconditionally blocks on FirstMeaningfulPaint before the strategy switch executes. This means:
nonestill waits for FirstMeaningfulPaint instead of skipping all waitsdomcontentloadedandloadwait for FirstMeaningfulPaint first before their respective delays- All strategies are unnecessarily bound by this lifecycle event
Move the lifecycle event wait into the strategy switch so behavior matches the intended flag semantics (skip for none, use appropriate lifecycle events for others).
Suggested fix
- // wait the page to be fully loaded and becoming idle
- waitNavigation := page.WaitNavigation(proto.PageLifecycleEventNameFirstMeaningfulPaint)
+ // wait the page based on page-load strategy
+ var waitNavigation func()
+ switch strategy {
+ case "none":
+ // no lifecycle wait
+ case "domcontentloaded":
+ waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded)
+ case "load":
+ waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameLoad)
+ default:
+ waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameFirstMeaningfulPaint)
+ }
...
- waitNavigation()
+ if waitNavigation != nil {
+ waitNavigation()
+ }🤖 Prompt for AI Agents
In `@pkg/engine/hybrid/crawl.go` around lines 218 - 249, The code currently calls
waitNavigation() (which blocks on FirstMeaningfulPaint) before the strategy
switch, so even "none" and other strategies always wait; move the
FirstMeaningfulPaint lifecycle wait out of the unconditional waitNavigation()
call and into the appropriate branches of the switch inside the same function
(i.e., call/await FirstMeaningfulPaint only for strategies that need it such as
"load" or the default heuristic, skip it entirely for "none", and keep
DOMWaitTime handling in "domcontentloaded"), ensuring you update or remove the
prior waitNavigation() call and instead invoke the lifecycle wait where
strategy-specific waits (e.g., page.WaitStable, DOMWaitTime sleep, or a short
load sleep) are performed so behavior matches the PageLoadStrategy flag.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/runner/options.go (1)
64-70:⚠️ Potential issue | 🟡 MinorDuplicate validation check — second block is unreachable dead code.
Lines 64-67 and 68-70 have identical conditions. The second check will never execute because the first returns an error. Remove the duplicate.
🔧 Proposed fix to remove dead code
if (options.HeadlessOptionalArguments != nil || options.HeadlessNoSandbox || options.SystemChromePath != "") && !options.Headless && !options.HeadlessHybrid { return errkit.New("headless (-hl) or hybrid (-hh) mode is required if -ho, -nos or -scp are set") } - if (options.HeadlessOptionalArguments != nil || options.HeadlessNoSandbox || options.SystemChromePath != "") && !options.Headless && !options.HeadlessHybrid { - return errkit.New("headless mode (-hl) is required if -ho, -nos or -scp are set") - }
Proposed changes
This PR adds configurable page load strategies and enhances the
-cwu(Chrome WebSocket URL) flag to use pure headless mode, solving timeout issues with modern SPAs.Problem
Modern Single Page Applications (SPAs) with continuous network activity (WebSockets, polling, streaming APIs) never reach a "network idle" state, causing Katana's headless crawler to timeout indefinitely. Additionally, the existing
-cwuflag forces hybrid mode, limiting crawling capabilities.Solution
1. Configurable Page Load Strategy (
--page-load-strategy/-pls)Adds multiple strategies to handle different application types:
heuristic(default): Smart wait with URL polling and network idle detection - maintains backward compatibilitydomcontentloaded: Fixed wait time for SPAs with continuous activity - solves timeout issuesnetworkidle: Wait for network to be idleload: Wait for page load event onlynone: No waiting, proceed immediatelyAlso adds
--dom-wait-time(-dwt) flag (default: 5 seconds) to configure wait time for thedomcontentloadedstrategy.2. Pure Headless ChromeWSUrl Support
Enhances the previous
-cwufix (af49d12) to use pure headless engine:headless.New()instead ofhybrid.New()when-cwuis set-cwu(only closes tabs)Benefits
-cwuenables action discovery, state tracking, diagnosticsFiles Changed
cmd/katana/main.go- Add CLI flagspkg/types/options.go- Add option fieldsinternal/runner/options.go- Add validation and warningsinternal/runner/runner.go- Use headless engine for ChromeWSUrlpkg/engine/headless/browser/browser.go- Implement strategy logic and ChromeWSUrl supportpkg/engine/headless/crawler/crawler.go- Pass options throughpkg/engine/headless/headless.go- Pass options throughpkg/engine/hybrid/crawl.go- Support strategy in hybrid modeProof
Tested on enterprise SPA (complex React application with WebSockets and continuous network activity):
Command:
katana -u https://target
-cwu 'ws://localhost:9222/devtools/browser/...'
-pls domcontentloaded
-dwt 10
-d 5
-no-incognito
-fs rdn
-xhr
-fx
Results:
Key findings:
-fs rdnChecklist
Summary by CodeRabbit
New Features
Behavior
Documentation