Skip to content

Feat/page load strategy#1514

Open
fail-open wants to merge 7 commits intoprojectdiscovery:devfrom
fail-open:feat/page-load-strategy
Open

Feat/page load strategy#1514
fail-open wants to merge 7 commits intoprojectdiscovery:devfrom
fail-open:feat/page-load-strategy

Conversation

@fail-open
Copy link
Collaborator

@fail-open fail-open commented Feb 3, 2026

Proposed changes

This PR adds configurable page load strategies and enhances the -cwu (Chrome WebSocket URL) flag to use pure headless mode, solving timeout issues with modern SPAs.

Problem

Modern Single Page Applications (SPAs) with continuous network activity (WebSockets, polling, streaming APIs) never reach a "network idle" state, causing Katana's headless crawler to timeout indefinitely. Additionally, the existing
-cwu flag forces hybrid mode, limiting crawling capabilities.

Solution

1. Configurable Page Load Strategy (--page-load-strategy / -pls)
Adds multiple strategies to handle different application types:

  • heuristic (default): Smart wait with URL polling and network idle detection - maintains backward compatibility
  • domcontentloaded: Fixed wait time for SPAs with continuous activity - solves timeout issues
  • networkidle: Wait for network to be idle
  • load: Wait for page load event only
  • none: No waiting, proceed immediately
    Also adds --dom-wait-time (-dwt) flag (default: 5 seconds) to configure wait time for the domcontentloaded strategy.
    2. Pure Headless ChromeWSUrl Support
    Enhances the previous -cwu fix (af49d12) to use pure headless engine:
  • Modified runner to use headless.New() instead of hybrid.New() when -cwu is set
  • Added ChromeWSUrl support to pure headless browser launcher
  • Prevents closing external browser when using -cwu (only closes tabs)
  • Properly handles temp directory creation with external browsers

Benefits

  1. Solves timeout issues for SPAs with continuous network activity
  2. Backward compatible - defaults to existing "heuristic" behavior
  3. Full headless features - pure headless mode with -cwu enables action discovery, state tracking, diagnostics
  4. Browser persistence - external browser stays open for reusable sessions
  5. Performance - faster crawling when appropriate

Files Changed

  • cmd/katana/main.go - Add CLI flags
  • pkg/types/options.go - Add option fields
  • internal/runner/options.go - Add validation and warnings
  • internal/runner/runner.go - Use headless engine for ChromeWSUrl
  • pkg/engine/headless/browser/browser.go - Implement strategy logic and ChromeWSUrl support
  • pkg/engine/headless/crawler/crawler.go - Pass options through
  • pkg/engine/headless/headless.go - Pass options through
  • pkg/engine/hybrid/crawl.go - Support strategy in hybrid mode

Proof

Tested on enterprise SPA (complex React application with WebSockets and continuous network activity):
Command:

katana -u https://target
-cwu 'ws://localhost:9222/devtools/browser/...'
-pls domcontentloaded
-dwt 10
-d 5
-no-incognito
-fs rdn
-xhr
-fx

Results:

  • Before: ~125 static resource URLs, frequent "context deadline exceeded" timeouts, browser closed after run
  • After: over 3000+ endpoints discovered (including 3000+ XHR/Fetch API calls), no timeouts, browser persisted
    Key findings:
  • Discovered GraphQL endpoints, apis
  • Browser stayed open for reusable authenticated sessions
  • No hanging on page loads
  • Proper scope handling across subdomains with -fs rdn

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Summary by CodeRabbit

  • New Features

    • Added --page-load-strategy (heuristic, load, domcontentloaded, networkidle, none).
    • Added --dom-wait-time to customize DOM-content-loaded wait duration.
    • Added ability to connect to an existing Chrome instance via WebSocket for headless crawling.
  • Behavior

    • Page load/waiting now follows the selected strategy (including option to skip waits), giving finer timing control during crawls.
  • Documentation

    • CLI docs updated with examples and guidance for the new options.

dogancanbakir and others added 4 commits December 1, 2025 22:32
When both -cwu (Chrome WebSocket URL) and -headless flags are used
together, -cwu should take precedence since the user explicitly wants
to connect to an existing browser session.

Without this fix, when users run:
  katana -headless -u URL -cwu ws://... -no-incognito

The -headless flag causes Katana to launch a new Chrome instance,
completely ignoring the -cwu WebSocket URL. This makes authenticated
crawling via existing browser sessions impossible when following the
official documentation.

Changes:
- Prioritize ChromeWSUrl check in engine selection (runner.go)
- Add warning when -headless is used with -cwu (options.go)
- Ensure hybrid engine is selected when -cwu is specified

This fix maintains backward compatibility while making the documented
workflow actually work as intended.

Fixes issue where -cwu flag is ignored when -headless is present.
## Page Load Strategy
Adds --page-load-strategy (-pls) flag with options:
- heuristic (default): Smart wait with URL polling and network idle
- domcontentloaded: Fixed wait time for SPAs with continuous activity
- networkidle: Wait for network to be idle
- load: Wait for page load event only
- none: No waiting, proceed immediately

Also adds --dom-wait-time (-dwt) flag to configure wait time
for domcontentloaded strategy (default: 5 seconds).

Fixes timeout issues with modern SPAs that have continuous
network activity (WebSockets, polling, streaming APIs) that
never reach a truly 'idle' state.

## Pure Headless ChromeWSUrl Support
Enhances the previous -cwu fix (af49d12) to use pure headless
engine instead of hybrid, enabling full crawling capabilities:
- Modified runner to use headless.New() when -cwu is set
- Added ChromeWSUrl support to pure headless browser launcher
- Prevents closing external browser (only closes tabs)
- Properly handles temp directory creation with -cwu

## Testing
Tested on TurboTax SPA - increased discovery from 125 to 5,300+
endpoints using '-cwu ws://... -pls domcontentloaded -dwt 10'.
Browser stays open for reusable sessions.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

Walkthrough

Adds CLI flags and Options fields to configure page-load strategy and DOM wait time, wires them through runner, crawler, and headless launcher, enables connecting to an existing Chrome via WebSocket (ChromeWSUrl) or launching a new browser, and implements multiple page-load strategies.

Changes

Cohort / File(s) Summary
CLI & Core Options
cmd/katana/main.go, pkg/types/options.go, internal/runner/options.go
Added --page-load-strategy (-pls) and --dom-wait-time (-dwt) flags; new PageLoadStrategy (string) and DOMWaitTime (int) fields on Options; validation for allowed strategies and CLI warnings/logging around ChromeWSUrl/headless combinations.
Runner / Startup
internal/runner/runner.go
When ChromeWSUrl is set, runner uses the pure headless initialization path (headless.New) instead of hybrid; logs/prints Chrome WebSocket URL and warns about redundant/overridden headless/hybrid flags when applicable.
Headless Launcher & Browser
pkg/engine/headless/browser/browser.go
Launcher options gain PageLoadStrategy, ChromeWSUrl, and DOMWaitTime; supports connecting via ChromeWSUrl or launching Chrome with temp user-data dir management; page-load wait logic refactored into strategies: none, load, domcontentloaded, networkidle, heuristic; cleanup/close behavior adjusted when reusing external browser.
Headless Crawler Wiring
pkg/engine/headless/crawler/crawler.go, pkg/engine/headless/headless.go
Crawler and headless layer accept and forward PageLoadStrategy, ChromeWSUrl, and DOMWaitTime into launcher options during initialization.
Hybrid Crawl Strategy
pkg/engine/hybrid/crawl.go
Replaced previous fixed stability wait with a strategy-based switch implementing none, domcontentloaded (configurable wait), load, and heuristic branches; retains timeout/error handling and strategy-specific logs.
Docs
README.md
Adds documentation and examples for -pls/--page-load-strategy and -dwt/--dom-wait-time flags and strategy behavior.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as Katana CLI
    participant Runner as Runner
    participant Crawler as Headless Crawler
    participant Launcher as Browser Launcher
    participant Browser as Chrome
    participant Page as Browser Page

    User->>CLI: pass flags (--page-load-strategy, --dom-wait-time, --chrome-ws-url)
    CLI->>Runner: parse & validate options
    Runner->>Crawler: instantiate headless crawler with options
    Crawler->>Launcher: NewLauncher(PageLoadStrategy, DOMWaitTime, ChromeWSUrl)
    alt ChromeWSUrl provided
        Launcher->>Browser: connect via WebSocket (reuse existing)
    else
        Launcher->>Browser: launch new Chrome (create temp user-data dir)
    end
    Launcher->>Page: create page & navigate
    alt strategy = "none"
        Page->>Page: no waiting
    else strategy = "load"
        Page->>Page: wait for load event
    else strategy = "domcontentloaded"
        Page->>Page: sleep DOMWaitTime
    else strategy = "networkidle"
        Page->>Page: wait for network idle
    else
        Page->>Page: apply heuristic stability checks
    end
    Page->>User: return crawled content
    alt connected via WebSocket
        Note right of Launcher: skip browser/process cleanup
    else
        Launcher->>Browser: close browser
        Launcher->>Launcher: cleanup temp dir
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through flags and stitch the run,

Heuristic springs or DOM-time fun.
Reuse your Chrome or spawn anew,
Tempdirs tidy, waits run true.
A rabbit cheers — the crawl's begun!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Feat/page load strategy' is partially related to the changeset—it refers to the main feature being added (page load strategy configuration), but uses a generic 'Feat/' prefix and lacks specificity about the full scope, which also includes ChromeWSUrl pure headless support and DOM wait time configuration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@internal/runner/options.go`:
- Around line 58-65: The current check for options.Headless and
options.ChromeWSUrl logs a message but doesn't explicitly warn when the headless
shorthand (-hh) is used; update the conditional handling around options.Headless
and options.ChromeWSUrl in internal/runner/options.go to emit a clear warning
via gologger.Warning().Msgf that the -hh/-headless flag is ignored when
ChromeWSUrl is provided and ChromeWSUrl takes precedence (include the
ChromeWSUrl in the info log as before), ensuring the message references the -hh
shorthand so users know their flag is being suppressed.

In `@pkg/engine/headless/browser/browser.go`:
- Around line 209-280: The domcontentloaded branch in
BrowserPage.WaitPageLoadHeurisitics currently sleeps for a fixed duration
instead of actually waiting for the DOMContentLoaded lifecycle event; replace
the sleep with an explicit wait for that event by creating a timeout chain (like
other branches use) and calling
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) (or the equivalent
WaitNavigation method on the chained Timeout object) before returning so the
behavior matches the "domcontentloaded" semantics.

In `@pkg/engine/hybrid/crawl.go`:
- Around line 218-229: The domcontentloaded branch in the stability wait switch
is hard-coded to time.Sleep(1 * time.Second), ignoring the CLI --dom-wait-time
setting; update the branch in the function that uses
c.Options.Options.PageLoadStrategy to read the configured DOM wait duration
(e.g., c.Options.Options.DomWaitTime or the appropriate field that holds the
--dom-wait-time value) and call time.Sleep with that duration instead of 1s,
ensuring to keep the existing debug log but include the actual wait duration in
the message.

@dogancanbakir
Copy link
Member

@fail-open merge conflict, fyi

@Mzack9999
Copy link
Member

@coderabbitai create an issue for this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@Mzack9999, I'll create an issue for this PR to track the page load strategy implementation and ChromeWSUrl improvements.


🏁 Script executed:

#!/bin/bash

# Create a GitHub issue for PR `#1514`
gh issue create \
  --title "Page Load Strategy Configuration and ChromeWSUrl Handling" \
  --body "This issue tracks the implementation of configurable page load strategies and improved Chrome WebSocket URL handling for Katana.

**Related PR:** https://github.com/projectdiscovery/katana/pull/1514
**Requested by:** `@Mzack9999`

## Overview
The PR adds new CLI flags to configure page load strategies:
- \`--page-load-strategy\` / \`-pls\`: Options include heuristic (default), domcontentloaded, networkidle, load, none
- \`--dom-wait-time\` / \`-dwt\`: Configurable wait time for domcontentloaded strategy (default 5s)

## Key Changes
- Pure headless ChromeWSUrl support: runner now selects headless.New() when -cwu is set
- Multiple page load strategies to address timeout failures on modern SPAs with continuous network activity
- Browser persistence for reusable sessions when using external Chrome via WebSocket URL
- Backward compatibility maintained with heuristic as default strategy

## Benefits
- Resolves SPA timeouts by avoiding indefinite waits for network idle
- Enables full headless features with -cwu flag
- Performance improvements for complex SPAs
- Tested with React SPA: increased endpoint discovery from ~125 to 3000+ URLs

**Comment reference:** https://github.com/projectdiscovery/katana/pull/1514#issuecomment" \
  --assignee "Mzack9999"

Length of output: 54


@Mzack9999, I've successfully created issue #1520 to track the page load strategy configuration and ChromeWSUrl handling implementation from this PR. The issue has been assigned to you and includes a summary of the key changes and benefits.

@Mzack9999 Mzack9999 linked an issue Feb 5, 2026 that may be closed by this pull request
@Mzack9999 Mzack9999 added the Type: Enhancement Most issues will probably ask for additions or changes. label Feb 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/engine/headless/browser/browser.go (1)

326-420: ⚠️ Potential issue | 🟡 Minor

Temp user‑data dir can leak when page creation fails

If launchBrowserWithDataDir succeeds but browser.Page(...) or later init fails, the deferred cleanup closes the browser but never removes tempDir. That can leak directories on repeated failures.

🧹 Add temp dir cleanup on failure
 defer func() {
     if !successfulPageCreation {
         _ = page.Close()
         // Only close browser if we launched it (not connecting via ChromeWSUrl)
         if l.opts.ChromeWSUrl == "" {
             _ = browser.Close()
         }
+        if shouldCleanupTempDir {
+            _ = os.RemoveAll(tempDir)
+        }
     }
 }()
🤖 Fix all issues with AI agents
In `@pkg/engine/headless/browser/browser.go`:
- Around line 67-75: The DOM wait defaulting currently treats 0 as unset; change
the logic in browser.go so that opts.DOMWaitTime is only defaulted when negative
(e.g., use a check for opts.DOMWaitTime < 0) so an explicit -dwt 0 is honored as
"no extra wait"; update the block that sets opts.DOMWaitTime (the DOM wait
defaulting around opts.DOMWaitTime) to only assign 5 when the value is negative
or otherwise sentinel, leaving 0 intact.
- Around line 210-285: The DOMContentLoaded branch currently arms
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation
has already completed, which can miss the event and block until timeout; change
this to use a post-navigation check such as calling chained.WaitLoad() (the same
primitive used in the "load" case) or a helper that detects whether
DOMContentLoaded has already fired before waiting, then apply the existing
optional sleep using b.launcher.opts.DOMWaitTime; update the "domcontentloaded"
case to call chained := b.Timeout(15 * time.Second); _ = chained.WaitLoad(); if
b.launcher.opts.DOMWaitTime > 0 { time.Sleep(...) } and remove the
WaitNavigation/wait() pairing so the listener cannot miss the event.

In `@pkg/engine/hybrid/crawl.go`:
- Around line 218-249: The code currently calls waitNavigation() (which blocks
on FirstMeaningfulPaint) before the strategy switch, so even "none" and other
strategies always wait; move the FirstMeaningfulPaint lifecycle wait out of the
unconditional waitNavigation() call and into the appropriate branches of the
switch inside the same function (i.e., call/await FirstMeaningfulPaint only for
strategies that need it such as "load" or the default heuristic, skip it
entirely for "none", and keep DOMWaitTime handling in "domcontentloaded"),
ensuring you update or remove the prior waitNavigation() call and instead invoke
the lifecycle wait where strategy-specific waits (e.g., page.WaitStable,
DOMWaitTime sleep, or a short load sleep) are performed so behavior matches the
PageLoadStrategy flag.

Comment on lines +67 to +75
// Default to "heuristic" if not specified
if opts.PageLoadStrategy == "" {
opts.PageLoadStrategy = "heuristic"
}

// Default DOM wait time if not specified
if opts.DOMWaitTime <= 0 {
opts.DOMWaitTime = 5
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Honor DOMWaitTime=0 when explicitly set

With <= 0, a user‑supplied -dwt 0 is forced back to 5s, so there’s no way to disable the extra wait. If 0 is intended to mean “no extra wait,” consider only defaulting on negative values or introducing a sentinel.

✏️ Possible tweak (allow 0)
- if opts.DOMWaitTime <= 0 {
+ if opts.DOMWaitTime < 0 {
     opts.DOMWaitTime = 5
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Default to "heuristic" if not specified
if opts.PageLoadStrategy == "" {
opts.PageLoadStrategy = "heuristic"
}
// Default DOM wait time if not specified
if opts.DOMWaitTime <= 0 {
opts.DOMWaitTime = 5
}
// Default to "heuristic" if not specified
if opts.PageLoadStrategy == "" {
opts.PageLoadStrategy = "heuristic"
}
// Default DOM wait time if not specified
if opts.DOMWaitTime < 0 {
opts.DOMWaitTime = 5
}
🤖 Prompt for AI Agents
In `@pkg/engine/headless/browser/browser.go` around lines 67 - 75, The DOM wait
defaulting currently treats 0 as unset; change the logic in browser.go so that
opts.DOMWaitTime is only defaulted when negative (e.g., use a check for
opts.DOMWaitTime < 0) so an explicit -dwt 0 is honored as "no extra wait";
update the block that sets opts.DOMWaitTime (the DOM wait defaulting around
opts.DOMWaitTime) to only assign 5 when the value is negative or otherwise
sentinel, leaving 0 intact.

Comment on lines +210 to +285
// Respect the page load strategy from launcher options
strategy := b.launcher.opts.PageLoadStrategy

switch strategy {
case "none":
// Don't wait at all, return immediately
return nil

case "load":
// Just wait for the load event
chained := b.Timeout(15 * time.Second)
return chained.WaitLoad()

case "domcontentloaded":
// Wait for the DOMContentLoaded event, then optionally wait additional time
// for JavaScript to render interactive elements
chained := b.Timeout(15 * time.Second)
wait := chained.WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded)
wait()
// Additional wait time for JS rendering after DOMContentLoaded
if b.launcher.opts.DOMWaitTime > 0 {
time.Sleep(time.Duration(b.launcher.opts.DOMWaitTime) * time.Second)
}
return nil

case "networkidle":
// Wait for network activity to stop
chained := b.Timeout(15 * time.Second)
_ = chained.WaitLoad()
_ = chained.WaitIdle(2 * time.Second)
return nil

case "heuristic":
fallthrough
default:
// Use the original heuristic approach
opts := defaultWaitOptions

chained := b.Timeout(opts.MaxTimeout)

// 1. Wait for the basic load event (DOMContentLoaded / load).
_ = chained.WaitLoad()

// 2. Capture the current URL so we can detect route changes.
urlVal, _ := b.Eval("() => window.location.href")
startURL := ""
if urlVal != nil {
startURL = urlVal.Value.Str()
}

// 3. Poll for a different URL for up to URLPollTimeout.
urlChanged := false
if startURL != "" {
pollCount := int(opts.URLPollTimeout / opts.URLPollInterval)
for i := 0; i < pollCount; i++ {
time.Sleep(opts.URLPollInterval)
cur, err := b.Eval("() => window.location.href")
if err == nil && cur != nil && cur.Value.Str() != startURL {
urlChanged = true
break
}
}
}
}

if urlChanged {
// 4a. URL changed – short grace period then network idle & done.
_ = chained.WaitIdle(opts.PostChangeWait)
return nil
}
if urlChanged {
// 4a. URL changed – short grace period then network idle & done.
_ = chained.WaitIdle(opts.PostChangeWait)
return nil
}

// 4b. URL didn't change – fall back to broader heuristics.
_ = chained.WaitIdle(opts.IdleWait)
_ = b.WaitNewStable(opts.DOMStableWait)
// 4b. URL didn't change – fall back to broader heuristics.
_ = chained.WaitIdle(opts.IdleWait)
_ = b.WaitNewStable(opts.DOMStableWait)

return nil
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find where WaitPageLoadHeurisitics is called
rg -n "WaitPageLoadHeurisitics" --type go -A 5 -B 5

Repository: projectdiscovery/katana

Length of output: 5055


🏁 Script executed:

# Find Navigate calls to understand the pattern
rg -n "\.Navigate\(" --type go -A 3 -B 3

Repository: projectdiscovery/katana

Length of output: 1760


🏁 Script executed:

# Check the WaitNavigation implementation to understand its requirements
rg -n "func.*WaitNavigation" --type go -A 20

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# Also check if WaitPageLoadHeurisitics is defined in this file
rg -n "func.*WaitPageLoadHeurisitics" --type go -A 5

Repository: projectdiscovery/katana

Length of output: 518


🏁 Script executed:

# Search for WaitNavigation usage patterns in the codebase
rg -n "WaitNavigation" --type go -B 3 -A 3

Repository: projectdiscovery/katana

Length of output: 1170


🏁 Script executed:

# Check if there's Rod browser documentation or comments about WaitNavigation
rg -n "rod\|Rod\|navigation" --type go pkg/engine/headless/browser/browser.go | head -30

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# Check the imports to understand where WaitNavigation comes from
head -50 pkg/engine/headless/browser/browser.go | rg -n "import|WaitNavigation"

Repository: projectdiscovery/katana

Length of output: 77


🏁 Script executed:

# Look for any configuration or initialization of page load strategy
rg -n "PageLoadStrategy\|page.load.strategy" --type go -B 2 -A 2

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# Check the BrowserPage type and its methods
rg -n "type BrowserPage\|func.*BrowserPage" --type go -A 10 pkg/engine/headless/browser/

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# Look at Timeout implementation
rg -n "func.*Timeout" --type go -B 2 -A 10 pkg/engine/headless/browser/

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# Check go.mod to see Rod version
cat go.mod | rg -i "rod\|go-rod"

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# Search for any comments about navigation timing or WaitNavigation behavior
rg -n "WaitNavigation\|navigation.*timing\|arm.*before\|before.*navigate" --type go -B 2 -A 2 | head -50

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# Read the browser.go file around the WaitPageLoadHeurisitics function
cat -n pkg/engine/headless/browser/browser.go | head -250 | tail -100

Repository: projectdiscovery/katana

Length of output: 4155


🏁 Script executed:

# Check the imports and type definitions at the top of the file
head -100 pkg/engine/headless/browser/browser.go

Repository: projectdiscovery/katana

Length of output: 2731


🏁 Script executed:

# Look for go.mod in various locations
find . -name "go.mod" -type f | head -5

Repository: projectdiscovery/katana

Length of output: 75


🏁 Script executed:

# Check the exact implementation of WaitPageLoadHeurisitics for the domcontentloaded case
sed -n '220,235p' pkg/engine/headless/browser/browser.go

Repository: projectdiscovery/katana

Length of output: 641


Fix WaitNavigation() timing for DOMContentLoaded case

The "domcontentloaded" case calls WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation has already started. Since WaitPageLoadHeurisitics() is called after Navigate() completes in all call sites (see pkg/engine/headless/crawler/crawler.go:435–438), the DOMContentLoaded event may have already fired before the listener is armed, causing wait() to block until timeout (15s) or the next navigation event.

Replace WaitNavigation() with a post-navigation primitive that checks current state. Consider using WaitLoad() (matches the "load" case pattern) or a helper that verifies if DOMContentLoaded has already occurred, followed by the optional DOMWaitTime sleep.

🤖 Prompt for AI Agents
In `@pkg/engine/headless/browser/browser.go` around lines 210 - 285, The
DOMContentLoaded branch currently arms
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation
has already completed, which can miss the event and block until timeout; change
this to use a post-navigation check such as calling chained.WaitLoad() (the same
primitive used in the "load" case) or a helper that detects whether
DOMContentLoaded has already fired before waiting, then apply the existing
optional sleep using b.launcher.opts.DOMWaitTime; update the "domcontentloaded"
case to call chained := b.Timeout(15 * time.Second); _ = chained.WaitLoad(); if
b.launcher.opts.DOMWaitTime > 0 { time.Sleep(...) } and remove the
WaitNavigation/wait() pairing so the listener cannot miss the event.

Comment on lines +218 to +249
// Wait the page to be stable a duration based on page load strategy
strategy := c.Options.Options.PageLoadStrategy

switch strategy {
case "none":
// Don't wait at all
gologger.Debug().Msgf("page-load-strategy=none: skipping stability wait\n")

case "domcontentloaded":
// Wait for DOM to render using the configured wait time
waitTime := time.Duration(c.Options.Options.DOMWaitTime) * time.Second
gologger.Debug().Msgf("page-load-strategy=domcontentloaded: waiting %s for DOM\n", waitTime)
time.Sleep(waitTime)

case "load":
// Wait for load event but don't check network stability
gologger.Debug().Msgf("page-load-strategy=load: basic load wait only\n")
time.Sleep(500 * time.Millisecond)

default:
// heuristic, networkidle, or any other strategy: use WaitStable
timeStable := time.Duration(c.Options.Options.TimeStable) * time.Second

if timeout < timeStable {
gologger.Warning().Msgf("timeout is less than time stable, setting time stable to half of timeout to avoid timeout\n")
timeStable = timeout / 2
gologger.Warning().Msgf("setting time stable to %s\n", timeStable)
}

if err := page.WaitStable(timeStable); err != nil {
gologger.Warning().Msgf("could not wait for page to be stable: %s\n", err)
if err := page.WaitStable(timeStable); err != nil {
gologger.Warning().Msgf("could not wait for page to be stable: %s\n", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the relevant sections of the crawl.go file
cd /tmp && find . -name "crawl.go" -path "*/engine/hybrid/*" 2>/dev/null | head -5

Repository: projectdiscovery/katana

Length of output: 49


🏁 Script executed:

# List files in the repo structure to understand where crawl.go is
git ls-files | grep -E "crawl\.go|engine.*hybrid" | head -20

Repository: projectdiscovery/katana

Length of output: 203


🏁 Script executed:

# Get the file and read a larger section around lines 218-249 to see context
wc -l pkg/engine/hybrid/crawl.go

Repository: projectdiscovery/katana

Length of output: 97


🏁 Script executed:

# Read the context around the strategy switch (lines 180-260) to see if waitNavigation() is called
sed -n '180,260p' pkg/engine/hybrid/crawl.go | cat -n

Repository: projectdiscovery/katana

Length of output: 3265


Page‑load strategy is gated by an unconditional FirstMeaningfulPaint wait that undermines strategy behavior

waitNavigation() (line 27, called at line 37) unconditionally blocks on FirstMeaningfulPaint before the strategy switch executes. This means:

  • none still waits for FirstMeaningfulPaint instead of skipping all waits
  • domcontentloaded and load wait for FirstMeaningfulPaint first before their respective delays
  • All strategies are unnecessarily bound by this lifecycle event

Move the lifecycle event wait into the strategy switch so behavior matches the intended flag semantics (skip for none, use appropriate lifecycle events for others).

Suggested fix
- // wait the page to be fully loaded and becoming idle
- waitNavigation := page.WaitNavigation(proto.PageLifecycleEventNameFirstMeaningfulPaint)
+ // wait the page based on page-load strategy
+ var waitNavigation func()
+ switch strategy {
+ case "none":
+     // no lifecycle wait
+ case "domcontentloaded":
+     waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded)
+ case "load":
+     waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameLoad)
+ default:
+     waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameFirstMeaningfulPaint)
+ }
...
- waitNavigation()
+ if waitNavigation != nil {
+     waitNavigation()
+ }
🤖 Prompt for AI Agents
In `@pkg/engine/hybrid/crawl.go` around lines 218 - 249, The code currently calls
waitNavigation() (which blocks on FirstMeaningfulPaint) before the strategy
switch, so even "none" and other strategies always wait; move the
FirstMeaningfulPaint lifecycle wait out of the unconditional waitNavigation()
call and into the appropriate branches of the switch inside the same function
(i.e., call/await FirstMeaningfulPaint only for strategies that need it such as
"load" or the default heuristic, skip it entirely for "none", and keep
DOMWaitTime handling in "domcontentloaded"), ensuring you update or remove the
prior waitNavigation() call and instead invoke the lifecycle wait where
strategy-specific waits (e.g., page.WaitStable, DOMWaitTime sleep, or a short
load sleep) are performed so behavior matches the PageLoadStrategy flag.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/runner/options.go (1)

64-70: ⚠️ Potential issue | 🟡 Minor

Duplicate validation check — second block is unreachable dead code.

Lines 64-67 and 68-70 have identical conditions. The second check will never execute because the first returns an error. Remove the duplicate.

🔧 Proposed fix to remove dead code
 	if (options.HeadlessOptionalArguments != nil || options.HeadlessNoSandbox || options.SystemChromePath != "") &&
 		!options.Headless && !options.HeadlessHybrid {
 		return errkit.New("headless (-hl) or hybrid (-hh) mode is required if -ho, -nos or -scp are set")
 	}
-	if (options.HeadlessOptionalArguments != nil || options.HeadlessNoSandbox || options.SystemChromePath != "") && !options.Headless && !options.HeadlessHybrid {
-		return errkit.New("headless mode (-hl) is required if -ho, -nos or -scp are set")
-	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Type: Enhancement Most issues will probably ask for additions or changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Page Load Strategy Configuration and ChromeWSUrl Handling

3 participants