Feat/page load strategy by fail-open · Pull Request #1514 · projectdiscovery/katana

fail-open · 2026-02-03T16:41:04Z

Proposed changes

This PR adds configurable page load strategies and enhances the -cwu (Chrome WebSocket URL) flag to use pure headless mode, solving timeout issues with modern SPAs.

Problem

Modern Single Page Applications (SPAs) with continuous network activity (WebSockets, polling, streaming APIs) never reach a "network idle" state, causing Katana's headless crawler to timeout indefinitely. Additionally, the existing
-cwu flag forces hybrid mode, limiting crawling capabilities.

Solution

1. Configurable Page Load Strategy (--page-load-strategy / -pls)
Adds multiple strategies to handle different application types:

heuristic (default): Smart wait with URL polling and network idle detection - maintains backward compatibility
domcontentloaded: Fixed wait time for SPAs with continuous activity - solves timeout issues
networkidle: Wait for network to be idle
load: Wait for page load event only
none: No waiting, proceed immediately
Also adds --dom-wait-time (-dwt) flag (default: 5 seconds) to configure wait time for the domcontentloaded strategy.
2. Pure Headless ChromeWSUrl Support
Enhances the previous -cwu fix (af49d12) to use pure headless engine:
Modified runner to use headless.New() instead of hybrid.New() when -cwu is set
Added ChromeWSUrl support to pure headless browser launcher
Prevents closing external browser when using -cwu (only closes tabs)
Properly handles temp directory creation with external browsers

Benefits

Solves timeout issues for SPAs with continuous network activity
Backward compatible - defaults to existing "heuristic" behavior
Full headless features - pure headless mode with -cwu enables action discovery, state tracking, diagnostics
Browser persistence - external browser stays open for reusable sessions
Performance - faster crawling when appropriate

Files Changed

cmd/katana/main.go - Add CLI flags
pkg/types/options.go - Add option fields
internal/runner/options.go - Add validation and warnings
internal/runner/runner.go - Use headless engine for ChromeWSUrl
pkg/engine/headless/browser/browser.go - Implement strategy logic and ChromeWSUrl support
pkg/engine/headless/crawler/crawler.go - Pass options through
pkg/engine/headless/headless.go - Pass options through
pkg/engine/hybrid/crawl.go - Support strategy in hybrid mode

Proof

Tested on enterprise SPA (complex React application with WebSockets and continuous network activity):
Command:

katana -u https://target
-cwu 'ws://localhost:9222/devtools/browser/...'
-pls domcontentloaded
-dwt 10
-d 5
-no-incognito
-fs rdn
-xhr
-fx

Results:

Before: ~125 static resource URLs, frequent "context deadline exceeded" timeouts, browser closed after run
After: over 3000+ endpoints discovered (including 3000+ XHR/Fetch API calls), no timeouts, browser persisted
Key findings:
Discovered GraphQL endpoints, apis
Browser stayed open for reusable authenticated sessions
No hanging on page loads
Proper scope handling across subdomains with -fs rdn

Checklist

Pull request is created against the dev branch
All checks passed (lint, unit/integration/regression tests etc.) with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Summary by CodeRabbit

New Features
- Added --page-load-strategy (heuristic, load, domcontentloaded, networkidle, none).
- Added --dom-wait-time to customize DOM-content-loaded wait duration.
- Added ability to connect to an existing Chrome instance via WebSocket for headless crawling.
Behavior
- Page load/waiting now follows the selected strategy (including option to skip waits), giving finer timing control during crawls.
Documentation
- CLI docs updated with examples and guidance for the new options.

v1.3.0

v1.4.0 (Full headless mode support)

When both -cwu (Chrome WebSocket URL) and -headless flags are used together, -cwu should take precedence since the user explicitly wants to connect to an existing browser session. Without this fix, when users run: katana -headless -u URL -cwu ws://... -no-incognito The -headless flag causes Katana to launch a new Chrome instance, completely ignoring the -cwu WebSocket URL. This makes authenticated crawling via existing browser sessions impossible when following the official documentation. Changes: - Prioritize ChromeWSUrl check in engine selection (runner.go) - Add warning when -headless is used with -cwu (options.go) - Ensure hybrid engine is selected when -cwu is specified This fix maintains backward compatibility while making the documented workflow actually work as intended. Fixes issue where -cwu flag is ignored when -headless is present.

## Page Load Strategy Adds --page-load-strategy (-pls) flag with options: - heuristic (default): Smart wait with URL polling and network idle - domcontentloaded: Fixed wait time for SPAs with continuous activity - networkidle: Wait for network to be idle - load: Wait for page load event only - none: No waiting, proceed immediately Also adds --dom-wait-time (-dwt) flag to configure wait time for domcontentloaded strategy (default: 5 seconds). Fixes timeout issues with modern SPAs that have continuous network activity (WebSockets, polling, streaming APIs) that never reach a truly 'idle' state. ## Pure Headless ChromeWSUrl Support Enhances the previous -cwu fix (af49d12) to use pure headless engine instead of hybrid, enabling full crawling capabilities: - Modified runner to use headless.New() when -cwu is set - Added ChromeWSUrl support to pure headless browser launcher - Prevents closing external browser (only closes tabs) - Properly handles temp directory creation with -cwu ## Testing Tested on TurboTax SPA - increased discovery from 125 to 5,300+ endpoints using '-cwu ws://... -pls domcontentloaded -dwt 10'. Browser stays open for reusable sessions.

coderabbitai · 2026-02-03T16:41:23Z

Walkthrough

Adds CLI flags and Options fields to configure page-load strategy and DOM wait time, wires them through runner, crawler, and headless launcher, enables connecting to an existing Chrome via WebSocket (ChromeWSUrl) or launching a new browser, and implements multiple page-load strategies.

Changes

Cohort / File(s)	Summary
CLI & Core Options `cmd/katana/main.go`, `pkg/types/options.go`, `internal/runner/options.go`	Added `--page-load-strategy` (`-pls`) and `--dom-wait-time` (`-dwt`) flags; new `PageLoadStrategy` (string) and `DOMWaitTime` (int) fields on Options; validation for allowed strategies and CLI warnings/logging around ChromeWSUrl/headless combinations.
Runner / Startup `internal/runner/runner.go`	When `ChromeWSUrl` is set, runner uses the pure headless initialization path (`headless.New`) instead of hybrid; logs/prints Chrome WebSocket URL and warns about redundant/overridden headless/hybrid flags when applicable.
Headless Launcher & Browser `pkg/engine/headless/browser/browser.go`	Launcher options gain `PageLoadStrategy`, `ChromeWSUrl`, and `DOMWaitTime`; supports connecting via ChromeWSUrl or launching Chrome with temp user-data dir management; page-load wait logic refactored into strategies: `none`, `load`, `domcontentloaded`, `networkidle`, `heuristic`; cleanup/close behavior adjusted when reusing external browser.
Headless Crawler Wiring `pkg/engine/headless/crawler/crawler.go`, `pkg/engine/headless/headless.go`	Crawler and headless layer accept and forward `PageLoadStrategy`, `ChromeWSUrl`, and `DOMWaitTime` into launcher options during initialization.
Hybrid Crawl Strategy `pkg/engine/hybrid/crawl.go`	Replaced previous fixed stability wait with a strategy-based switch implementing `none`, `domcontentloaded` (configurable wait), `load`, and heuristic branches; retains timeout/error handling and strategy-specific logs.
Docs `README.md`	Adds documentation and examples for `-pls`/`--page-load-strategy` and `-dwt`/`--dom-wait-time` flags and strategy behavior.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as Katana CLI
    participant Runner as Runner
    participant Crawler as Headless Crawler
    participant Launcher as Browser Launcher
    participant Browser as Chrome
    participant Page as Browser Page

    User->>CLI: pass flags (--page-load-strategy, --dom-wait-time, --chrome-ws-url)
    CLI->>Runner: parse & validate options
    Runner->>Crawler: instantiate headless crawler with options
    Crawler->>Launcher: NewLauncher(PageLoadStrategy, DOMWaitTime, ChromeWSUrl)
    alt ChromeWSUrl provided
        Launcher->>Browser: connect via WebSocket (reuse existing)
    else
        Launcher->>Browser: launch new Chrome (create temp user-data dir)
    end
    Launcher->>Page: create page & navigate
    alt strategy = "none"
        Page->>Page: no waiting
    else strategy = "load"
        Page->>Page: wait for load event
    else strategy = "domcontentloaded"
        Page->>Page: sleep DOMWaitTime
    else strategy = "networkidle"
        Page->>Page: wait for network idle
    else
        Page->>Page: apply heuristic stability checks
    end
    Page->>User: return crawled content
    alt connected via WebSocket
        Note right of Launcher: skip browser/process cleanup
    else
        Launcher->>Browser: close browser
        Launcher->>Launcher: cleanup temp dir
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through flags and stitch the run,

Heuristic springs or DOM-time fun.
Reuse your Chrome or spawn anew,
Tempdirs tidy, waits run true.
A rabbit cheers — the crawl's begun!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Feat/page load strategy' is partially related to the changeset—it refers to the main feature being added (page load strategy configuration), but uses a generic 'Feat/' prefix and lacks specificity about the full scope, which also includes ChromeWSUrl pure headless support and DOM wait time configuration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@internal/runner/options.go`:
- Around line 58-65: The current check for options.Headless and
options.ChromeWSUrl logs a message but doesn't explicitly warn when the headless
shorthand (-hh) is used; update the conditional handling around options.Headless
and options.ChromeWSUrl in internal/runner/options.go to emit a clear warning
via gologger.Warning().Msgf that the -hh/-headless flag is ignored when
ChromeWSUrl is provided and ChromeWSUrl takes precedence (include the
ChromeWSUrl in the info log as before), ensuring the message references the -hh
shorthand so users know their flag is being suppressed.

In `@pkg/engine/headless/browser/browser.go`:
- Around line 209-280: The domcontentloaded branch in
BrowserPage.WaitPageLoadHeurisitics currently sleeps for a fixed duration
instead of actually waiting for the DOMContentLoaded lifecycle event; replace
the sleep with an explicit wait for that event by creating a timeout chain (like
other branches use) and calling
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) (or the equivalent
WaitNavigation method on the chained Timeout object) before returning so the
behavior matches the "domcontentloaded" semantics.

In `@pkg/engine/hybrid/crawl.go`:
- Around line 218-229: The domcontentloaded branch in the stability wait switch
is hard-coded to time.Sleep(1 * time.Second), ignoring the CLI --dom-wait-time
setting; update the branch in the function that uses
c.Options.Options.PageLoadStrategy to read the configured DOM wait duration
(e.g., c.Options.Options.DomWaitTime or the appropriate field that holds the
--dom-wait-time value) and call time.Sleep with that duration instead of 1s,
ensuring to keep the existing debug log but include the actual wait duration in
the message.

internal/runner/options.go

pkg/engine/headless/browser/browser.go

pkg/engine/hybrid/crawl.go

dogancanbakir · 2026-02-03T17:21:27Z

@fail-open merge conflict, fyi

Mzack9999 · 2026-02-05T14:42:53Z

@coderabbitai create an issue for this PR

coderabbitai · 2026-02-05T14:43:56Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@Mzack9999, I'll create an issue for this PR to track the page load strategy implementation and ChromeWSUrl improvements.

🏁 Script executed:

#!/bin/bash

# Create a GitHub issue for PR `#1514`
gh issue create \
  --title "Page Load Strategy Configuration and ChromeWSUrl Handling" \
  --body "This issue tracks the implementation of configurable page load strategies and improved Chrome WebSocket URL handling for Katana.

**Related PR:** https://github.com/projectdiscovery/katana/pull/1514
**Requested by:** `@Mzack9999`

## Overview
The PR adds new CLI flags to configure page load strategies:
- \`--page-load-strategy\` / \`-pls\`: Options include heuristic (default), domcontentloaded, networkidle, load, none
- \`--dom-wait-time\` / \`-dwt\`: Configurable wait time for domcontentloaded strategy (default 5s)

## Key Changes
- Pure headless ChromeWSUrl support: runner now selects headless.New() when -cwu is set
- Multiple page load strategies to address timeout failures on modern SPAs with continuous network activity
- Browser persistence for reusable sessions when using external Chrome via WebSocket URL
- Backward compatibility maintained with heuristic as default strategy

## Benefits
- Resolves SPA timeouts by avoiding indefinite waits for network idle
- Enables full headless features with -cwu flag
- Performance improvements for complex SPAs
- Tested with React SPA: increased endpoint discovery from ~125 to 3000+ URLs

**Comment reference:** https://github.com/projectdiscovery/katana/pull/1514#issuecomment" \
  --assignee "Mzack9999"

Length of output: 54

@Mzack9999, I've successfully created issue #1520 to track the page load strategy configuration and ChromeWSUrl handling implementation from this PR. The issue has been assigned to you and includes a summary of the key changes and benefits.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/engine/headless/browser/browser.go (1)
326-420: ⚠️ Potential issue | 🟡 Minor

Temp user‑data dir can leak when page creation fails

If launchBrowserWithDataDir succeeds but browser.Page(...) or later init fails, the deferred cleanup closes the browser but never removes tempDir. That can leak directories on repeated failures.
🧹 Add temp dir cleanup on failure
 defer func() {
     if !successfulPageCreation {
         _ = page.Close()
         // Only close browser if we launched it (not connecting via ChromeWSUrl)
         if l.opts.ChromeWSUrl == "" {
             _ = browser.Close()
         }
+        if shouldCleanupTempDir {
+            _ = os.RemoveAll(tempDir)
+        }
     }
 }()

🤖 Fix all issues with AI agents

In `@pkg/engine/headless/browser/browser.go`:
- Around line 67-75: The DOM wait defaulting currently treats 0 as unset; change
the logic in browser.go so that opts.DOMWaitTime is only defaulted when negative
(e.g., use a check for opts.DOMWaitTime < 0) so an explicit -dwt 0 is honored as
"no extra wait"; update the block that sets opts.DOMWaitTime (the DOM wait
defaulting around opts.DOMWaitTime) to only assign 5 when the value is negative
or otherwise sentinel, leaving 0 intact.
- Around line 210-285: The DOMContentLoaded branch currently arms
WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation
has already completed, which can miss the event and block until timeout; change
this to use a post-navigation check such as calling chained.WaitLoad() (the same
primitive used in the "load" case) or a helper that detects whether
DOMContentLoaded has already fired before waiting, then apply the existing
optional sleep using b.launcher.opts.DOMWaitTime; update the "domcontentloaded"
case to call chained := b.Timeout(15 * time.Second); _ = chained.WaitLoad(); if
b.launcher.opts.DOMWaitTime > 0 { time.Sleep(...) } and remove the
WaitNavigation/wait() pairing so the listener cannot miss the event.

In `@pkg/engine/hybrid/crawl.go`:
- Around line 218-249: The code currently calls waitNavigation() (which blocks
on FirstMeaningfulPaint) before the strategy switch, so even "none" and other
strategies always wait; move the FirstMeaningfulPaint lifecycle wait out of the
unconditional waitNavigation() call and into the appropriate branches of the
switch inside the same function (i.e., call/await FirstMeaningfulPaint only for
strategies that need it such as "load" or the default heuristic, skip it
entirely for "none", and keep DOMWaitTime handling in "domcontentloaded"),
ensuring you update or remove the prior waitNavigation() call and instead invoke
the lifecycle wait where strategy-specific waits (e.g., page.WaitStable,
DOMWaitTime sleep, or a short load sleep) are performed so behavior matches the
PageLoadStrategy flag.

coderabbitai · 2026-02-05T14:46:08Z

pkg/engine/headless/browser/browser.go

+	// Default to "heuristic" if not specified
+	if opts.PageLoadStrategy == "" {
+		opts.PageLoadStrategy = "heuristic"
+	}
+
+	// Default DOM wait time if not specified
+	if opts.DOMWaitTime <= 0 {
+		opts.DOMWaitTime = 5
+	}


⚠️ Potential issue | 🟡 Minor

Honor DOMWaitTime=0 when explicitly set

With <= 0, a user‑supplied -dwt 0 is forced back to 5s, so there’s no way to disable the extra wait. If 0 is intended to mean “no extra wait,” consider only defaulting on negative values or introducing a sentinel.

✏️ Possible tweak (allow 0)

- if opts.DOMWaitTime <= 0 { + if opts.DOMWaitTime < 0 { opts.DOMWaitTime = 5 }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Default to "heuristic" if not specified

if opts.PageLoadStrategy == "" {

opts.PageLoadStrategy = "heuristic"

}

// Default DOM wait time if not specified

if opts.DOMWaitTime <= 0 {

opts.DOMWaitTime = 5

}

// Default to "heuristic" if not specified

if opts.PageLoadStrategy == "" {

opts.PageLoadStrategy = "heuristic"

}

// Default DOM wait time if not specified

if opts.DOMWaitTime < 0 {

opts.DOMWaitTime = 5

}

🤖 Prompt for AI Agents

In `@pkg/engine/headless/browser/browser.go` around lines 67 - 75, The DOM wait defaulting currently treats 0 as unset; change the logic in browser.go so that opts.DOMWaitTime is only defaulted when negative (e.g., use a check for opts.DOMWaitTime < 0) so an explicit -dwt 0 is honored as "no extra wait"; update the block that sets opts.DOMWaitTime (the DOM wait defaulting around opts.DOMWaitTime) to only assign 5 when the value is negative or otherwise sentinel, leaving 0 intact.

coderabbitai · 2026-02-05T14:46:08Z

pkg/engine/headless/browser/browser.go

+	// Respect the page load strategy from launcher options
+	strategy := b.launcher.opts.PageLoadStrategy
+
+	switch strategy {
+	case "none":
+		// Don't wait at all, return immediately
+		return nil
+
+	case "load":
+		// Just wait for the load event
+		chained := b.Timeout(15 * time.Second)
+		return chained.WaitLoad()
+
+	case "domcontentloaded":
+		// Wait for the DOMContentLoaded event, then optionally wait additional time
+		// for JavaScript to render interactive elements
+		chained := b.Timeout(15 * time.Second)
+		wait := chained.WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded)
+		wait()
+		// Additional wait time for JS rendering after DOMContentLoaded
+		if b.launcher.opts.DOMWaitTime > 0 {
+			time.Sleep(time.Duration(b.launcher.opts.DOMWaitTime) * time.Second)
+		}
+		return nil
+
+	case "networkidle":
+		// Wait for network activity to stop
+		chained := b.Timeout(15 * time.Second)
+		_ = chained.WaitLoad()
+		_ = chained.WaitIdle(2 * time.Second)
+		return nil
+
+	case "heuristic":
+		fallthrough
+	default:
+		// Use the original heuristic approach
+		opts := defaultWaitOptions
+
+		chained := b.Timeout(opts.MaxTimeout)
+
+		// 1. Wait for the basic load event (DOMContentLoaded / load).
+		_ = chained.WaitLoad()
+
+		// 2. Capture the current URL so we can detect route changes.
+		urlVal, _ := b.Eval("() => window.location.href")
+		startURL := ""
+		if urlVal != nil {
+			startURL = urlVal.Value.Str()
+		}
+
+		// 3. Poll for a different URL for up to URLPollTimeout.
+		urlChanged := false
+		if startURL != "" {
+			pollCount := int(opts.URLPollTimeout / opts.URLPollInterval)
+			for i := 0; i < pollCount; i++ {
+				time.Sleep(opts.URLPollInterval)
+				cur, err := b.Eval("() => window.location.href")
+				if err == nil && cur != nil && cur.Value.Str() != startURL {
+					urlChanged = true
+					break
+				}
 			}
 		}
-	}

-	if urlChanged {
-		// 4a. URL changed – short grace period then network idle & done.
-		_ = chained.WaitIdle(opts.PostChangeWait)
-		return nil
-	}
+		if urlChanged {
+			// 4a. URL changed – short grace period then network idle & done.
+			_ = chained.WaitIdle(opts.PostChangeWait)
+			return nil
+		}

-	// 4b. URL didn't change – fall back to broader heuristics.
-	_ = chained.WaitIdle(opts.IdleWait)
-	_ = b.WaitNewStable(opts.DOMStableWait)
+		// 4b. URL didn't change – fall back to broader heuristics.
+		_ = chained.WaitIdle(opts.IdleWait)
+		_ = b.WaitNewStable(opts.DOMStableWait)

-	return nil
+		return nil
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find where WaitPageLoadHeurisitics is called rg -n "WaitPageLoadHeurisitics" --type go -A 5 -B 5

Repository: projectdiscovery/katana

Length of output: 5055

🏁 Script executed:

# Find Navigate calls to understand the pattern rg -n "\.Navigate\(" --type go -A 3 -B 3

Repository: projectdiscovery/katana

Length of output: 1760

🏁 Script executed:

# Check the WaitNavigation implementation to understand its requirements rg -n "func.*WaitNavigation" --type go -A 20

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# Also check if WaitPageLoadHeurisitics is defined in this file rg -n "func.*WaitPageLoadHeurisitics" --type go -A 5

Repository: projectdiscovery/katana

Length of output: 518

🏁 Script executed:

# Search for WaitNavigation usage patterns in the codebase rg -n "WaitNavigation" --type go -B 3 -A 3

Repository: projectdiscovery/katana

Length of output: 1170

🏁 Script executed:

# Check if there's Rod browser documentation or comments about WaitNavigation rg -n "rod\|Rod\|navigation" --type go pkg/engine/headless/browser/browser.go | head -30

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# Check the imports to understand where WaitNavigation comes from head -50 pkg/engine/headless/browser/browser.go | rg -n "import|WaitNavigation"

Repository: projectdiscovery/katana

Length of output: 77

🏁 Script executed:

# Look for any configuration or initialization of page load strategy rg -n "PageLoadStrategy\|page.load.strategy" --type go -B 2 -A 2

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# Check the BrowserPage type and its methods rg -n "type BrowserPage\|func.*BrowserPage" --type go -A 10 pkg/engine/headless/browser/

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# Look at Timeout implementation rg -n "func.*Timeout" --type go -B 2 -A 10 pkg/engine/headless/browser/

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# Check go.mod to see Rod version cat go.mod | rg -i "rod\|go-rod"

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# Search for any comments about navigation timing or WaitNavigation behavior rg -n "WaitNavigation\|navigation.*timing\|arm.*before\|before.*navigate" --type go -B 2 -A 2 | head -50

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# Read the browser.go file around the WaitPageLoadHeurisitics function cat -n pkg/engine/headless/browser/browser.go | head -250 | tail -100

Repository: projectdiscovery/katana

Length of output: 4155

🏁 Script executed:

# Check the imports and type definitions at the top of the file head -100 pkg/engine/headless/browser/browser.go

Repository: projectdiscovery/katana

Length of output: 2731

🏁 Script executed:

# Look for go.mod in various locations find . -name "go.mod" -type f | head -5

Repository: projectdiscovery/katana

Length of output: 75

🏁 Script executed:

# Check the exact implementation of WaitPageLoadHeurisitics for the domcontentloaded case sed -n '220,235p' pkg/engine/headless/browser/browser.go

Repository: projectdiscovery/katana

Length of output: 641

Fix WaitNavigation() timing for DOMContentLoaded case

The "domcontentloaded" case calls WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation has already started. Since WaitPageLoadHeurisitics() is called after Navigate() completes in all call sites (see pkg/engine/headless/crawler/crawler.go:435–438), the DOMContentLoaded event may have already fired before the listener is armed, causing wait() to block until timeout (15s) or the next navigation event.

Replace WaitNavigation() with a post-navigation primitive that checks current state. Consider using WaitLoad() (matches the "load" case pattern) or a helper that verifies if DOMContentLoaded has already occurred, followed by the optional DOMWaitTime sleep.

🤖 Prompt for AI Agents

In `@pkg/engine/headless/browser/browser.go` around lines 210 - 285, The DOMContentLoaded branch currently arms WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) after navigation has already completed, which can miss the event and block until timeout; change this to use a post-navigation check such as calling chained.WaitLoad() (the same primitive used in the "load" case) or a helper that detects whether DOMContentLoaded has already fired before waiting, then apply the existing optional sleep using b.launcher.opts.DOMWaitTime; update the "domcontentloaded" case to call chained := b.Timeout(15 * time.Second); _ = chained.WaitLoad(); if b.launcher.opts.DOMWaitTime > 0 { time.Sleep(...) } and remove the WaitNavigation/wait() pairing so the listener cannot miss the event.

coderabbitai · 2026-02-05T14:46:09Z

pkg/engine/hybrid/crawl.go

+	// Wait the page to be stable a duration based on page load strategy
+	strategy := c.Options.Options.PageLoadStrategy
+
+	switch strategy {
+	case "none":
+		// Don't wait at all
+		gologger.Debug().Msgf("page-load-strategy=none: skipping stability wait\n")
+
+	case "domcontentloaded":
+		// Wait for DOM to render using the configured wait time
+		waitTime := time.Duration(c.Options.Options.DOMWaitTime) * time.Second
+		gologger.Debug().Msgf("page-load-strategy=domcontentloaded: waiting %s for DOM\n", waitTime)
+		time.Sleep(waitTime)
+
+	case "load":
+		// Wait for load event but don't check network stability
+		gologger.Debug().Msgf("page-load-strategy=load: basic load wait only\n")
+		time.Sleep(500 * time.Millisecond)
+
+	default:
+		// heuristic, networkidle, or any other strategy: use WaitStable
+		timeStable := time.Duration(c.Options.Options.TimeStable) * time.Second
+
+		if timeout < timeStable {
+			gologger.Warning().Msgf("timeout is less than time stable, setting time stable to half of timeout to avoid timeout\n")
+			timeStable = timeout / 2
+			gologger.Warning().Msgf("setting time stable to %s\n", timeStable)
+		}

-	if err := page.WaitStable(timeStable); err != nil {
-		gologger.Warning().Msgf("could not wait for page to be stable: %s\n", err)
+		if err := page.WaitStable(timeStable); err != nil {
+			gologger.Warning().Msgf("could not wait for page to be stable: %s\n", err)
+		}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the relevant sections of the crawl.go file cd /tmp && find . -name "crawl.go" -path "*/engine/hybrid/*" 2>/dev/null | head -5

Repository: projectdiscovery/katana

Length of output: 49

🏁 Script executed:

# List files in the repo structure to understand where crawl.go is git ls-files | grep -E "crawl\.go|engine.*hybrid" | head -20

Repository: projectdiscovery/katana

Length of output: 203

🏁 Script executed:

# Get the file and read a larger section around lines 218-249 to see context wc -l pkg/engine/hybrid/crawl.go

Repository: projectdiscovery/katana

Length of output: 97

🏁 Script executed:

# Read the context around the strategy switch (lines 180-260) to see if waitNavigation() is called sed -n '180,260p' pkg/engine/hybrid/crawl.go | cat -n

Repository: projectdiscovery/katana

Length of output: 3265

Page‑load strategy is gated by an unconditional FirstMeaningfulPaint wait that undermines strategy behavior

waitNavigation() (line 27, called at line 37) unconditionally blocks on FirstMeaningfulPaint before the strategy switch executes. This means:

none still waits for FirstMeaningfulPaint instead of skipping all waits

domcontentloaded and load wait for FirstMeaningfulPaint first before their respective delays

All strategies are unnecessarily bound by this lifecycle event

Move the lifecycle event wait into the strategy switch so behavior matches the intended flag semantics (skip for none, use appropriate lifecycle events for others).

Suggested fix

- // wait the page to be fully loaded and becoming idle - waitNavigation := page.WaitNavigation(proto.PageLifecycleEventNameFirstMeaningfulPaint) + // wait the page based on page-load strategy + var waitNavigation func() + switch strategy { + case "none": + // no lifecycle wait + case "domcontentloaded": + waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameDOMContentLoaded) + case "load": + waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameLoad) + default: + waitNavigation = page.WaitNavigation(proto.PageLifecycleEventNameFirstMeaningfulPaint) + } ... - waitNavigation() + if waitNavigation != nil { + waitNavigation() + }

🤖 Prompt for AI Agents

In `@pkg/engine/hybrid/crawl.go` around lines 218 - 249, The code currently calls waitNavigation() (which blocks on FirstMeaningfulPaint) before the strategy switch, so even "none" and other strategies always wait; move the FirstMeaningfulPaint lifecycle wait out of the unconditional waitNavigation() call and into the appropriate branches of the switch inside the same function (i.e., call/await FirstMeaningfulPaint only for strategies that need it such as "load" or the default heuristic, skip it entirely for "none", and keep DOMWaitTime handling in "domcontentloaded"), ensuring you update or remove the prior waitNavigation() call and instead invoke the lifecycle wait where strategy-specific waits (e.g., page.WaitStable, DOMWaitTime sleep, or a short load sleep) are performed so behavior matches the PageLoadStrategy flag.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/runner/options.go (1)
64-70: ⚠️ Potential issue | 🟡 Minor

Duplicate validation check — second block is unreachable dead code.

Lines 64-67 and 68-70 have identical conditions. The second check will never execute because the first returns an error. Remove the duplicate.
🔧 Proposed fix to remove dead code
 	if (options.HeadlessOptionalArguments != nil || options.HeadlessNoSandbox || options.SystemChromePath != "") &&
 		!options.Headless && !options.HeadlessHybrid {
 		return errkit.New("headless (-hl) or hybrid (-hh) mode is required if -ho, -nos or -scp are set")
 	}
-	if (options.HeadlessOptionalArguments != nil || options.HeadlessNoSandbox || options.SystemChromePath != "") && !options.Headless && !options.HeadlessHybrid {
-		return errkit.New("headless mode (-hl) is required if -ho, -nos or -scp are set")
-	}

dogancanbakir and others added 4 commits December 1, 2025 22:32

Merge pull request projectdiscovery#1472 from projectdiscovery/dev

3b75c8e

v1.3.0

Merge pull request projectdiscovery#1490 from projectdiscovery/dev

474b3b8

v1.4.0 (Full headless mode support)

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

internal/runner/options.go Outdated Show resolved Hide resolved

pkg/engine/headless/browser/browser.go Show resolved Hide resolved

pkg/engine/hybrid/crawl.go Outdated Show resolved Hide resolved

dogancanbakir requested review from Mzack9999, dwisiswant0 and knakul853 February 3, 2026 17:18

Mzack9999 added 3 commits February 5, 2026 18:32

Merge branch 'dev' into pr/1514

706c3e6

better option handling

fa0a187

lint + docs

0ec0456

Mzack9999 approved these changes Feb 5, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Feb 5, 2026

Page Load Strategy Configuration and ChromeWSUrl Handling #1520

Open

Mzack9999 linked an issue Feb 5, 2026 that may be closed by this pull request

Page Load Strategy Configuration and ChromeWSUrl Handling #1520

Open

Mzack9999 assigned fail-open Feb 5, 2026

Mzack9999 added the Type: Enhancement Most issues will probably ask for additions or changes. label Feb 5, 2026

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/page load strategy#1514

Feat/page load strategy#1514
fail-open wants to merge 7 commits intoprojectdiscovery:devfrom
fail-open:feat/page-load-strategy

fail-open commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dogancanbakir commented Feb 3, 2026

Uh oh!

Mzack9999 commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fail-open commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Problem

Solution

Benefits

Files Changed

Proof

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dogancanbakir commented Feb 3, 2026

Uh oh!

Mzack9999 commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fail-open commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading