Skip to content

fix(scrape-webpage): harden analyze-webpage.js against bot detection and HTTP/2 errors #82

@arumsey

Description

@arumsey

Summary

analyze-webpage.js uses a hardcoded Chromium executable path and a fragile browser
configuration that can be blocked by sites that detect headless browsers (e.g. HTTP/2
rejection, bot fingerprinting).

Changes needed

  • Replace hardcoded executablePath (/ms-playwright/chromium-1208/chrome-linux/chrome)
    with dynamic Chrome-first auto-detection: try channel: 'chrome' first (real TLS
    fingerprint), fall back to bundled Chromium silently. The hardcoded path breaks if the
    Chromium version changes and doesn't work outside Docker.
  • Add --disable-http2 to browser args to prevent ERR_HTTP2_PROTOCOL_ERROR from
    servers that reject HTTP/2 connections from headless browsers.
  • Switch navigation to domcontentloaded (60s timeout + 5s settle) instead of
    networkidle. Many sites never reach networkidle and the script times out unnecessarily.
  • Add explicit timeout: 60000 to the screenshot call to prevent failures on large pages.
  • Align browser context config with run-bulk-import.js: Chrome 131 UA, realistic
    sec-ch-ua headers, locale, timezone, ignoreHTTPSErrors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions