fix(scraper): check HTTP response status before extracting page content#3
Open
edwardlonsdale wants to merge 11 commits into
Open
fix(scraper): check HTTP response status before extracting page content#3edwardlonsdale wants to merge 11 commits into
edwardlonsdale wants to merge 11 commits into
Conversation
All three scrapers (accessibility, cookie, privacy) now capture the response object from page.goto() and return scrape_error with the HTTP status code when the server responds with 4xx/5xx. Previously, error pages were silently sent to Bedrock, resulting in misleading "did not contain an accessibility statement" messages. Also adds character-count logging after text extraction to aid future debugging of similar issues.
…rText empty From AWS IPs, some pages return 200 but render main content asynchronously (WAF challenge) or hide it via CSS. The innerText property respects visibility, returning empty for hidden elements. Changes: - extractMainText/extractFullText fall back to textContent when innerText is empty - All scrapers wait up to 10s for content to appear when initially 0 chars extracted, then re-extract
The waitForFunction was not triggering when innerText returned whitespace-only content (e.g. "\n"), since length > 0. Now uses .trim().length === 0 to correctly detect effectively-empty pages. Also logs the URL being scraped when content is empty, aiding diagnosis of WAF/redirect issues.
Wraps all timestamp displays in <time> elements with a title attribute showing full date+time on hover. Replaces the previous string truncation approach.
Container exits(1) in CI without visible error output. This step captures the container logs so we can diagnose the startup crash.
The chromium stage used npx playwright@1 (latest 1.x) to install the browser, but the release stage installs playwright@1.59.1 from the lockfile. When a newer playwright 1.x was released, the browser revision installed in the chromium stage no longer matched what the app expected, causing "Executable doesn't exist" on startup. Pin to 1.59.1 to match pnpm-lock.yaml.
findDeeperStatementLink was matching "Report an accessibility problem" links (/contact/accessibility?service=...) because their text contains "accessibility". This caused the scraper to navigate away from the correct accessibility statement page to a contact form.
…r crash - Skip domain_mismatch when final URL is www.gov.uk — government services legitimately link to GOV.UK for privacy policies - When a scrape fails with "has been closed" (browser crash), fail the pg-boss job instead of completing it, triggering automatic retry (retryLimit: 2, retryDelay: 300s)
- Add isAccessibilityStatement field to Bedrock extraction — pages like register-trailer (valid statement but no WCAG keywords) are now correctly recognized - Report "Page inaccessible from scraper — possible WAF block" when 0 chars extracted instead of misleading "did not contain a statement" - Skip Bedrock API calls for empty pages (saves cost) - Exempt www.gov.uk from domain_mismatch detection — services legitimately link to GOV.UK for privacy policies - Exclude /contact/ paths from deeper link discovery — prevents following "report accessibility problem" links
…s empty When Bedrock identifies a page as an accessibility statement but cannot extract complianceStatus or wcagStandard (truncated WAF content), log the full extracted text for diagnosis.
- Log both HTML length and innerText length for diagnosis (reveals whether server sent truncated HTML or JS removed content) - When statement detected but < 500 chars with no structured data, report as scrape_error "Partial content received" instead of misleading success with empty fields - Use JSON.stringify for content logging to avoid CloudWatch newline splitting
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
page.goto()and returnscrape_errorwith the HTTP status code when the server responds with 4xx/5xxContext
accept-a-refugee-integration-loanaccessibility statement URL returns 403 to the scraper (WAF blocking AWS IPs or headless browser). The scraper never checked HTTP status codes, so it extracted text from the error page, sent it to Bedrock, got null back, and reported "no data extracted."Test plan
npx tsc --noEmit)pnpm test)HTTP 403error shown