fix(scraper): check HTTP response status before extracting page content by edwardlonsdale · Pull Request #3 · co-cddo/octo-observability-compliance-scraper

edwardlonsdale · 2026-05-12T09:52:51Z

Summary

All three scrapers (accessibility, cookie, privacy) now capture the response object from page.goto() and return scrape_error with the HTTP status code when the server responds with 4xx/5xx
Previously, error pages were silently sent to Bedrock, resulting in misleading "did not contain an accessibility statement" messages
Adds character-count logging after text extraction to aid future debugging

Context

accept-a-refugee-integration-loan accessibility statement URL returns 403 to the scraper (WAF blocking AWS IPs or headless browser). The scraper never checked HTTP status codes, so it extracted text from the error page, sent it to Bedrock, got null back, and reported "no data extracted."

Test plan

TypeScript compiles (npx tsc --noEmit)
97 unit tests pass (pnpm test)
ESLint clean
Deployed to playground — service stable
Trigger re-scrape of affected service and verify HTTP 403 error shown

All three scrapers (accessibility, cookie, privacy) now capture the response object from page.goto() and return scrape_error with the HTTP status code when the server responds with 4xx/5xx. Previously, error pages were silently sent to Bedrock, resulting in misleading "did not contain an accessibility statement" messages. Also adds character-count logging after text extraction to aid future debugging of similar issues.

…rText empty From AWS IPs, some pages return 200 but render main content asynchronously (WAF challenge) or hide it via CSS. The innerText property respects visibility, returning empty for hidden elements. Changes: - extractMainText/extractFullText fall back to textContent when innerText is empty - All scrapers wait up to 10s for content to appear when initially 0 chars extracted, then re-extract

The waitForFunction was not triggering when innerText returned whitespace-only content (e.g. "\n"), since length > 0. Now uses .trim().length === 0 to correctly detect effectively-empty pages. Also logs the URL being scraped when content is empty, aiding diagnosis of WAF/redirect issues.

Wraps all timestamp displays in <time> elements with a title attribute showing full date+time on hover. Replaces the previous string truncation approach.

Container exits(1) in CI without visible error output. This step captures the container logs so we can diagnose the startup crash.

The chromium stage used npx playwright@1 (latest 1.x) to install the browser, but the release stage installs playwright@1.59.1 from the lockfile. When a newer playwright 1.x was released, the browser revision installed in the chromium stage no longer matched what the app expected, causing "Executable doesn't exist" on startup. Pin to 1.59.1 to match pnpm-lock.yaml.

findDeeperStatementLink was matching "Report an accessibility problem" links (/contact/accessibility?service=...) because their text contains "accessibility". This caused the scraper to navigate away from the correct accessibility statement page to a contact form.

…r crash - Skip domain_mismatch when final URL is www.gov.uk — government services legitimately link to GOV.UK for privacy policies - When a scrape fails with "has been closed" (browser crash), fail the pg-boss job instead of completing it, triggering automatic retry (retryLimit: 2, retryDelay: 300s)

- Add isAccessibilityStatement field to Bedrock extraction — pages like register-trailer (valid statement but no WCAG keywords) are now correctly recognized - Report "Page inaccessible from scraper — possible WAF block" when 0 chars extracted instead of misleading "did not contain a statement" - Skip Bedrock API calls for empty pages (saves cost) - Exempt www.gov.uk from domain_mismatch detection — services legitimately link to GOV.UK for privacy policies - Exclude /contact/ paths from deeper link discovery — prevents following "report accessibility problem" links

…s empty When Bedrock identifies a page as an accessibility statement but cannot extract complianceStatus or wcagStandard (truncated WAF content), log the full extracted text for diagnosis.

- Log both HTML length and innerText length for diagnosis (reveals whether server sent truncated HTML or JS removed content) - When statement detected but < 500 chars with no structured data, report as scrape_error "Partial content received" instead of misleading success with empty fields - Use JSON.stringify for content logging to avoid CloudWatch newline splitting

edwardlonsdale added 11 commits May 12, 2026 10:47

feat(ui): add date filter with hover tooltip showing full timestamp

3b5f176

Wraps all timestamp displays in <time> elements with a title attribute showing full date+time on hover. Replaces the previous string truncation approach.

ci: dump compliance-scraper logs on e2e failure

4f59f1f

Container exits(1) in CI without visible error output. This step captures the container logs so we can diagnose the startup crash.

fix(scraper): log full page content when statement detected but field…

c1146d4

…s empty When Bedrock identifies a page as an accessibility statement but cannot extract complianceStatus or wcagStandard (truncated WAF content), log the full extracted text for diagnosis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scraper): check HTTP response status before extracting page content#3

fix(scraper): check HTTP response status before extracting page content#3
edwardlonsdale wants to merge 11 commits into
mainfrom
fix/http-status-check

edwardlonsdale commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edwardlonsdale commented May 12, 2026

Summary

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant