Skip to content

fix(scraper): check HTTP response status before extracting page content#3

Open
edwardlonsdale wants to merge 11 commits into
mainfrom
fix/http-status-check
Open

fix(scraper): check HTTP response status before extracting page content#3
edwardlonsdale wants to merge 11 commits into
mainfrom
fix/http-status-check

Conversation

@edwardlonsdale
Copy link
Copy Markdown
Collaborator

Summary

  • All three scrapers (accessibility, cookie, privacy) now capture the response object from page.goto() and return scrape_error with the HTTP status code when the server responds with 4xx/5xx
  • Previously, error pages were silently sent to Bedrock, resulting in misleading "did not contain an accessibility statement" messages
  • Adds character-count logging after text extraction to aid future debugging

Context

accept-a-refugee-integration-loan accessibility statement URL returns 403 to the scraper (WAF blocking AWS IPs or headless browser). The scraper never checked HTTP status codes, so it extracted text from the error page, sent it to Bedrock, got null back, and reported "no data extracted."

Test plan

  • TypeScript compiles (npx tsc --noEmit)
  • 97 unit tests pass (pnpm test)
  • ESLint clean
  • Deployed to playground — service stable
  • Trigger re-scrape of affected service and verify HTTP 403 error shown

All three scrapers (accessibility, cookie, privacy) now capture the
response object from page.goto() and return scrape_error with the HTTP
status code when the server responds with 4xx/5xx. Previously, error
pages were silently sent to Bedrock, resulting in misleading
"did not contain an accessibility statement" messages.

Also adds character-count logging after text extraction to aid
future debugging of similar issues.
…rText empty

From AWS IPs, some pages return 200 but render main content
asynchronously (WAF challenge) or hide it via CSS. The innerText
property respects visibility, returning empty for hidden elements.

Changes:
- extractMainText/extractFullText fall back to textContent when
  innerText is empty
- All scrapers wait up to 10s for content to appear when initially
  0 chars extracted, then re-extract
The waitForFunction was not triggering when innerText returned
whitespace-only content (e.g. "\n"), since length > 0. Now uses
.trim().length === 0 to correctly detect effectively-empty pages.

Also logs the URL being scraped when content is empty, aiding
diagnosis of WAF/redirect issues.
Wraps all timestamp displays in <time> elements with a title
attribute showing full date+time on hover. Replaces the previous
string truncation approach.
Container exits(1) in CI without visible error output. This step
captures the container logs so we can diagnose the startup crash.
The chromium stage used npx playwright@1 (latest 1.x) to install
the browser, but the release stage installs playwright@1.59.1 from
the lockfile. When a newer playwright 1.x was released, the browser
revision installed in the chromium stage no longer matched what the
app expected, causing "Executable doesn't exist" on startup.

Pin to 1.59.1 to match pnpm-lock.yaml.
findDeeperStatementLink was matching "Report an accessibility
problem" links (/contact/accessibility?service=...) because their
text contains "accessibility". This caused the scraper to navigate
away from the correct accessibility statement page to a contact form.
…r crash

- Skip domain_mismatch when final URL is www.gov.uk — government
  services legitimately link to GOV.UK for privacy policies
- When a scrape fails with "has been closed" (browser crash), fail
  the pg-boss job instead of completing it, triggering automatic
  retry (retryLimit: 2, retryDelay: 300s)
- Add isAccessibilityStatement field to Bedrock extraction — pages
  like register-trailer (valid statement but no WCAG keywords) are
  now correctly recognized
- Report "Page inaccessible from scraper — possible WAF block" when
  0 chars extracted instead of misleading "did not contain a statement"
- Skip Bedrock API calls for empty pages (saves cost)
- Exempt www.gov.uk from domain_mismatch detection — services
  legitimately link to GOV.UK for privacy policies
- Exclude /contact/ paths from deeper link discovery — prevents
  following "report accessibility problem" links
…s empty

When Bedrock identifies a page as an accessibility statement but
cannot extract complianceStatus or wcagStandard (truncated WAF
content), log the full extracted text for diagnosis.
- Log both HTML length and innerText length for diagnosis (reveals
  whether server sent truncated HTML or JS removed content)
- When statement detected but < 500 chars with no structured data,
  report as scrape_error "Partial content received" instead of
  misleading success with empty fields
- Use JSON.stringify for content logging to avoid CloudWatch
  newline splitting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant