diff --git a/README.md b/README.md index 38d6f71..623a3fa 100644 --- a/README.md +++ b/README.md @@ -7,12 +7,12 @@ Test your documentation site against the [Agent-Friendly Documentation Spec](htt Agents don't use docs like humans. They hit truncation limits, get walls of CSS instead of content, can't follow cross-host redirects, and don't know about quality-of-life improvements like `llms.txt` or `.md` docs pages that would make life swell. Maybe this is because the industry has lacked guidance - until now. -afdocs runs 21 checks across 8 categories to evaluate how well your docs serve agent consumers. 16 are fully implemented; the rest return `skip` until completed. +afdocs runs 22 checks across 8 categories to evaluate how well your docs serve agent consumers. > **Status: Early development (0.x)** > This project is under active development. Check IDs, CLI flags, and output formats may change between minor versions. Feel free to try it out, but don't build automation against specific output until 1.0. > -> Implements [spec v0.1.0](https://agentdocsspec.com/spec) (2026-02-22). +> Implements [spec v0.2.1](https://agentdocsspec.com/spec) (2026-03-15). ## Quick start @@ -43,7 +43,7 @@ Authentication ✓ auth-gate-detection: All 50 sampled pages are publicly accessible Summary - 9 passed, 3 failed, 9 skipped (21 total) + 9 passed, 3 failed, 10 skipped (22 total) ``` ## Install @@ -75,12 +75,34 @@ afdocs check https://docs.example.com --pass-threshold 30000 --fail-threshold 80 | `--format ` | `text` | Output format: `text` or `json` | | `-v, --verbose` | | Show per-page details for checks with issues | | `--checks ` | all | Comma-separated list of check IDs | +| `--sampling ` | `random` | URL sampling strategy (see below) | | `--max-concurrency ` | `3` | Maximum concurrent HTTP requests | | `--request-delay ` | `200` | Delay between requests | | `--max-links ` | `50` | Maximum links to test in link checks | | `--pass-threshold ` | `50000` | Size pass threshold (characters) | | `--fail-threshold ` | `100000` | Size fail threshold (characters) | +### Sampling strategies + +By default, afdocs discovers pages from your site (via `llms.txt`, sitemap, or both) and randomly samples up to `--max-links` pages to check. The `--sampling` flag gives you control over how that sample is selected. + +| Strategy | Behavior | +| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `random` | Shuffle discovered URLs and take the first N. Fast and broad, but results vary between runs. | +| `deterministic` | Sort discovered URLs alphabetically, then pick every Nth URL for an even spread. Produces the same sample on repeated runs as long as the URL set is stable. | +| `none` | Skip discovery entirely. Only check the URL you pass on the command line. | + +```bash +# Reproducible runs for CI or iteration (same pages every time) +afdocs check https://docs.example.com --sampling deterministic + +# Check a single page without any discovery +afdocs check https://docs.example.com/api/auth --sampling none + +# Check a single page with specific checks +afdocs check https://docs.example.com/api/auth --sampling none --checks page-size-html,redirect-behavior +``` + ### Exit codes - `0` if all checks pass or warn @@ -144,7 +166,7 @@ describe('agent-friendliness', () => { ## Checks -21 checks across 8 categories. Checks marked with \* are not yet implemented and return `skip`. +22 checks across 8 categories. ### Category 1: llms.txt @@ -165,19 +187,20 @@ describe('agent-friendliness', () => { ### Category 3: Page Size and Truncation Risk -| Check | Description | -| ------------------------ | ------------------------------------------------ | -| `page-size-markdown` | Character count when served as markdown | -| `page-size-html` | Character count of HTML and post-conversion size | -| `content-start-position` | How far into the response actual content begins | +| Check | Description | +| ------------------------ | --------------------------------------------------------------- | +| `rendering-strategy` | Whether pages contain server-rendered content or are SPA shells | +| `page-size-markdown` | Character count when served as markdown | +| `page-size-html` | Character count of HTML and post-conversion size | +| `content-start-position` | How far into the response actual content begins | ### Category 4: Content Structure -| Check | Description | -| --------------------------------- | -------------------------------------------------- | -| `tabbed-content-serialization` \* | Whether tabbed content creates oversized output | -| `section-header-quality` \* | Whether headers in tabbed sections include context | -| `markdown-code-fence-validity` | Whether markdown has unclosed code fences | +| Check | Description | +| ------------------------------ | -------------------------------------------------- | +| `tabbed-content-serialization` | Whether tabbed content creates oversized output | +| `section-header-quality` | Whether headers in tabbed sections include context | +| `markdown-code-fence-validity` | Whether markdown has unclosed code fences | ### Category 5: URL Stability and Redirects @@ -194,18 +217,18 @@ describe('agent-friendliness', () => { ### Category 7: Observability and Content Health -| Check | Description | -| ---------------------------- | ---------------------------------------------- | -| `llms-txt-freshness` \* | Whether `llms.txt` reflects current site state | -| `markdown-content-parity` \* | Whether markdown and HTML versions match | -| `cache-header-hygiene` | Whether cache headers allow timely updates | +| Check | Description | +| ------------------------- | ---------------------------------------------- | +| `llms-txt-freshness` | Whether `llms.txt` reflects current site state | +| `markdown-content-parity` | Whether markdown and HTML versions match | +| `cache-header-hygiene` | Whether cache headers allow timely updates | ### Category 8: Authentication and Access -| Check | Description | -| ---------------------------- | -------------------------------------------------------------------- | -| `auth-gate-detection` | Whether documentation pages require authentication to access content | -| `auth-alternative-access` \* | Whether auth-gated sites provide alternative access paths for agents | +| Check | Description | +| ------------------------- | -------------------------------------------------------------------- | +| `auth-gate-detection` | Whether documentation pages require authentication to access content | +| `auth-alternative-access` | Whether auth-gated sites provide alternative access paths for agents | ## Check dependencies diff --git a/package-lock.json b/package-lock.json index 7cbc6aa..d780a0d 100644 --- a/package-lock.json +++ b/package-lock.json @@ -11,6 +11,7 @@ "dependencies": { "chalk": "^5.4.1", "commander": "^13.1.0", + "node-html-parser": "^7.1.0", "turndown": "^7.2.2", "yaml": "^2.7.0" }, @@ -32,7 +33,7 @@ "vitest": "^4.0.18" }, "engines": { - "node": ">=20" + "node": ">=22" } }, "node_modules/@babel/helper-string-parser": { @@ -1807,6 +1808,12 @@ "node": "20 || >=22" } }, + "node_modules/boolbase": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/boolbase/-/boolbase-1.0.0.tgz", + "integrity": "sha512-JZOSA7Mo9sNGB8+UjSgzdLtokWAky1zbztM3WRLCbZ70/3cTANmQmOdR7y2g+J0e2WXywy1yS468tY+IruqEww==", + "license": "ISC" + }, "node_modules/brace-expansion": { "version": "5.0.2", "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-5.0.2.tgz", @@ -2042,6 +2049,34 @@ "node": ">= 8" } }, + "node_modules/css-select": { + "version": "5.2.2", + "resolved": "https://registry.npmjs.org/css-select/-/css-select-5.2.2.tgz", + "integrity": "sha512-TizTzUddG/xYLA3NXodFM0fSbNizXjOKhqiQQwvhlspadZokn1KDy0NZFS0wuEubIYAV5/c1/lAr0TaaFXEXzw==", + "license": "BSD-2-Clause", + "dependencies": { + "boolbase": "^1.0.0", + "css-what": "^6.1.0", + "domhandler": "^5.0.2", + "domutils": "^3.0.1", + "nth-check": "^2.0.1" + }, + "funding": { + "url": "https://github.com/sponsors/fb55" + } + }, + "node_modules/css-what": { + "version": "6.2.2", + "resolved": "https://registry.npmjs.org/css-what/-/css-what-6.2.2.tgz", + "integrity": "sha512-u/O3vwbptzhMs3L1fQE82ZSLHQQfto5gyZzwteVIEyeaY5Fc7R4dapF/BvRoSYFeqfBk4m0V1Vafq5Pjv25wvA==", + "license": "BSD-2-Clause", + "engines": { + "node": ">= 6" + }, + "funding": { + "url": "https://github.com/sponsors/fb55" + } + }, "node_modules/debug": { "version": "4.4.3", "resolved": "https://registry.npmjs.org/debug/-/debug-4.4.3.tgz", @@ -2067,6 +2102,61 @@ "dev": true, "license": "MIT" }, + "node_modules/dom-serializer": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/dom-serializer/-/dom-serializer-2.0.0.tgz", + "integrity": "sha512-wIkAryiqt/nV5EQKqQpo3SToSOV9J0DnbJqwK7Wv/Trc92zIAYZ4FlMu+JPFW1DfGFt81ZTCGgDEabffXeLyJg==", + "license": "MIT", + "dependencies": { + "domelementtype": "^2.3.0", + "domhandler": "^5.0.2", + "entities": "^4.2.0" + }, + "funding": { + "url": "https://github.com/cheeriojs/dom-serializer?sponsor=1" + } + }, + "node_modules/domelementtype": { + "version": "2.3.0", + "resolved": "https://registry.npmjs.org/domelementtype/-/domelementtype-2.3.0.tgz", + "integrity": "sha512-OLETBj6w0OsagBwdXnPdN0cnMfF9opN69co+7ZrbfPGrdpPVNBUj02spi6B1N7wChLQiPn4CSH/zJvXw56gmHw==", + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/fb55" + } + ], + "license": "BSD-2-Clause" + }, + "node_modules/domhandler": { + "version": "5.0.3", + "resolved": "https://registry.npmjs.org/domhandler/-/domhandler-5.0.3.tgz", + "integrity": "sha512-cgwlv/1iFQiFnU96XXgROh8xTeetsnJiDsTc7TYCLFd9+/WNkIqPTxiM/8pSd8VIrhXGTf1Ny1q1hquVqDJB5w==", + "license": "BSD-2-Clause", + "dependencies": { + "domelementtype": "^2.3.0" + }, + "engines": { + "node": ">= 4" + }, + "funding": { + "url": "https://github.com/fb55/domhandler?sponsor=1" + } + }, + "node_modules/domutils": { + "version": "3.2.2", + "resolved": "https://registry.npmjs.org/domutils/-/domutils-3.2.2.tgz", + "integrity": "sha512-6kZKyUajlDuqlHKVX1w7gyslj9MPIXzIFiz/rGu35uC1wMi+kMhQwGhl4lt9unC9Vb9INnY9Z3/ZA3+FhASLaw==", + "license": "BSD-2-Clause", + "dependencies": { + "dom-serializer": "^2.0.0", + "domelementtype": "^2.3.0", + "domhandler": "^5.0.3" + }, + "funding": { + "url": "https://github.com/fb55/domutils?sponsor=1" + } + }, "node_modules/emoji-regex": { "version": "8.0.0", "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-8.0.0.tgz", @@ -2074,6 +2164,18 @@ "dev": true, "license": "MIT" }, + "node_modules/entities": { + "version": "4.5.0", + "resolved": "https://registry.npmjs.org/entities/-/entities-4.5.0.tgz", + "integrity": "sha512-V0hjH4dGPh9Ao5p0MoRY6BVqtwCjhz6vI5LT8AJ55H+4g9/4vbHx1I54fS0XuclLhDHArPQCiMjDxjaL8fPxhw==", + "license": "BSD-2-Clause", + "engines": { + "node": ">=0.12" + }, + "funding": { + "url": "https://github.com/fb55/entities?sponsor=1" + } + }, "node_modules/environment": { "version": "1.1.0", "resolved": "https://registry.npmjs.org/environment/-/environment-1.1.0.tgz", @@ -2513,6 +2615,15 @@ "node": ">=8" } }, + "node_modules/he": { + "version": "1.2.0", + "resolved": "https://registry.npmjs.org/he/-/he-1.2.0.tgz", + "integrity": "sha512-F/1DnUGPopORZi0ni+CvrCgHQ5FyEAHRLSApuYWMmrbSwoN2Mn/7k+Gl38gJnR7yyDZk6WLXwiGod1JOWNDKGw==", + "license": "MIT", + "bin": { + "he": "bin/he" + } + }, "node_modules/headers-polyfill": { "version": "4.0.3", "resolved": "https://registry.npmjs.org/headers-polyfill/-/headers-polyfill-4.0.3.tgz", @@ -3166,6 +3277,28 @@ "dev": true, "license": "MIT" }, + "node_modules/node-html-parser": { + "version": "7.1.0", + "resolved": "https://registry.npmjs.org/node-html-parser/-/node-html-parser-7.1.0.tgz", + "integrity": "sha512-iJo8b2uYGT40Y8BTyy5ufL6IVbN8rbm/1QK2xffXU/1a/v3AAa0d1YAoqBNYqaS4R/HajkWIpIfdE6KcyFh1AQ==", + "license": "MIT", + "dependencies": { + "css-select": "^5.1.0", + "he": "1.2.0" + } + }, + "node_modules/nth-check": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/nth-check/-/nth-check-2.1.1.tgz", + "integrity": "sha512-lqjrjmaOoAnWfMmBPL+XNnynZh2+swxiX3WUE0s4yEHI6m+AwrK2UZOimIRl3X/4QctVqS8AiZjFqyOGrMXb/w==", + "license": "BSD-2-Clause", + "dependencies": { + "boolbase": "^1.0.0" + }, + "funding": { + "url": "https://github.com/fb55/nth-check?sponsor=1" + } + }, "node_modules/obug": { "version": "2.1.1", "resolved": "https://registry.npmjs.org/obug/-/obug-2.1.1.tgz", diff --git a/package.json b/package.json index b9b2626..a1cd25d 100644 --- a/package.json +++ b/package.json @@ -60,6 +60,7 @@ "dependencies": { "chalk": "^5.4.1", "commander": "^13.1.0", + "node-html-parser": "^7.1.0", "turndown": "^7.2.2", "yaml": "^2.7.0" }, diff --git a/src/checks/authentication/auth-alternative-access.ts b/src/checks/authentication/auth-alternative-access.ts index 2eb08b6..d5353b9 100644 --- a/src/checks/authentication/auth-alternative-access.ts +++ b/src/checks/authentication/auth-alternative-access.ts @@ -1,12 +1,149 @@ import { registerCheck } from '../registry.js'; import type { CheckContext, CheckResult } from '../../types.js'; -async function check(_ctx: CheckContext): Promise { +interface AuthGateDetails { + accessible?: number; + authRequired?: number; + softAuthGate?: number; + authRedirect?: number; + testedPages?: number; + pageResults?: Array<{ + url: string; + classification: string; + }>; +} + +interface DetectedPath { + type: string; + description: string; +} + +async function check(ctx: CheckContext): Promise { + const id = 'auth-alternative-access'; + const category = 'authentication'; + + // Read auth-gate-detection result; skip if it didn't run or docs are all public + const authResult = ctx.previousResults.get('auth-gate-detection'); + if (!authResult) { + return { + id, + category, + status: 'skip', + message: 'auth-gate-detection did not run', + }; + } + + if (authResult.status === 'pass') { + return { + id, + category, + status: 'skip', + message: 'All docs pages are publicly accessible; no alternative access paths needed', + }; + } + + if (authResult.status === 'skip' || authResult.status === 'error') { + return { + id, + category, + status: 'skip', + message: `auth-gate-detection ${authResult.status === 'error' ? 'errored' : 'was skipped'}; cannot assess alternative access`, + }; + } + + // Auth-gate-detection returned warn or fail — look for alternative access paths + const authDetails = (authResult.details ?? {}) as AuthGateDetails; + const gatedCount = + (authDetails.authRequired ?? 0) + + (authDetails.softAuthGate ?? 0) + + (authDetails.authRedirect ?? 0); + const accessibleCount = authDetails.accessible ?? 0; + const testedCount = authDetails.testedPages ?? 0; + + const detectedPaths: DetectedPath[] = []; + + // 1. Check for public llms.txt + const llmsResult = ctx.previousResults.get('llms-txt-exists'); + if (llmsResult?.status === 'pass' || llmsResult?.status === 'warn') { + detectedPaths.push({ + type: 'public-llms-txt', + description: + 'Site serves a public llms.txt file, giving agents a navigational index even though docs pages are gated', + }); + } + + // 2. Check for publicly accessible markdown + const mdUrlResult = ctx.previousResults.get('markdown-url-support'); + const cnResult = ctx.previousResults.get('content-negotiation'); + if (mdUrlResult?.status === 'pass' || mdUrlResult?.status === 'warn') { + detectedPaths.push({ + type: 'public-markdown', + description: + 'Some pages serve markdown via .md URLs, providing agent-readable content without authentication', + }); + } else if (cnResult?.status === 'pass' || cnResult?.status === 'warn') { + detectedPaths.push({ + type: 'public-markdown', + description: + 'Some pages serve markdown via content negotiation, providing agent-readable content without authentication', + }); + } + + // 3. Check for partially accessible pages (from auth-gate-detection itself) + if (accessibleCount > 0 && gatedCount > 0) { + const pct = Math.round((accessibleCount / testedCount) * 100); + detectedPaths.push({ + type: 'partial-public-access', + description: `${accessibleCount} of ${testedCount} tested pages (${pct}%) are publicly accessible without authentication`, + }); + } + + // Determine status + const manualOnlyNote = + 'Some access paths cannot be detected automatically: bundled SDK docs, CLI doc commands, and MCP servers'; + + let status: 'pass' | 'warn' | 'fail'; + let message: string; + + if (detectedPaths.length === 0) { + status = 'fail'; + message = `No alternative access paths detected for ${gatedCount} auth-gated pages. ${manualOnlyNote}`; + } else { + // Pass if we found a full-content path (llms.txt + markdown, or most pages accessible). + // Warn if we only found partial paths (llms.txt alone is just an index, not content). + const hasContentPath = detectedPaths.some((p) => p.type === 'public-markdown'); + const hasHighAccessibility = + accessibleCount > 0 && testedCount > 0 && accessibleCount / testedCount >= 0.5; + + if (hasContentPath || hasHighAccessibility) { + status = 'pass'; + } else { + status = 'warn'; + } + + const pathSummary = detectedPaths.map((p) => p.type).join(', '); + message = + status === 'pass' + ? `Alternative access detected (${pathSummary}) for site with ${gatedCount} auth-gated pages` + : `Partial alternative access detected (${pathSummary}) for site with ${gatedCount} auth-gated pages. ${manualOnlyNote}`; + } + return { - id: 'auth-alternative-access', - category: 'authentication', - status: 'skip', - message: 'Not yet implemented', + id, + category, + status, + message, + details: { + gatedPages: gatedCount, + accessiblePages: accessibleCount, + testedPages: testedCount, + detectedPaths, + manualVerificationNeeded: [ + 'Bundled documentation (docs shipped in package/SDK)', + 'CLI-based doc access (e.g. `yourproduct docs search "topic"`)', + 'MCP server providing doc access through tool calls', + ], + }, }; } @@ -15,6 +152,6 @@ registerCheck({ category: 'authentication', description: 'Whether an auth-gated documentation site provides alternative access paths for agents', - dependsOn: [['auth-gate-detection']], + dependsOn: [], run: check, }); diff --git a/src/checks/authentication/auth-gate-detection.ts b/src/checks/authentication/auth-gate-detection.ts index 726451d..b36cbe1 100644 --- a/src/checks/authentication/auth-gate-detection.ts +++ b/src/checks/authentication/auth-gate-detection.ts @@ -45,11 +45,18 @@ function detectLoginForm(body: string): string | undefined { return 'Contains password input field'; } - // Check page title for login indicators + // Check page title for login indicators. + // Only match titles that suggest the page IS a login form, not pages that + // mention login as a topic (e.g. "unable to login" in a knowledge base article). + // We require the login keyword to appear at the start or after a separator. const titleMatch = /]*>(.*?)<\/title>/i.exec(sample); if (titleMatch) { - const title = titleMatch[1].toLowerCase(); - if (/sign\s*in|log\s*in|authenticate/i.test(title)) { + const title = titleMatch[1].toLowerCase().trim(); + if ( + /^(sign\s*in|log\s*in)\b/.test(title) || + /[|\-–—:]\s*(sign\s*in|log\s*in)\s*$/i.test(title) || + /^authenticate\b/.test(title) + ) { return `Page title suggests login: "${titleMatch[1].trim()}"`; } } diff --git a/src/checks/content-structure/markdown-code-fence-validity.ts b/src/checks/content-structure/markdown-code-fence-validity.ts index 354982b..c32394a 100644 --- a/src/checks/content-structure/markdown-code-fence-validity.ts +++ b/src/checks/content-structure/markdown-code-fence-validity.ts @@ -136,6 +136,6 @@ registerCheck({ id: 'markdown-code-fence-validity', category: 'content-structure', description: 'Whether markdown contains unclosed code fences', - dependsOn: [['markdown-url-support', 'content-negotiation']], + dependsOn: [], run: check, }); diff --git a/src/checks/content-structure/section-header-quality.ts b/src/checks/content-structure/section-header-quality.ts index 3010ab0..82484ec 100644 --- a/src/checks/content-structure/section-header-quality.ts +++ b/src/checks/content-structure/section-header-quality.ts @@ -1,12 +1,291 @@ +import { parse } from 'node-html-parser'; import { registerCheck } from '../registry.js'; -import type { CheckContext, CheckResult } from '../../types.js'; +import type { CheckContext, CheckResult, CheckStatus } from '../../types.js'; +import type { DetectedTabGroup } from '../../helpers/detect-tabs.js'; + +interface TabbedPageResult { + url: string; + tabGroups: DetectedTabGroup[]; + totalTabbedChars: number; + status: CheckStatus; + error?: string; +} + +interface GroupHeaderAnalysis { + url: string; + framework: string; + totalHeaders: number; + genericHeaders: number; + contextualHeaders: number; + hasGenericMajority: boolean; + hasCrossGroupGeneric: boolean; +} + +const MD_HEADING_RE = /^#{1,6}\s+(.+)$/gm; + +/** + * Extract header text from content that may be HTML, markdown, or a mix (MDX). + * Tries HTML parsing first, then falls back to markdown heading regex. + */ +function extractHeaders(content: string): string[] { + const headers: string[] = []; + + // HTML headers + const root = parse(content); + const htmlHeaders = root.querySelectorAll('h1, h2, h3, h4, h5, h6'); + for (const h of htmlHeaders) { + const text = h.textContent.trim(); + if (text.length > 0) headers.push(text); + } + + // Markdown headers (## Heading) + let match; + while ((match = MD_HEADING_RE.exec(content)) !== null) { + const text = match[1].trim(); + if (text.length > 0) headers.push(text); + } + + return headers; +} + +async function check(ctx: CheckContext): Promise { + const id = 'section-header-quality'; + const category = 'content-structure'; + + const tabResult = ctx.previousResults.get('tabbed-content-serialization'); + + if (!tabResult || tabResult.status === 'skip') { + return { + id, + category, + status: 'skip', + message: 'Skipped: tabbed-content-serialization did not run', + }; + } + + const tabbedPages = (tabResult.details?.tabbedPages as TabbedPageResult[] | undefined) ?? []; + const pagesWithGroups = tabbedPages.filter((p) => p.tabGroups && p.tabGroups.length > 0); + + if (pagesWithGroups.length === 0) { + return { + id, + category, + status: 'pass', + message: 'No tabbed content found; header quality check not applicable', + }; + } + + const analyses: GroupHeaderAnalysis[] = []; + // Track unique headers per analysis for cross-group pass + const analysisHeaderSets: Set[] = []; + + for (const page of pagesWithGroups) { + for (const group of page.tabGroups) { + if (group.panels.length < 2) continue; + + // Extract headers from each panel + const panelHeaders: Array<{ label: string | null; headers: string[] }> = group.panels.map( + (panel) => ({ + label: panel.label, + headers: extractHeaders(panel.html), + }), + ); + + // Count how many times each header text appears across panels + const headerCounts = new Map(); + const uniqueHeaders = new Set(); + for (const ph of panelHeaders) { + for (const h of ph.headers) { + const lower = h.toLowerCase(); + headerCounts.set(lower, (headerCounts.get(lower) ?? 0) + 1); + uniqueHeaders.add(lower); + } + } + + const allHeaders = panelHeaders.flatMap((ph) => ph.headers); + let genericCount = 0; + let contextualCount = 0; + + for (const ph of panelHeaders) { + for (const h of ph.headers) { + const lower = h.toLowerCase(); + const appearsInMultiple = (headerCounts.get(lower) ?? 0) >= 2; + + // A header is contextual if it includes the panel label or is unique + const includesLabel = ph.label != null && lower.includes(ph.label.toLowerCase()); + + if (includesLabel || !appearsInMultiple) { + contextualCount++; + } else { + genericCount++; + } + } + } + + const totalHeaders = allHeaders.length; + const hasGenericMajority = totalHeaders > 0 && genericCount > totalHeaders / 2; + + analysisHeaderSets.push(uniqueHeaders); + analyses.push({ + url: page.url, + framework: group.framework, + totalHeaders, + genericHeaders: genericCount, + contextualHeaders: contextualCount, + hasGenericMajority, + hasCrossGroupGeneric: false, + }); + } + } + + // Cross-group analysis: detect identical headers repeated across separate tab groups + // on the same page without variant context (e.g. "Build a MongoDB Search Query" + // appearing in 7 driver-specific tab groups). + let crossGroupGenericGroupCount = 0; + let crossGroupTotalGroupCount = 0; + const crossGroupRepeatedHeaders: Array<{ url: string; header: string; groupCount: number }> = []; + + for (const page of pagesWithGroups) { + if (page.tabGroups.length < 2) continue; + + // Collect all panel labels and unique headers per group + const allLabels = new Set(); + const perGroup: Set[] = []; + for (const group of page.tabGroups) { + const headers = new Set(); + for (const panel of group.panels) { + if (panel.label) allLabels.add(panel.label.toLowerCase()); + for (const h of extractHeaders(panel.html)) headers.add(h.toLowerCase()); + } + perGroup.push(headers); + } + + // Count how many groups each header appears in + const headerGroupCount = new Map(); + for (const hs of perGroup) { + for (const h of hs) headerGroupCount.set(h, (headerGroupCount.get(h) ?? 0) + 1); + } + + // A header is cross-group generic if it appears in 2+ groups and doesn't + // include any panel label (i.e. lacks variant context) + const crossGenericSet = new Set(); + for (const [header, count] of headerGroupCount) { + if (count >= 2 && ![...allLabels].some((l) => header.includes(l))) { + crossGenericSet.add(header); + crossGroupRepeatedHeaders.push({ url: page.url, header, groupCount: count }); + } + } + + // Count groups affected by cross-group generic headers + for (const hs of perGroup) { + if (hs.size === 0) continue; + crossGroupTotalGroupCount++; + if ([...hs].some((h) => crossGenericSet.has(h))) crossGroupGenericGroupCount++; + } + + // Update individual analyses with cross-group flag + if (crossGenericSet.size > 0) { + for (let i = 0; i < analyses.length; i++) { + if (analyses[i].url !== page.url) continue; + if ([...analysisHeaderSets[i]].some((h) => crossGenericSet.has(h))) { + analyses[i].hasCrossGroupGeneric = true; + } + } + } + } + + if (analyses.length === 0 && crossGroupTotalGroupCount === 0) { + return { + id, + category, + status: 'pass', + message: 'Tab groups have fewer than 2 panels; header quality check not applicable', + }; + } + + const groupsWithGenericMajority = analyses.filter((a) => a.hasGenericMajority).length; + const groupsWithHeaders = analyses.filter((a) => a.totalHeaders > 0).length; + + // If no tab panels contain any section headers, we can't evaluate quality + if (groupsWithHeaders === 0 && crossGroupTotalGroupCount === 0) { + return { + id, + category, + status: 'skip', + message: `${pagesWithGroups.length} page(s) with tabs found, but no section headers inside tab panels to evaluate`, + }; + } + + // Identify affected pages: pages where any group has within-group or cross-group issues + const pagesWithWithinGroupIssues = new Set( + analyses.filter((a) => a.hasGenericMajority).map((a) => a.url), + ); + const pagesWithCrossGroupIssues = new Set(crossGroupRepeatedHeaders.map((h) => h.url)); + const affectedPages = new Set([...pagesWithWithinGroupIssues, ...pagesWithCrossGroupIssues]); + + // Count pages where we actually found headers to evaluate + const pagesWithHeaders = new Set(analyses.filter((a) => a.totalHeaders > 0).map((a) => a.url)); + + // Scoring: use group-level ratios for fine-grained thresholds + // Within-group: ratio of groups-with-headers that have majority-generic + let withinStatus: CheckStatus = 'pass'; + if (groupsWithHeaders > 0) { + const wRatio = groupsWithGenericMajority / groupsWithHeaders; + if (wRatio > 0.5) withinStatus = 'fail'; + else if (wRatio > 0.25) withinStatus = 'warn'; + } + + // Cross-group: ratio of groups on multi-group pages that have cross-group generics + let crossGroupStatus: CheckStatus = 'pass'; + if (crossGroupTotalGroupCount > 0) { + const cRatio = crossGroupGenericGroupCount / crossGroupTotalGroupCount; + if (cRatio > 0.5) crossGroupStatus = 'fail'; + else if (cRatio > 0.25) crossGroupStatus = 'warn'; + } + + // Combined status: worst of both + const statusRank: Record = { pass: 0, skip: 0, warn: 1, fail: 2, error: 2 }; + const status: CheckStatus = + statusRank[crossGroupStatus] > statusRank[withinStatus] ? crossGroupStatus : withinStatus; + + // Build a page-oriented message for docs teams + let message: string; + if (affectedPages.size === 0) { + message = `${pagesWithHeaders.size} page(s) with tab headers checked; headers include variant context`; + } else { + // Find the most-repeated cross-group header for a concrete example + const worstHeader = + crossGroupRepeatedHeaders.length > 0 + ? [...crossGroupRepeatedHeaders].sort((a, b) => b.groupCount - a.groupCount)[0] + : null; + + const pageSummary = + `${affectedPages.size} of ${pagesWithHeaders.size} page(s) with tab headers ` + + `don't distinguish between variants`; + + if (worstHeader) { + message = `${pageSummary} (e.g. "${worstHeader.header}" repeats across ${worstHeader.groupCount} tab groups)`; + } else { + message = pageSummary; + } + } -async function check(_ctx: CheckContext): Promise { return { - id: 'section-header-quality', - category: 'content-structure', - status: 'skip', - message: 'Not yet implemented', + id, + category, + status, + message, + details: { + pagesWithTabs: pagesWithGroups.length, + pagesAffected: affectedPages.size, + totalGroupsAnalyzed: analyses.length, + groupsWithHeaders, + groupsWithGenericMajority, + crossGroupGenericGroupCount, + crossGroupTotalGroupCount, + crossGroupRepeatedHeaders, + analyses, + }, }; } @@ -14,6 +293,8 @@ registerCheck({ id: 'section-header-quality', category: 'content-structure', description: 'Whether headers in tabbed sections include variant context', - dependsOn: ['tabbed-content-serialization'], + // No hard dependency: we read from previousResults if available, + // but the check handles missing data gracefully (returns skip). + dependsOn: [], run: check, }); diff --git a/src/checks/content-structure/tabbed-content-serialization.ts b/src/checks/content-structure/tabbed-content-serialization.ts index 6e91f67..3126ef0 100644 --- a/src/checks/content-structure/tabbed-content-serialization.ts +++ b/src/checks/content-structure/tabbed-content-serialization.ts @@ -1,12 +1,229 @@ import { registerCheck } from '../registry.js'; -import type { CheckContext, CheckResult } from '../../types.js'; +import { discoverAndSamplePages } from '../../helpers/get-page-urls.js'; +import { htmlToMarkdown } from '../../helpers/html-to-markdown.js'; +import { fetchPage } from '../../helpers/fetch-page.js'; +import { detectTabGroups } from '../../helpers/detect-tabs.js'; +import { toMdUrls } from '../../helpers/to-md-urls.js'; +import type { CheckContext, CheckResult, CheckStatus } from '../../types.js'; +import type { DetectedTabGroup } from '../../helpers/detect-tabs.js'; + +interface TabbedPageResult { + url: string; + tabGroups: DetectedTabGroup[]; + totalTabbedChars: number; + status: CheckStatus; + source?: 'html' | 'md-fallback' | 'markdown'; + error?: string; +} + +function sizeStatus(chars: number): CheckStatus { + if (chars <= 50_000) return 'pass'; + if (chars <= 100_000) return 'warn'; + return 'fail'; +} + +function worstStatus(statuses: CheckStatus[]): CheckStatus { + if (statuses.includes('fail')) return 'fail'; + if (statuses.includes('warn')) return 'warn'; + return 'pass'; +} + +function formatSize(chars: number): string { + if (chars >= 1000) return `${Math.round(chars / 1000)}K`; + return String(chars); +} + +/** + * Try to fetch a .md fallback URL for a page. Returns the body if successful, null otherwise. + */ +async function tryMdFallback(ctx: CheckContext, pageUrl: string): Promise { + const candidates = toMdUrls(pageUrl); + for (const mdUrl of candidates) { + try { + const response = await ctx.http.fetch(mdUrl); + if (!response.ok) continue; + const contentType = response.headers.get('content-type') ?? ''; + if (!contentType.includes('text/markdown') && !contentType.includes('text/plain')) continue; + const body = await response.text(); + // Sanity check: must have some content and not be HTML + if (body.length > 0 && !body.trimStart().startsWith('; + const match = pageResults.find((r) => r.url === url); + return match?.status === 'fail'; +} + +async function analyzePage(ctx: CheckContext, url: string): Promise { + const page = await fetchPage(ctx, url); + + // For markdown responses, run MDX detection directly + if (!page.isHtml) { + const tabGroups = detectTabGroups(page.body); + if (tabGroups.length === 0) { + return { url, tabGroups: [], totalTabbedChars: 0, status: 'pass', source: 'markdown' }; + } + // For markdown content, the serialized size is the raw content of the tab groups + let totalTabbedChars = 0; + for (const group of tabGroups) { + totalTabbedChars += group.htmlSlice.length; + } + return { + url, + tabGroups, + totalTabbedChars, + status: sizeStatus(totalTabbedChars), + source: 'markdown', + }; + } + + // HTML response: try HTML-based detection first + const tabGroups = detectTabGroups(page.body); + if (tabGroups.length > 0) { + let totalTabbedChars = 0; + for (const group of tabGroups) { + const md = htmlToMarkdown(group.htmlSlice); + totalTabbedChars += md.length; + } + return { + url, + tabGroups, + totalTabbedChars, + status: sizeStatus(totalTabbedChars), + source: 'html', + }; + } + + // No tabs found in HTML. If rendering-strategy flagged this as an SPA shell, + // try the markdown path as a fallback so we can still analyze tab content + // for agents that support content negotiation. + if (isSpaShell(ctx, url)) { + const mdBody = await tryMdFallback(ctx, url); + if (mdBody) { + const mdTabGroups = detectTabGroups(mdBody); + if (mdTabGroups.length > 0) { + let totalTabbedChars = 0; + for (const group of mdTabGroups) { + totalTabbedChars += group.htmlSlice.length; + } + return { + url, + tabGroups: mdTabGroups, + totalTabbedChars, + status: sizeStatus(totalTabbedChars), + source: 'md-fallback', + }; + } + } + } + + return { url, tabGroups: [], totalTabbedChars: 0, status: 'pass', source: 'html' }; +} + +async function check(ctx: CheckContext): Promise { + const id = 'tabbed-content-serialization'; + const category = 'content-structure'; + + const { + urls: pageUrls, + totalPages, + sampled: wasSampled, + warnings, + } = await discoverAndSamplePages(ctx); + + const results: TabbedPageResult[] = []; + const concurrency = ctx.options.maxConcurrency; + + for (let i = 0; i < pageUrls.length; i += concurrency) { + const batch = pageUrls.slice(i, i + concurrency); + const batchResults = await Promise.all( + batch.map(async (url): Promise => { + try { + return await analyzePage(ctx, url); + } catch (err) { + return { + url, + tabGroups: [], + totalTabbedChars: 0, + status: 'fail', + error: err instanceof Error ? err.message : String(err), + }; + } + }), + ); + results.push(...batchResults); + } + + const successful = results.filter((r) => !r.error); + const fetchErrors = results.filter((r) => r.error).length; + + if (successful.length === 0) { + const suffix = fetchErrors > 0 ? `; ${fetchErrors} failed to fetch` : ''; + return { + id, + category, + status: 'fail', + message: `Could not fetch any pages to analyze${suffix}`, + details: { + totalPages, + testedPages: results.length, + sampled: wasSampled, + fetchErrors, + tabbedPages: results, + discoveryWarnings: warnings, + }, + }; + } + + const pagesWithTabs = successful.filter((r) => r.tabGroups.length > 0); + const totalGroupsFound = successful.reduce((sum, r) => sum + r.tabGroups.length, 0); + const overallStatus = worstStatus(successful.map((r) => r.status)); + const pageLabel = wasSampled ? 'sampled pages' : 'pages'; + + let message: string; + if (totalGroupsFound === 0) { + message = `No tabbed content detected across ${successful.length} ${pageLabel}`; + } else if (overallStatus === 'pass') { + message = `${totalGroupsFound} tab group(s) across ${pagesWithTabs.length} of ${successful.length} ${pageLabel}; all serialize under 50K chars`; + } else if (overallStatus === 'warn') { + const worst = Math.max(...successful.map((r) => r.totalTabbedChars)); + message = `${totalGroupsFound} tab group(s) found; worst page serializes to ${formatSize(worst)} chars (50K–100K)`; + } else { + const worst = Math.max(...successful.map((r) => r.totalTabbedChars)); + message = `${totalGroupsFound} tab group(s) found; worst page serializes to ${formatSize(worst)} chars (over 100K)`; + } + + if (fetchErrors > 0) { + message += `; ${fetchErrors} failed to fetch`; + } -async function check(_ctx: CheckContext): Promise { return { - id: 'tabbed-content-serialization', - category: 'content-structure', - status: 'skip', - message: 'Not yet implemented', + id, + category, + status: overallStatus, + message, + details: { + totalPages, + testedPages: results.length, + sampled: wasSampled, + pagesWithTabs: pagesWithTabs.length, + totalGroupsFound, + fetchErrors, + tabbedPages: results, + discoveryWarnings: warnings, + }, }; } diff --git a/src/checks/index.ts b/src/checks/index.ts index c8e0221..ecbfe0a 100644 --- a/src/checks/index.ts +++ b/src/checks/index.ts @@ -12,6 +12,7 @@ import './markdown-availability/markdown-url-support.js'; import './markdown-availability/content-negotiation.js'; // Category 3: Page Size +import './page-size/rendering-strategy.js'; import './page-size/page-size-markdown.js'; import './page-size/page-size-html.js'; import './page-size/content-start-position.js'; diff --git a/src/checks/llms-txt/llms-txt-exists.ts b/src/checks/llms-txt/llms-txt-exists.ts index c813277..9631a26 100644 --- a/src/checks/llms-txt/llms-txt-exists.ts +++ b/src/checks/llms-txt/llms-txt-exists.ts @@ -1,4 +1,5 @@ import { registerCheck } from '../registry.js'; +import { isCrossHostRedirect } from '../../helpers/to-md-urls.js'; import type { CheckContext, CheckResult, DiscoveredFile } from '../../types.js'; /** @@ -14,16 +15,6 @@ function getCandidateUrls(baseUrl: string, origin: string): string[] { return Array.from(candidates); } -function isCrossHostRedirect(originalUrl: string, finalUrl: string): boolean { - try { - const original = new URL(originalUrl); - const final_ = new URL(finalUrl); - return original.host !== final_.host; - } catch { - return false; - } -} - async function checkLlmsTxtExists(ctx: CheckContext): Promise { const candidates = getCandidateUrls(ctx.baseUrl, ctx.origin); const discovered: DiscoveredFile[] = []; @@ -155,6 +146,21 @@ async function checkLlmsTxtExists(ctx: CheckContext): Promise { details.redirectedOrigins = redirectedOrigins; } + // Set effectiveOrigin for downstream checks when content lives at a different host. + // Derive from redirect URLs on discovered files, or from the fallback redirectedOrigins. + if (!ctx.effectiveOrigin) { + const crossHostFile = discovered.find((f) => f.crossHostRedirect && f.redirectUrl); + if (crossHostFile?.redirectUrl) { + try { + ctx.effectiveOrigin = new URL(crossHostFile.redirectUrl).origin; + } catch { + /* ignore malformed */ + } + } else if (redirectedOrigins.length > 0) { + ctx.effectiveOrigin = redirectedOrigins[0]; + } + } + if (discovered.length === 0) { const redirectNote = redirectedOrigins.length > 0 diff --git a/src/checks/observability/llms-txt-freshness.ts b/src/checks/observability/llms-txt-freshness.ts index 91a2ed9..62a801a 100644 --- a/src/checks/observability/llms-txt-freshness.ts +++ b/src/checks/observability/llms-txt-freshness.ts @@ -1,12 +1,435 @@ import { registerCheck } from '../registry.js'; +import { + getUrlsFromCachedLlmsTxt, + getUrlsFromSitemap, + parseSitemapUrls, +} from '../../helpers/get-page-urls.js'; +import { isNonPageUrl } from '../../helpers/to-md-urls.js'; import type { CheckContext, CheckResult } from '../../types.js'; -async function check(_ctx: CheckContext): Promise { +/** + * Normalize a URL to a canonical path for comparison. + * Strips trailing slashes, .md/.mdx/.html extensions, and /index variants, + * then lowercases the path. + */ +export function normalizeUrlPath(url: string): string { + try { + const parsed = new URL(url); + let path = parsed.pathname; + + // Strip /index.md, /index.mdx, /index.html + path = path.replace(/\/index\.(?:md|mdx|html?)$/i, '/'); + + // Strip .md, .mdx, .html extensions + path = path.replace(/\.(?:md|mdx|html?)$/i, ''); + + // Strip trailing slash (but keep root /) + if (path.length > 1 && path.endsWith('/')) { + path = path.slice(0, -1); + } + + return path.toLowerCase(); + } catch { + return url.toLowerCase(); + } +} + +/** + * Path patterns that are unlikely to need llms.txt coverage. + * These are non-doc pages that commonly appear in sitemaps. + */ +const EXCLUDED_PATH_PATTERNS = [ + /^\/blog(\/|$)/i, + /^\/changelog(\/|$)/i, + /^\/releases?(\/|$)/i, + /^\/pricing(\/|$)/i, + /^\/about(\/|$)/i, + /^\/careers?(\/|$)/i, + /^\/jobs?(\/|$)/i, + /^\/contact(\/|$)/i, + /^\/legal(\/|$)/i, + /^\/privacy(\/|$)/i, + /^\/terms(\/|$)/i, + /^\/security(\/|$)/i, + /^\/status(\/|$)/i, + /^\/login(\/|$)/i, + /^\/signup(\/|$)/i, + /^\/sign-up(\/|$)/i, + /^\/sign-in(\/|$)/i, + /^\/register(\/|$)/i, + /^\/404(\/|$)/i, + /^\/500(\/|$)/i, +]; + +export function isExcludedPath(normalizedPath: string, baseUrlPath?: string): boolean { + if (EXCLUDED_PATH_PATTERNS.some((pattern) => pattern.test(normalizedPath))) { + return true; + } + // Also check relative to the base path prefix (e.g. /docs/changelog → /changelog) + if (baseUrlPath && baseUrlPath !== '/' && normalizedPath.startsWith(baseUrlPath)) { + const relative = normalizedPath.slice(baseUrlPath.length) || '/'; + if (EXCLUDED_PATH_PATTERNS.some((pattern) => pattern.test(relative))) { + return true; + } + } + return false; +} + +/** + * Detect whether a URL set uses locale-prefixed paths and, if so, return the + * path segment position where locales appear. + * + * Detection is empirical: for each path segment position, count how many + * distinct 2-letter (or xx-yy) codes appear. If a position has ≥2 distinct + * codes and those codes cover >50% of URLs, it's a locale segment. + * + * Example: `/docs/en/intro` and `/docs/de/intro` → position 1 has codes + * `en` and `de` → locale position detected at index 1. + */ +export function detectLocalePosition(urls: string[]): number | null { + const positionCounts = new Map>(); + const positionTotals = new Map(); + + for (const url of urls) { + try { + const segments = new URL(url).pathname.split('/').filter(Boolean); + for (let i = 0; i < segments.length; i++) { + const seg = segments[i].toLowerCase(); + if (/^[a-z]{2}(-[a-z]{2})?$/.test(seg)) { + if (!positionCounts.has(i)) positionCounts.set(i, new Map()); + const counts = positionCounts.get(i)!; + counts.set(seg, (counts.get(seg) ?? 0) + 1); + positionTotals.set(i, (positionTotals.get(i) ?? 0) + 1); + } + } + } catch { + continue; + } + } + + for (const [pos, counts] of positionCounts) { + if (counts.size < 2) continue; + const total = positionTotals.get(pos) ?? 0; + if (total > urls.length * 0.5) { + return pos; + } + } + + return null; +} + +/** + * Get the dominant value at a given path segment position across a URL set. + * Returns null if no consistent value is found. + */ +export function getDominantSegment(urls: string[], position: number): string | null { + const counts = new Map(); + for (const url of urls) { + try { + const segments = new URL(url).pathname.split('/').filter(Boolean); + if (segments.length > position) { + const seg = segments[position].toLowerCase(); + counts.set(seg, (counts.get(seg) ?? 0) + 1); + } + } catch { + continue; + } + } + + let dominant = ''; + let dominantCount = 0; + for (const [seg, count] of counts) { + if (count > dominantCount) { + dominant = seg; + dominantCount = count; + } + } + + // Only return if it covers >50% of the URLs + return dominantCount > urls.length * 0.5 ? dominant : null; +} + +/** + * Filter URLs to only those whose path segment at `position` matches `locale`. + */ +function filterByLocale(urls: string[], locale: string, position: number): string[] { + return urls.filter((url) => { + try { + const segments = new URL(url).pathname.split('/').filter(Boolean); + return segments.length > position && segments[position].toLowerCase() === locale; + } catch { + return false; + } + }); +} + +/** Coverage thresholds */ +const COVERAGE_PASS = 0.95; +const COVERAGE_WARN = 0.8; + +/** + * Maximum sitemap URLs to collect for freshness comparison. + * Higher than the default MAX_SITEMAP_URLS (500) used for page sampling, + * because freshness needs the full sitemap to produce meaningful coverage + * percentages. Enterprise docs sites (Stripe, MongoDB) can have thousands + * of pages. + */ +const MAX_FRESHNESS_SITEMAP_URLS = 50_000; + +/** + * Try to fetch a docs-specific sitemap at {baseUrl}/sitemap.xml. + * Many docs sites host their own sitemap that isn't referenced from robots.txt + * (e.g., Loops /docs/sitemap.xml, Supabase /docs/sitemap.xml). + */ +async function fetchDocsSitemap(ctx: CheckContext): Promise { + const baseUrlPath = new URL(ctx.baseUrl).pathname.replace(/\/$/, ''); + if (!baseUrlPath || baseUrlPath === '/') return []; + + const docsSitemapUrl = `${ctx.origin}${baseUrlPath}/sitemap.xml`; + try { + const response = await ctx.http.fetch(docsSitemapUrl); + if (!response.ok) return []; + const xml = await response.text(); + const parsed = parseSitemapUrls(xml); + + // If it's a sitemap index, follow one level + if (parsed.sitemapIndexUrls.length > 0) { + const urls: string[] = []; + for (const subUrl of parsed.sitemapIndexUrls) { + try { + const subResp = await ctx.http.fetch(subUrl); + if (!subResp.ok) continue; + const subXml = await subResp.text(); + const subParsed = parseSitemapUrls(subXml); + urls.push(...subParsed.urls); + } catch { + // Skip failed fetches + } + } + return urls; + } + + return parsed.urls; + } catch { + return []; + } +} + +/** + * Scope URLs to the baseUrl path prefix and same origin. + */ +function scopeUrls(urls: string[], origin: string, baseUrlPath: string): string[] { + return urls.filter((url) => { + try { + const parsed = new URL(url); + if (parsed.origin !== origin) return false; + if (baseUrlPath && baseUrlPath !== '/') { + if (!parsed.pathname.startsWith(baseUrlPath + '/') && parsed.pathname !== baseUrlPath) { + return false; + } + } + if (isNonPageUrl(url)) return false; + return true; + } catch { + return false; + } + }); +} + +async function check(ctx: CheckContext): Promise { + const id = 'llms-txt-freshness'; + const category = 'observability'; + + // 1. Get llms.txt page URLs (with progressive disclosure walking) + const llmsTxtUrls = await getUrlsFromCachedLlmsTxt(ctx); + if (llmsTxtUrls.length === 0) { + return { + id, + category, + status: 'skip', + message: 'No page URLs found in llms.txt', + }; + } + + // 2. Get sitemap URLs, with docs-specific sitemap fallback + // Use effectiveOrigin when a cross-host redirect was detected, so that + // sitemap URLs at the redirected host are accepted rather than filtered out. + const effectiveOrigin = ctx.effectiveOrigin ?? ctx.origin; + const sitemapWarnings: string[] = []; + let sitemapUrls = await getUrlsFromSitemap( + ctx, + sitemapWarnings, + MAX_FRESHNESS_SITEMAP_URLS, + effectiveOrigin, + ); + let sitemapSource = 'robots.txt/sitemap.xml'; + const baseUrlPath = new URL(ctx.baseUrl).pathname.replace(/\/$/, ''); + + // Check if main sitemap has any docs URLs + let scopedSitemapUrls = scopeUrls(sitemapUrls, effectiveOrigin, baseUrlPath); + + // If the main sitemap has no docs URLs, try a docs-specific sitemap + if (scopedSitemapUrls.length === 0 && baseUrlPath && baseUrlPath !== '/') { + const docsSitemapUrls = await fetchDocsSitemap(ctx); + if (docsSitemapUrls.length > 0) { + sitemapUrls = docsSitemapUrls; + scopedSitemapUrls = scopeUrls(docsSitemapUrls, effectiveOrigin, baseUrlPath); + sitemapSource = `${baseUrlPath}/sitemap.xml`; + } + } + + if (sitemapUrls.length === 0) { + return { + id, + category, + status: 'skip', + message: + 'No sitemap found; cannot assess llms.txt freshness without a sitemap as ground truth', + details: { sitemapWarnings }, + }; + } + + if (scopedSitemapUrls.length === 0) { + return { + id, + category, + status: 'skip', + message: `Sitemap has ${sitemapUrls.length} URLs but none are under the docs path prefix (${baseUrlPath || '/'})`, + details: { + totalSitemapUrls: sitemapUrls.length, + baseUrlPath: baseUrlPath || '/', + sitemapWarnings, + }, + }; + } + + // 2b. Locale filtering: if the sitemap uses locale-prefixed paths (e.g. /docs/en/, + // /docs/de/), filter to the same locale as the llms.txt URLs. This avoids + // penalizing sites for not listing every localized variant in llms.txt. + let localeFiltered = false; + let detectedLocale: string | null = null; + const localePosition = detectLocalePosition(scopedSitemapUrls); + + if (localePosition !== null) { + const llmsLocale = getDominantSegment(llmsTxtUrls, localePosition); + if (llmsLocale) { + detectedLocale = llmsLocale; + const before = scopedSitemapUrls.length; + scopedSitemapUrls = filterByLocale(scopedSitemapUrls, llmsLocale, localePosition); + localeFiltered = scopedSitemapUrls.length < before; + } + } + + // 3. Normalize both sets for comparison + const llmsNormalized = new Set(llmsTxtUrls.map(normalizeUrlPath)); + const sitemapNormalized = new Map(); // normalized -> original URL + for (const url of scopedSitemapUrls) { + const norm = normalizeUrlPath(url); + if (!isExcludedPath(norm, baseUrlPath)) { + sitemapNormalized.set(norm, url); + } + } + + const excludedCount = scopedSitemapUrls.length - sitemapNormalized.size; + + // 4. Missing coverage: in sitemap but not in llms.txt + const missingFromLlmsTxt: string[] = []; + for (const [norm, originalUrl] of sitemapNormalized) { + if (!llmsNormalized.has(norm)) { + missingFromLlmsTxt.push(originalUrl); + } + } + + // 5. Unmatched llms.txt links: in llms.txt but not in sitemap + // This could mean either (a) the page was removed (truly stale) or + // (b) the sitemap is incomplete. We report it but don't use it to + // determine the overall status since we can't distinguish the two + // without fetching every URL (which llms-txt-links-resolve handles). + const sitemapNormalizedSet = new Set(sitemapNormalized.keys()); + const unmatchedLlmsTxtUrls: string[] = []; + for (const url of llmsTxtUrls) { + const norm = normalizeUrlPath(url); + // Only check URLs under the same origin and path prefix + try { + const parsed = new URL(url); + if (parsed.origin !== effectiveOrigin) continue; + if ( + baseUrlPath && + baseUrlPath !== '/' && + !parsed.pathname.startsWith(baseUrlPath + '/') && + parsed.pathname !== baseUrlPath + ) { + continue; + } + } catch { + continue; + } + if (isExcludedPath(norm, baseUrlPath)) continue; + if (!sitemapNormalizedSet.has(norm)) { + unmatchedLlmsTxtUrls.push(url); + } + } + + // 6. Compute metrics + const sitemapDocPages = sitemapNormalized.size; + const coveredCount = sitemapDocPages - missingFromLlmsTxt.length; + const coverageRate = sitemapDocPages > 0 ? coveredCount / sitemapDocPages : 1; + const unmatchedRate = + llmsTxtUrls.length > 0 ? unmatchedLlmsTxtUrls.length / llmsTxtUrls.length : 0; + + const coveragePct = Math.round(coverageRate * 100); + const unmatchedPct = Math.round(unmatchedRate * 100); + + // 7. Determine status based on coverage only + // Unmatched links are informational (see note in step 5) + let overallStatus: 'pass' | 'warn' | 'fail'; + if (coverageRate >= COVERAGE_PASS) { + overallStatus = 'pass'; + } else if (coverageRate >= COVERAGE_WARN) { + overallStatus = 'warn'; + } else { + overallStatus = 'fail'; + } + + // 8. Build message + const parts: string[] = []; + if (overallStatus === 'pass') { + parts.push(`llms.txt covers ${coveragePct}% of ${sitemapDocPages} sitemap doc pages`); + } else { + parts.push( + `llms.txt covers ${coveredCount}/${sitemapDocPages} sitemap doc pages (${coveragePct}%); ${missingFromLlmsTxt.length} missing`, + ); + } + if (unmatchedLlmsTxtUrls.length > 0) { + parts.push( + `${unmatchedLlmsTxtUrls.length} llms.txt links not in sitemap (may indicate stale links or incomplete sitemap)`, + ); + } + + const message = parts.join('; '); + return { - id: 'llms-txt-freshness', - category: 'observability', - status: 'skip', - message: 'Not yet implemented', + id, + category, + status: overallStatus, + message, + details: { + llmsTxtPageCount: llmsTxtUrls.length, + sitemapTotal: sitemapUrls.length, + sitemapScoped: scopedSitemapUrls.length, + sitemapDocPages, + sitemapSource, + excludedNonDocPages: excludedCount, + ...(localeFiltered ? { localeFiltered: true, detectedLocale } : {}), + baseUrlPath: baseUrlPath || '/', + coverageRate: coveragePct, + missingFromLlmsTxt: missingFromLlmsTxt.slice(0, 50), + missingCount: missingFromLlmsTxt.length, + unmatchedLlmsTxtUrls: unmatchedLlmsTxtUrls.slice(0, 50), + unmatchedCount: unmatchedLlmsTxtUrls.length, + unmatchedPct, + sitemapWarnings, + }, }; } diff --git a/src/checks/observability/markdown-content-parity.ts b/src/checks/observability/markdown-content-parity.ts index 354a43c..414d64f 100644 --- a/src/checks/observability/markdown-content-parity.ts +++ b/src/checks/observability/markdown-content-parity.ts @@ -1,12 +1,672 @@ +import { parse } from 'node-html-parser'; import { registerCheck } from '../registry.js'; -import type { CheckContext, CheckResult } from '../../types.js'; +import { fetchPage } from '../../helpers/fetch-page.js'; +import type { CheckContext, CheckResult, CheckStatus } from '../../types.js'; + +/** Thresholds for the percentage of HTML segments not found in markdown. */ +const WARN_THRESHOLD = 5; +const FAIL_THRESHOLD = 20; + +/** Minimum character length for a text segment to be considered meaningful. */ +const MIN_SEGMENT_LENGTH = 20; + +/** + * Minimum number of unique HTML segments required for a meaningful comparison. + * Pages below this threshold auto-pass because the percentage is too volatile + * (e.g., 3 breadcrumb items on a 10-segment page = 30% "missing"). + */ +const MIN_SEGMENTS_FOR_COMPARISON = 10; + +/** HTML tags to strip before extracting text (non-content chrome). */ +const STRIP_TAGS = [ + 'script', + 'style', + 'nav', + 'footer', + 'header', + 'noscript', + 'button', + 'svg', + 'aside', +]; + +/** CSS selectors for common doc-site chrome that lives inside
. */ +const STRIP_SELECTORS = [ + '[aria-label="breadcrumb"]', + '[aria-label="pagination"]', + '[class*="breadcrumb"]', + '[class*="pagination"]', + '[class*="prev-next"]', + '[class*="prevnext"]', + '[class*="page-nav"]', + '[class*="feedback"]', + '[class*="helpful"]', + '[class*="table-of-contents"]', + '[class*="toc"]', + '[rel="prev"]', + '[rel="next"]', + '.sr-only', +]; + +/** + * Segment-level patterns for common non-content text that survives DOM stripping. + * Matched against normalized (lowercased, whitespace-collapsed) segments. + */ +const NOISE_PATTERNS = [ + /^last updated/, + /^was this page helpful/, + /^thank you for your feedback/, + /^previous\s+\S.*next\s+\S/, // "Previous X Next Y" pagination + /^start from the beginning$/, + /^join our .* server/, // "Join our Discord Server..." + /^loading video content/, + /^\/.+\/.+/, // breadcrumb paths like "/Connect to Neon/..." +]; + +interface PageParityResult { + url: string; + markdownSource: string; + status: CheckStatus; + /** Percentage of HTML text segments not found in the markdown version. */ + missingPercent: number; + /** Total meaningful text segments extracted from HTML. */ + totalSegments: number; + /** Number of HTML segments not found in the markdown. */ + missingSegments: number; + /** Sample of missing segments for diagnostics. */ + sampleDiffs: string[]; + error?: string; +} + +/** + * Known HTML tag names used to distinguish real tags from angle-bracket + * placeholders like or in code examples. + * Only needs to cover tags that appear in node-html-parser's .text output + * (i.e., tags inside
 that survive as raw text).
+ */
+const HTML_TAG_NAMES = new Set([
+  'a',
+  'abbr',
+  'address',
+  'article',
+  'aside',
+  'audio',
+  'b',
+  'bdi',
+  'bdo',
+  'blockquote',
+  'body',
+  'br',
+  'button',
+  'canvas',
+  'caption',
+  'cite',
+  'code',
+  'col',
+  'colgroup',
+  'data',
+  'dd',
+  'del',
+  'details',
+  'dfn',
+  'dialog',
+  'div',
+  'dl',
+  'dt',
+  'em',
+  'embed',
+  'fieldset',
+  'figcaption',
+  'figure',
+  'footer',
+  'form',
+  'h1',
+  'h2',
+  'h3',
+  'h4',
+  'h5',
+  'h6',
+  'head',
+  'header',
+  'hr',
+  'html',
+  'i',
+  'iframe',
+  'img',
+  'input',
+  'ins',
+  'kbd',
+  'label',
+  'legend',
+  'li',
+  'link',
+  'main',
+  'map',
+  'mark',
+  'meta',
+  'meter',
+  'nav',
+  'noscript',
+  'object',
+  'ol',
+  'optgroup',
+  'option',
+  'output',
+  'p',
+  'param',
+  'picture',
+  'pre',
+  'progress',
+  'q',
+  'rp',
+  'rt',
+  'ruby',
+  's',
+  'samp',
+  'script',
+  'section',
+  'select',
+  'slot',
+  'small',
+  'source',
+  'span',
+  'strong',
+  'style',
+  'sub',
+  'summary',
+  'sup',
+  'table',
+  'tbody',
+  'td',
+  'template',
+  'textarea',
+  'tfoot',
+  'th',
+  'thead',
+  'time',
+  'title',
+  'tr',
+  'track',
+  'u',
+  'ul',
+  'var',
+  'video',
+  'wbr',
+]);
+
+/** Block-level HTML elements that should produce line breaks in extracted text. */
+const BLOCK_TAGS = new Set([
+  'p',
+  'div',
+  'h1',
+  'h2',
+  'h3',
+  'h4',
+  'h5',
+  'h6',
+  'li',
+  'tr',
+  'td',
+  'th',
+  'blockquote',
+  'pre',
+  'dt',
+  'dd',
+  'figcaption',
+  'section',
+  'article',
+  'details',
+  'summary',
+  'br',
+  'hr',
+]);
+
+/**
+ * Minimum link density (0–1) and minimum link count for an element to be
+ * classified as navigation chrome. Navigation panels are structurally
+ * distinguishable from content: they consist almost entirely of links with
+ * very little non-link text between them. Content sections, even link-heavy
+ * ones like "Related resources", include enough description text to stay
+ * well below this threshold.
+ */
+const NAV_LINK_DENSITY_THRESHOLD = 0.7;
+const NAV_MIN_LINK_COUNT = 10;
+
+/**
+ * Extract plain text from HTML, stripping chrome elements.
+ * Inserts newlines between block-level elements so that paragraphs,
+ * list items, etc. become separate lines in the output.
+ */
+/**
+ * Heuristic selectors for content containers, tried in order when
+ * 
and
are not present. Common across doc platforms + * like Mintlify, ReadMe, Docusaurus/Starlight, and custom sites. + */ +const CONTENT_SELECTORS = [ + '[role="main"]', + '#content', + '.sl-markdown-content', + '.markdown-content', + '.markdown-body', + '.docs-content', + '.doc-content', + '.main-pane', + '.page-content', + '.prose', +]; + +function extractHtmlText(html: string): string { + const root = parse(html); + + // Prefer the tightest content container available. + // Priority: heuristic selector inside article/main > article inside main + // > article > heuristic selector inside main > main > heuristic on root > body + const main = root.querySelector('main'); + const article = main?.querySelector('article') ?? root.querySelector('article'); + let content: ReturnType = null; + + // Look for a heuristic content selector inside the best semantic container + const semanticContainer = article ?? main; + if (semanticContainer) { + for (const selector of CONTENT_SELECTORS) { + content = semanticContainer.querySelector(selector); + if (content) break; + } + } + // Fall back to the semantic container itself + if (!content) content = semanticContainer; + + // If no semantic container, try heuristic selectors on the root + if (!content) { + for (const selector of CONTENT_SELECTORS) { + content = root.querySelector(selector); + if (content) break; + } + } + + if (!content) content = root.querySelector('body'); + if (!content) return root.text; + + // Remove non-content elements by tag + for (const tag of STRIP_TAGS) { + for (const el of content.querySelectorAll(tag)) { + el.remove(); + } + } + + // Remove common doc-site chrome by CSS selector + for (const selector of STRIP_SELECTORS) { + for (const el of content.querySelectorAll(selector)) { + el.remove(); + } + } + + // Remove elements that look like navigation based on link density. + // Navigation panels (sidebars, header menus) are structurally distinct + // from content: they consist almost entirely of links. This catches + // nav-like elements that use
instead of