-
Notifications
You must be signed in to change notification settings - Fork 28
[code-infra] Rewrite broken links checker #832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bundle size report
Check out the code infra dashboard for more information about this PR. |
| expectNotIssue(result.issues, { link: { href: '/nested/page.html' } }); | ||
|
|
||
| // Verify that external links are not reported | ||
| expectNotIssue(result.issues, { link: { href: 'https://example.com' } }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should atleast report when an external link is not working or gives 404.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we're not fetching external links at all. We can rely on ahrefs for that for now. It would greatly impact performance (e.g. the whole crawl runs in a few seconds now). It would also greatly increase complexity when done correctly (proper per origin concurrency, etc...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly meant checking if that particular link is not giving 404. It could be through a HEAD request instead of fetching the whole page. No need to process the contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crawler doesn't fetch external links for now. Reason:
- would make it too flaky
- not idempotent, this crawl is run during the build of the docs. If I run the build today vs. in a year on the same commit, it should work regardless of external links that we don't control.
- would make it too slow. Each external fetch is magnitudes slower than internal ones. It currently runs on my local in 3seconds for the whole material docs site.
- ahrefs is already performing this check out-of-band
- If done correctly, causes increased complexity due to having to implement concurrency per url according to external robots.txt files
- many websites internally implement HEAD as a GET without body. It wouldn't impact performance too much.
- we need to fetch content if we want to resolve the hash target to an id. This is currently the main advantage over e.g. ahrefs for internal links.
| * @param {CrawlOptions} rawOptions | ||
| * @returns {Promise<CrawlResult>} | ||
| */ | ||
| export async function crawl(rawOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also expose this as a cli so that its easier to setup in downstream repos where no major configurations are required.
Most of the options I see here are simple strings which'll work well with cli as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now all downstream repos are consuming this as a lib to keep it easily configurable. I think over time the complexity will add up. We could add a CLI when the need arises?
I want to avoid the need for complex config files resolution like the bundle size checker, and easy expansion of logic for filtering and ignoring to be able to take functions.
| * @property {RegExp[]} [ignoredPaths] | ||
| * @property {string[]} [ignoredContent] | ||
| * @property {Set<string>} [ignoredTargets] | ||
| * @property {Map<string, Set<string>>} [knownTargets] | ||
| * @property {string[]} [knownTargetsDownloadUrl] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to have descriptions of these when being used in a downstream script.
|
|
||
| crawledPages.set(pageUrl, pagePromise); | ||
|
|
||
| await pagePromise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why await here when its also awaited in line 536 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the queue handles concurrency, and so the task promise shouldn't resolve until completely finished. In principle the promises in crawledPages at line 536 will all have settled at that point in time. The await in 536 serves mostly to convert the promises back to their resolved value. The task promise serves both as a vessel for the result, but also as a placeholder to deduplicate.
| // Derive counts from issues | ||
| const brokenLinks = issues.filter((issue) => issue.type === 'broken-link').length; | ||
| const brokenLinkTargets = issues.filter((issue) => issue.type === 'broken-target').length; | ||
|
|
||
| const endTime = Date.now(); | ||
| const durationSeconds = (endTime - startTime) / 1000; | ||
| const duration = new Intl.NumberFormat('en-US', { | ||
| style: 'unit', | ||
| unit: 'second', | ||
| maximumFractionDigits: 2, | ||
| }).format(durationSeconds); | ||
| console.log(chalk.blue(`\nCrawl completed in ${duration}`)); | ||
| console.log(` Total links found: ${chalk.cyan(crawledLinks.size)}`); | ||
| console.log(` Total broken links: ${chalk.cyan(brokenLinks)}`); | ||
| console.log(` Total broken link targets: ${chalk.cyan(brokenLinkTargets)}`); | ||
| if (options.outPath) { | ||
| console.log(chalk.blue(`Output written to: ${options.outPath}`)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this, the link checker should definitely be a cli as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, maybe just expose another method to do all the reporting instead of reporting here directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of adding a CLI if we're never gonna call it? Why not add it at the time we need it?
Rewrite of the broken links checker which allows us to:
Added a single e2e style test which covers most functionality.