Skip to content

Conversation

@Janpot
Copy link
Member

@Janpot Janpot commented Oct 23, 2025

Rewrite of the broken links checker which allows us to:

  • cover more use-case by checking hash links
  • cover more use-cases by operating on a running server instead of markdown only
  • get rid of @mui/monorepo usage in broken links reporting in MUI X

Added a single e2e style test which covers most functionality.

@mui-bot
Copy link

mui-bot commented Oct 23, 2025

Bundle size report

Bundle Parsed size Gzip size
@base-ui-components/react 0B(0.00%) 0B(0.00%)
@mui/x-charts-pro 0B(0.00%) 0B(0.00%)

Details of bundle changes


Check out the code infra dashboard for more information about this PR.

@github-actions github-actions bot added the PR: out-of-date The pull request has merge conflicts and can't be merged. label Oct 26, 2025
@oliviertassinari oliviertassinari temporarily deployed to brokenLinksChecker - mui-tools-public PR #832 October 29, 2025 09:53 — with Render Destroyed
@oliviertassinari oliviertassinari temporarily deployed to brokenLinksChecker - mui-tools-public PR #832 October 29, 2025 09:54 — with Render Destroyed
@Janpot Janpot marked this pull request as ready for review October 29, 2025 09:54
@Janpot Janpot requested a review from a team October 29, 2025 09:54
@oliviertassinari oliviertassinari temporarily deployed to brokenLinksChecker - mui-tools-public PR #832 October 29, 2025 09:55 — with Render Destroyed
expectNotIssue(result.issues, { link: { href: '/nested/page.html' } });

// Verify that external links are not reported
expectNotIssue(result.issues, { link: { href: 'https://example.com' } });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should atleast report when an external link is not working or gives 404.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we're not fetching external links at all. We can rely on ahrefs for that for now. It would greatly impact performance (e.g. the whole crawl runs in a few seconds now). It would also greatly increase complexity when done correctly (proper per origin concurrency, etc...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly meant checking if that particular link is not giving 404. It could be through a HEAD request instead of fetching the whole page. No need to process the contents.

Copy link
Member Author

@Janpot Janpot Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crawler doesn't fetch external links for now. Reason:

  • would make it too flaky
  • not idempotent, this crawl is run during the build of the docs. If I run the build today vs. in a year on the same commit, it should work regardless of external links that we don't control.
  • would make it too slow. Each external fetch is magnitudes slower than internal ones. It currently runs on my local in 3seconds for the whole material docs site.
  • ahrefs is already performing this check out-of-band
  • If done correctly, causes increased complexity due to having to implement concurrency per url according to external robots.txt files
  • many websites internally implement HEAD as a GET without body. It wouldn't impact performance too much.
  • we need to fetch content if we want to resolve the hash target to an id. This is currently the main advantage over e.g. ahrefs for internal links.

* @param {CrawlOptions} rawOptions
* @returns {Promise<CrawlResult>}
*/
export async function crawl(rawOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also expose this as a cli so that its easier to setup in downstream repos where no major configurations are required.
Most of the options I see here are simple strings which'll work well with cli as well.

Copy link
Member Author

@Janpot Janpot Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now all downstream repos are consuming this as a lib to keep it easily configurable. I think over time the complexity will add up. We could add a CLI when the need arises?

I want to avoid the need for complex config files resolution like the bundle size checker, and easy expansion of logic for filtering and ignoring to be able to take functions.

Comment on lines 251 to 255
* @property {RegExp[]} [ignoredPaths]
* @property {string[]} [ignoredContent]
* @property {Set<string>} [ignoredTargets]
* @property {Map<string, Set<string>>} [knownTargets]
* @property {string[]} [knownTargetsDownloadUrl]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to have descriptions of these when being used in a downstream script.


crawledPages.set(pageUrl, pagePromise);

await pagePromise;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why await here when its also awaited in line 536 ?

Copy link
Member Author

@Janpot Janpot Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the queue handles concurrency, and so the task promise shouldn't resolve until completely finished. In principle the promises in crawledPages at line 536 will all have settled at that point in time. The await in 536 serves mostly to convert the promises back to their resolved value. The task promise serves both as a vessel for the result, but also as a placeholder to deduplicate.

Comment on lines +599 to +616
// Derive counts from issues
const brokenLinks = issues.filter((issue) => issue.type === 'broken-link').length;
const brokenLinkTargets = issues.filter((issue) => issue.type === 'broken-target').length;

const endTime = Date.now();
const durationSeconds = (endTime - startTime) / 1000;
const duration = new Intl.NumberFormat('en-US', {
style: 'unit',
unit: 'second',
maximumFractionDigits: 2,
}).format(durationSeconds);
console.log(chalk.blue(`\nCrawl completed in ${duration}`));
console.log(` Total links found: ${chalk.cyan(crawledLinks.size)}`);
console.log(` Total broken links: ${chalk.cyan(brokenLinks)}`);
console.log(` Total broken link targets: ${chalk.cyan(brokenLinkTargets)}`);
if (options.outPath) {
console.log(chalk.blue(`Output written to: ${options.outPath}`));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this, the link checker should definitely be a cli as well.

Copy link
Contributor

@brijeshb42 brijeshb42 Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, maybe just expose another method to do all the reporting instead of reporting here directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of adding a CLI if we're never gonna call it? Why not add it at the time we need it?

@Janpot Janpot merged commit 16fc0c9 into master Oct 29, 2025
9 checks passed
@Janpot Janpot deleted the brokenLinksChecker branch October 29, 2025 11:55
@zannager zannager added the scope: code-infra Involves the code-infra product (https://www.notion.so/mui-org/5562c14178aa42af97bc1fa5114000cd). label Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope: code-infra Involves the code-infra product (https://www.notion.so/mui-org/5562c14178aa42af97bc1fa5114000cd).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants