[code-infra] Rewrite broken links checker #832

Janpot · 2025-10-23T11:11:10Z

Rewrite of the broken links checker which allows us to:

cover more use-case by checking hash links
cover more use-cases by operating on a running server instead of markdown only
get rid of @mui/monorepo usage in broken links reporting in MUI X

Added a single e2e style test which covers most functionality.

mui-bot · 2025-10-23T11:12:38Z

Bundle size report

Bundle	Parsed size	Gzip size
@base-ui-components/react	0B^(0.00%)	0B^(0.00%)
@mui/x-charts-pro	0B^(0.00%)	0B^(0.00%)

Details of bundle changes

Check out the code infra dashboard for more information about this PR.

brijeshb42 · 2025-10-29T10:09:18Z

packages/code-infra/src/brokenLinksChecker/index.test.ts

+    expectNotIssue(result.issues, { link: { href: '/nested/page.html' } });
+
+    // Verify that external links are not reported
+    expectNotIssue(result.issues, { link: { href: 'https://example.com' } });


We should atleast report when an external link is not working or gives 404.

For now we're not fetching external links at all. We can rely on ahrefs for that for now. It would greatly impact performance (e.g. the whole crawl runs in a few seconds now). It would also greatly increase complexity when done correctly (proper per origin concurrency, etc...)

I mainly meant checking if that particular link is not giving 404. It could be through a HEAD request instead of fetching the whole page. No need to process the contents.

The crawler doesn't fetch external links for now. Reason:

would make it too flaky

not idempotent, this crawl is run during the build of the docs. If I run the build today vs. in a year on the same commit, it should work regardless of external links that we don't control.

would make it too slow. Each external fetch is magnitudes slower than internal ones. It currently runs on my local in 3seconds for the whole material docs site.

ahrefs is already performing this check out-of-band

If done correctly, causes increased complexity due to having to implement concurrency per url according to external robots.txt files

many websites internally implement HEAD as a GET without body. It wouldn't impact performance too much.

we need to fetch content if we want to resolve the hash target to an id. This is currently the main advantage over e.g. ahrefs for internal links.

brijeshb42 · 2025-10-29T10:13:02Z

packages/code-infra/src/brokenLinksChecker/index.mjs

+ * @param {CrawlOptions} rawOptions
+ * @returns {Promise<CrawlResult>}
+ */
+export async function crawl(rawOptions) {


We should also expose this as a cli so that its easier to setup in downstream repos where no major configurations are required.
Most of the options I see here are simple strings which'll work well with cli as well.

For now all downstream repos are consuming this as a lib to keep it easily configurable. I think over time the complexity will add up. We could add a CLI when the need arises?

I want to avoid the need for complex config files resolution like the bundle size checker, and easy expansion of logic for filtering and ignoring to be able to take functions.

brijeshb42 · 2025-10-29T10:40:24Z

packages/code-infra/src/brokenLinksChecker/index.mjs

+ * @property {RegExp[]} [ignoredPaths]
+ * @property {string[]} [ignoredContent]
+ * @property {Set<string>} [ignoredTargets]
+ * @property {Map<string, Set<string>>} [knownTargets]
+ * @property {string[]} [knownTargetsDownloadUrl]


Would be helpful to have descriptions of these when being used in a downstream script.

brijeshb42 · 2025-10-29T10:51:28Z

packages/code-infra/src/brokenLinksChecker/index.mjs

+
+    crawledPages.set(pageUrl, pagePromise);
+
+    await pagePromise;


Why await here when its also awaited in line 536 ?

Because the queue handles concurrency, and so the task promise shouldn't resolve until completely finished. In principle the promises in crawledPages at line 536 will all have settled at that point in time. The await in 536 serves mostly to convert the promises back to their resolved value. The task promise serves both as a vessel for the result, but also as a placeholder to deduplicate.

brijeshb42 · 2025-10-29T10:52:55Z

packages/code-infra/src/brokenLinksChecker/index.mjs

+  // Derive counts from issues
+  const brokenLinks = issues.filter((issue) => issue.type === 'broken-link').length;
+  const brokenLinkTargets = issues.filter((issue) => issue.type === 'broken-target').length;
+
+  const endTime = Date.now();
+  const durationSeconds = (endTime - startTime) / 1000;
+  const duration = new Intl.NumberFormat('en-US', {
+    style: 'unit',
+    unit: 'second',
+    maximumFractionDigits: 2,
+  }).format(durationSeconds);
+  console.log(chalk.blue(`\nCrawl completed in ${duration}`));
+  console.log(`  Total links found: ${chalk.cyan(crawledLinks.size)}`);
+  console.log(`  Total broken links: ${chalk.cyan(brokenLinks)}`);
+  console.log(`  Total broken link targets: ${chalk.cyan(brokenLinkTargets)}`);
+  if (options.outPath) {
+    console.log(chalk.blue(`Output written to: ${options.outPath}`));
+  }


Based on this, the link checker should definitely be a cli as well.

If not, maybe just expose another method to do all the reporting instead of reporting here directly.

What's the point of adding a CLI if we're never gonna call it? Why not add it at the time we need it?

Janpot added 4 commits October 23, 2025 11:52

WIP

744c253

tests

8f023b2

Update index.mjs

134bb01

Update index.mjs

a777237

oliviertassinari deployed to brokenLinksChecker - mui-tools-public PR #832 October 23, 2025 11:11 — with Render View deployment

Janpot added 4 commits October 23, 2025 13:41

Update index.mjs

d58c1ad

improve output

1692d01

Update index.test.ts

052af33

Update index.test.ts

6f3c3bd

Janpot mentioned this pull request Oct 23, 2025

[code-infra] New broken links checker mui/material-ui#47113

Open

github-actions bot added the PR: out-of-date The pull request has merge conflicts and can't be merged. label Oct 26, 2025

Merge branch 'master' into brokenLinksChecker

9667233

oliviertassinari deployed to brokenLinksChecker - mui-tools-public PR #832 October 27, 2025 16:16 — with Render View deployment

github-actions bot removed the PR: out-of-date The pull request has merge conflicts and can't be merged. label Oct 27, 2025

Janpot added 4 commits October 28, 2025 12:59

Update index.mjs

e81c9fe

Update index.mjs

a31718c

Add seedUrls

bdf7a33

normalize

23a7897

Janpot mentioned this pull request Oct 28, 2025

[code-infra] Add new broken links checker mui/mui-x#20120

Open

Ignore md pages for now

844c113

Janpot mentioned this pull request Oct 28, 2025

[code-infra] New broken links checker mui/base-ui#3080

Open

Few more tweaks

1683374

oliviertassinari temporarily deployed to brokenLinksChecker - mui-tools-public PR #832 October 29, 2025 09:53 — with Render Destroyed

Merge branch 'master' into brokenLinksChecker

a5c6577

oliviertassinari temporarily deployed to brokenLinksChecker - mui-tools-public PR #832 October 29, 2025 09:54 — with Render Destroyed

Janpot marked this pull request as ready for review October 29, 2025 09:54

Janpot requested a review from a team October 29, 2025 09:54

Update pnpm-lock.yaml

ea6cf15

oliviertassinari temporarily deployed to brokenLinksChecker - mui-tools-public PR #832 October 29, 2025 09:55 — with Render Destroyed

Update index.test.ts

766d846

brijeshb42 reviewed Oct 29, 2025

View reviewed changes

Add jsdoc comments

28eaf44

brijeshb42 approved these changes Oct 29, 2025

View reviewed changes

Janpot merged commit 16fc0c9 into master Oct 29, 2025
9 checks passed

Janpot deleted the brokenLinksChecker branch October 29, 2025 11:55

zannager added the scope: code-infra Involves the code-infra product (https://www.notion.so/mui-org/5562c14178aa42af97bc1fa5114000cd). label Oct 29, 2025

[code-infra] Rewrite broken links checker #832

[code-infra] Rewrite broken links checker #832

Uh oh!

Conversation

Janpot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mui-bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bundle size report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Janpot Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Janpot Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Janpot Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brijeshb42 Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Janpot commented Oct 23, 2025 •

edited

Loading

mui-bot commented Oct 23, 2025 •

edited

Loading

Janpot Oct 29, 2025 •

edited

Loading

Janpot Oct 29, 2025 •

edited

Loading

Janpot Oct 29, 2025 •

edited

Loading

brijeshb42 Oct 29, 2025 •

edited

Loading