Skip to content

Added Directory Exclusion #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ type Config = {
}) => Promise<void>;
/** Optional timeout for waiting for a selector to appear */
waitForSelectorTimeout?: number;
/** Directories to exclude from crawling */
exclude: string[];
};

export const config: Config = {
Expand All @@ -28,4 +30,5 @@ export const config: Config = {
selector: `.docs-builder-container`,
maxPagesToCrawl: 50,
outputFileName: "output.json",
exclude: [],
};
27 changes: 25 additions & 2 deletions src/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,23 @@ export function getPageHtml(page: Page) {
}, config.selector);
}

function shouldCrawl(url: string): boolean {
// This function checks if a given URL should be crawled or not.
// It returns false if the URL contains any of the directories specified in the exclude array of the config object.
// Otherwise, it returns true.

// Iterate over each directory in the exclude array of the config object
for (const dir of config.exclude) {
// If the URL contains the current directory, return false
if (url.includes(dir)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency sake, can we make these glob expressions? similar to include. we can use minimatch for testing against it

return false;
}
}

// If the URL does not contain any excluded directories, return true
return true;
}

if (process.env.NO_CRAWL !== "true") {
// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
Expand Down Expand Up @@ -45,9 +62,15 @@ if (process.env.NO_CRAWL !== "true") {
await config.onVisitPage({ page, pushData });
}

// Extract links from the current page
// and add them to the crawling queue.
// Extract all the href attributes from the anchor tags on the current page
const links = await page.$$eval('a', links => links.map(a => a.href));

// Filter out the links that should not be crawled based on the configuration
const filteredLinks = links.filter(shouldCrawl);

// Add the filtered links to the crawling queue, only if they match the pattern specified in the configuration
await enqueueLinks({
urls: filteredLinks,
globs: [config.match],
});
},
Expand Down