-
Notifications
You must be signed in to change notification settings - Fork 712
feat: Add use to Router for middleware support with pre-handler execution
#1857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| import asyncio | ||
| import time | ||
|
|
||
| from crawlee import Request | ||
| from crawlee.crawlers import ParselCrawler, ParselCrawlingContext | ||
| from crawlee.router import Router | ||
|
|
||
|
|
||
| async def main() -> None: | ||
| # Create a custom router instance | ||
| router = Router[ParselCrawlingContext]() | ||
|
|
||
| # Register a middleware that logs every request before it reaches a handler | ||
| @router.use | ||
| async def logging_middleware(context: ParselCrawlingContext) -> None: | ||
| context.log.info( | ||
| f'Processing request: {context.request.url} label={context.request.label}' | ||
| ) | ||
|
|
||
| # Register a middleware that adds a timestamp to the request's user data | ||
| @router.use | ||
| async def timestamp_middleware(context: ParselCrawlingContext) -> None: | ||
| context.request.user_data['start_time'] = time.monotonic() | ||
|
|
||
| @router.default_handler | ||
| async def default_handler(context: ParselCrawlingContext) -> None: | ||
| context.log.info(f'Processing {context.request.url} with default handler') | ||
|
|
||
| @router.handler('CATEGORY') | ||
| async def category_handler(context: ParselCrawlingContext) -> None: | ||
| context.log.info(f'Processing {context.request.url} with category handler') | ||
|
|
||
| crawler = ParselCrawler( | ||
| request_handler=router, | ||
| max_requests_per_crawl=10, | ||
| ) | ||
|
|
||
| await crawler.run( | ||
| [ | ||
| 'https://warehouse-theme-metal.myshopify.com/', | ||
| Request.from_url( | ||
| 'https://warehouse-theme-metal.myshopify.com/collections/all', | ||
| label='CATEGORY', | ||
| ), | ||
| ] | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| asyncio.run(main()) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| --- | ||
| id: request-router | ||
| title: Request router | ||
| description: Learn how to use the Router class to organize request handlers, error handlers, and pre-navigation hooks in Crawlee. | ||
| description: Learn how to use the Router class to organize request handlers, middleware, error handlers, and pre-navigation hooks in Crawlee. | ||
| --- | ||
|
|
||
| import ApiLink from '@site/src/components/ApiLink'; | ||
|
|
@@ -16,6 +16,7 @@ import ErrorHandler from '!!raw-loader!roa-loader!./code_examples/request_router | |
| import FailedRequestHandler from '!!raw-loader!roa-loader!./code_examples/request_router/failed_request_handler.py'; | ||
| import PlaywrightPreNavigation from '!!raw-loader!roa-loader!./code_examples/request_router/playwright_pre_navigation.py'; | ||
| import AdaptiveCrawlerHandlers from '!!raw-loader!roa-loader!./code_examples/request_router/adaptive_crawler_handlers.py'; | ||
| import RouterMiddleware from '!!raw-loader!roa-loader!./code_examples/request_router/router_middleware.py'; | ||
|
|
||
| The <ApiLink to="class/Router">`Router`</ApiLink> class manages request flow and coordinates the execution of user-defined logic in Crawlee projects. It routes incoming requests to appropriate user-defined handlers based on labels, manages error scenarios, and provides hooks for pre-navigation execution. The <ApiLink to="class/Router">`Router`</ApiLink> serves as the orchestrator for all crawling operations, ensuring that each request is processed by the correct handler according to its type and label. | ||
|
|
||
|
|
@@ -57,6 +58,14 @@ More complex crawling projects often require different processing logic for vari | |
| {BasicRequestHandlers} | ||
| </RunnableCodeBlock> | ||
|
|
||
| ## Middleware | ||
|
|
||
| Middlewares are functions registered with `router.use()` that execute before the matched request handler on every request, regardless of the request label. Multiple middlewares can be registered and are executed sequentially in the order they were registered. If a middleware raises an exception, the execution chain is interrupted and the handler is not called. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated |
||
|
|
||
| <RunnableCodeBlock className="language-python" language="python"> | ||
| {RouterMiddleware} | ||
| </RunnableCodeBlock> | ||
|
|
||
| ## Error handlers | ||
|
|
||
| Crawlee provides error handling mechanisms to manage request processing failures. It distinguishes between recoverable errors that may succeed on retry and permanent failures that require alternative handling strategies. | ||
|
|
@@ -107,6 +116,6 @@ The <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</A | |
|
|
||
| ## Conclusion | ||
|
|
||
| This guide introduced you to the <ApiLink to="class/Router">`Router`</ApiLink> class and how to organize your crawling logic. You learned how to use built-in and custom routers, implement request handlers with label-based routing, handle errors with error and failed request handlers, and configure pre-navigation hooks for different crawler types. | ||
| This guide introduced you to the <ApiLink to="class/Router">`Router`</ApiLink> class and how to organize your crawling logic. You learned how to use built-in and custom routers, implement request handlers with label-based routing, add middleware with `router.use()`, handle errors with error and failed request handlers, and configure pre-navigation hooks for different crawler types. | ||
|
vdusek marked this conversation as resolved.
Outdated
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated |
||
|
|
||
| If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! | ||
Uh oh!
There was an error while loading. Please reload this page.