diff --git a/docs/guides/request_loaders.mdx b/docs/guides/request_loaders.mdx index 64f2ceeaf8..9211c60d48 100644 --- a/docs/guides/request_loaders.mdx +++ b/docs/guides/request_loaders.mdx @@ -31,6 +31,12 @@ And one specific request loader: Below is a class diagram that illustrates the relationships between these components and the `RequestQueue`: ```mermaid +--- +config: + class: + hideEmptyMembersBox: true +--- + classDiagram %% ======================== @@ -39,20 +45,29 @@ classDiagram class BaseStorage { <> - _attributes_ - _methods_() + + id + + name + + open() + + drop() } class RequestLoader { <> - _attributes_ - _methods_() + + fetch_next_request() + + mark_request_as_handled() + + is_empty() + + is_finished() + + get_handled_count() + + get_total_count() + + to_tandem() } class RequestManager { <> - _attributes_ - _methods_() + + add_request() + + add_requests_batched() + + reclaim_request() + + drop() } %% ======================== @@ -90,6 +105,8 @@ RequestManager <|-- RequestManagerTandem The `RequestLoader` interface defines the foundation for fetching requests during a crawl. It provides abstract methods for basic operations like retrieving, marking, or checking the status of requests. Concrete implementations, such as `RequestList`, build on this interface to handle specific scenarios. You may create your own loader that reads from an external file, a web endpoint, a database or matches some other specific scenario. For more details refer to the `RequestLoader` API reference. +The `RequestList` can accept an asynchronous generator as input. This allows the requests to be streamed, rather than loading them all into memory at once. This can significantly reduce the memory usage, especially when working with large sets of URLs. + Here is a basic example of working with the `RequestList`: @@ -102,11 +119,11 @@ The `RequestManager` extends `Reque ## Request manager tandem -The `RequestManagerTandem` class allows you to combine the read-only capabilities `RequestLoader` (like `RequestList`) with a writable capabilities `RequestManager` (like `RequestQueue`). This is useful for scenarios where you need to load initial requests from a static source (like a file or database) and dynamically add or retry requests during the crawl. Additionally, it provides deduplication capabilities, ensuring that requests are not processed multiple times. Under the hood, `RequestManagerTandem` manages whether the read-only loader still has pending requests. If so, each new request from the loader is transferred to the manager. Any newly added or reclaimed requests go directly to the manager side. +The `RequestManagerTandem` class allows you to combine the read-only capabilities `RequestLoader` (like `RequestList`) with read-write capabilities of a `RequestManager` (like `RequestQueue`). This is useful for scenarios where you need to load initial requests from a static source (like a file or database) and dynamically add or retry requests during the crawl. Additionally, it provides deduplication capabilities, ensuring that requests are not processed multiple times. Under the hood, `RequestManagerTandem` checks whether the read-only loader still has pending requests. If so, each new request from the loader is transferred to the manager. Any newly added or reclaimed requests go directly to the manager side. ### Request list with request queue -This sections describes the combination of the `RequestList` and `RequestQueue` classes. This setup is particularly useful when you have a static list of URLs that you want to crawl, but you also need to handle dynamic requests during the crawl process. The `RequestManagerTandem` class facilitates this combination, with the `RequestLoader.to_tandem` method available as a convenient shortcut. Requests from the `RequestList` are processed first by enqueuing them into the default `RequestQueue`, which handles persistence and retries for failed requests. +This sections describes the combination of the `RequestList` and `RequestQueue` classes. This setup is particularly useful when you have a static list of URLs that you want to crawl, but you also need to handle dynamic requests during the crawl process. The `RequestManagerTandem` class facilitates this combination, with the `RequestLoader.to_tandem` method available as a convenient shortcut. Requests from the `RequestList` are processed first by enqueuing them into the default `RequestQueue`, which handles persistence and retries failed requests. diff --git a/docs/guides/storages.mdx b/docs/guides/storages.mdx index 3135e04ce8..0a5e4da601 100644 --- a/docs/guides/storages.mdx +++ b/docs/guides/storages.mdx @@ -61,6 +61,16 @@ You can override default storage IDs using these environment variables: `CRAWLEE The `RequestQueue` is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition and removal of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run. The `RequestQueue` is highly useful for large-scale and complex crawls. +By default, data are stored using the following path structure: + +```text +{CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/{INDEX}.json +``` + +- `{CRAWLEE_STORAGE_DIR}`: The root directory for all storage data, specified by the environment variable. +- `{QUEUE_ID}`: The ID of the request queue, "default" by default. +- `{INDEX}`: Represents the zero-based index of the record within the queue. + The following code demonstrates the usage of the `RequestQueue`: