You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The <ApiLinkto="class/RequestLoader">`RequestLoader`</ApiLink> interface defines the foundation for fetching requests during a crawl. It provides abstract methods for basic operations like retrieving, marking, or checking the status of requests. Concrete implementations, such as <ApiLinkto="class/RequestList">`RequestList`</ApiLink>, build on this interface to handle specific scenarios. You may create your own loader that reads from an external file, a web endpoint, a database or matches some other specific scenario. For more details refer to the <ApiLinkto="class/RequestLoader">`RequestLoader`</ApiLink> API reference.
92
107
108
+
The <ApiLinkto="class/RequestList">`RequestList`</ApiLink> can accept an asynchronous generator as input. This allows the requests to be streamed, rather than loading them all into memory at once. This can significantly reduce the memory usage, especially when working with large sets of URLs.
109
+
93
110
Here is a basic example of working with the <ApiLinkto="class/RequestList">`RequestList`</ApiLink>:
94
111
95
112
<CodeBlockclassName="language-python">
@@ -102,11 +119,11 @@ The <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> extends `Reque
102
119
103
120
## Request manager tandem
104
121
105
-
The <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class allows you to combine the read-only capabilities `RequestLoader` (like <ApiLinkto="class/RequestList">`RequestList`</ApiLink>) with a writable capabilities `RequestManager` (like <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>). This is useful for scenarios where you need to load initial requests from a static source (like a file or database) and dynamically add or retry requests during the crawl. Additionally, it provides deduplication capabilities, ensuring that requests are not processed multiple times. Under the hood, <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> manages whether the read-only loader still has pending requests. If so, each new request from the loader is transferred to the manager. Any newly added or reclaimed requests go directly to the manager side.
122
+
The <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class allows you to combine the read-only capabilities `RequestLoader` (like <ApiLinkto="class/RequestList">`RequestList`</ApiLink>) with read-write capabilities of a `RequestManager` (like <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>). This is useful for scenarios where you need to load initial requests from a static source (like a file or database) and dynamically add or retry requests during the crawl. Additionally, it provides deduplication capabilities, ensuring that requests are not processed multiple times. Under the hood, <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> checks whether the read-only loader still has pending requests. If so, each new request from the loader is transferred to the manager. Any newly added or reclaimed requests go directly to the manager side.
106
123
107
124
### Request list with request queue
108
125
109
-
This sections describes the combination of the <ApiLinkto="class/RequestList">`RequestList`</ApiLink> and <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink> classes. This setup is particularly useful when you have a static list of URLs that you want to crawl, but you also need to handle dynamic requests during the crawl process. The <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class facilitates this combination, with the <ApiLinkto="class/RequestLoader#to_tandem">`RequestLoader.to_tandem`</ApiLink> method available as a convenient shortcut. Requests from the <ApiLinkto="class/RequestList">`RequestList`</ApiLink> are processed first by enqueuing them into the default <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>, which handles persistence and retries for failed requests.
126
+
This sections describes the combination of the <ApiLinkto="class/RequestList">`RequestList`</ApiLink> and <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink> classes. This setup is particularly useful when you have a static list of URLs that you want to crawl, but you also need to handle dynamic requests during the crawl process. The <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class facilitates this combination, with the <ApiLinkto="class/RequestLoader#to_tandem">`RequestLoader.to_tandem`</ApiLink> method available as a convenient shortcut. Requests from the <ApiLinkto="class/RequestList">`RequestList`</ApiLink> are processed first by enqueuing them into the default <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>, which handles persistence and retries failed requests.
Copy file name to clipboardexpand all lines: docs/guides/storages.mdx
+10
Original file line number
Diff line number
Diff line change
@@ -61,6 +61,16 @@ You can override default storage IDs using these environment variables: `CRAWLEE
61
61
62
62
The <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink> is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition and removal of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run. The <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink> is highly useful for large-scale and complex crawls.
63
63
64
+
By default, data are stored using the following path structure:
0 commit comments