Skip to content

Commit dea6ca3

Browse files
authored
docs: update request loaders & storages guide (#904)
1 parent 778d48d commit dea6ca3

File tree

2 files changed

+35
-8
lines changed

2 files changed

+35
-8
lines changed

docs/guides/request_loaders.mdx

+25-8
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,12 @@ And one specific request loader:
3131
Below is a class diagram that illustrates the relationships between these components and the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>:
3232

3333
```mermaid
34+
---
35+
config:
36+
class:
37+
hideEmptyMembersBox: true
38+
---
39+
3440
classDiagram
3541
3642
%% ========================
@@ -39,20 +45,29 @@ classDiagram
3945
4046
class BaseStorage {
4147
<<abstract>>
42-
_attributes_
43-
_methods_()
48+
+ id
49+
+ name
50+
+ open()
51+
+ drop()
4452
}
4553
4654
class RequestLoader {
4755
<<abstract>>
48-
_attributes_
49-
_methods_()
56+
+ fetch_next_request()
57+
+ mark_request_as_handled()
58+
+ is_empty()
59+
+ is_finished()
60+
+ get_handled_count()
61+
+ get_total_count()
62+
+ to_tandem()
5063
}
5164
5265
class RequestManager {
5366
<<abstract>>
54-
_attributes_
55-
_methods_()
67+
+ add_request()
68+
+ add_requests_batched()
69+
+ reclaim_request()
70+
+ drop()
5671
}
5772
5873
%% ========================
@@ -90,6 +105,8 @@ RequestManager <|-- RequestManagerTandem
90105

91106
The <ApiLink to="class/RequestLoader">`RequestLoader`</ApiLink> interface defines the foundation for fetching requests during a crawl. It provides abstract methods for basic operations like retrieving, marking, or checking the status of requests. Concrete implementations, such as <ApiLink to="class/RequestList">`RequestList`</ApiLink>, build on this interface to handle specific scenarios. You may create your own loader that reads from an external file, a web endpoint, a database or matches some other specific scenario. For more details refer to the <ApiLink to="class/RequestLoader">`RequestLoader`</ApiLink> API reference.
92107

108+
The <ApiLink to="class/RequestList">`RequestList`</ApiLink> can accept an asynchronous generator as input. This allows the requests to be streamed, rather than loading them all into memory at once. This can significantly reduce the memory usage, especially when working with large sets of URLs.
109+
93110
Here is a basic example of working with the <ApiLink to="class/RequestList">`RequestList`</ApiLink>:
94111

95112
<CodeBlock className="language-python">
@@ -102,11 +119,11 @@ The <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> extends `Reque
102119

103120
## Request manager tandem
104121

105-
The <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class allows you to combine the read-only capabilities `RequestLoader` (like <ApiLink to="class/RequestList">`RequestList`</ApiLink>) with a writable capabilities `RequestManager` (like <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>). This is useful for scenarios where you need to load initial requests from a static source (like a file or database) and dynamically add or retry requests during the crawl. Additionally, it provides deduplication capabilities, ensuring that requests are not processed multiple times. Under the hood, <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> manages whether the read-only loader still has pending requests. If so, each new request from the loader is transferred to the manager. Any newly added or reclaimed requests go directly to the manager side.
122+
The <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class allows you to combine the read-only capabilities `RequestLoader` (like <ApiLink to="class/RequestList">`RequestList`</ApiLink>) with read-write capabilities of a `RequestManager` (like <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>). This is useful for scenarios where you need to load initial requests from a static source (like a file or database) and dynamically add or retry requests during the crawl. Additionally, it provides deduplication capabilities, ensuring that requests are not processed multiple times. Under the hood, <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> checks whether the read-only loader still has pending requests. If so, each new request from the loader is transferred to the manager. Any newly added or reclaimed requests go directly to the manager side.
106123

107124
### Request list with request queue
108125

109-
This sections describes the combination of the <ApiLink to="class/RequestList">`RequestList`</ApiLink> and <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> classes. This setup is particularly useful when you have a static list of URLs that you want to crawl, but you also need to handle dynamic requests during the crawl process. The <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class facilitates this combination, with the <ApiLink to="class/RequestLoader#to_tandem">`RequestLoader.to_tandem`</ApiLink> method available as a convenient shortcut. Requests from the <ApiLink to="class/RequestList">`RequestList`</ApiLink> are processed first by enqueuing them into the default <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, which handles persistence and retries for failed requests.
126+
This sections describes the combination of the <ApiLink to="class/RequestList">`RequestList`</ApiLink> and <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> classes. This setup is particularly useful when you have a static list of URLs that you want to crawl, but you also need to handle dynamic requests during the crawl process. The <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink> class facilitates this combination, with the <ApiLink to="class/RequestLoader#to_tandem">`RequestLoader.to_tandem`</ApiLink> method available as a convenient shortcut. Requests from the <ApiLink to="class/RequestList">`RequestList`</ApiLink> are processed first by enqueuing them into the default <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, which handles persistence and retries failed requests.
110127

111128
<Tabs groupId="request_manager_tandem">
112129
<TabItem value="request_manager_tandem_explicit" label="Explicitly usage">

docs/guides/storages.mdx

+10
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,16 @@ You can override default storage IDs using these environment variables: `CRAWLEE
6161

6262
The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition and removal of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run. The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> is highly useful for large-scale and complex crawls.
6363

64+
By default, data are stored using the following path structure:
65+
66+
```text
67+
{CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/{INDEX}.json
68+
```
69+
70+
- `{CRAWLEE_STORAGE_DIR}`: The root directory for all storage data, specified by the environment variable.
71+
- `{QUEUE_ID}`: The ID of the request queue, "default" by default.
72+
- `{INDEX}`: Represents the zero-based index of the record within the queue.
73+
6474
The following code demonstrates the usage of the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>:
6575

6676
<Tabs groupId="request_queue">

0 commit comments

Comments
 (0)