You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: update setting up and running in cloud (#936)
- I updated the "Setting up" and "Running in cloud" pages to make them
more beginner-friendly (hopefully) and also removed mentions of Poetry.
Including Poetry was unnecessary and could be confusing. Not to mention
in Python there are X other package managers that we could hypotetically
mention as well. So I decided to stay only with Pip and Pipx for
templates. People using a package manager other than Pip will likely
know how to handle the installation on their own.
- This was reported on Slack by @honzajavorek.
Copy file name to clipboardexpand all lines: README.md
+4-2
Original file line number
Diff line number
Diff line change
@@ -38,10 +38,12 @@ We also have a TypeScript implementation of the Crawlee, which you can explore a
38
38
39
39
We recommend visiting the [Introduction tutorial](https://crawlee.dev/python/docs/introduction) in Crawlee documentation for more information.
40
40
41
-
Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI package. The core functionality is included in the base package, with additional features available as optional extras to minimize package size and dependencies. To install Crawlee with all features, run the following command:
41
+
Crawlee is available as [`crawlee`](https://pypi.org/project/crawlee/) package on PyPI. This package includes the core functionality, while additional features are available as optional extras to keep dependencies and package size minimal.
42
+
43
+
To install Crawlee with all features, run the following command:
42
44
43
45
```sh
44
-
pip install 'crawlee[all]'
46
+
python -m pip install 'crawlee[all]'
45
47
```
46
48
47
49
Then, install the [Playwright](https://playwright.dev/) dependencies:
@@ -27,7 +25,7 @@ We do not test Crawlee in other cloud environments such as Lambda or on specific
27
25
28
26
## Logging into Apify platform from Crawlee
29
27
30
-
To access your [Apify account](https://console.apify.com/sign-up) from Crawlee, you must provide credentials - your [API token](https://console.apify.com/account?tab=integrations). You can do that either by utilizing [Apify CLI](https://github.com/apify/apify-cli) or with environment variables.
28
+
To access your [Apify account](https://console.apify.com/sign-up) from Crawlee, you must provide credentials - your [API token](https://console.apify.com/account?tab=integrations). You can do that either by utilizing [Apify CLI](https://docs.apify.com/cli/) or with environment variables.
31
29
32
30
Once you provide credentials to your Apify CLI installation, you will be able to use all the Apify platform features, such as calling Actors, saving to cloud storages, using Apify proxies, setting up webhooks and so on.
33
31
@@ -142,7 +140,7 @@ If you don't plan to force usage of the platform storages when running the Actor
142
140
{/*
143
141
### Getting public url of an item in the platform storage
144
142
145
-
If you need to share a link to some file stored in a [Key-Value](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) Store on Apify Platform, you can use [`get_public_url()`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#get_public_url) method. It accepts only one parameter: `key` - the key of the item you want to share.
143
+
If you need to share a link to some file stored in a [Key-Value](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) Store on Apify platform, you can use [`get_public_url()`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#get_public_url) method. It accepts only one parameter: `key` - the key of the item you want to share.
Copy file name to clipboardexpand all lines: docs/guides/http_clients.mdx
+2-2
Original file line number
Diff line number
Diff line change
@@ -36,13 +36,13 @@ In Crawlee we currently have two HTTP clients: <ApiLink to="class/HttpxHttpClien
36
36
Since <ApiLinkto="class/HttpxHttpClient">`HttpxHttpClient`</ApiLink> is the default HTTP client, you don't need to install additional packages to use it. If you want to use <ApiLinkto="class/CurlImpersonateHttpClient">`CurlImpersonateHttpClient`</ApiLink>, you need to install `crawlee` with the `curl-impersonate` extra.
Copy file name to clipboardexpand all lines: docs/guides/storages.mdx
+3-3
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ Crawlee offers multiple storage types for managing and persisting your crawling
33
33
Storage clients in Crawlee are subclasses of <ApiLinkto="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. They handle interactions with different storage backends. For instance:
34
34
35
35
- <ApiLinkto="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>: Stores data in memory and persists it to the local file system.
36
-
-[`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient): Manages storage on the [Apify Platform](https://apify.com). Apify storage client is implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
36
+
-[`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient): Manages storage on the [Apify platform](https://apify.com). Apify storage client is implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
37
37
38
38
Each storage client is responsible for maintaining the storages in a specific environment. This abstraction makes it easier to switch between different environments, e.g. between local development and cloud production setup.
39
39
@@ -52,7 +52,7 @@ where:
52
52
-`{STORAGE_ID}`: The ID of the specific storage instance (default: `default`).
53
53
54
54
:::info NOTE
55
-
The current <ApiLinkto="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and its interface is quite old and not great. We plan to refactor it, together with the whole <ApiLinkto="class/BaseStorageClient">`BaseStorageClient`</ApiLink> interface in the near future and it better and and easier to use. We also plan to introduce new storage clients for different storage backends - e.g. for [SQLLite](https://www.sqlite.org/).
55
+
The current <ApiLinkto="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and its interface is quite old and not great. We plan to refactor it, together with the whole <ApiLinkto="class/BaseStorageClient">`BaseStorageClient`</ApiLink> interface in the near future and it better and and easier to use. We also plan to introduce new storage clients for different storage backends - e.g. for [SQLLite](https://sqlite.org/).
56
56
:::
57
57
58
58
You can override default storage IDs using these environment variables: `CRAWLEE_DEFAULT_DATASET_ID`, `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID`, or `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID`.
@@ -117,7 +117,7 @@ The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> implements the <Ap
117
117
118
118
If you need custom functionality, you can create your own request storage by subclassing the <ApiLinkto="class/RequestManager">`RequestManager`</ApiLink> class and implementing its required methods.
119
119
120
-
For a detailed explanation of the <ApiLinkto="class/RequestManager">`RequestManager`</ApiLink> and other related components, refer to the [Request loaders guide](https://www.crawlee.dev/python/docs/guides/request-loaders).
120
+
For a detailed explanation of the <ApiLinkto="class/RequestManager">`RequestManager`</ApiLink> and other related components, refer to the [Request loaders guide](https://crawlee.dev/python/docs/guides/request-loaders).
You can verify these by running the following commands:
15
+
Before installing Crawlee itself, make sure that your system meets the following requirements:
16
+
17
+
-**Python 3.9 or higher**: Crawlee requires Python 3.9 or a newer version. You can download Python from the [official website](https://python.org/downloads/).
18
+
-**Python package manager**: While this guide uses [pip](https://pip.pypa.io/) (the most common package manager), you can also use any package manager you want. You can download pip from the [official website](https://pip.pypa.io/en/stable/installation/).
19
+
20
+
### Verifying prerequisites
21
+
22
+
To check if Python and pip are installed, run the following commands:
14
23
15
24
```sh
16
25
python --version
17
26
```
18
27
19
28
```sh
20
-
pip --version
29
+
python -m pip --version
21
30
```
22
31
23
-
## Installation
32
+
If these commands return the respective versions, you're ready to continue.
33
+
34
+
## Installing Crawlee
35
+
36
+
Crawlee is available as [`crawlee`](https://pypi.org/project/crawlee/) package on PyPI. This package includes the core functionality, while additional features are available as optional extras to keep dependencies and package size minimal.
24
37
25
-
Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI package. To install the core package, use:
38
+
### Basic installation
39
+
40
+
To install the core package, run:
26
41
27
42
```sh
28
-
pip install crawlee
43
+
python -m pip install crawlee
29
44
```
30
45
31
46
After installation, verify that Crawlee is installed correctly by checking its version:
@@ -34,50 +49,40 @@ After installation, verify that Crawlee is installed correctly by checking its v
Crawlee offers several optional features through package extras. You can choose to install only the dependencies you need or install everything if you don't mind the package size.
38
-
39
-
### Install all features
52
+
### Full installation
40
53
41
-
If you do not care about the package size, install Crawlee with all features:
54
+
If you do not mind the package size, you can run the following command to install Crawlee with all optional features:
42
55
43
56
```sh
44
-
pip install 'crawlee[all]'
57
+
python -m pip install 'crawlee[all]'
45
58
```
46
59
47
-
### Installing only specific extras
60
+
### Installing specific extras
48
61
49
62
Depending on your use case, you may want to install specific extras to enable additional functionality:
50
63
51
-
#### BeautifulSoup
52
-
53
64
For using the <ApiLinkto="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, install the `beautifulsoup` extra:
54
65
55
66
```sh
56
-
pip install 'crawlee[beautifulsoup]'
67
+
python -m pip install 'crawlee[beautifulsoup]'
57
68
```
58
69
59
-
#### Parsel
60
-
61
70
For using the <ApiLinkto="class/ParselCrawler">`ParselCrawler`</ApiLink>, install the `parsel` extra:
62
71
63
72
```sh
64
-
pip install 'crawlee[parsel]'
73
+
python -m pip install 'crawlee[parsel]'
65
74
```
66
75
67
-
#### Curl impersonate
68
-
69
76
For using the <ApiLinkto="class/CurlImpersonateHttpClient">`CurlImpersonateHttpClient`</ApiLink>, install the `curl-impersonate` extra:
70
77
71
78
```sh
72
-
pip install 'crawlee[curl-impersonate]'
79
+
python -m pip install 'crawlee[curl-impersonate]'
73
80
```
74
81
75
-
#### Playwright
76
-
77
82
If you plan to use a (headless) browser with <ApiLinkto="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, install Crawlee with the `playwright` extra:
78
83
79
84
```sh
80
-
pip install 'crawlee[playwright]'
85
+
python -m pip install 'crawlee[playwright]'
81
86
```
82
87
83
88
After installing the playwright extra, install the necessary Playwright dependencies:
@@ -91,29 +96,58 @@ playwright install
91
96
You can install multiple extras at once by using a comma as a separator:
The quickest way to get started with Crawlee is by using the Crawlee CLI and selecting one of the prepared templates. The CLI helps you set up a new project in seconds.
105
+
106
+
### Using Crawlee CLI with Pipx
98
107
99
-
The quickest way to get started with Crawlee is by using the Crawlee CLI and selecting one of the prepared templates. First, ensure you have [Pipx](https://pipx.pypa.io/)installed:
108
+
First, ensure you have Pipx installed. You can check if Pipx is installed by running:
100
109
101
110
```sh
102
-
pipx --help
111
+
pipx --version
103
112
```
104
113
105
-
Then, run the CLI and choose from the available templates:
114
+
If Pipx is not installed, follow the official [installation guide](https://pipx.pypa.io/stable/installation/).
115
+
116
+
Then, run the Crawlee CLI using Pipx and choose from the available templates:
106
117
107
118
```sh
108
-
pipx run crawlee create my-crawler
119
+
pipx run crawlee create my_crawler
109
120
```
110
121
122
+
### Using Crawlee CLI directly
123
+
111
124
If you already have `crawlee` installed, you can spin it up by running:
112
125
113
126
```sh
114
-
crawlee create my-crawler
127
+
crawlee create my_crawler
115
128
```
116
129
130
+
Follow the interactive prompts in the CLI to choose a crawler type and set up your new project.
131
+
132
+
### Running your project
133
+
134
+
To run your newly created project, navigate to the project directory, activate the virtual environment, and execute the Python interpreter with the project module:
Copy file name to clipboardexpand all lines: docs/introduction/08_refactoring.mdx
+1-5
Original file line number
Diff line number
Diff line change
@@ -67,10 +67,6 @@ Initially, using a simple `if` / `else` statement for selecting different logic
67
67
68
68
It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long `request_handler()` where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.
69
69
70
-
{/* TODO: write this once SDK v2 is ready
71
-
72
70
## Next steps
73
71
74
-
In the next and final step, you'll see how to deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a **Dockerfile** ready, and the next section will show you how to deploy it to the [Apify Platform](../deployment/apify-platform) with ease.
75
-
76
-
*/}
72
+
In the next and final step, you'll see how to deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a `Dockerfile` ready, and the next section will show you how to deploy it to the [Apify platform](../deployment/apify-platform) with ease.
0 commit comments