Skip to content

Commit 2f9dae4

Browse files
authored
docs: update setting up and running in cloud (#936)
- I updated the "Setting up" and "Running in cloud" pages to make them more beginner-friendly (hopefully) and also removed mentions of Poetry. Including Poetry was unnecessary and could be confusing. Not to mention in Python there are X other package managers that we could hypotetically mention as well. So I decided to stay only with Pip and Pipx for templates. People using a package manager other than Pip will likely know how to handle the installation on their own. - This was reported on Slack by @honzajavorek.
1 parent 54ecc36 commit 2f9dae4

File tree

13 files changed

+122
-96
lines changed

13 files changed

+122
-96
lines changed

Makefile

+5-7
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
.PHONY: clean install-dev build publish-to-pypi lint type-check unit-tests unit-tests-cov integration-tests format check-code build-api-reference run-docs
22

3-
DIRS_WITH_CODE = src tests docs website
4-
53
# This is default for local testing, but GitHub workflows override it to a higher value in CI
64
INTEGRATION_TESTS_CONCURRENCY = 1
75

@@ -22,11 +20,11 @@ publish-to-pypi:
2220
poetry publish --no-interaction -vv
2321

2422
lint:
25-
poetry run ruff format --check $(DIRS_WITH_CODE)
26-
poetry run ruff check $(DIRS_WITH_CODE)
23+
poetry run ruff format --check
24+
poetry run ruff check
2725

2826
type-check:
29-
poetry run mypy $(DIRS_WITH_CODE)
27+
poetry run mypy
3028

3129
unit-tests:
3230
poetry run pytest --numprocesses=auto --verbose --cov=src/crawlee tests/unit
@@ -38,8 +36,8 @@ integration-tests:
3836
poetry run pytest --numprocesses=$(INTEGRATION_TESTS_CONCURRENCY) tests/integration
3937

4038
format:
41-
poetry run ruff check --fix $(DIRS_WITH_CODE)
42-
poetry run ruff format $(DIRS_WITH_CODE)
39+
poetry run ruff check --fix
40+
poetry run ruff format
4341

4442
# The check-code target runs a series of checks equivalent to those performed by pre-commit hooks
4543
# and the run_checks.yaml GitHub Actions workflow.

README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,12 @@ We also have a TypeScript implementation of the Crawlee, which you can explore a
3838

3939
We recommend visiting the [Introduction tutorial](https://crawlee.dev/python/docs/introduction) in Crawlee documentation for more information.
4040

41-
Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI package. The core functionality is included in the base package, with additional features available as optional extras to minimize package size and dependencies. To install Crawlee with all features, run the following command:
41+
Crawlee is available as [`crawlee`](https://pypi.org/project/crawlee/) package on PyPI. This package includes the core functionality, while additional features are available as optional extras to keep dependencies and package size minimal.
42+
43+
To install Crawlee with all features, run the following command:
4244

4345
```sh
44-
pip install 'crawlee[all]'
46+
python -m pip install 'crawlee[all]'
4547
```
4648

4749
Then, install the [Playwright](https://playwright.dev/) dependencies:

docs/deployment/apify_platform.mdx

+2-4
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ description: Apify platform - large-scale and high-performance web scraping
66

77
import ApiLink from '@site/src/components/ApiLink';
88

9-
import Tabs from '@theme/Tabs';
10-
import TabItem from '@theme/TabItem';
119
import CodeBlock from '@theme/CodeBlock';
1210

1311
import LogWithConfigExample from '!!raw-loader!./code/apify/log_with_config_example.py';
@@ -27,7 +25,7 @@ We do not test Crawlee in other cloud environments such as Lambda or on specific
2725

2826
## Logging into Apify platform from Crawlee
2927

30-
To access your [Apify account](https://console.apify.com/sign-up) from Crawlee, you must provide credentials - your [API token](https://console.apify.com/account?tab=integrations). You can do that either by utilizing [Apify CLI](https://github.com/apify/apify-cli) or with environment variables.
28+
To access your [Apify account](https://console.apify.com/sign-up) from Crawlee, you must provide credentials - your [API token](https://console.apify.com/account?tab=integrations). You can do that either by utilizing [Apify CLI](https://docs.apify.com/cli/) or with environment variables.
3129

3230
Once you provide credentials to your Apify CLI installation, you will be able to use all the Apify platform features, such as calling Actors, saving to cloud storages, using Apify proxies, setting up webhooks and so on.
3331

@@ -142,7 +140,7 @@ If you don't plan to force usage of the platform storages when running the Actor
142140
{/*
143141
### Getting public url of an item in the platform storage
144142
145-
If you need to share a link to some file stored in a [Key-Value](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) Store on Apify Platform, you can use [`get_public_url()`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#get_public_url) method. It accepts only one parameter: `key` - the key of the item you want to share.
143+
If you need to share a link to some file stored in a [Key-Value](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) Store on Apify platform, you can use [`get_public_url()`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#get_public_url) method. It accepts only one parameter: `key` - the key of the item you want to share.
146144
147145
<CodeBlock language="python">
148146
{GetPublicUrlSource}

docs/guides/http_clients.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,13 @@ In Crawlee we currently have two HTTP clients: <ApiLink to="class/HttpxHttpClien
3636
Since <ApiLink to="class/HttpxHttpClient">`HttpxHttpClient`</ApiLink> is the default HTTP client, you don't need to install additional packages to use it. If you want to use <ApiLink to="class/CurlImpersonateHttpClient">`CurlImpersonateHttpClient`</ApiLink>, you need to install `crawlee` with the `curl-impersonate` extra.
3737

3838
```sh
39-
pip install 'crawlee[curl-impersonate]'
39+
python -m pip install 'crawlee[curl-impersonate]'
4040
```
4141

4242
or install all available extras:
4343

4444
```sh
45-
pip install 'crawlee[all]'
45+
python -m pip install 'crawlee[all]'
4646
```
4747

4848
## How HTTP clients work

docs/guides/storages.mdx

+3-3
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Crawlee offers multiple storage types for managing and persisting your crawling
3333
Storage clients in Crawlee are subclasses of <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. They handle interactions with different storage backends. For instance:
3434

3535
- <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>: Stores data in memory and persists it to the local file system.
36-
- [`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient): Manages storage on the [Apify Platform](https://apify.com). Apify storage client is implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
36+
- [`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient): Manages storage on the [Apify platform](https://apify.com). Apify storage client is implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
3737

3838
Each storage client is responsible for maintaining the storages in a specific environment. This abstraction makes it easier to switch between different environments, e.g. between local development and cloud production setup.
3939

@@ -52,7 +52,7 @@ where:
5252
- `{STORAGE_ID}`: The ID of the specific storage instance (default: `default`).
5353

5454
:::info NOTE
55-
The current <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and its interface is quite old and not great. We plan to refactor it, together with the whole <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink> interface in the near future and it better and and easier to use. We also plan to introduce new storage clients for different storage backends - e.g. for [SQLLite](https://www.sqlite.org/).
55+
The current <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and its interface is quite old and not great. We plan to refactor it, together with the whole <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink> interface in the near future and it better and and easier to use. We also plan to introduce new storage clients for different storage backends - e.g. for [SQLLite](https://sqlite.org/).
5656
:::
5757

5858
You can override default storage IDs using these environment variables: `CRAWLEE_DEFAULT_DATASET_ID`, `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID`, or `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID`.
@@ -117,7 +117,7 @@ The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> implements the <Ap
117117

118118
If you need custom functionality, you can create your own request storage by subclassing the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> class and implementing its required methods.
119119

120-
For a detailed explanation of the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> and other related components, refer to the [Request loaders guide](https://www.crawlee.dev/python/docs/guides/request-loaders).
120+
For a detailed explanation of the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> and other related components, refer to the [Request loaders guide](https://crawlee.dev/python/docs/guides/request-loaders).
121121

122122
## Dataset
123123

docs/introduction/01_setting_up.mdx

+67-33
Original file line numberDiff line numberDiff line change
@@ -4,28 +4,43 @@ title: Setting up
44
---
55

66
import ApiLink from '@site/src/components/ApiLink';
7+
import CodeBlock from '@theme/CodeBlock';
8+
import Tabs from '@theme/Tabs';
9+
import TabItem from '@theme/TabItem';
710

8-
To run Crawlee on your computer, ensure you meet the following requirements:
11+
This guide will help you get started with Crawlee by setting it up on your computer. Follow the steps below to ensure a smooth installation process.
912

10-
1. [Python](https://www.python.org/) 3.9 or higher installed,
11-
2. [Pip](https://pip.pypa.io/en/stable/) installed.
13+
## Prerequisites
1214

13-
You can verify these by running the following commands:
15+
Before installing Crawlee itself, make sure that your system meets the following requirements:
16+
17+
- **Python 3.9 or higher**: Crawlee requires Python 3.9 or a newer version. You can download Python from the [official website](https://python.org/downloads/).
18+
- **Python package manager**: While this guide uses [pip](https://pip.pypa.io/) (the most common package manager), you can also use any package manager you want. You can download pip from the [official website](https://pip.pypa.io/en/stable/installation/).
19+
20+
### Verifying prerequisites
21+
22+
To check if Python and pip are installed, run the following commands:
1423

1524
```sh
1625
python --version
1726
```
1827

1928
```sh
20-
pip --version
29+
python -m pip --version
2130
```
2231

23-
## Installation
32+
If these commands return the respective versions, you're ready to continue.
33+
34+
## Installing Crawlee
35+
36+
Crawlee is available as [`crawlee`](https://pypi.org/project/crawlee/) package on PyPI. This package includes the core functionality, while additional features are available as optional extras to keep dependencies and package size minimal.
2437

25-
Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI package. To install the core package, use:
38+
### Basic installation
39+
40+
To install the core package, run:
2641

2742
```sh
28-
pip install crawlee
43+
python -m pip install crawlee
2944
```
3045

3146
After installation, verify that Crawlee is installed correctly by checking its version:
@@ -34,50 +49,40 @@ After installation, verify that Crawlee is installed correctly by checking its v
3449
python -c 'import crawlee; print(crawlee.__version__)'
3550
```
3651

37-
Crawlee offers several optional features through package extras. You can choose to install only the dependencies you need or install everything if you don't mind the package size.
38-
39-
### Install all features
52+
### Full installation
4053

41-
If you do not care about the package size, install Crawlee with all features:
54+
If you do not mind the package size, you can run the following command to install Crawlee with all optional features:
4255

4356
```sh
44-
pip install 'crawlee[all]'
57+
python -m pip install 'crawlee[all]'
4558
```
4659

47-
### Installing only specific extras
60+
### Installing specific extras
4861

4962
Depending on your use case, you may want to install specific extras to enable additional functionality:
5063

51-
#### BeautifulSoup
52-
5364
For using the <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, install the `beautifulsoup` extra:
5465

5566
```sh
56-
pip install 'crawlee[beautifulsoup]'
67+
python -m pip install 'crawlee[beautifulsoup]'
5768
```
5869

59-
#### Parsel
60-
6170
For using the <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink>, install the `parsel` extra:
6271

6372
```sh
64-
pip install 'crawlee[parsel]'
73+
python -m pip install 'crawlee[parsel]'
6574
```
6675

67-
#### Curl impersonate
68-
6976
For using the <ApiLink to="class/CurlImpersonateHttpClient">`CurlImpersonateHttpClient`</ApiLink>, install the `curl-impersonate` extra:
7077

7178
```sh
72-
pip install 'crawlee[curl-impersonate]'
79+
python -m pip install 'crawlee[curl-impersonate]'
7380
```
7481

75-
#### Playwright
76-
7782
If you plan to use a (headless) browser with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, install Crawlee with the `playwright` extra:
7883

7984
```sh
80-
pip install 'crawlee[playwright]'
85+
python -m pip install 'crawlee[playwright]'
8186
```
8287

8388
After installing the playwright extra, install the necessary Playwright dependencies:
@@ -91,29 +96,58 @@ playwright install
9196
You can install multiple extras at once by using a comma as a separator:
9297

9398
```sh
94-
pip install 'crawlee[beautifulsoup,curl-impersonate]'
99+
python -m pip install 'crawlee[beautifulsoup,curl-impersonate]'
95100
```
96101

97-
## With Crawlee CLI
102+
## Start a new project
103+
104+
The quickest way to get started with Crawlee is by using the Crawlee CLI and selecting one of the prepared templates. The CLI helps you set up a new project in seconds.
105+
106+
### Using Crawlee CLI with Pipx
98107

99-
The quickest way to get started with Crawlee is by using the Crawlee CLI and selecting one of the prepared templates. First, ensure you have [Pipx](https://pipx.pypa.io/) installed:
108+
First, ensure you have Pipx installed. You can check if Pipx is installed by running:
100109

101110
```sh
102-
pipx --help
111+
pipx --version
103112
```
104113

105-
Then, run the CLI and choose from the available templates:
114+
If Pipx is not installed, follow the official [installation guide](https://pipx.pypa.io/stable/installation/).
115+
116+
Then, run the Crawlee CLI using Pipx and choose from the available templates:
106117

107118
```sh
108-
pipx run crawlee create my-crawler
119+
pipx run crawlee create my_crawler
109120
```
110121

122+
### Using Crawlee CLI directly
123+
111124
If you already have `crawlee` installed, you can spin it up by running:
112125

113126
```sh
114-
crawlee create my-crawler
127+
crawlee create my_crawler
115128
```
116129

130+
Follow the interactive prompts in the CLI to choose a crawler type and set up your new project.
131+
132+
### Running your project
133+
134+
To run your newly created project, navigate to the project directory, activate the virtual environment, and execute the Python interpreter with the project module:
135+
136+
<Tabs>
137+
<TabItem value="Linux" label="Linux" default>
138+
<CodeBlock language="sh">cd my_crawler/</CodeBlock>
139+
<CodeBlock language="sh">source .venv/bin/activate</CodeBlock>
140+
<CodeBlock language="sh">python -m my_crawler</CodeBlock>
141+
</TabItem>
142+
<TabItem value="Windows" label="Windows" default>
143+
<CodeBlock language="sh">cd my_crawler/</CodeBlock>
144+
<CodeBlock language="sh">venv\Scripts\activate</CodeBlock>
145+
<CodeBlock language="sh">python -m my_crawler</CodeBlock>
146+
</TabItem>
147+
</Tabs>
148+
149+
Congratulations! You have successfully set up and executed your first Crawlee project.
150+
117151
## Next steps
118152

119153
Next, you will learn how to create a very simple crawler and Crawlee components while building it.

docs/introduction/08_refactoring.mdx

+1-5
Original file line numberDiff line numberDiff line change
@@ -67,10 +67,6 @@ Initially, using a simple `if` / `else` statement for selecting different logic
6767

6868
It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long `request_handler()` where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.
6969

70-
{/* TODO: write this once SDK v2 is ready
71-
7270
## Next steps
7371

74-
In the next and final step, you'll see how to deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a **Dockerfile** ready, and the next section will show you how to deploy it to the [Apify Platform](../deployment/apify-platform) with ease.
75-
76-
*/}
72+
In the next and final step, you'll see how to deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a `Dockerfile` ready, and the next section will show you how to deploy it to the [Apify platform](../deployment/apify-platform) with ease.

0 commit comments

Comments
 (0)