Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize package footprint by removing unnecessary deps #1077

Open
vdusek opened this issue Mar 12, 2025 · 4 comments
Open

Optimize package footprint by removing unnecessary deps #1077

vdusek opened this issue Mar 12, 2025 · 4 comments
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@vdusek
Copy link
Collaborator

vdusek commented Mar 12, 2025

Description

The current installation of the Crawlee package, when combined with all its direct and transitive dependencies, occupies ~75 MB. For context, Scrapy occupies ~88.4 MB.

$ (venv)  du -h venv/ --max-depth 0
6,6M	venv/
(venv) $ pip install crawlee
...
(venv) $ du -h venv/ --max-depth 0
82M	venv/

The large size is primarily due to several dependencies that may not be strictly necessary for the core functionality. Below is a detailed breakdown of the dependency tree, individual package sizes, and the summed sizes of Crawlee's direct dependencies.

Dependency tree

crawlee v0.6.4
    ├── apify-fingerprint-datapoints v0.0.2
    ├── browserforge v1.2.3
    │   └── click v8.1.8
    ├── cachetools v5.5.2
    ├── colorama v0.4.6
    ├── docutils v0.21.2
    ├── eval-type-backport v0.2.2
    ├── httpx[brotli, http2, zstd] v0.28.1
    │   ├── anyio v4.8.0
    │   │   ├── idna v3.10
    │   │   └── sniffio v1.3.1
    │   ├── certifi v2025.1.31
    │   ├── httpcore v1.0.7
    │   │   ├── certifi v2025.1.31
    │   │   └── h11 v0.14.0
    │   ├── idna v3.10
    │   ├── brotli v1.1.0 (extra: brotli)
    │   ├── h2 v4.2.0 (extra: http2)
    │   │   ├── hpack v4.1.0
    │   │   └── hyperframe v6.1.0
    │   └── zstandard v0.23.0 (extra: zstd)
    ├── more-itertools v10.6.0
    ├── psutil v7.0.0
    ├── pydantic v2.10.6
    │   ├── annotated-types v0.7.0
    │   ├── pydantic-core v2.27.2
    │   │   └── typing-extensions v4.12.2
    │   └── typing-extensions v4.12.2
    ├── pydantic-settings v2.6.1
    │   ├── pydantic v2.10.6 (*)
    │   └── python-dotenv v1.0.1
    ├── pyee v12.1.1
    │   └── typing-extensions v4.12.2
    ├── rich v13.9.4
    │   ├── markdown-it-py v3.0.0
    │   │   └── mdurl v0.1.2
    │   └── pygments v2.19.1
    ├── sortedcollections v2.1.0
    │   └── sortedcontainers v2.4.0
    ├── tldextract v5.1.3
    │   ├── filelock v3.17.0
    │   ├── idna v3.10
    │   ├── requests v2.32.3
    │   │   ├── certifi v2025.1.31
    │   │   ├── charset-normalizer v3.4.1
    │   │   ├── idna v3.10
    │   │   └── urllib3 v2.3.0
    │   └── requests-file v2.1.0
    │       └── requests v2.32.3 (*)
    ├── typing-extensions v4.12.2
    └── yarl v1.18.3
        ├── idna v3.10
        ├── multidict v6.1.0
        └── propcache v0.3.0

Package sizes

23M .venv/lib/python3.13/site-packages/zstandard
7,2M    .venv/lib/python3.13/site-packages/_brotli.cpython-313-x86_64-linux-gnu.so
4,9M    .venv/lib/python3.13/site-packages/pygments
4,7M    .venv/lib/python3.13/site-packages/pydantic_core
2,4M    .venv/lib/python3.13/site-packages/docutils
1,9M    .venv/lib/python3.13/site-packages/pydantic
1,1M    .venv/lib/python3.13/site-packages/yarl
1,1M    .venv/lib/python3.13/site-packages/rich
1,1M    .venv/lib/python3.13/site-packages/crawlee
1,0M    .venv/lib/python3.13/site-packages/psutil
836K    .venv/lib/python3.13/site-packages/apify_fingerprint_datapoints
784K    .venv/lib/python3.13/site-packages/propcache
484K    .venv/lib/python3.13/site-packages/urllib3
456K    .venv/lib/python3.13/site-packages/multidict
452K    .venv/lib/python3.13/site-packages/charset_normalizer
436K    .venv/lib/python3.13/site-packages/anyio
376K    .venv/lib/python3.13/site-packages/markdown_it
368K    .venv/lib/python3.13/site-packages/tldextract
368K    .venv/lib/python3.13/site-packages/click
352K    .venv/lib/python3.13/site-packages/idna
328K    .venv/lib/python3.13/site-packages/httpx
324K    .venv/lib/python3.13/site-packages/httpcore
308K    .venv/lib/python3.13/site-packages/certifi
260K    .venv/lib/python3.13/site-packages/h2
236K    .venv/lib/python3.13/site-packages/h11
228K    .venv/lib/python3.13/site-packages/requests
228K    .venv/lib/python3.13/site-packages/more_itertools
228K    .venv/lib/python3.13/site-packages/hpack
136K    .venv/lib/python3.13/site-packages/pydantic_settings
132K    .venv/lib/python3.13/site-packages/typing_extensions.py
124K    .venv/lib/python3.13/site-packages/sortedcontainers
120K    .venv/lib/python3.13/site-packages/browserforge
80K .venv/lib/python3.13/site-packages/colorama
60K .venv/lib/python3.13/site-packages/filelock
52K .venv/lib/python3.13/site-packages/pyee
52K .venv/lib/python3.13/site-packages/dotenv
44K .venv/lib/python3.13/site-packages/hyperframe
36K .venv/lib/python3.13/site-packages/mdurl
36K .venv/lib/python3.13/site-packages/cachetools
28K .venv/lib/python3.13/site-packages/sortedcollections
24K .venv/lib/python3.13/site-packages/annotated_types
16K .venv/lib/python3.13/site-packages/sniffio
16K .venv/lib/python3.13/site-packages/eval_type_backport
8,0K    .venv/lib/python3.13/site-packages/_virtualenv.py
8,0K    .venv/lib/python3.13/site-packages/requests_file.py
4,0K    .venv/lib/python3.13/site-packages/_virtualenv.pth
4,0K    .venv/lib/python3.13/site-packages/brotli.py

Extracted via:

du -sh .venv/lib/python*/site-packages/* | sort -hr

Total size per direct dependency

  • httpx[brotli, http2, zstd]: 32.732M
  • pydantic‑settings: 6.944M
  • pydantic: 6.756M
  • rich: 6.412M
  • yarl: 2.692M
  • docutils: 2.4M
  • tldextract: 2.260M
  • psutil: 1.0M
  • apify‑fingerprint‑datapoints: 836K
  • browserforge: 488K
  • more‑itertools: 228K
  • pyee: 184K
  • sortedcollections: 152K
  • typing‑extensions: 132K
  • colorama: 80K
  • cachetools: 36K
  • eval‑type‑backport: 16K

Goal

The goal is to identify and potentially remove or replace dependencies that contribute significantly to the overall package size without compromising its functionality.

@vdusek vdusek changed the title Optimize package footprint by eliminating unnecessary dependencies Optimize package footprint by removing unnecessary deps Mar 12, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 12, 2025
@vdusek
Copy link
Collaborator Author

vdusek commented Mar 12, 2025

Removable

  • httpx[brotli, http2, zstd] - Replace with Impit as the default HTTP client, making this an optional dependency.
  • rich - Currently used only for log tables. Consider eliminating it and/or replacing it with something smaller.
  • docutils - Not used at all.
  • tldextract - Used to extract URL components (subdomain, domain, and suffix). Evaluate replacing it with exclusive use of yarl.
  • apify‑fingerprint‑datapoints - Could be optional as part of HTTPX & Playwright extras if Impit is adopted as the default HTTP client.
  • browserforge - Could be optional as part of HTTPX & Playwright extras if Impit is adopted as the default HTTP client.
  • sortedcollections - Only used for ValueSortedDict in the memory storage RQ client. Since it's not actively maintained, evaluate replacing or removing it.
  • sortedcontainers - Only used for SortedList in the Snapshotter.
  • eval‑type‑backport - Backport which can be moved to dev deps and be removed once support for Python 3.9 is dropped.

Not removable

  • pydantic‑settings
  • pydantic
  • yarl - Used for URLs.
  • psutil: - Used for monitoring CPU and memory usage by the Snapshotter.
  • more‑itertools
  • pyee - Core component of EventManager.
  • typing‑extensions
  • colorama - Utilized for colored logging output.
  • cachetools - Operates as an LRU cache for requests in the RQ implementation.

@Pijukatel
Copy link
Collaborator

Btw. sortedcollections depends on sortedcontainers

I used sorted containers here:
https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/snapshotter.py#L10

even though it is not explicitly mentioned in the requirements (due to sortedcollections already including it)

@janbuchar
Copy link
Collaborator

janbuchar commented Mar 13, 2025

I used sorted containers here: https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/snapshotter.py#L10

even though it is not explicitly mentioned in the requirements (due to sortedcollections already including it)

Please add it to the dependencies, you never know when transitive dependencies will change 🙂

@Pijukatel
Copy link
Collaborator

Please add it to the dependencies, you never know when transitive dependencies will change 🙂

Here it goes #1083

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants