Skip to content

Latest commit

 

History

History
67 lines (48 loc) · 1.79 KB

20230110130247.md

File metadata and controls

67 lines (48 loc) · 1.79 KB

Faster page downloads

You can use concurrent.futures' ThreadPoolExecutor to significantly speed up page downloads. It uses a pool of threads to execute calls asynchronously 💪

Here I adapted the ThreadPoolExecutor example from the concurrent.futures module docs and added a little command line flag to compare downloading 10 Pybites articles synchronously version asynchronously.

Notice: making more requests you probably want to limit the number of workers / threads for rate limiting (and to be respectful).

The code (first pip install feedparser requests into a virtual environment):

import concurrent.futures
from pathlib import Path
import sys

import feedparser
import requests

URLS = [
    entry.link for entry in
    feedparser.parse("https://pybit.es/feed/").entries
]


def _download_page(url):
    resp = requests.get(url)
    with open(Path(url).stem, "wb") as f:
        f.write(resp.content)


def download_urls_sync(urls=URLS):
    for url in urls:
        _download_page(url)


def download_urls_async(urls=URLS):
    with concurrent.futures.ThreadPoolExecutor(
        max_workers=32
    ) as executor:
        future_to_url = {
            executor.submit(_download_page, url): url
            for url in urls
        }
        for future in concurrent.futures.as_completed(
            future_to_url
        ):
            future_to_url[future]


if __name__ == "__main__":
    if len(sys.argv) > 1 and sys.argv[1] == "async":
        download_urls_async()
    else:
        download_urls_sync()

Running this code locally I get an almost 4X performance improvement:

$ time python download.py async
python download.py async  0.31s user 0.08s system 22% cpu 1.719 total

$ time python download.py
python download.py  0.29s user 0.05s system 5% cpu 6.416 total

#threading #concurrent