Skip to content

Conversation

@datawookie
Copy link

@datawookie datawookie commented Oct 1, 2021

Hi!

We build a lot of web scrapers using Scrapy and I've been using your package for a while now. It's great for managing our multi-proxy setup.

We have been developing a proxy system that shares the proxy list via a URL. I have been dumping the contents of that URL to a file so that I can read it in via ROTATING_PROXY_LIST_PATH but this has become a bit of a pain. It occurred to me that it should be possible to read the proxy list from an URL.

The merge request includes a simple change to the RotatingProxyMiddleware.from_crawler() method to make that possible.

Example: Sharing proxy list at http://127.0.0.1:8800.

image

In settings.py I then have:

ROTATING_PROXY_LIST_PATH = 'http://127.0.0.1:8800'

For context, here's a blog post about the proxy system that we are using in conjunction with scrapy-rotating-proxies.

Best regards,
Andrew.

@kaybeudeker
Copy link

The link to your blog post should be: https://datawookie.dev/blog/2021/10/medusa-multi-headed-tor-proxy/ (instead of pointing to localhost) ;) Great work btw!

@datawookie
Copy link
Author

datawookie commented Nov 28, 2021

Thanks, @kaybeudeker, I've updated the URL. Appreciate you bringing that to my attention.

Have you tried this out? I'd really appreciate any feedback.

@SashiDareddy
Copy link

SashiDareddy commented Feb 20, 2022

I had a similar use case to read proxies from an URL (specifically an API call to a third party which returns a list of proxies - exactly like you have) - I created a small utility function which uses requests.get to fetch the proxies and assigns the result to ROTATING_PROXY_LIST_PATH in settings.py.

utility function:

`def get_proxies(proxy_json_end_point: str) -> List[str]:
r = requests.get(proxy_json_end_point)
proxies = r.json()

proxy_urls = [
    f"http://{user}:{pwd}@{host_port}"
    for (host_port, user, pwd) in [p.split(";") for p in proxies]
]
random.shuffle(proxy_urls)
print("Proxies:", proxy_urls)
return proxy_urls`

settings.py

ROTATING_PROXY_LIST = get_act_proxies(os.getenv("PROXY_JSON_ENDPOINT"))

note - the PROXY_JSON_ENDPOINT env variable points to the third-party's API endpoint which returns the proxies. I used a similar approach to even fetch proxies listed in text file hosted in S3.

@datawookie
Copy link
Author

Hi @TeamHG-Memex, any progress on this? This PR has been languishing for a few months now. Thanks, Andrew.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants