Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions from concurrent_scrape? #17

Open
williamhakim10 opened this issue Apr 3, 2024 · 1 comment
Open

Exceptions from concurrent_scrape? #17

williamhakim10 opened this issue Apr 3, 2024 · 1 comment

Comments

@williamhakim10
Copy link

We have code that looks like this:

        scrapfly = ScrapflyClient(key=self.__scrapfly_api_key, max_concurrency=15)
        targets = [
            ScrapeConfig(
                url=url,
                render_js=True,
                raise_on_upstream_error=False,
                country="us",
                asp=True,
            )
            for url in urls
        ]
        async for result in scrapfly.concurrent_scrape(scrape_configs=targets):
            self.__logger.info(f"Got result: {result}")  # when this code explodes, no log appears
            if isinstance(result, ScrapflyError):  # error from scrapfly itself
                ...
            elif result.error:  # error from upstream
                ...
            else:  # success
                ...

That being said, this code tends to explode on the async iterator sometimes, which will throw an error which looks like this without returning a result at all.

<-- 422 | ERR::PROXY::TIMEOUT - Proxy connection or website was too slow and timeout - Proxy or website do not respond after 15s - Check if the website is online or geoblocking, if you are using session, rotate it..Checkout the related doc: https://scrapfly.io/docs/scrape-api/error/ERR::PROXY::TIMEOUT

Seems like there's some kind of bug where the async iterator can itself throw rather than return an exception, which means the entire process blows up. Any ideas how we might go about fixing?

As an aside, I wanted to point out that it feels like the very inconsistent use of typing throughout the library makes it very hard to debug what's actually going on and reason about what errors can happen and when.

@jjsaunier
Copy link
Member

I will try some load test to check, tbh implementation is quite old and probably not super efficient as it relying on threadpool for async behind the scene. I think it's simpler to build a new a implem on top of native async client and leverage out-of-the-box asyncio feature

As an aside, I wanted to point out that it feels like the very inconsistent use of typing throughout the library makes it very hard to debug what's actually going on and reason about what errors can happen and when.

Yeah, that's due to the current implementation, we can't really throw in async way with async for, the only way is to return exception as result, also I made the choice to disable the throw in concurrency mode https://github.com/scrapfly/python-scrapfly/blob/master/scrapfly/client.py#L369 (same issue with asyncio.gather which require return_exception to not stop everything and result in the same typing experience)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants