Skip to content

Exceptions from concurrent_scrape? #17

Open
@williamhakim10

Description

@williamhakim10

We have code that looks like this:

        scrapfly = ScrapflyClient(key=self.__scrapfly_api_key, max_concurrency=15)
        targets = [
            ScrapeConfig(
                url=url,
                render_js=True,
                raise_on_upstream_error=False,
                country="us",
                asp=True,
            )
            for url in urls
        ]
        async for result in scrapfly.concurrent_scrape(scrape_configs=targets):
            self.__logger.info(f"Got result: {result}")  # when this code explodes, no log appears
            if isinstance(result, ScrapflyError):  # error from scrapfly itself
                ...
            elif result.error:  # error from upstream
                ...
            else:  # success
                ...

That being said, this code tends to explode on the async iterator sometimes, which will throw an error which looks like this without returning a result at all.

<-- 422 | ERR::PROXY::TIMEOUT - Proxy connection or website was too slow and timeout - Proxy or website do not respond after 15s - Check if the website is online or geoblocking, if you are using session, rotate it..Checkout the related doc: https://scrapfly.io/docs/scrape-api/error/ERR::PROXY::TIMEOUT

Seems like there's some kind of bug where the async iterator can itself throw rather than return an exception, which means the entire process blows up. Any ideas how we might go about fixing?

As an aside, I wanted to point out that it feels like the very inconsistent use of typing throughout the library makes it very hard to debug what's actually going on and reason about what errors can happen and when.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions