Skip to content

Commit adf5fb2

Browse files
vdusekclaude
andcommitted
fix: migrate to Scrapy's native AsyncCrawlerRunner
Adopt Scrapy 2.14's AsyncCrawlerRunner to eliminate the Deferred conversion layer (deferred_to_future). The run_scrapy_actor function now handles asyncio reactor installation internally, removing boilerplate from user code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ea1f52b commit adf5fb2

File tree

15 files changed

+40
-62
lines changed

15 files changed

+40
-62
lines changed

docs/03_guides/06_scrapy.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,13 @@ import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py'
1717

1818
## Integrating Scrapy with the Apify platform
1919

20-
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
20+
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
2121

2222
<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
2323
{UnderscoreMainExample}
2424
</CodeBlock>
2525

26-
In this setup, `apify.scrapy.initialize_logging` configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The `apify.scrapy.run_scrapy_actor` bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.
26+
In this setup, `apify.scrapy.initialize_logging` configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The `apify.scrapy.run_scrapy_actor` installs Twisted's asyncio-compatible reactor and bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.
2727

2828
Make sure the `SCRAPY_SETTINGS_MODULE` environment variable is set to the path of the Scrapy settings module. This variable is also used by the `Actor` class to detect that the project is a Scrapy project, triggering additional actions.
2929

@@ -47,7 +47,7 @@ Additional helper functions in the [`apify.scrapy`](https://github.com/apify/api
4747
- `apply_apify_settings` - Applies Apify-specific components to Scrapy settings.
4848
- `to_apify_request` and `to_scrapy_request` - Convert between Apify and Scrapy request objects.
4949
- `initialize_logging` - Configures logging for the Actor environment.
50-
- `run_scrapy_actor` - Bridges asyncio and Twisted event loops.
50+
- `run_scrapy_actor` - Installs Twisted's asyncio reactor and bridges asyncio and Twisted event loops.
5151

5252
## Create a new Apify-Scrapy project
5353

docs/03_guides/code/scrapy_project/src/__main__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,5 @@
11
from __future__ import annotations
22

3-
from scrapy.utils.reactor import install_reactor
4-
5-
# Install Twisted's asyncio reactor before importing any other Twisted or
6-
# Scrapy components.
7-
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
8-
93
import os
104

115
from apify.scrapy import initialize_logging, run_scrapy_actor

docs/03_guides/code/scrapy_project/src/main.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
from __future__ import annotations
2+
23
import asyncio
34

4-
from scrapy.crawler import CrawlerRunner
5-
from scrapy.utils.defer import deferred_to_future
5+
from scrapy.crawler import AsyncCrawlerRunner
66

77
from apify import Actor
88
from apify.scrapy import apply_apify_settings
@@ -23,14 +23,13 @@ async def main() -> None:
2323
# Apply Apify settings, which will override the Scrapy project settings.
2424
settings = apply_apify_settings(proxy_config=proxy_config)
2525

26-
# Create CrawlerRunner and execute the Scrapy spider.
27-
crawler_runner = CrawlerRunner(settings)
28-
crawl_deferred = crawler_runner.crawl(
26+
# Create AsyncCrawlerRunner and execute the Scrapy spider.
27+
crawler_runner = AsyncCrawlerRunner(settings)
28+
await crawler_runner.crawl(
2929
Spider,
3030
start_urls=start_urls,
3131
allowed_domains=allowed_domains,
3232
)
33-
await deferred_to_future(crawl_deferred)
3433

3534

3635
if __name__ == '__main__':

pyproject.toml

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ dependencies = [
5050
]
5151

5252
[project.optional-dependencies]
53-
scrapy = ["scrapy>=2.11.0"]
53+
scrapy = ["scrapy>=2.14.0"]
5454

5555
[project.urls]
5656
"Apify Homepage" = "https://apify.com"
@@ -161,10 +161,6 @@ indent-style = "space"
161161
"PLR2004", # Magic value used in comparison, consider replacing `{value}` with a constant variable
162162
"PLW0603", # Using the global statement to update `{name}` is discouraged
163163
]
164-
"**/docs/**/scrapy_project/**/__main__.py" = [
165-
# Because of asyncioreactor.install() call.
166-
"E402", # Module level import not at top of file
167-
]
168164
"**/docs/**/scrapy_project/**" = [
169165
# Local imports are mixed up with the Apify SDK.
170166
"I001", # Import block is un-sorted or un-formatted

src/apify/scrapy/_actor_runner.py

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,24 +3,22 @@
33
import asyncio
44
from typing import TYPE_CHECKING
55

6-
from twisted.internet.defer import Deferred, ensureDeferred
7-
from twisted.internet.task import react
8-
96
if TYPE_CHECKING:
107
from collections.abc import Coroutine
118

129

13-
async def _run_coro_as_deferred(coro: Coroutine) -> None:
14-
"""Wrap the given asyncio coroutine in a Task and await its result as a Twisted Deferred."""
15-
task = asyncio.ensure_future(coro)
16-
await Deferred.fromFuture(task)
17-
18-
1910
def run_scrapy_actor(coro: Coroutine) -> None:
2011
"""Start Twisted's reactor and execute the provided Actor coroutine.
2112
22-
This function initiates the Twisted reactor and runs the given asyncio coroutine (typically the
23-
Actor's main) by converting it to a Deferred. This bridges the asyncio and Twisted event loops,
24-
enabling the Apify and Scrapy integration to work together.
13+
This function installs Twisted's asyncio-compatible reactor, then initiates it and runs the given asyncio
14+
coroutine (typically the Actor's main) by converting it to a Deferred. This bridges the asyncio and Twisted
15+
event loops, enabling the Apify and Scrapy integration to work together.
2516
"""
26-
react(lambda _: ensureDeferred(_run_coro_as_deferred(coro)))
17+
from scrapy.utils.reactor import install_reactor # noqa: PLC0415
18+
19+
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
20+
21+
from twisted.internet.defer import Deferred # noqa: PLC0415
22+
from twisted.internet.task import react # noqa: PLC0415
23+
24+
react(lambda _reactor: Deferred.fromFuture(asyncio.ensure_future(coro)))

tests/e2e/test_actor_scrapy.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ async def test_actor_scrapy_title_spider(
2828
actor = await make_actor(
2929
'actor-scrapy-title-spider',
3030
source_files=actor_source_files,
31-
additional_requirements=['scrapy~=2.12.0'],
31+
additional_requirements=['scrapy>=2.14.0'],
3232
)
3333
run_result = await run_actor(
3434
actor,

tests/e2e/test_scrapy/actor_source/__main__.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,9 @@
11
from __future__ import annotations
22

3-
from scrapy.utils.reactor import install_reactor
3+
import os
44

5-
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
6-
7-
import os # noqa: E402, I001
8-
9-
from apify.scrapy import initialize_logging, run_scrapy_actor # noqa: E402
10-
11-
from .main import main # noqa: E402
5+
from .main import main
6+
from apify.scrapy import initialize_logging, run_scrapy_actor
127

138
os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'
149

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,14 @@
1-
from __future__ import annotations # noqa: I001
1+
from __future__ import annotations
22

3-
from scrapy.crawler import CrawlerRunner
4-
from scrapy.utils.defer import deferred_to_future
3+
from scrapy.crawler import AsyncCrawlerRunner
54

5+
from .spiders import Spider # ty: ignore[unresolved-import]
66
from apify import Actor
77
from apify.scrapy import apply_apify_settings
88

9-
from .spiders import Spider # ty: ignore[unresolved-import]
10-
119

1210
async def main() -> None:
1311
async with Actor:
1412
settings = apply_apify_settings()
15-
runner = CrawlerRunner(settings)
16-
await deferred_to_future(runner.crawl(Spider, start_urls=['http://localhost:8080/']))
13+
runner = AsyncCrawlerRunner(settings)
14+
await runner.crawl(Spider, start_urls=['http://localhost:8080/'])
Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,17 @@
1-
from __future__ import annotations # noqa: I001
1+
from __future__ import annotations
22

33
import os
44

5-
from scrapy.crawler import CrawlerRunner
6-
from scrapy.utils.defer import deferred_to_future
5+
from scrapy.crawler import AsyncCrawlerRunner
76

7+
from .spiders import Spider # ty: ignore[unresolved-import]
88
from apify import Actor
99
from apify.scrapy import apply_apify_settings
1010

11-
from .spiders import Spider # ty: ignore[unresolved-import]
12-
1311

1412
async def main() -> None:
1513
async with Actor:
1614
os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings_custom_pipeline'
1715
settings = apply_apify_settings()
18-
runner = CrawlerRunner(settings)
19-
await deferred_to_future(runner.crawl(Spider, start_urls=['http://localhost:8080/']))
16+
runner = AsyncCrawlerRunner(settings)
17+
await runner.crawl(Spider, start_urls=['http://localhost:8080/'])

tests/e2e/test_scrapy/test_basic_spider.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ async def test_basic_spider(make_actor: MakeActorFunction, run_actor: RunActorFu
1212
actor = await make_actor(
1313
label='scrapy-basic',
1414
source_files=get_scrapy_source_files('spider_basic.py', 'BasicSpider'),
15-
additional_requirements=['scrapy~=2.12.0'],
15+
additional_requirements=['scrapy>=2.14.0'],
1616
)
1717
run_result = await run_actor(actor)
1818
await verify_spider_results(actor, run_result)

0 commit comments

Comments
 (0)