-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Introduction
Jeremy Chou edited this page May 31, 2023
·
6 revisions
Redis-based components for Scrapy.
- Free software: MIT license
- Python support: 3.8+
- Scrapy support: 2.6+
-
Distributed crawling/scraping
- You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
-
Distributed post-processing
- Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
-
Scrapy plug-and-play components
- Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
-
In this forked version: added
jsonsupported data in Redisdata contains
url,metaand other optional parameters.metais a nested json which contains sub-data. this function extract this data and send another FormRequest withurl,metaand additionformdata.For example:
{"url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }this data can be accessed in
scrapy spiderthrough response. like:response.url,response.meta,response.url_cookie_key
This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.
- Python 3.8, 3.9, 3.10, 3.11
- Redis >=5.0
- Scrapy >=2.6.0
- redis-py >=4.2