Skip to content

Python web scraping framework : Scrapy

SanderLiang edited this page Apr 24, 2020 · 16 revisions

1. Abstract

The real project usually start from obtaining data. No matter text mining, machine learning and data mining, all need data. At this time, the web scraping is particularly important for data collection. Fortunately, python provides a very good web scraping frameworks -- Scrapy, which can not only scrape the data, but also get and clean the data. This article discusses and analyzes the technology framework, architecture composition, operation process and application examples of scrapy. This paper expounds the significance and future of it in the data-driven project, and also discusses its existing shortcomings and influences.

2. Introduction

Web scraping is an important way to prepare data and is often used to collect data from the network in daily esystems. This article will discuss a powerful web scraping framework, Scrapy. Scrapy is a fast high-level web crawling[1] and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing[2].

Scrapy Architecture

Scrapy architecture is composed of Scrapy Engine, Scheduler, Downloader, Spider, Item Pipeline, Downloader Middlewares and Spider Middlewares. As shown in Figure 1[3].

Framework
Figure 1. The Scrapy architecture

Scrapy Process

In Figure 1, the green line is the data flow direction.

  • The scheduler takes requests - > Engine- > Downloader Middlewares - > Downloader
  • Downloader sends request and gets response - > Downloader Middlewares - > Engine- > Spider Middlewares - > Spider
  • Spider extracts URL address and assembles it into request object - > Spider Middlewares - > Engine - > Scheduler
  • Spider extract data - > Engine - > Item Pipeline
  • Data processing and saving of Item Pipeline

Scrapy Example

Scrapy's installation has very detailed instructions on its official website. It is very compatible with Linux, MAC and Windows systems.

Simple scrapy only needs five steps:

  • New project: create Scrapy project, command: scratch startproject X(name of project).
  • New application: create a application, command: scrapy genspider SpiderName targetWebsite.
  • Define fields: a python file (eg. items.py), specify the fields of the target site.
  • Make spiders: write spiders (eg. spider.py) to crawl the web information and analyze the content of the web page.
  • Storage contents:Design pipelines(eg. pipelines.py) to store crawling contents.

3. Findings

The Scrapy framework, which integrates the general functions of web page collection into each module and leaves out the customized part, liberates the programmer from the tedious process of repeated work. The focus of simple web page scraper is to deal with anti scraping, large-scale scraping and efficient and stable scraping. Because Scrapy is based on the Twisted framework, it offers features that are superior to other multithreaded web crawler frameworks.

Twisted framework

Using Twisted efficient asynchronous network framework to handle network communication. Twisted provides methods that allow the above operations to be performed without blocking code execution. Figure 2 shows a comparison of twisted and multithreaded code. Figure 2 Figure 2. The comparison of twisted and multithreaded [4]

The developers of the operating system have been optimizing the thread operation for decades, and the performance problem is not as important as before. However, compared with multithreaded programming, it is very difficult to write thread safe code, so twisted code is far simpler and safer than multithreaded code.

On the positive side, Scrapy is asynchronous, flexible in adjusting the number of concurrent requests, and can improve data mining efficiency by replacing regular expressions with more readable xpaths. Moreover, when writing middleware, uniform complementary filters can be used to make the data more accurate and clean.It is more convenient to use Scrapy to get data on different URLs at the same time.In addition, it is more convenient to debug independently because it supports the shell approach.Finally, the data is stored in the database by pipeline, which makes the storage of data more flexible.

However, there are some shortcomings, such as the inability to use it to complete distributed data extraction, high memory consumption, and no effect on Web pages that execute Javascript. On the other hand, due to base on the Twisted framework, an exception to one task does not stop the other task, and other exception handling is difficult to detect.

In data-driven large data and machine learning, data preparation is very important and an important part that cannot be ignored[5].More and more technologies are being used to prepare databases, and an important part of them is to collect data from the network.Compared with traditional web scraping frameworks, Scrapy has received considerable attention and use in many web scraping frameworks due to its simple programming and flexible deployment.

Moreover, Many companies have customized scraping frameworks base on Scrapy to complete data collection, and many extensions have also achieved considerable success on GitHub. Such as, Scrapy_redis solves the issue of distributed scraping. Scrapy-splash, which integrates JavaScript and can execute JS in scrapy.

4. Conclusions

References

  1. Pyle, D. (1999). Data preparation for data mining. morgan kaufmann.
Clone this wiki locally