|
| 1 | +# Scrapy with selenium |
| 2 | +[](https://pypi.python.org/pypi/scrapy-selenium) [](https://travis-ci.org/clemfromspace/scrapy-selenium) [](https://codeclimate.com/github/clemfromspace/scrapy-selenium/test_coverage) [](https://codeclimate.com/github/clemfromspace/scrapy-selenium/maintainability) |
| 3 | + |
| 4 | +Scrapy middleware to handle javascript pages using selenium. |
| 5 | + |
| 6 | +## Installation |
| 7 | +``` |
| 8 | +$ pip install scrapy-selenium |
| 9 | +``` |
| 10 | + |
| 11 | +You will also need one of the Selenium [compatible browsers](http://www.seleniumhq.org/about/platforms.jsp). |
| 12 | + |
| 13 | +## Configuration |
| 14 | +1. Add the browser to use, the path to the executable, and the arguments to pass to the executable to the scrapy settings: |
| 15 | + ```python |
| 16 | + from shutil import which |
| 17 | + |
| 18 | + SELENIUM_DRIVER_NAME='firefox' |
| 19 | + SELENIUM_DRIVER_EXECUTABLE_PATH=which('geckodriver') |
| 20 | + SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox |
| 21 | + ``` |
| 22 | + |
| 23 | +2. Add the `SeleniumMiddleware` to the downloader middlewares: |
| 24 | + ```python |
| 25 | + DOWNLOADER_MIDDLEWARES = { |
| 26 | + 'scrapy_selenium.SeleniumMiddleware': 800 |
| 27 | + } |
| 28 | + ``` |
| 29 | +## Usage |
| 30 | +Use the `scrapy_selenium.SeleniumRequest` instead of the scrapy built-in `Request` like below: |
| 31 | +```python |
| 32 | +from scrapy_selenium import SeleniumRequest |
| 33 | + |
| 34 | +yield SeleniumRequest(url, self.parse_result) |
| 35 | +``` |
| 36 | +The request will be handled by selenium, and the response will have an additional `meta` key, named `driver` containing the selenium driver with the request processed. |
| 37 | +```python |
| 38 | +def parse_result(self, response): |
| 39 | + print(response.meta['driver'].title) |
| 40 | +``` |
| 41 | +For more information about the available driver methods and attributes, refer to the [selenium python documentation](http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webdriver) |
| 42 | + |
| 43 | +The `selector` response attribute work as usual (but contains the html processed by the selenium driver). |
| 44 | +```python |
| 45 | +def parse_result(self, response): |
| 46 | + print(response.selector.xpath('//title/@text')) |
| 47 | +``` |
| 48 | + |
| 49 | +### Additional arguments |
| 50 | +The `scrapy_selenium.SeleniumRequest` accept 3 additional arguments: |
| 51 | + |
| 52 | +#### `wait_time` / `wait_until` |
| 53 | + |
| 54 | +When used, selenium will perform an [Explicit wait](http://selenium-python.readthedocs.io/waits.html#explicit-waits) before returning the response to the spider. |
| 55 | +```python |
| 56 | +from selenium.webdriver.common.by import By |
| 57 | +from selenium.webdriver.support import expected_conditions as EC |
| 58 | + |
| 59 | +yield SeleniumRequest( |
| 60 | + url, |
| 61 | + self.parse_result, |
| 62 | + wait_time=10, |
| 63 | + wait_until=EC.element_to_be_clickable((By.ID, 'someid')) |
| 64 | +) |
| 65 | +``` |
| 66 | + |
| 67 | +#### `screenshot` |
| 68 | +When used, selenium will take a screenshot of the page and the binary data of the .png captured will be added to the response `meta`: |
| 69 | +```python |
| 70 | +yield SeleniumRequest( |
| 71 | + url, |
| 72 | + self.parse_result, |
| 73 | + screenshot=True |
| 74 | +) |
| 75 | + |
| 76 | +def parse_result(self, response): |
| 77 | + with open('image.png', 'wb') as image_file: |
| 78 | + image_file.write(response.meta['screenshot]) |
| 79 | +``` |
| 80 | + |
0 commit comments