Skip to content

Files

Latest commit

29c19f3 · Apr 1, 2020

History

History

scraper

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Apr 1, 2020
Jul 10, 2019
Jul 10, 2019
Jul 10, 2019

scraper

Simple tool to scrape all the data in dd.meteo.gc.ca.

On 2019-07-10, we indexed 15 millions files which have the following file extensions: file format distribution

How to use

Requirements

  • Node for scraping
  • Redis for queuing and distribute work across multiple workers
  • CouchDB to index all the entries

Usage

To start scraping dd.meteo.gc.ca from its root '/', add an entry in the Redis queue: redis-cli -n 2 rpush url-0 / Then you start the scraper with COUCHDB_URL=http://username:password@localhost:5984 node scraper.js