Effortlessly scrape the web using just a few keywords!
Save the existing PDFs in a separate folder and convert the data stored in the MongoDB database into PDFs or json.
This guide walks you through setting up a web scraping environment using Scrapy and MongoDB on Ubuntu/Debian Linux systems. With just a few keywords, you'll be able to scrape the web and store the results in a MongoDB database.
python3 -m venv scraper
source scraper/bin/activate
pip install -r requirements.txt
wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.list
sudo apt-get update
sudo apt-get install -y mongodb-org
mkdir -p ~/path/to/your/project/data/db
mongod --dbpath ~/path/to/your/project/data/db
If you are using WSL, you may want to specify a different port or use configuration files:
mongod --dbpath ~/data/db --bind_ip 0.0.0.0 --port 27017
Open a new terminal window and connect to the MongoDB server:
mongo
If you encounter an error, run:
sudo apt update
sudo apt install mongodb-clients
cd webcrawler
scrapy crawl spider -a keywords="climate change" # You can add whatever keyword you want to scrape for