Web Scraping with Cheerio and Axios

This is a web crawler built on the top of Node.js for harvesting the unique question url, and their corresponding info like referenceCount(s), upvote(s) and answer(s) from stackoverflow.

Installation & Use

First get the repo ready on your PC by cloning it from github

  git clone https://github.com/Bucephalus-lgtm/crawler.git

Then, install the required library

  cd crawler
  npm install

Then, simply run

npm start

Features

What our crawler do?

Find every unique question url from the site.
Total reference counts for every URL.
Total count of upvotes and answers for every question.
Data saved in mongoDB after processed fully.
Data saved as CSV file whenever it gets killed.

Methodology

Whole script written in asynchronous mode.
Within a nameless async function(to run immediately), function fetchPage(url) is called with the url as input.
This function is responsible for parsing the HTML elements and thus obtaining the required fields.
Within fetchPage(url)
- Axios make the HTTP request to the input url.
- Cheerio then parses the html file.
- Check if this is the index of the end page.
- If end page return array with data harvested so far, i.e. mergedArray.
- Else proceed forth.
- Query written for extracting the required field.
- Then we increment the current page by 1.
- Thus nextUrl obtained.
- Update the mergedArray with this current page's harvested datas.
- Recursively call the function fetchPage(url) with the nextUrl as input.
After the fetchPage(url) function gets called, saved the datas to mongodb database.
Importantly, we keep checking for the node.js process killed.
When the process is killed, the data harvested so far gets saved as output.csv in the root directory.

Author

@Bhargab Nath

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
csv		csv
db		db
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping with Cheerio and Axios

Installation & Use

Features

What our crawler do?

Methodology

Author

About

Uh oh!

Releases

Packages

Languages

Bucephalus-lgtm/crawler

Folders and files

Latest commit

History

Repository files navigation

Web Scraping with Cheerio and Axios

Installation & Use

Features

What our crawler do?

Methodology

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages