Website Scraper

A simple tool to scrape website content. It focuses on extracting headings and body text, explicitly ignoring navigation elements and footer elements.

The only exception is that the tool will find all nav elements, extract any links to other pages on the same domain, and scrape those pages as well, automatically.

Note: The initial version of this script extracted every link on the page and that cause a lot of pain and errors lol. So this should act as a quick start for gathering copy from websites you work on.

Installation

Note this script requires python 3.11 or greater.

pip install git+https://github.com/brillnt/website-scraper.git

Usage

website-scraper https://example.com

This will create a directory with the domain name and save all scraped content to text files within that directory.

Example output:

example.com/
├── index.txt
├── about.txt
├── contact-us.txt

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
bin		bin
website_scraper		website_scraper
.gitignore		.gitignore
README.md		README.md
scraper_upgrade.sh		scraper_upgrade.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Website Scraper

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

brillnt/website-scraper

Folders and files

Latest commit

History

Repository files navigation

Website Scraper

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages