Healthy Spiders 💪🕷️

A naive crawler-scraper for health domain websites built with Python. This project autonomously extracts health-related content from multiple Indonesian health websites using asynchronous crawling and scraping techniques.

Overview

Crawling-Scraping Mechanism

The Healthy Spiders project implements a two-phase pipeline:

Phase 1: Crawling

The crawler discovers content URLs by processing paginated endpoint lists. It:

Fetches pagination pages from predefined URL patterns
Extracts individual content URLs from HTML/JSON responses
Stores discovered URLs in an SQLite queue database
Supports multiple categories per website (articles, discussions, news, etc.)

Phase 2: Scraping

The scraper extracts detailed content from discovered URLs. It:

Retrieves pending URLs from the database in configurable batches
Fetches and parses HTML content with BeautifulSoup
Converts HTML to clean Markdown format
Extracts metadata (title, date, author, tokens count, etc.)
Saves structured data as JSONL (JSON Lines) format

Supported Websites

The project currently supports crawling and scraping from:

alodokter.com - Articles and patient-doctor discussions
biofarma.co.id - Health articles
pom.go.id - Public health information and news
halodoc.com - Medical articles
hellosehat.com - Comprehensive health content

Key Features

Asynchronous Processing: Built on Python asyncio for concurrent requests
Rate Limiting: Configurable concurrency limits to avoid overwhelming servers
Error Handling: Automatic retry mechanisms and failure tracking
Logging: Detailed logging with daily rotation using loguru
Token Counting: Quantifies content using Qwen3 tokenizer
Database Tracking: SQLite persistence for pagination and URL status

Getting Started

1. Clone the Repository

git clone https://github.com/RubikRif/healthy-spiders.git
cd healthy-spiders

2. Create and Activate Python Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure the Pipeline

Edit config.py to customize the crawling and scraping behavior:

General Configuration

BATCH_SIZE = 100                    # Number of items to process per batch
DB_PATH = "queue.db"                # SQLite database file
MAX_CONCURRENT = 5                  # Maximum concurrent async requests
OUTPUT_PATH = 'output/output.jsonl' # Output file path for scraped data

Website-Specific Configuration

Each website config requires:

domain: The website domain
pages_2b_crawled: Dictionary with pagination patterns and max pages to crawl
contents_2b_scraped: Dictionary with unpatterned URL sources and max pages to scrape

Example configuration:

ALODOKTER_CONFIG = {
    'domain': 'alodokter.com',
    'pages_2b_crawled': {
        '/page/': 100,                              # Article pages, max 1162 available
        '/komunitas/diskusi/penyakit/page/': 100   # Discussion pages, max 7423 available
    },
    'contents_2b_scraped': None
}

Reset Flags

Control behavior between pipeline runs:

RESET_FIRST_PATTERNED_PAGINATION = True  # Reset first pagination batch
RESET_ALL_HALODOC_PAGINATION = True      # Reset Halodoc-specific pagination
RESET_ALL_FAILED_PAGINATION = True       # Retry failed pagination URLs
RESET_ALL_FAILED_URL = True              # Retry failed content URLs

5. Run the Pipeline

python main.py

The pipeline will:

Initialize the SQLite database
Generate pagination URLs for all configured websites
Crawl all pagination pages to discover content URLs
Wait 5 seconds between crawler and scraper phases
Scrape all discovered URLs to extract content
Save results to output/output.jsonl
Generate daily log files: healthy_spiders_YYYY-MM-DD.log

Output Format

Scraped content is saved in JSONL format with the following structure:

{
  "url": "https://example.com/article",
  "domain": "example.com",
  "category": "article",
  "title": "Article Title",
  "content": "# Article Title\n\nMarkdown formatted content...",
  "author": "Author Name",
  "date": "2024-01-15",
  "token_count": 1250,
  "hash": "abc123def456...",
  "id": "uuid-string"
}

Project Structure

healthy-spiders/
├── main.py              # Entry point
├── config.py            # Configuration settings
├── requirements.txt     # Python dependencies
├── core/
│   ├── engine.py        # Main crawler and scraper orchestration
│   ├── router.py        # Route tasks to appropriate website handlers
│   ├── database.py      # SQLite database operations
│   └── utils.py         # Utility functions (token counting, HTML parsing, etc.)
├── spiders/             # Website-specific crawlers and scrapers
│   ├── alodokter.py
│   ├── biofarma.py
│   ├── bpom.py
│   ├── halodoc.py
│   └── hellosehat.py
├── temp/                # Temporary data and tokenizer files
└── output/              # Scraped data output directory

Requirements

Python 3.8+
See requirements.txt for full dependencies

Key dependencies:

curl_cffi - HTTP requests with CFFI support
beautifulsoup4 - HTML parsing
aiofiles & aiosqlite - Async file and database operations
loguru - Advanced logging
markdownify - HTML to Markdown conversion
transformers - Token counting support

Logging

Application logs are automatically created and rotated daily:

Location: healthy_spiders_YYYY-MM-DD.log
Rotation: Daily at 00:00
Retention: 1 week of logs

Example log output:

2024-01-15 10:30:45 | INFO | Starting healthy spiders 💪🕷️...
2024-01-15 10:30:45 | INFO | Starting crawler...
2024-01-15 10:30:47 | INFO | Crawling 100 pending pagination URLs...

License

MIT License - See LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Healthy Spiders 💪🕷️

Overview

Crawling-Scraping Mechanism

Phase 1: Crawling

Phase 2: Scraping

Supported Websites

Key Features

Getting Started

1. Clone the Repository

2. Create and Activate Python Virtual Environment

3. Install Dependencies

4. Configure the Pipeline

General Configuration

Website-Specific Configuration

Reset Flags

5. Run the Pipeline

Output Format

Project Structure

Requirements

Logging

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
core		core
output		output
spiders		spiders
temp		temp
.gitignore		.gitignore
README.md		README.md
config.py		config.py
healthy_spiders_2026-06-07.log		healthy_spiders_2026-06-07.log
main.py		main.py
queue.db		queue.db
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Healthy Spiders 💪🕷️

Overview

Crawling-Scraping Mechanism

Phase 1: Crawling

Phase 2: Scraping

Supported Websites

Key Features

Getting Started

1. Clone the Repository

2. Create and Activate Python Virtual Environment

3. Install Dependencies

4. Configure the Pipeline

General Configuration

Website-Specific Configuration

Reset Flags

5. Run the Pipeline

Output Format

Project Structure

Requirements

Logging

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages