A naive crawler-scraper for health domain websites built with Python. This project autonomously extracts health-related content from multiple Indonesian health websites using asynchronous crawling and scraping techniques.
The Healthy Spiders project implements a two-phase pipeline:
The crawler discovers content URLs by processing paginated endpoint lists. It:
- Fetches pagination pages from predefined URL patterns
- Extracts individual content URLs from HTML/JSON responses
- Stores discovered URLs in an SQLite queue database
- Supports multiple categories per website (articles, discussions, news, etc.)
The scraper extracts detailed content from discovered URLs. It:
- Retrieves pending URLs from the database in configurable batches
- Fetches and parses HTML content with BeautifulSoup
- Converts HTML to clean Markdown format
- Extracts metadata (title, date, author, tokens count, etc.)
- Saves structured data as JSONL (JSON Lines) format
The project currently supports crawling and scraping from:
- alodokter.com - Articles and patient-doctor discussions
- biofarma.co.id - Health articles
- pom.go.id - Public health information and news
- halodoc.com - Medical articles
- hellosehat.com - Comprehensive health content
- Asynchronous Processing: Built on Python
asynciofor concurrent requests - Rate Limiting: Configurable concurrency limits to avoid overwhelming servers
- Error Handling: Automatic retry mechanisms and failure tracking
- Logging: Detailed logging with daily rotation using
loguru - Token Counting: Quantifies content using Qwen3 tokenizer
- Database Tracking: SQLite persistence for pagination and URL status
git clone https://github.com/RubikRif/healthy-spiders.git
cd healthy-spiders# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activatepip install -r requirements.txtEdit config.py to customize the crawling and scraping behavior:
BATCH_SIZE = 100 # Number of items to process per batch
DB_PATH = "queue.db" # SQLite database file
MAX_CONCURRENT = 5 # Maximum concurrent async requests
OUTPUT_PATH = 'output/output.jsonl' # Output file path for scraped dataEach website config requires:
- domain: The website domain
- pages_2b_crawled: Dictionary with pagination patterns and max pages to crawl
- contents_2b_scraped: Dictionary with unpatterned URL sources and max pages to scrape
Example configuration:
ALODOKTER_CONFIG = {
'domain': 'alodokter.com',
'pages_2b_crawled': {
'/page/': 100, # Article pages, max 1162 available
'/komunitas/diskusi/penyakit/page/': 100 # Discussion pages, max 7423 available
},
'contents_2b_scraped': None
}Control behavior between pipeline runs:
RESET_FIRST_PATTERNED_PAGINATION = True # Reset first pagination batch
RESET_ALL_HALODOC_PAGINATION = True # Reset Halodoc-specific pagination
RESET_ALL_FAILED_PAGINATION = True # Retry failed pagination URLs
RESET_ALL_FAILED_URL = True # Retry failed content URLspython main.pyThe pipeline will:
- Initialize the SQLite database
- Generate pagination URLs for all configured websites
- Crawl all pagination pages to discover content URLs
- Wait 5 seconds between crawler and scraper phases
- Scrape all discovered URLs to extract content
- Save results to
output/output.jsonl - Generate daily log files:
healthy_spiders_YYYY-MM-DD.log
Scraped content is saved in JSONL format with the following structure:
{
"url": "https://example.com/article",
"domain": "example.com",
"category": "article",
"title": "Article Title",
"content": "# Article Title\n\nMarkdown formatted content...",
"author": "Author Name",
"date": "2024-01-15",
"token_count": 1250,
"hash": "abc123def456...",
"id": "uuid-string"
}healthy-spiders/
├── main.py # Entry point
├── config.py # Configuration settings
├── requirements.txt # Python dependencies
├── core/
│ ├── engine.py # Main crawler and scraper orchestration
│ ├── router.py # Route tasks to appropriate website handlers
│ ├── database.py # SQLite database operations
│ └── utils.py # Utility functions (token counting, HTML parsing, etc.)
├── spiders/ # Website-specific crawlers and scrapers
│ ├── alodokter.py
│ ├── biofarma.py
│ ├── bpom.py
│ ├── halodoc.py
│ └── hellosehat.py
├── temp/ # Temporary data and tokenizer files
└── output/ # Scraped data output directory
- Python 3.8+
- See
requirements.txtfor full dependencies
Key dependencies:
curl_cffi- HTTP requests with CFFI supportbeautifulsoup4- HTML parsingaiofiles&aiosqlite- Async file and database operationsloguru- Advanced loggingmarkdownify- HTML to Markdown conversiontransformers- Token counting support
Application logs are automatically created and rotated daily:
- Location:
healthy_spiders_YYYY-MM-DD.log - Rotation: Daily at 00:00
- Retention: 1 week of logs
Example log output:
2024-01-15 10:30:45 | INFO | Starting healthy spiders 💪🕷️...
2024-01-15 10:30:45 | INFO | Starting crawler...
2024-01-15 10:30:47 | INFO | Crawling 100 pending pagination URLs...
MIT License - See LICENSE file for details
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.