A robust Python-based web scraper that extracts comprehensive data from Y Combinator's startup directory, including company information and founder LinkedIn profiles.
- 🔍 Multi-Strategy Extraction: 4-tier fallback system for maximum data coverage
- ⚡ High Performance: Concurrent scraping with configurable worker threads
- 🎯 Data Validation: Strict validation framework ensures high-quality data
- 🛡️ Robust Error Handling: Graceful degradation with comprehensive logging
- 📊 Flexible Configuration: Environment-based configuration for easy customization
- 💾 CSV Export: Clean, structured output with all required fields
For each startup:
- Company Name
- Y Combinator Batch (e.g., W23, S22)
- Short Description (one-liner)
- Founder Name(s)
- Founder LinkedIn URL(s)
- Python 3.8 or higher
- pip package manager
- Clone the repository
git clone https://github.com/ShariniN/yc-startup-scraper.git
cd yc-startup-scraper- Install dependencies
pip install -r requirements.txt- Configure environment variables
# Copy the example environment file
cp .env.example .env
# Edit .env and add your Algolia credentials
# You can find these in Y Combinator's website network requestspython yc_algolia_scraper.pyThis will:
- Fetch 500 companies from Y Combinator's directory
- Extract founder information using multi-strategy approach
- Save results to
yc_startups_enhanced.csv - Display progress and statistics
Edit the .env file to customize scraper behavior:
# Number of companies to scrape
COMPANIES_LIMIT=500
# Number of concurrent worker threads
MAX_WORKERS=10
# Request timeout in seconds
REQUEST_TIMEOUT=15
# Delay between requests (seconds)
RATE_LIMIT_DELAY=0.3
# Output CSV filename
OUTPUT_FILE=yc_startups_enhanced.csvThe scraper employs a sophisticated multi-tiered extraction strategy:
-
Algolia API Integration
- Discovers Y Combinator's search infrastructure
- Efficiently retrieves company metadata
- Checks for pre-existing founder data in API response
-
Four-Strategy Founder Extraction
- Strategy 1: Parse Next.js embedded JSON (
__NEXT_DATA__) - Strategy 2: Extract from legacy
data-pageattributes - Strategy 3: Semantic HTML parsing of founder sections
- Strategy 4: Broad LinkedIn URL detection with strict validation
- Strategy 1: Parse Next.js embedded JSON (
-
Data Validation Framework
- Names must be 2-4 capitalized words
- Blacklist filtering removes false positives
- No numbers or special characters
- URL normalization and deduplication
-
Concurrent Processing
- ThreadPoolExecutor with configurable workers
- Rate limiting to respect server resources
- Comprehensive error handling
- requests: HTTP library for API calls and web scraping
- pandas: Data manipulation and CSV export
- BeautifulSoup4: HTML parsing and extraction
- python-dotenv: Environment variable management
- concurrent.futures: Parallel processing
The scraper implements strict validation to ensure data quality:
- ✅ Names: 2-4 words, capitalized, no special characters
- ✅ URLs: Normalized, tracking parameters removed
- ✅ Deduplication: Removes duplicate founder entries
- ✅ Blacklist filtering: Excludes common false positives
# Install development dependencies
pip install -r requirements.txt
# Run the scraper
python yc_algolia_scraper.py
# Check code style
black yc_algolia_scraper.py