🚀 Y Combinator Startup Scraper

A robust Python-based web scraper that extracts comprehensive data from Y Combinator's startup directory, including company information and founder LinkedIn profiles.

✨ Features

🔍 Multi-Strategy Extraction: 4-tier fallback system for maximum data coverage
⚡ High Performance: Concurrent scraping with configurable worker threads
🎯 Data Validation: Strict validation framework ensures high-quality data
🛡️ Robust Error Handling: Graceful degradation with comprehensive logging
📊 Flexible Configuration: Environment-based configuration for easy customization
💾 CSV Export: Clean, structured output with all required fields

📋 Data Extracted

For each startup:

Company Name
Y Combinator Batch (e.g., W23, S22)
Short Description (one-liner)
Founder Name(s)
Founder LinkedIn URL(s)

🔧 Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone the repository

git clone https://github.com/ShariniN/yc-startup-scraper.git
cd yc-startup-scraper

Install dependencies

pip install -r requirements.txt

Configure environment variables

# Copy the example environment file
cp .env.example .env

# Edit .env and add your Algolia credentials
# You can find these in Y Combinator's website network requests

🚀 Usage

Basic Usage

python yc_algolia_scraper.py

This will:

Fetch 500 companies from Y Combinator's directory
Extract founder information using multi-strategy approach
Save results to yc_startups_enhanced.csv
Display progress and statistics

⚙️ Configuration

Edit the .env file to customize scraper behavior:

# Number of companies to scrape
COMPANIES_LIMIT=500

# Number of concurrent worker threads
MAX_WORKERS=10

# Request timeout in seconds
REQUEST_TIMEOUT=15

# Delay between requests (seconds)
RATE_LIMIT_DELAY=0.3

# Output CSV filename
OUTPUT_FILE=yc_startups_enhanced.csv

🏗️ How It Works

Architecture Overview

The scraper employs a sophisticated multi-tiered extraction strategy:

Algolia API Integration
- Discovers Y Combinator's search infrastructure
- Efficiently retrieves company metadata
- Checks for pre-existing founder data in API response
Four-Strategy Founder Extraction
- Strategy 1: Parse Next.js embedded JSON (__NEXT_DATA__)
- Strategy 2: Extract from legacy data-page attributes
- Strategy 3: Semantic HTML parsing of founder sections
- Strategy 4: Broad LinkedIn URL detection with strict validation
Data Validation Framework
- Names must be 2-4 capitalized words
- Blacklist filtering removes false positives
- No numbers or special characters
- URL normalization and deduplication
Concurrent Processing
- ThreadPoolExecutor with configurable workers
- Rate limiting to respect server resources
- Comprehensive error handling

🛠️ Technical Details

Technologies Used

requests: HTTP library for API calls and web scraping
pandas: Data manipulation and CSV export
BeautifulSoup4: HTML parsing and extraction
python-dotenv: Environment variable management
concurrent.futures: Parallel processing

Validation Rules

The scraper implements strict validation to ensure data quality:

✅ Names: 2-4 words, capitalized, no special characters
✅ URLs: Normalized, tracking parameters removed
✅ Deduplication: Removes duplicate founder entries
✅ Blacklist filtering: Excludes common false positives

Development Setup

# Install development dependencies
pip install -r requirements.txt

# Run the scraper
python yc_algolia_scraper.py

# Check code style
black yc_algolia_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Y Combinator Startup Scraper

✨ Features

📋 Data Extracted

🔧 Installation

Prerequisites

Setup

🚀 Usage

Basic Usage

⚙️ Configuration

🏗️ How It Works

Architecture Overview

🛠️ Technical Details

Technologies Used

Validation Rules

Development Setup

📚 Additional Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt
yc_algolia_scraper.py		yc_algolia_scraper.py

Folders and files

Latest commit

History

Repository files navigation

🚀 Y Combinator Startup Scraper

✨ Features

📋 Data Extracted

🔧 Installation

Prerequisites

Setup

🚀 Usage

Basic Usage

⚙️ Configuration

🏗️ How It Works

Architecture Overview

🛠️ Technical Details

Technologies Used

Validation Rules

Development Setup

📚 Additional Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages