Skip to content

ShariniN/yc-startup-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Y Combinator Startup Scraper

A robust Python-based web scraper that extracts comprehensive data from Y Combinator's startup directory, including company information and founder LinkedIn profiles.

✨ Features

  • 🔍 Multi-Strategy Extraction: 4-tier fallback system for maximum data coverage
  • High Performance: Concurrent scraping with configurable worker threads
  • 🎯 Data Validation: Strict validation framework ensures high-quality data
  • 🛡️ Robust Error Handling: Graceful degradation with comprehensive logging
  • 📊 Flexible Configuration: Environment-based configuration for easy customization
  • 💾 CSV Export: Clean, structured output with all required fields

📋 Data Extracted

For each startup:

  • Company Name
  • Y Combinator Batch (e.g., W23, S22)
  • Short Description (one-liner)
  • Founder Name(s)
  • Founder LinkedIn URL(s)

🔧 Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

  1. Clone the repository
git clone https://github.com/ShariniN/yc-startup-scraper.git
cd yc-startup-scraper
  1. Install dependencies
pip install -r requirements.txt
  1. Configure environment variables
# Copy the example environment file
cp .env.example .env

# Edit .env and add your Algolia credentials
# You can find these in Y Combinator's website network requests

🚀 Usage

Basic Usage

python yc_algolia_scraper.py

This will:

  1. Fetch 500 companies from Y Combinator's directory
  2. Extract founder information using multi-strategy approach
  3. Save results to yc_startups_enhanced.csv
  4. Display progress and statistics

⚙️ Configuration

Edit the .env file to customize scraper behavior:

# Number of companies to scrape
COMPANIES_LIMIT=500

# Number of concurrent worker threads
MAX_WORKERS=10

# Request timeout in seconds
REQUEST_TIMEOUT=15

# Delay between requests (seconds)
RATE_LIMIT_DELAY=0.3

# Output CSV filename
OUTPUT_FILE=yc_startups_enhanced.csv

🏗️ How It Works

Architecture Overview

The scraper employs a sophisticated multi-tiered extraction strategy:

  1. Algolia API Integration

    • Discovers Y Combinator's search infrastructure
    • Efficiently retrieves company metadata
    • Checks for pre-existing founder data in API response
  2. Four-Strategy Founder Extraction

    • Strategy 1: Parse Next.js embedded JSON (__NEXT_DATA__)
    • Strategy 2: Extract from legacy data-page attributes
    • Strategy 3: Semantic HTML parsing of founder sections
    • Strategy 4: Broad LinkedIn URL detection with strict validation
  3. Data Validation Framework

    • Names must be 2-4 capitalized words
    • Blacklist filtering removes false positives
    • No numbers or special characters
    • URL normalization and deduplication
  4. Concurrent Processing

    • ThreadPoolExecutor with configurable workers
    • Rate limiting to respect server resources
    • Comprehensive error handling

🛠️ Technical Details

Technologies Used

  • requests: HTTP library for API calls and web scraping
  • pandas: Data manipulation and CSV export
  • BeautifulSoup4: HTML parsing and extraction
  • python-dotenv: Environment variable management
  • concurrent.futures: Parallel processing

Validation Rules

The scraper implements strict validation to ensure data quality:

  • ✅ Names: 2-4 words, capitalized, no special characters
  • ✅ URLs: Normalized, tracking parameters removed
  • ✅ Deduplication: Removes duplicate founder entries
  • ✅ Blacklist filtering: Excludes common false positives

Development Setup

# Install development dependencies
pip install -r requirements.txt

# Run the scraper
python yc_algolia_scraper.py

# Check code style
black yc_algolia_scraper.py

📚 Additional Resources

About

Python scraper for Y Combinator startup directory with multi-strategy founder extraction

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages