This project analyzes Airbnb listings and reviews data across multiple US cities to answer various analytical questions about the platform's data.
Dataset Source: Airbnb Dataset
The dataset contains:
- 34 cities/regions across the United States
- 68 CSV files total (34 listings + 34 reviews files)
- 22GB+ of raw data including 1.4M+ listings and 68M+ reviews
- Data covers major metropolitan areas like New York City, Los Angeles, San Francisco, Chicago, Boston, etc.
- Visit the dataset page
- Download all 68 CSV files to this project directory
- Ensure all files are in the same directory as the Python scripts
Note: The dataset is quite large (~22GB). Make sure you have sufficient disk space and a stable internet connection for the download.
big_data_challenge/
├── README.md # This file
├── requirements.txt # Python dependencies
├── install_packages.py # Package installation script
├── preprocess.py # Optimized data preprocessing (recommended)
├── preprocess_fast.py # Alternative fast preprocessing
├── analysis.py # Original analysis script
├── count_rows.py # Count total rows
├── count_unique.py # Count unique listings/reviews/reviewers
├── state_analysis.py # Find states with most/least listings
├── top_host.py # Find host with most reviews
├── camera_listings.py # Count listings mentioning cameras
├── top_camera_states.py # Find state with highest camera review percentage
├── secret_cameras.py # Find state with highest secret camera listings
├── .gitignore # Git ignore rules
└── env/ # Virtual environment (created after setup)
# Create virtual environment
python3 -m venv env
# Activate virtual environment
source env/bin/activate # On macOS/Linux
# or
env\Scripts\activate # On Windows
# Install packages
pip install -r requirements.txtDownload all CSV files from https://big-data-challenge.hdanny.org/dataset/ into this directory.
Run the optimized preprocessing script to create the DuckDB database:
python3 preprocess.pyThis will:
- Process 22GB+ of CSV data
- Create a 24GB+ DuckDB database (
airbnb.db) - Add proper indexes for fast queries
- Take approximately 4-5 minutes on most systems
Alternative: Use the fast preprocessing script:
python3 preprocess_fast.pyRun the scripts in this exact order:
python3 preprocess.pyTime: ~4-5 minutes
Output: Creates airbnb.db database file
python3 count_rows.pyOutput: Total listings and reviews counts
python3 count_unique.pyOutput: Unique listings, reviews, and reviewers counts
python3 state_analysis.pyOutput: States with most and least listings
python3 top_host.pyOutput: Host ID with most reviews
python3 camera_listings.pyOutput: Count of listings mentioning cameras
python3 top_camera_states.pyOutput: State with highest camera review percentage and count
python3 secret_cameras.pyOutput: State with highest percentage of secret camera listings and count
The preprocessing script includes several optimizations for remote environments:
- Parallel Processing: Uses 4 concurrent workers to process files simultaneously
- Chunked Reading: Processes CSV files in 50,000-row chunks to reduce memory usage
- Optimized Indexes: Creates efficient database indexes for fast queries
- Progress Tracking: Real-time progress bars for long-running operations
- Error Handling: Continues processing even if individual files fail
Each analysis script outputs results in a specific format:
count_rows.py: Two numbers (listings_count, reviews_count)count_unique.py: Three numbers (unique_listings, unique_reviews, unique_reviewers)state_analysis.py: Two state codes (most_listings_state, least_listings_state)top_host.py: One number (host_id_with_most_reviews)camera_listings.py: One number (camera_listings_count)top_camera_states.py: State code and count (state_code, camera_reviews_count)secret_cameras.py: State code and count (state_code, secret_camera_count)
-
"Conflicting lock" Error
- Delete
airbnb.dband rerun preprocessing - Ensure no other scripts are running simultaneously
- Delete
-
Memory Errors
- The optimized script uses chunked processing to minimize memory usage
- Ensure at least 8GB RAM available
-
Slow Performance
- Remote environments may be slower than local machines
- The optimizations reduce processing time from ~10 minutes to ~4-5 minutes
-
Missing CSV Files
- Ensure all 68 CSV files are downloaded to the project directory
- Check file names match exactly
| Environment | Original Script | Optimized Script | Improvement |
|---|---|---|---|
| Local Machine | ~4 minutes | ~4 minutes | Same (already optimized) |
| Remote Environment | ~10+ minutes | ~4-5 minutes | 2x faster |
- duckdb==1.4.3: High-performance analytical database
- pandas==2.3.3: Data manipulation and CSV processing
- tqdm==4.67.1: Progress bars for long operations
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is for educational purposes as part of the Big Data Challenge.