Airbnb Big Data Challenge Analysis

This project analyzes Airbnb listings and reviews data across multiple US cities to answer various analytical questions about the platform's data.

Dataset

Dataset Source: Airbnb Dataset

The dataset contains:

34 cities/regions across the United States
68 CSV files total (34 listings + 34 reviews files)
22GB+ of raw data including 1.4M+ listings and 68M+ reviews
Data covers major metropolitan areas like New York City, Los Angeles, San Francisco, Chicago, Boston, etc.

Download Instructions

Visit the dataset page
Download all 68 CSV files to this project directory
Ensure all files are in the same directory as the Python scripts

Note: The dataset is quite large (~22GB). Make sure you have sufficient disk space and a stable internet connection for the download.

Project Structure

big_data_challenge/
├── README.md                    # This file
├── requirements.txt             # Python dependencies
├── install_packages.py          # Package installation script
├── preprocess.py                # Optimized data preprocessing (recommended)
├── preprocess_fast.py           # Alternative fast preprocessing
├── analysis.py                  # Original analysis script
├── count_rows.py               # Count total rows
├── count_unique.py             # Count unique listings/reviews/reviewers
├── state_analysis.py           # Find states with most/least listings
├── top_host.py                 # Find host with most reviews
├── camera_listings.py          # Count listings mentioning cameras
├── top_camera_states.py        # Find state with highest camera review percentage
├── secret_cameras.py           # Find state with highest secret camera listings
├── .gitignore                  # Git ignore rules
└── env/                        # Virtual environment (created after setup)

Setup Instructions

1. Install Dependencies

# Create virtual environment
python3 -m venv env

# Activate virtual environment
source env/bin/activate  # On macOS/Linux
# or
env\Scripts\activate     # On Windows

# Install packages
pip install -r requirements.txt

2. Download Dataset

Download all CSV files from https://big-data-challenge.hdanny.org/dataset/ into this directory.

3. Preprocess Data

Run the optimized preprocessing script to create the DuckDB database:

python3 preprocess.py

This will:

Process 22GB+ of CSV data
Create a 24GB+ DuckDB database (airbnb.db)
Add proper indexes for fast queries
Take approximately 4-5 minutes on most systems

Alternative: Use the fast preprocessing script:

python3 preprocess_fast.py

Execution Order

Run the scripts in this exact order:

1. Data Preprocessing (Required)

python3 preprocess.py

Time: ~4-5 minutes Output: Creates airbnb.db database file

2. Analysis Scripts (Run after preprocessing)

Count Rows

python3 count_rows.py

Output: Total listings and reviews counts

Count Unique Values

python3 count_unique.py

Output: Unique listings, reviews, and reviewers counts

State Analysis

python3 state_analysis.py

Output: States with most and least listings

Top Host

python3 top_host.py

Output: Host ID with most reviews

Camera Listings

python3 camera_listings.py

Output: Count of listings mentioning cameras

Top Camera States

python3 top_camera_states.py

Output: State with highest camera review percentage and count

Secret Cameras

python3 secret_cameras.py

Output: State with highest percentage of secret camera listings and count

Performance Optimizations

The preprocessing script includes several optimizations for remote environments:

Parallel Processing: Uses 4 concurrent workers to process files simultaneously
Chunked Reading: Processes CSV files in 50,000-row chunks to reduce memory usage
Optimized Indexes: Creates efficient database indexes for fast queries
Progress Tracking: Real-time progress bars for long-running operations
Error Handling: Continues processing even if individual files fail

Output Format

Each analysis script outputs results in a specific format:

count_rows.py: Two numbers (listings_count, reviews_count)
count_unique.py: Three numbers (unique_listings, unique_reviews, unique_reviewers)
state_analysis.py: Two state codes (most_listings_state, least_listings_state)
top_host.py: One number (host_id_with_most_reviews)
camera_listings.py: One number (camera_listings_count)
top_camera_states.py: State code and count (state_code, camera_reviews_count)
secret_cameras.py: State code and count (state_code, secret_camera_count)

Troubleshooting

Common Issues

"Conflicting lock" Error
- Delete airbnb.db and rerun preprocessing
- Ensure no other scripts are running simultaneously
Memory Errors
- The optimized script uses chunked processing to minimize memory usage
- Ensure at least 8GB RAM available
Slow Performance
- Remote environments may be slower than local machines
- The optimizations reduce processing time from ~10 minutes to ~4-5 minutes
Missing CSV Files
- Ensure all 68 CSV files are downloaded to the project directory
- Check file names match exactly

Performance Comparison

Environment	Original Script	Optimized Script	Improvement
Local Machine	~4 minutes	~4 minutes	Same (already optimized)
Remote Environment	~10+ minutes	~4-5 minutes	2x faster

Dependencies

duckdb==1.4.3: High-performance analytical database
pandas==2.3.3: Data manipulation and CSV processing
tqdm==4.67.1: Progress bars for long operations

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

License

This project is for educational purposes as part of the Big Data Challenge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airbnb Big Data Challenge Analysis

Dataset

Download Instructions

Project Structure

Setup Instructions

1. Install Dependencies

2. Download Dataset

3. Preprocess Data

Execution Order

1. Data Preprocessing (Required)

2. Analysis Scripts (Run after preprocessing)

Count Rows

Count Unique Values

State Analysis

Top Host

Camera Listings

Top Camera States

Secret Cameras

Performance Optimizations

Output Format

Troubleshooting

Common Issues

Performance Comparison

Dependencies

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
camera_listings.py		camera_listings.py
count_rows.py		count_rows.py
count_unique.py		count_unique.py
install_packages.py		install_packages.py
preprocess.py		preprocess.py
preprocess_fast.py		preprocess_fast.py
requirements.txt		requirements.txt
secret_cameras.py		secret_cameras.py
state_analysis.py		state_analysis.py
top_camera_states.py		top_camera_states.py
top_host.py		top_host.py

Folders and files

Latest commit

History

Repository files navigation

Airbnb Big Data Challenge Analysis

Dataset

Download Instructions

Project Structure

Setup Instructions

1. Install Dependencies

2. Download Dataset

3. Preprocess Data

Execution Order

1. Data Preprocessing (Required)

2. Analysis Scripts (Run after preprocessing)

Count Rows

Count Unique Values

State Analysis

Top Host

Camera Listings

Top Camera States

Secret Cameras

Performance Optimizations

Output Format

Troubleshooting

Common Issues

Performance Comparison

Dependencies

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages