Analyzing Netflix's global content library to optimize acquisitions, genre diversity, and viewer engagement using SQL, Python, and Tableu.
- Overview
- Business Problem
- Dataset
- Tools & Technologies
- Project Structure
- Data Cleaning & Preparation
- Exploratory Data Analysis (EDA)
- Research Questions & Key Findings
- Dashboard
- How to Run This Project
- Final Recommendations
- Author & Contact
This project evaluates Netflix's content portfolio across genres, age, geography, and ratings to drive data-informed content strategies. A full pipeline was built with SQL for ETL, Python for advanced analysis and stats testing, tableu and Power BI for interactive dashboards.
In the competitive streaming market, Netflix must optimize its library for retention and growth. This project addresses:
- Genre imbalances and diversity gaps
- Geographic content distribution for global expansion
- Content age and relevance for viewer churn reduction
- High-rated vs. underperforming titles
- Portfolio concentration risks by type/region
- Netflix titles CSV (~8K+ shows/movies) from public sources
- Columns: title, type, director, cast, country, date_added, release_year, rating, duration, genres, description
- Enriched with geographic and age metrics
- SQL (joins, Group by, Aggregations)
- Python (Pandas, Matplotlib, Seaborn, SciPy)
- Tableu (DAX, Interactive Visuals)
- GitHub
netflix-content-analysis/
│
├── README.md
├── .gitignore
├── requirements.txt
├── Netflix Portfolio Report.pdf
│
├── notebooks/ # Jupyter notebooks
│ ├── exploratory_data_analysis.ipynb
│ ├── netflix_portfolio_analysis.ipynb
│
├── scripts/ # Python scripts for ETL
│ ├── data_ingestion.py
│ └── content_summary.py
│
├── dashboard/ # Tableu dashboards
Tableu files
│ └── netflix_portfolio.twbx
│
└── data/ # Raw & processed CSVs
├── netflix_titles.csv
└── content_summary.csv
- Handled missing values: ~5% in country/director (imputed/mode-filled)
- Parsed multi-genres into lists; exploded for analysis
- Calculated content age: (2025 - release_year)
- Filtered invalid ratings/durations; standardized countries (ISO codes)
- Created summary tables: genre counts, geo aggregates
Key Distributions:
- Content Age: 70% released post-2010; oldest ~1920s classics
- Ratings: Skewed high (mean 7.0/10); TV-MA dominant
- Geography: US (45%), India (20%), UK (10%)
Outliers:
- Longest: 5+ hour docs/epics
- Genre Overlaps: Dramas in 60% titles
Correlations:
- Release Year & Ratings (0.12 weak positive)
- Duration & Ratings (-0.08 for movies)
- Genres & Countries (strong regional ties, e.g., Bollywood in India)
- Geographic Coverage: US/India dominate (65%); only 20% from Africa/LatAm → expansion opportunities
- Portfolio Size by Type: Movies (70%), TV Shows (30%) → diversify originals
- Content Age Breakdown: 40% <5 years old; 25% >20 years → refresh legacy library
- Top Genres: Dramas (25%), Comedies (15%), Thrillers (12%) → over-reliance risk
- Genre Diversity: Treemap shows thrillers/dramas 50% share; niches like Anime <2%
- Ratings Distribution: 35% rated 8+; highest in International (mean 7.4)
- Hypothesis Testing: Significant difference in ratings by region (ANOVA p<0.01) → localize content
Interactive Tableu dashboard highlights:
- Global geo map (content by country)
- Portfolio size/content age bars
- Genre diversity treemap/pie
- Top performers & ratings histograms
- Clone the repository:
git clone https://github.com/yourusername/netflix-content-analysis.git- Install dependencies:
pip install -r requirements.txt- Ingest & process data:
python scripts/data_ingestion.py
python scripts/content_summary.py- Run notebooks:
notebooks/exploratory_data_analysis.ipynbnotebooks/netflix_portfolio_analysis.ipynb
- Open dashboard:
dashboard/netflix_portfolio.twbx
- Acquire Regional Content: Prioritize Africa/LatAm (20% gap) for 15% retention boost
- Balance Genres: Invest in niches (Anime, Docs) to hit 30% diversity
- Content Refresh: Phase out >20yr old low-raters; focus on 2015+ originals
- Geo-Targeting: India/US strongholds → tailor promos; test thrillers in thrill-heavy regions
- Originals Strategy: High-rated internationals → scale for global hits
- Monitor Ratings: Threshold 7.5+ for acquisitions
PRANAY JHA
Data Analyst
📧 Email: Pranayjha535@gmail.com
🔗 LinkedIn
