This repository contains a comprehensive Exploratory Data Analysis (EDA) of the Netflix Movies and TV Shows dataset using PostgreSQL and Python.
The project is structured around 15 business-driven analytical questions, demonstrating a wide range of SQL techniques — from basic filtering and aggregation to advanced analytics such as window functions and full-text search.
Additionally, it includes a Python notebook for further data transformation, visualization, and integration with the IMDb dataset, allowing for a deeper exploration of patterns and rating correlations between the two platforms.
The main goal of this project is to analyze Netflix’s content library to extract meaningful insights.
The analysis is performed in two complementary parts:
-
SQL Analysis (PostgreSQL):
A single SQL script defines the database schema, loads the dataset, and answers 15 analytical questions. -
Python Analysis (Jupyter Notebook):
A notebook that performs data cleaning, visualization, and integration with IMDb data to analyze trends in movie ratings and categories.
This project showcases practical applications of the following SQL techniques:
- Schema Definition: Creating relational tables with appropriate data types (
CREATE TABLE). - Data Transformation & Cleaning: Applying
UNNEST,string_to_array,split_part, and type casting (::INT,to_date). - Advanced Aggregation: Using
GROUP BY GROUPING SETSto compute totals and subtotals. - Window Functions: Employing
RANK()to find the top-N items per category. - Common Table Expressions (CTEs): Organizing complex logic with
WITHclauses for readability. - Date/Time Functions: Filtering and aggregating data by time intervals (
CURRENT_DATE,INTERVAL). - Full-Text Search: Implementing search capabilities using
to_tsvectorandplainto_tsquery. - Pattern Matching: Using
LIKEandILIKEfor flexible string filtering.
- Pandas & NumPy: Cleaning, transforming, and restructuring raw data for consistency.
- Exploratory Data Analysis (EDA): Generating descriptive statistics and identifying key trends.
- Data Merging: Integrating Netflix data with IMDb datasets to uncover rating and genre correlations.
- Visualization: Using Matplotlib and Seaborn to visualize rating distributions and patterns.
| Dataset | Source | Description |
|---|---|---|
| Netflix Movies and TV Shows | Kaggle | Main dataset containing Netflix titles and metadata. |
| IMDb Datasets | IMDb Data Interface | External data source providing movie ratings and title information. |
Before running the analysis, make sure you have the following installed:
- PostgreSQL (installed and running)
- Python 3.x
- Jupyter Notebook
- Required Python libraries:
pip install pandas numpy matplotlib seaborn psycopg2-binary
First, run the CREATE TABLE statement from the top of the provided SQL script to create the netflix table structure in your database.
DROP TABLE IF EXISTS netflix;
CREATE TABLE netflix
(
show_id VARCHAR(5),
type VARCHAR(10),
title VARCHAR(250),
director VARCHAR(550),
casts VARCHAR(1050),
country VARCHAR(550),
date_added VARCHAR(55),
release_year INT,
rating VARCHAR(15),
duration VARCHAR(15),
listed_in VARCHAR(250),
description VARCHAR(550)
);