Skip to content

AmandaCLima/Netflix_SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Netflix Data Analysis with PostgreSQL and Python

Netflix Analysis Banner

This repository contains a comprehensive Exploratory Data Analysis (EDA) of the Netflix Movies and TV Shows dataset using PostgreSQL and Python.
The project is structured around 15 business-driven analytical questions, demonstrating a wide range of SQL techniques — from basic filtering and aggregation to advanced analytics such as window functions and full-text search.

Additionally, it includes a Python notebook for further data transformation, visualization, and integration with the IMDb dataset, allowing for a deeper exploration of patterns and rating correlations between the two platforms.


📊 Project Overview

The main goal of this project is to analyze Netflix’s content library to extract meaningful insights.
The analysis is performed in two complementary parts:

  1. SQL Analysis (PostgreSQL):
    A single SQL script defines the database schema, loads the dataset, and answers 15 analytical questions.

  2. Python Analysis (Jupyter Notebook):
    A notebook that performs data cleaning, visualization, and integration with IMDb data to analyze trends in movie ratings and categories.


✨ Key SQL Concepts Demonstrated

This project showcases practical applications of the following SQL techniques:

  • Schema Definition: Creating relational tables with appropriate data types (CREATE TABLE).
  • Data Transformation & Cleaning: Applying UNNEST, string_to_array, split_part, and type casting (::INT, to_date).
  • Advanced Aggregation: Using GROUP BY GROUPING SETS to compute totals and subtotals.
  • Window Functions: Employing RANK() to find the top-N items per category.
  • Common Table Expressions (CTEs): Organizing complex logic with WITH clauses for readability.
  • Date/Time Functions: Filtering and aggregating data by time intervals (CURRENT_DATE, INTERVAL).
  • Full-Text Search: Implementing search capabilities using to_tsvector and plainto_tsquery.
  • Pattern Matching: Using LIKE and ILIKE for flexible string filtering.

🐍 Key Python Concepts Demonstrated

  • Pandas & NumPy: Cleaning, transforming, and restructuring raw data for consistency.
  • Exploratory Data Analysis (EDA): Generating descriptive statistics and identifying key trends.
  • Data Merging: Integrating Netflix data with IMDb datasets to uncover rating and genre correlations.
  • Visualization: Using Matplotlib and Seaborn to visualize rating distributions and patterns.

💾 Datasets

Dataset Source Description
Netflix Movies and TV Shows Kaggle Main dataset containing Netflix titles and metadata.
IMDb Datasets IMDb Data Interface External data source providing movie ratings and title information.

⚙️ Prerequisites

Before running the analysis, make sure you have the following installed:

  • PostgreSQL (installed and running)
  • Python 3.x
  • Jupyter Notebook
  • Required Python libraries:
    pip install pandas numpy matplotlib seaborn psycopg2-binary
    
    

1. Table Creation

First, run the CREATE TABLE statement from the top of the provided SQL script to create the netflix table structure in your database.

DROP TABLE IF EXISTS netflix;
CREATE TABLE netflix
(
    show_id      VARCHAR(5),
    type         VARCHAR(10),
    title        VARCHAR(250),
    director     VARCHAR(550),
    casts        VARCHAR(1050),
    country      VARCHAR(550),
    date_added   VARCHAR(55),
    release_year INT,
    rating       VARCHAR(15),
    duration     VARCHAR(15),
    listed_in    VARCHAR(250),
    description  VARCHAR(550)
);

About

An exploratory data analysis of the Netflix Movies & TV Shows dataset using PostgreSQL. This repository contains SQL queries to analyze content trends and cast information.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors