The steps to identify outliers in numerical data using statistical methods and visualizations like box plots. We'll also explore techniques to handle outliers, such as removing them or transforming the data, using the Titanic dataset.
- Load the dataset:
import pandas as pd df = pd.read_csv('train.csv')
-
Import Matplotlib and Seaborn:
import matplotlib.pyplot as plt import seaborn as sns
-
Create box plots for numerical columns:
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns plt.figure(figsize=(15, 10)) for i, col in enumerate(numerical_columns, 1): plt.subplot(3, 3, i) sns.boxplot(y=col, data=df) plt.title(f'Box Plot of {col}') plt.tight_layout() plt.show()
- Calculate the IQR and identify outliers:
def find_outliers(df, col): Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)] return outliers for col in numerical_columns: outliers = find_outliers(df, col) print(f'Outliers in {col}:') print(outliers)
-
Option 1: Remove Outliers
def remove_outliers(df, col): Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)] return df for col in numerical_columns: df = remove_outliers(df, col)
-
Option 2: Transform Data (e.g., Log Transformation)
import numpy as np df_transformed = df.copy() for col in numerical_columns: df_transformed[col] = np.log1p(df_transformed[col]) # Plot transformed data plt.figure(figsize=(15, 10)) for i, col in enumerate(numerical_columns, 1): plt.subplot(3, 3, i) sns.boxplot(y=col, data=df_transformed) plt.title(f'Box Plot of Transformed {col}') plt.tight_layout() plt.show()
-
Load the Titanic Dataset:
import pandas as pd df = pd.read_csv('train.csv')
-
Import Matplotlib and Seaborn:
import matplotlib.pyplot as plt import seaborn as sns
-
Create Box Plots for Numerical Columns:
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns plt.figure(figsize=(15, 10)) for i, col in enumerate(numerical_columns, 1): plt.subplot(3, 3, i) sns.boxplot(y=col, data=df) plt.title(f'Box Plot of {col}') plt.tight_layout() plt.show()
-
Identify Outliers Using the IQR Method:
def find_outliers(df, col): Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)] return outliers for col in numerical_columns: outliers = find_outliers(df, col) print(f'Outliers in {col}:') print(outliers)
-
Handle Outliers - Option 1: Remove Outliers:
def remove_outliers(df, col): Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)] return df for col in numerical_columns: df = remove_outliers(df, col)
-
Handle Outliers - Option 2: Transform Data (e.g., Log Transformation):
import numpy as np df_transformed = df.copy() for col in numerical_columns: df_transformed[col] = np.log1p(df_transformed[col]) # Plot transformed data plt.figure(figsize=(15, 10)) for i, col in enumerate(numerical_columns, 1): plt.subplot(3, 3, i) sns.boxplot(y=col, data=df_transformed) plt.title(f'Box Plot of Transformed {col}') plt.tight_layout() plt.show()
Please create an Issue for any improvements, suggestions or errors in the content.
You can also contact me using Linkedin for any other queries or feedback.