Exercise 7: Identifying Outliers

The steps to identify outliers in numerical data using statistical methods and visualizations like box plots. We'll also explore techniques to handle outliers, such as removing them or transforming the data, using the Titanic dataset.

Step 1: Load the Titanic Dataset

Load the dataset:

import pandas as pd
df = pd.read_csv('train.csv')

Step 2: Identify Outliers Using Box Plots

Import Matplotlib and Seaborn:

import matplotlib.pyplot as plt
import seaborn as sns

Create box plots for numerical columns:

numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(y=col, data=df)
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()

Step 3: Identify Outliers Using the IQR Method

Calculate the IQR and identify outliers:

def find_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    return outliers

for col in numerical_columns:
    outliers = find_outliers(df, col)
    print(f'Outliers in {col}:')
    print(outliers)

Step 4: Handle Outliers

Option 1: Remove Outliers

def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

for col in numerical_columns:
    df = remove_outliers(df, col)

Option 2: Transform Data (e.g., Log Transformation)

import numpy as np

df_transformed = df.copy()
for col in numerical_columns:
    df_transformed[col] = np.log1p(df_transformed[col])

# Plot transformed data
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(y=col, data=df_transformed)
    plt.title(f'Box Plot of Transformed {col}')
plt.tight_layout()
plt.show()

Step-by-Step Execution

Load the Titanic Dataset:

import pandas as pd
df = pd.read_csv('train.csv')

Import Matplotlib and Seaborn:

import matplotlib.pyplot as plt
import seaborn as sns

Create Box Plots for Numerical Columns:

numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(y=col, data=df)
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()

Identify Outliers Using the IQR Method:

def find_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    return outliers

for col in numerical_columns:
    outliers = find_outliers(df, col)
    print(f'Outliers in {col}:')
    print(outliers)

Handle Outliers - Option 1: Remove Outliers:

def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

for col in numerical_columns:
    df = remove_outliers(df, col)

Handle Outliers - Option 2: Transform Data (e.g., Log Transformation):

import numpy as np

df_transformed = df.copy()
for col in numerical_columns:
    df_transformed[col] = np.log1p(df_transformed[col])

# Plot transformed data
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(y=col, data=df_transformed)
    plt.title(f'Box Plot of Transformed {col}')
plt.tight_layout()
plt.show()

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exer7.md

exer7.md

Exercise 7: Identifying Outliers

Step 1: Load the Titanic Dataset

Step 2: Identify Outliers Using Box Plots

Step 3: Identify Outliers Using the IQR Method

Step 4: Handle Outliers

Step-by-Step Execution

Contribution 🛠️

Files

exer7.md

Latest commit

History

exer7.md

File metadata and controls

Exercise 7: Identifying Outliers

Step 1: Load the Titanic Dataset

Step 2: Identify Outliers Using Box Plots

Step 3: Identify Outliers Using the IQR Method

Step 4: Handle Outliers

Step-by-Step Execution

Contribution 🛠️