Skip to content

Latest commit

 

History

History
176 lines (146 loc) · 5.88 KB

exer7.md

File metadata and controls

176 lines (146 loc) · 5.88 KB

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

Exercise 7: Identifying Outliers

The steps to identify outliers in numerical data using statistical methods and visualizations like box plots. We'll also explore techniques to handle outliers, such as removing them or transforming the data, using the Titanic dataset.

Step 1: Load the Titanic Dataset

  1. Load the dataset:
    import pandas as pd
    df = pd.read_csv('train.csv')

Step 2: Identify Outliers Using Box Plots

  1. Import Matplotlib and Seaborn:

    import matplotlib.pyplot as plt
    import seaborn as sns
  2. Create box plots for numerical columns:

    numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
    
    plt.figure(figsize=(15, 10))
    for i, col in enumerate(numerical_columns, 1):
        plt.subplot(3, 3, i)
        sns.boxplot(y=col, data=df)
        plt.title(f'Box Plot of {col}')
    plt.tight_layout()
    plt.show()

Step 3: Identify Outliers Using the IQR Method

  1. Calculate the IQR and identify outliers:
    def find_outliers(df, col):
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        return outliers
    
    for col in numerical_columns:
        outliers = find_outliers(df, col)
        print(f'Outliers in {col}:')
        print(outliers)

Step 4: Handle Outliers

  1. Option 1: Remove Outliers

    def remove_outliers(df, col):
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
        return df
    
    for col in numerical_columns:
        df = remove_outliers(df, col)
  2. Option 2: Transform Data (e.g., Log Transformation)

    import numpy as np
    
    df_transformed = df.copy()
    for col in numerical_columns:
        df_transformed[col] = np.log1p(df_transformed[col])
    
    # Plot transformed data
    plt.figure(figsize=(15, 10))
    for i, col in enumerate(numerical_columns, 1):
        plt.subplot(3, 3, i)
        sns.boxplot(y=col, data=df_transformed)
        plt.title(f'Box Plot of Transformed {col}')
    plt.tight_layout()
    plt.show()

Step-by-Step Execution

  1. Load the Titanic Dataset:

    import pandas as pd
    df = pd.read_csv('train.csv')
  2. Import Matplotlib and Seaborn:

    import matplotlib.pyplot as plt
    import seaborn as sns
  3. Create Box Plots for Numerical Columns:

    numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
    
    plt.figure(figsize=(15, 10))
    for i, col in enumerate(numerical_columns, 1):
        plt.subplot(3, 3, i)
        sns.boxplot(y=col, data=df)
        plt.title(f'Box Plot of {col}')
    plt.tight_layout()
    plt.show()
  4. Identify Outliers Using the IQR Method:

    def find_outliers(df, col):
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        return outliers
    
    for col in numerical_columns:
        outliers = find_outliers(df, col)
        print(f'Outliers in {col}:')
        print(outliers)
  5. Handle Outliers - Option 1: Remove Outliers:

    def remove_outliers(df, col):
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
        return df
    
    for col in numerical_columns:
        df = remove_outliers(df, col)
  6. Handle Outliers - Option 2: Transform Data (e.g., Log Transformation):

    import numpy as np
    
    df_transformed = df.copy()
    for col in numerical_columns:
        df_transformed[col] = np.log1p(df_transformed[col])
    
    # Plot transformed data
    plt.figure(figsize=(15, 10))
    for i, col in enumerate(numerical_columns, 1):
        plt.subplot(3, 3, i)
        sns.boxplot(y=col, data=df_transformed)
        plt.title(f'Box Plot of Transformed {col}')
    plt.tight_layout()
    plt.show()

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors