Skip to content

aayush-0131/Heart-Disease-Prediction-Project

Repository files navigation

Predictive Analysis of Cardiovascular Disease Risk Factors

Aayush Jha | Indian Statistical Institute (ISI)


Problem Statement

This project aims to leverage machine learning to predict the presence of heart disease in patients based on a set of medical and demographic features. The goal is to build a reliable classification model and identify the most significant risk factors to aid in preventative healthcare strategies.


Dataset

The analysis was performed on the well-known "Heart Disease UCI" dataset, sourced from Kaggle.


Methodology & Key Steps

  1. Data Cleaning & EDA: Investigated feature distributions, handled missing values represented by '?', and visualized relationships with the target variable to uncover initial insights.
  2. Data Preprocessing: Performed one-hot encoding on categorical variables and scaled all numerical features using StandardScaler to prepare the data for modeling.
  3. Model Implementation: Developed and compared three distinct classification models:
    • Logistic Regression (Baseline)
    • Random Forest Classifier
    • XGBoost Classifier
  4. Evaluation: Assessed model performance on a held-out test set using Accuracy, Precision, Recall, F1-Score, and identified the best model.
  5. Insight Generation: Extracted feature importances from the top-performing model to identify the key predictors of heart disease.

Key Findings & Visualizations

Finding 1: Chest pain type is a strong indicator of heart disease. The analysis showed that patients with 'non-anginal chest pain' (cp = 2) have a significantly higher likelihood of having heart disease compared to other chest pain types.

Chest Pain vs. Heart Disease

Finding 2: Several medical metrics are highly correlated. The correlation heatmap revealed strong relationships between features. For instance, thalach (maximum heart rate achieved) and slope (the slope of the peak exercise ST segment) showed a noticeable correlation with the presence of heart disease.

Correlation Heatmap


Model Performance

The models were evaluated, and Random Forest demonstrated the best overall performance for this prediction task.

Model Accuracy Precision (class 1) Recall (class 1) F1-Score (class 1)
Logistic Regression 78% 86% 76% 81%
Random Forest 84% 88% 83% 86%
XGBoost Classifier 83% 89% 82% 85%

Feature Importance from Random Forest Model:

The model identified which medical factors were most influential in its predictions. This provides a clear, data-driven focus for clinical screening.

Feature Importance Chart

The top 3 predictors identified were [Feature 1], [Feature 2], and [Feature 3]. (Replace these with the top 3 features from your Random Forest model).


How to Run

  1. Clone the repository:

    git clone [https://github.com/aayush-0131/Heart-Disease-Prediction-Project.git]
  2. Install the required dependencies:

    pip install -r requirements.txt
  3. Open and run the Heart_Disease_Analysis.ipynb Jupyter Notebook.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published