Aayush Jha | Indian Statistical Institute (ISI)
This project aims to leverage machine learning to predict the presence of heart disease in patients based on a set of medical and demographic features. The goal is to build a reliable classification model and identify the most significant risk factors to aid in preventative healthcare strategies.
The analysis was performed on the well-known "Heart Disease UCI" dataset, sourced from Kaggle.
- Link: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data
- Attributes: The dataset includes 14 attributes such as age, sex, chest pain type, resting blood pressure, cholesterol, and more.
- Target Variable:
num(1 = Heart Disease, 0 = No Heart Disease)
- Data Cleaning & EDA: Investigated feature distributions, handled missing values represented by '?', and visualized relationships with the target variable to uncover initial insights.
- Data Preprocessing: Performed one-hot encoding on categorical variables and scaled all numerical features using
StandardScalerto prepare the data for modeling. - Model Implementation: Developed and compared three distinct classification models:
- Logistic Regression (Baseline)
- Random Forest Classifier
- XGBoost Classifier
- Evaluation: Assessed model performance on a held-out test set using Accuracy, Precision, Recall, F1-Score, and identified the best model.
- Insight Generation: Extracted feature importances from the top-performing model to identify the key predictors of heart disease.
Finding 1: Chest pain type is a strong indicator of heart disease. The analysis showed that patients with 'non-anginal chest pain' (cp = 2) have a significantly higher likelihood of having heart disease compared to other chest pain types.
Finding 2: Several medical metrics are highly correlated. The correlation heatmap revealed strong relationships between features. For instance, thalach (maximum heart rate achieved) and slope (the slope of the peak exercise ST segment) showed a noticeable correlation with the presence of heart disease.
The models were evaluated, and Random Forest demonstrated the best overall performance for this prediction task.
| Model | Accuracy | Precision (class 1) | Recall (class 1) | F1-Score (class 1) |
|---|---|---|---|---|
| Logistic Regression | 78% | 86% | 76% | 81% |
| Random Forest | 84% | 88% | 83% | 86% |
| XGBoost Classifier | 83% | 89% | 82% | 85% |
Feature Importance from Random Forest Model:
The model identified which medical factors were most influential in its predictions. This provides a clear, data-driven focus for clinical screening.
The top 3 predictors identified were [Feature 1], [Feature 2], and [Feature 3]. (Replace these with the top 3 features from your Random Forest model).
-
Clone the repository:
git clone [https://github.com/aayush-0131/Heart-Disease-Prediction-Project.git]
-
Install the required dependencies:
pip install -r requirements.txt
-
Open and run the
Heart_Disease_Analysis.ipynbJupyter Notebook.


