This project aims to predict health insurance charges based on various personal factors such as age, BMI, smoking habits,number of children and region using machine learning algorithms. The primary goal is to build a predictive model that accurately estimates the insurance premiums individuals might expect based on their health and demographic information. The project also includes a user-friendly interface built with Streamlit, allowing users to input their details and receive instant charge predictions.
The dataset used for this project is sourced from Kaggle. It includes the following features:
- Age: The age of the individual.
- Sex: Insurance contractor gender, female, male.
- BMI: Body Mass Index, which is a measure of body fat based on height and weight.
- Children: The number of children covered by the insurance policy.
- Smoker: Whether the individual is a smoker or not.
- Region: The residential area of the individual in the US i.e. northeast, southeast, southwest, northwest.
- Charges: The medical costs billed by the health insurance provider.
- Data Analysis: Python (Pandas,Numpy)
- Machine Learning: Scikit-Learn(Linear Regression, Support vector Regression, Decision Tree, Random Forest,K-Nearest Neighbors, XGBoost, Gradient Boosting)
- Visualization: Matplotlib, Seaborn
- Model Deployment: Streamlit
- Version Control: Git, GitHub
Prerequisites Ensure you have Python installed on your machine. You will also need to install the required libraries:
# Install dependencies
pip install -r requirements.txt
Running the Project
# Clone the repository
git clone https://github.com/puni-ram48/Health-Insurance-Charges-Prediction.git
# Run the Streamlit Application
streamlit run app.py
Interact with the Application:
- Open the provided local URL in your web browser.
- Enter the required details such as age,sex, BMI, number of children, smoking status, and region.
- The application will predict the insurance charges based on the inputs provided.
Data: Conatins the dataset for the given project.
Project Analysis Report: Final report containing data analysis and visualizations and the model development .
SVM_Model: Saved machine learning model for deployment.
app.py: Streamlit application script.
requirements.txt: List of required python libraries.
- Data Preprocessing:
- Handling missing values (if any).
- Encoding categorical variables (e.g., smoker, region).
- Scaling numerical features for better model performance.
- Model Training:
- Several models including Linear Regression, Decision Trees, and Random Forest were trained.
- Hyperparameter tuning was performed to improve model accuracy.
- Model Evaluation:
- The models were evaluated using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), RMSE and R-squared.
- The best-performing model was selected for deployment.
- Best Model:
- The SVM Regression model, achieving an R-squared value of 0.8895, demonstrates strong predictive performance for health insurance charges. Its accuracy in modeling complex relationships between features and charges makes it the ideal choice for deployment, ensuring reliable and precise insurance cost predictions.
We welcome contributions to this project! If you would like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeature
). - Make your changes and commit them (
git commit -am 'Add some feature'
). - Push to the branch (
git push origin feature/YourFeature
). - Create a new Pull Request.
Please ensure your code is well-documented.
This project was initiated and completed by Puneetha Dharmapura Shrirama. Special thanks to the Jeevitha DS for the guidance and support.
This project is licensed under the MIT License - see the LICENSE file for details.