This repository contains the code, data, and report for this project, which aims to predict the quality of white wine based on its physicochemical properties. We also explore whether models trained on white wine data can be used to predict the quality of red wine.
- Dataset Description
- Exploratory Data Analysis
- Preparing Dataset
- Model Fitting and Evaluation
- High Dimensional Models
- Our Own Gradient Descent Implementation
- Out-Of-Distribution Generalisation
- Acknowledgements
We used a fascinating dataset containing physicochemical and sensory data for both red and white wines. This dataset includes 11 attributes like acidity, sugar, and alcohol levels. The goal is to predict the wine quality score, which ranges from 3 to 9.
To get a sense of the data, we plotted various attributes against wine quality. Here are some intriguing insights:
- Fixed acidity: No clear trend with wine quality.
- Free Sulfur Dioxide: Lower levels might correlate with higher wine quality.
- Alcohol: Higher alcohol content seems to correlate with better wine quality.
We split the white wine data into training (70%) and testing (30%) sets. To ensure our models perform well, we normalized the numeric predictors and set up 5-fold cross-validation for hyperparameter tuning.
- Simple Linear Regression: Used three predictors (alcohol, pH, and residual sugar). Achieved an RMSE of 0.774 on the test set.
- Multiple Linear Regression: Considered all predictors. Improved performance with an RMSE of 0.742 on the test set.
- Random Forest: Our star performer with an RMSE of 0.60 on the test set.
- Boosted Trees: Achieved a notable RMSE of 0.645.
- Single Decision Tree: Simpler but less effective with an RMSE of 0.74.
- Cubist Model: Combined decision trees with linear regression, achieving an RMSE of 0.66.
- Elastic Net: Balanced between lasso and ridge regression; RMSE of 0.74.
- Support Vector Machine (SVM): Achieved an RMSE of 0.70.
- Neural Network (MLP): A single hidden layer model with an RMSE of 0.737.
- Lasso Regression: Similar to Elastic Net with an RMSE of 0.74.
- Generalized Additive Model (GAM): Captured non-linear relationships with an RMSE of 0.729.
We didn't stop at standard models. We ventured into high-dimensional space by creating polynomial transformations and interaction terms. PCA helped us reduce dimensionality. We used Lasso and SVM for this high-dimensional dataset.
- SVM without PCA: Best in class among high-dimensional models.
- Lasso with PCA: Showed better performance compared to Lasso without PCA.
Why settle for built-in algorithms when you can create your own? We implemented a custom stochastic gradient descent algorithm, adding functionalities like momentum, early stopping, and plotting. This helped us deeply understand the optimization process and the model's convergence behavior.
To test the generalization of our models, we used the red wine dataset. We observed:
- Covariate Shift: Significant differences in predictors' distributions between red and white wines.
- Concept Shift: Simpler models like linear regression generalized better than complex models like Random Forest and Cubist.
We extend our gratitude to the developers of the R packages such as tidymodels
, Cubist
, xgboost
, ranger
, keras
, and others that made this project possible. Special thanks to our team for their hard work and collaboration.
For a deeper dive, check out the full report included in the repository. Cheers to better wine predictions! 🍷