Developed a versatile machine learning template for streamlining data preprocessing, exploratory data analysis, and modeling for both regression and classification tasks. Integrated hyperparameter tuning, model evaluation, and selection, providing a flexible and efficient framework for machine learning workflows.
This repository provides a comprehensive machine learning template in a Jupyter Notebook file to streamline the key stages of the machine learning workflow for tabular data:
- Data Preprocessing:
- Load, clean, transform, and save data using
pandasandsklearn. - Handle duplicates, incorrect data types, missing values, and outliers.
- Extract features, scale numerical features, and encode categorical features.
- Split data into training, validation, and test sets.
- Load, clean, transform, and save data using
- Exploratory Data Analysis (EDA):
- Analyze descriptive statistics using
pandasandnumpy. - Visualize distributions, correlations, and relationships using
seabornandmatplotlib.
- Analyze descriptive statistics using
- Modeling:
- Train baseline models and perform hyperparameter tuning for regression and classification tasks with
sklearnandxgboost. - Evaluate regression (RMSE, MAPE, R-squared) and classification models (accuracy, precision, recall, F1-score).
- Visualize feature importance, show model prediction examples, and save the final model with
pickle.
- Train baseline models and perform hyperparameter tuning for regression and classification tasks with
This template provides a flexible, customizable foundation for various datasets and use cases, making it an ideal starting point for efficient and reproducible machine learning projects. It is specifically tailored to structured tabular data (e.g., .csv, .xls, or SQL tables) using Pandas and Scikit-learn. It is not optimized for text, image, or time series data, which require specialized preprocessing, models, and tools (e.g., TensorFlow, PyTorch).
Use pandas, sklearn, sqlalchemy, and mysql-connector-python for data loading, cleaning, transformation, and saving.
- Load data:
- From a .csv file using
pandasread_csv. - From a MySQL database table using
sqlalchemy,mysql-connector-python, andpandasread_sql.
- From a .csv file using
- Remove duplicates:
- Drop duplicate rows (e.g., based on the ID column) using
pandasdrop_duplicates.
- Drop duplicate rows (e.g., based on the ID column) using
- Handle incorrect data types:
- Convert string columns to numerical types (
pandasastype) and datetime types (pandasto_datetime).
- Convert string columns to numerical types (
- Extract features:
- Create categorical features from string columns using custom functions with
pandasapply. - Create numerical features from string columns using custom functions with
pandasapply, andrefor numeric pattern matching. - Create boolean features from string columns using
lambdafunctions withpandasapply.
- Create categorical features from string columns using custom functions with
- Handle missing values:
- Delete rows with missing values using
pandasdropna. - Impute missing values: Fill in the median for numerical columns or the mode for categorical columns using
pandasfillna.
- Delete rows with missing values using
- Handle outliers:
- Remove univariate outliers using statistical methods (e.g., 3 standard deviations or 1.5 IQR) with a custom transformer class that inherits from
sklearnBaseEstimatorandTransformerMixin.
- Remove univariate outliers using statistical methods (e.g., 3 standard deviations or 1.5 IQR) with a custom transformer class that inherits from
- Save the preprocessed data:
- As a .csv file using
pandasto_csv. - In a MySQL database table using
sqlalchemy,mysql-connector-python, andpandasto_sql.
- As a .csv file using
- Train-validation-test split:
- Split data into training (e.g., 70%), validation (15%), and test (15%) sets using
sklearntrain_test_split.
- Split data into training (e.g., 70%), validation (15%), and test (15%) sets using
- Polynomial features:
- Create polynomial features using
sklearnPolynomialFeatures.
- Create polynomial features using
- Feature scaling and encoding:
- Scale numerical features using standard scaling with
sklearnStandardScaleror min-max normalization withMinMaxScaler. - Encode categorical features:
- Nominal features: Use one-hot encoding with
sklearnOneHotEncoder. - Ordinal features: Use ordinal encoding with
sklearnOrdinalEncoder.
- Nominal features: Use one-hot encoding with
- Apply scaling and encoding together using
sklearnColumnTransformer.
- Scale numerical features using standard scaling with
Use pandas, numpy, seaborn, and matplotlib for statistical analysis and visualizations.
- Univariate EDA:
- Numerical columns:
- Analyze descriptive statistics (e.g., mean, median, standard deviation) using
pandasdescribe. - Visualize distributions with histograms using
seabornhistplotandmatplotlib.
- Analyze descriptive statistics (e.g., mean, median, standard deviation) using
- Categorical columns:
- Examine frequencies using
pandasvalue_counts. - Visualize frequencies with bar plots (
seabornbarplot) or a bar plot matrix (matplotlibsubplot).
- Examine frequencies using
- Numerical columns:
- Bivariate EDA:
- Two numerical columns:
- Analyze pairwise relationships with a correlation matrix (
pandascorrandnumpy) and visualize them with a heatmap (seabornheatmap). - Visualize relationships with scatterplots (
seabornscatterplot) or a scatterplot matrix (matplotlibsubplot).
- Analyze pairwise relationships with a correlation matrix (
- Numerical and categorical columns:
- Explore relationships with group-wise statistics (e.g., mean or median by category) using
pandasgroupbyanddescribe. - Visualize results with bar plots (
seabornbarplot) or a bar plot matrix (matplotlibsubplot).
- Explore relationships with group-wise statistics (e.g., mean or median by category) using
- Two numerical columns:
Use sklearn, xgboost, and pickle for model training, evaluation, and saving.
- Train baseline models:
- Establish performance benchmarks with the following models using
sklearnandxgboost: Linear Regression, Logistic Regression, Elastic Net Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Multi-Layer Perceptron, and XGBoost.
- Establish performance benchmarks with the following models using
- Hyperparameter tuning:
- Perform hyperparameter tuning using grid search with
sklearnGridSearchCVor random search withRandomizedSearchCV.
- Perform hyperparameter tuning using grid search with
- Model evaluation:
- Regression task:
- Calculate metrics such as RMSE, MAPE, and R-squared with
sklearnmean_squared_error,mean_absolute_percentage_error, andr2_score. - Analyze errors with residual plots, error distributions, and feature-error relationships using
pandas,seaborn, andmatplotlib.
- Calculate metrics such as RMSE, MAPE, and R-squared with
- Classification task:
- Create classification report with metrics like accuracy, precision, recall, and F1-score using
sklearnclassification_report. - Analyze misclassifications using a confusion matrix with
sklearnconfusion_matrixandConfusionMatrixDisplay. - Explore feature-misclassification relationships using
pandas,seaborn, andmatplotlib.
- Create classification report with metrics like accuracy, precision, recall, and F1-score using
- Regression task:
- Feature importance:
- Visualize feature importances using
seabornandmatplotliborxgboostplot_importance.
- Visualize feature importances using
- Model prediction examples:
- Show illustrative examples of model predictions with best, worst, and typical cases using
pandas.
- Show illustrative examples of model predictions with best, worst, and typical cases using
- Save the final model:
- Save the best-performing model as a .pkl file using
pickle.
- Save the best-performing model as a .pkl file using
Follow these steps to set up the virtual environment, install the required packages, and, if needed, set up environment variables for the project.
Follow the steps below to set up a Python virtual environment for this machine learning project and install the required dependencies.
- Ensure you have Python installed on your system.
- Create a virtual environment:
python -m venv machine-learning-venv
- Activate the virtual environment:
- On Windows:
machine-learning-venv\Scripts\activate
- On macOS/Linux:
source machine-learning-venv/bin/activate
- On Windows:
- Ensure that
pipis up to date:pip install --upgrade pip
- Install the required Python packages using the provided
requirements.txtfile:pip install -r requirements.txt
You're now ready to use the environment for your machine learning project!
- If your project requires sensitive information, such as API keys or database credentials, it is good practice to store this information securely in a
.envfile. Example.envfile content:# Your API key API_KEY="your_api_key_here" # Your SQL database credentials SQL_USERNAME="your_sql_username_here" SQL_PASSWORD="your_sql_password_here" - Replace the placeholder values with your actual values.
- Add the
.envfile to your.gitignoreto ensure it is not accidentally committed to version control. - The environment variables stored in your
.envfile will be loaded into your environment using theload_dotenv()function from thepython-dotenvlibrary.
This project is licensed under the MIT License.
This project was made possible with the help of the following resources:
- Header and footer images: Generated using the FLUX.1 [dev] image generator via Hugging Face by Black Forest Labs.

