GitHub - JensBender/machine-learning-template: A ready-to-use Jupyter Notebook template for machine learning projects.

Developed a versatile machine learning template for streamlining data preprocessing, exploratory data analysis, and modeling for both regression and classification tasks. Integrated hyperparameter tuning, model evaluation, and selection, providing a flexible and efficient framework for machine learning workflows.

📋 Table of Contents

Summary
- Built With
Data Preprocessing
Exploratory Data Analysis (EDA)
Modeling
Getting Started
- Set Up Virtual Environment
- Set Up Environment Variables
License
Credits

🎯 Summary

This repository provides a comprehensive machine learning template in a Jupyter Notebook file to streamline the key stages of the machine learning workflow for tabular data:

Data Preprocessing:
- Load, clean, transform, and save data using pandas and sklearn.
- Handle duplicates, incorrect data types, missing values, and outliers.
- Extract features, scale numerical features, and encode categorical features.
- Split data into training, validation, and test sets.
Exploratory Data Analysis (EDA):
- Analyze descriptive statistics using pandas and numpy.
- Visualize distributions, correlations, and relationships using seaborn and matplotlib.
Modeling:
- Train baseline models and perform hyperparameter tuning for regression and classification tasks with sklearn and xgboost.
- Evaluate regression (RMSE, MAPE, R-squared) and classification models (accuracy, precision, recall, F1-score).
- Visualize feature importance, show model prediction examples, and save the final model with pickle.

This template provides a flexible, customizable foundation for various datasets and use cases, making it an ideal starting point for efficient and reproducible machine learning projects. It is specifically tailored to structured tabular data (e.g., .csv, .xls, or SQL tables) using Pandas and Scikit-learn. It is not optimized for text, image, or time series data, which require specialized preprocessing, models, and tools (e.g., TensorFlow, PyTorch).

🛠️ Built With

(back to top)

🧹 Data Preprocessing

Use pandas, sklearn, sqlalchemy, and mysql-connector-python for data loading, cleaning, transformation, and saving.

Load data:
- From a .csv file using pandas read_csv.
- From a MySQL database table using sqlalchemy, mysql-connector-python, and pandas read_sql.
Remove duplicates:
- Drop duplicate rows (e.g., based on the ID column) using pandas drop_duplicates.
Handle incorrect data types:
- Convert string columns to numerical types (pandas astype) and datetime types (pandas to_datetime).
Extract features:
- Create categorical features from string columns using custom functions with pandas apply.
- Create numerical features from string columns using custom functions with pandas apply, and re for numeric pattern matching.
- Create boolean features from string columns using lambda functions with pandas apply.
Handle missing values:
- Delete rows with missing values using pandas dropna.
- Impute missing values: Fill in the median for numerical columns or the mode for categorical columns using pandas fillna.
Handle outliers:
- Remove univariate outliers using statistical methods (e.g., 3 standard deviations or 1.5 IQR) with a custom transformer class that inherits from sklearn BaseEstimator and TransformerMixin.
Save the preprocessed data:
- As a .csv file using pandas to_csv.
- In a MySQL database table using sqlalchemy, mysql-connector-python, and pandas to_sql.
Train-validation-test split:
- Split data into training (e.g., 70%), validation (15%), and test (15%) sets using sklearn train_test_split.
Polynomial features:
- Create polynomial features using sklearn PolynomialFeatures.
Feature scaling and encoding:
- Scale numerical features using standard scaling with sklearn StandardScaler or min-max normalization with MinMaxScaler.
- Encode categorical features:
  - Nominal features: Use one-hot encoding with sklearn OneHotEncoder.
  - Ordinal features: Use ordinal encoding with sklearn OrdinalEncoder.
- Apply scaling and encoding together using sklearn ColumnTransformer.

(back to top)

🔍 Exploratory Data Analysis (EDA)

Use pandas, numpy, seaborn, and matplotlib for statistical analysis and visualizations.

Univariate EDA:
- Numerical columns:
  - Analyze descriptive statistics (e.g., mean, median, standard deviation) using pandas describe.
  - Visualize distributions with histograms using seaborn histplot and matplotlib.
- Categorical columns:
  - Examine frequencies using pandas value_counts.
  - Visualize frequencies with bar plots (seaborn barplot) or a bar plot matrix (matplotlib subplot).
Bivariate EDA:
- Two numerical columns:
  - Analyze pairwise relationships with a correlation matrix (pandas corr and numpy) and visualize them with a heatmap (seaborn heatmap).
  - Visualize relationships with scatterplots (seaborn scatterplot) or a scatterplot matrix (matplotlib subplot).
- Numerical and categorical columns:
  - Explore relationships with group-wise statistics (e.g., mean or median by category) using pandas groupby and describe.
  - Visualize results with bar plots (seaborn barplot) or a bar plot matrix (matplotlib subplot).

(back to top)

🧠 Modeling

Use sklearn, xgboost, and pickle for model training, evaluation, and saving.

Train baseline models:
- Establish performance benchmarks with the following models using sklearn and xgboost: Linear Regression, Logistic Regression, Elastic Net Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Multi-Layer Perceptron, and XGBoost.
Hyperparameter tuning:
- Perform hyperparameter tuning using grid search with sklearn GridSearchCV or random search with RandomizedSearchCV.
Model evaluation:
- Regression task:
  - Calculate metrics such as RMSE, MAPE, and R-squared with sklearn mean_squared_error, mean_absolute_percentage_error, and r2_score.
  - Analyze errors with residual plots, error distributions, and feature-error relationships using pandas, seaborn, and matplotlib.
- Classification task:
  - Create classification report with metrics like accuracy, precision, recall, and F1-score using sklearn classification_report.
  - Analyze misclassifications using a confusion matrix with sklearn confusion_matrix and ConfusionMatrixDisplay.
  - Explore feature-misclassification relationships using pandas, seaborn, and matplotlib.
Feature importance:
- Visualize feature importances using seaborn and matplotlib or xgboost plot_importance.
Model prediction examples:
- Show illustrative examples of model predictions with best, worst, and typical cases using pandas.
Save the final model:
- Save the best-performing model as a .pkl file using pickle.

(back to top)

🚀 Getting Started

Follow these steps to set up the virtual environment, install the required packages, and, if needed, set up environment variables for the project.

📦 Set Up Virtual Environment

Follow the steps below to set up a Python virtual environment for this machine learning project and install the required dependencies.

Ensure you have Python installed on your system.
Create a virtual environment:
```
python -m venv machine-learning-venv
```

Activate the virtual environment:

On Windows:
```
machine-learning-venv\Scripts\activate
```

On macOS/Linux:

source machine-learning-venv/bin/activate

Ensure that pip is up to date:
```
pip install --upgrade pip
```
Install the required Python packages using the provided requirements.txt file:
```
pip install -r requirements.txt
```

You're now ready to use the environment for your machine learning project!

(back to top)

🗝️ Set Up Environment Variables

If your project requires sensitive information, such as API keys or database credentials, it is good practice to store this information securely in a .env file. Example .env file content:
```
# Your API key
API_KEY="your_api_key_here"

# Your SQL database credentials
SQL_USERNAME="your_sql_username_here"
SQL_PASSWORD="your_sql_password_here"
```
Replace the placeholder values with your actual values.
Add the .env file to your .gitignore to ensure it is not accidentally committed to version control.
The environment variables stored in your .env file will be loaded into your environment using the load_dotenv() function from the python-dotenv library.

(back to top)

©️ License

This project is licensed under the MIT License.

(back to top)

👏 Credits

This project was made possible with the help of the following resources:

Header and footer images: Generated using the FLUX.1 [dev] image generator via Hugging Face by Black Forest Labs.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 682 Commits
images		images
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
machine_learning_template.ipynb		machine_learning_template.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📋 Table of Contents

🎯 Summary

🛠️ Built With

🧹 Data Preprocessing

🔍 Exploratory Data Analysis (EDA)

🧠 Modeling

🚀 Getting Started

📦 Set Up Virtual Environment

🗝️ Set Up Environment Variables

©️ License

👏 Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

JensBender/machine-learning-template

Folders and files

Latest commit

History

Repository files navigation

📋 Table of Contents

🎯 Summary

🛠️ Built With

🧹 Data Preprocessing

🔍 Exploratory Data Analysis (EDA)

🧠 Modeling

🚀 Getting Started

📦 Set Up Virtual Environment

🗝️ Set Up Environment Variables

©️ License

👏 Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages