Skip to content

JensBender/machine-learning-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Developed a versatile machine learning template for streamlining data preprocessing, exploratory data analysis, and modeling for both regression and classification tasks. Integrated hyperparameter tuning, model evaluation, and selection, providing a flexible and efficient framework for machine learning workflows.


📋 Table of Contents

  1. Summary
  2. Data Preprocessing
  3. Exploratory Data Analysis (EDA)
  4. Modeling
  5. Getting Started
  6. License
  7. Credits

🎯 Summary

This repository provides a comprehensive machine learning template in a Jupyter Notebook file to streamline the key stages of the machine learning workflow for tabular data:

  • Data Preprocessing:
    • Load, clean, transform, and save data using pandas and sklearn.
    • Handle duplicates, incorrect data types, missing values, and outliers.
    • Extract features, scale numerical features, and encode categorical features.
    • Split data into training, validation, and test sets.
  • Exploratory Data Analysis (EDA):
    • Analyze descriptive statistics using pandas and numpy.
    • Visualize distributions, correlations, and relationships using seaborn and matplotlib.
  • Modeling:
    • Train baseline models and perform hyperparameter tuning for regression and classification tasks with sklearn and xgboost.
    • Evaluate regression (RMSE, MAPE, R-squared) and classification models (accuracy, precision, recall, F1-score).
    • Visualize feature importance, show model prediction examples, and save the final model with pickle.

This template provides a flexible, customizable foundation for various datasets and use cases, making it an ideal starting point for efficient and reproducible machine learning projects. It is specifically tailored to structured tabular data (e.g., .csv, .xls, or SQL tables) using Pandas and Scikit-learn. It is not optimized for text, image, or time series data, which require specialized preprocessing, models, and tools (e.g., TensorFlow, PyTorch).

🛠️ Built With

  • Python
  • Pandas
  • Matplotlib
  • Seaborn
  • scikit-learn
  • Jupyter Notebook

(back to top)

🧹 Data Preprocessing

Use pandas, sklearn, sqlalchemy, and mysql-connector-python for data loading, cleaning, transformation, and saving.

  • Load data:
    • From a .csv file using pandas read_csv.
    • From a MySQL database table using sqlalchemy, mysql-connector-python, and pandas read_sql.
  • Remove duplicates:
    • Drop duplicate rows (e.g., based on the ID column) using pandas drop_duplicates.
  • Handle incorrect data types:
    • Convert string columns to numerical types (pandas astype) and datetime types (pandas to_datetime).
  • Extract features:
    • Create categorical features from string columns using custom functions with pandas apply.
    • Create numerical features from string columns using custom functions with pandas apply, and re for numeric pattern matching.
    • Create boolean features from string columns using lambda functions with pandas apply.
  • Handle missing values:
    • Delete rows with missing values using pandas dropna.
    • Impute missing values: Fill in the median for numerical columns or the mode for categorical columns using pandas fillna.
  • Handle outliers:
    • Remove univariate outliers using statistical methods (e.g., 3 standard deviations or 1.5 IQR) with a custom transformer class that inherits from sklearn BaseEstimator and TransformerMixin.
  • Save the preprocessed data:
    • As a .csv file using pandas to_csv.
    • In a MySQL database table using sqlalchemy, mysql-connector-python, and pandas to_sql.
  • Train-validation-test split:
    • Split data into training (e.g., 70%), validation (15%), and test (15%) sets using sklearn train_test_split.
  • Polynomial features:
    • Create polynomial features using sklearn PolynomialFeatures.
  • Feature scaling and encoding:
    • Scale numerical features using standard scaling with sklearn StandardScaler or min-max normalization with MinMaxScaler.
    • Encode categorical features:
      • Nominal features: Use one-hot encoding with sklearn OneHotEncoder.
      • Ordinal features: Use ordinal encoding with sklearn OrdinalEncoder.
    • Apply scaling and encoding together using sklearn ColumnTransformer.

(back to top)

🔍 Exploratory Data Analysis (EDA)

Use pandas, numpy, seaborn, and matplotlib for statistical analysis and visualizations.

  • Univariate EDA:
    • Numerical columns:
      • Analyze descriptive statistics (e.g., mean, median, standard deviation) using pandas describe.
      • Visualize distributions with histograms using seaborn histplot and matplotlib.
    • Categorical columns:
      • Examine frequencies using pandas value_counts.
      • Visualize frequencies with bar plots (seaborn barplot) or a bar plot matrix (matplotlib subplot).
  • Bivariate EDA:
    • Two numerical columns:
      • Analyze pairwise relationships with a correlation matrix (pandas corr and numpy) and visualize them with a heatmap (seaborn heatmap).
      • Visualize relationships with scatterplots (seaborn scatterplot) or a scatterplot matrix (matplotlib subplot).
    • Numerical and categorical columns:
      • Explore relationships with group-wise statistics (e.g., mean or median by category) using pandas groupby and describe.
      • Visualize results with bar plots (seaborn barplot) or a bar plot matrix (matplotlib subplot).

(back to top)

🧠 Modeling

Use sklearn, xgboost, and pickle for model training, evaluation, and saving.

  • Train baseline models:
    • Establish performance benchmarks with the following models using sklearn and xgboost: Linear Regression, Logistic Regression, Elastic Net Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Multi-Layer Perceptron, and XGBoost.
  • Hyperparameter tuning:
    • Perform hyperparameter tuning using grid search with sklearn GridSearchCV or random search with RandomizedSearchCV.
  • Model evaluation:
    • Regression task:
      • Calculate metrics such as RMSE, MAPE, and R-squared with sklearn mean_squared_error, mean_absolute_percentage_error, and r2_score.
      • Analyze errors with residual plots, error distributions, and feature-error relationships using pandas, seaborn, and matplotlib.
    • Classification task:
      • Create classification report with metrics like accuracy, precision, recall, and F1-score using sklearn classification_report.
      • Analyze misclassifications using a confusion matrix with sklearn confusion_matrix and ConfusionMatrixDisplay.
      • Explore feature-misclassification relationships using pandas, seaborn, and matplotlib.
  • Feature importance:
    • Visualize feature importances using seaborn and matplotlib or xgboost plot_importance.
  • Model prediction examples:
    • Show illustrative examples of model predictions with best, worst, and typical cases using pandas.
  • Save the final model:
    • Save the best-performing model as a .pkl file using pickle.

(back to top)

🚀 Getting Started

Follow these steps to set up the virtual environment, install the required packages, and, if needed, set up environment variables for the project.

📦 Set Up Virtual Environment

Follow the steps below to set up a Python virtual environment for this machine learning project and install the required dependencies.

  • Ensure you have Python installed on your system.
  • Create a virtual environment:
    python -m venv machine-learning-venv
  • Activate the virtual environment:
    • On Windows:
      machine-learning-venv\Scripts\activate
    • On macOS/Linux:
      source machine-learning-venv/bin/activate
  • Ensure that pip is up to date:
    pip install --upgrade pip
  • Install the required Python packages using the provided requirements.txt file:
    pip install -r requirements.txt

You're now ready to use the environment for your machine learning project!

(back to top)

🗝️ Set Up Environment Variables

  • If your project requires sensitive information, such as API keys or database credentials, it is good practice to store this information securely in a .env file. Example .env file content:
    # Your API key
    API_KEY="your_api_key_here"
    
    # Your SQL database credentials
    SQL_USERNAME="your_sql_username_here"
    SQL_PASSWORD="your_sql_password_here"
    
  • Replace the placeholder values with your actual values.
  • Add the .env file to your .gitignore to ensure it is not accidentally committed to version control.
  • The environment variables stored in your .env file will be loaded into your environment using the load_dotenv() function from the python-dotenv library.

(back to top)

©️ License

This project is licensed under the MIT License.

(back to top)

👏 Credits

This project was made possible with the help of the following resources:

(back to top)

Releases

No releases published

Packages

No packages published