🛒 Retail Customer Churn Classification

A complete end-to-end machine learning system for predicting customer churn in online retail, featuring database normalization (3NF), 16 experiments, MLflow tracking, and production deployment.

📋 Table of Contents

Project Overview
Features
Architecture
Installation
Usage
Database Schema
Experiments
Deployment
API Documentation
Project Structure

🎯 Project Overview

This project implements a binary classification system to predict customer churn in an online retail environment using the UCI Online Retail Dataset. The system includes:

3NF Normalized Database: Proper database design with SQLite
16 Machine Learning Experiments: 4 algorithms × 4 configurations
Experiment Tracking: MLflow integration with DagsHub
Production API: FastAPI backend for model serving
User Interface: Streamlit frontend for predictions
Docker Deployment: Containerized services with docker-compose

Classification Problem

Target Variable: Customer Churn (Binary)

1 (Churned): Customer did not return within 90 days
0 (Retained): Customer made purchases within 90 days

✨ Features

✅ 3NF normalized SQLite database
✅ RFM (Recency, Frequency, Monetary) analysis
✅ 16 experiments with different configurations
✅ Hyperparameter tuning with Optuna
✅ PCA for dimensionality reduction
✅ MLflow/DagsHub experiment tracking
✅ FastAPI REST API for inference
✅ Interactive Streamlit dashboard
✅ Docker containerization
✅ Complete CI/CD ready

🏗️ Architecture

┌─────────────────┐      ┌──────────────────┐
│   Streamlit UI  │─────▶│   FastAPI API    │
│  (Port 8501)    │      │   (Port 8000)    │
└─────────────────┘      └──────────────────┘
                                  │
                                  ▼
                         ┌──────────────────┐
                         │  Trained Models  │
                         │   (Pickle files) │
                         └──────────────────┘

🚀 Installation

Prerequisites

Python 3.10+
Docker & Docker Compose (for deployment)
Git

1. Clone Repository

git clone https://github.com/yourusername/retail-churn-classification.git
cd retail-churn-classification

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Download Dataset

Download the UCI Online Retail Dataset from:

Place the CSV file in: data/raw/online_retail.csv

5. Setup Environment Variables

cp .env.example .env
# Edit .env with your DagsHub credentials

📊 Usage

Step 1: Initialize Database

Create and populate the 3NF normalized database:

python database/init_db.py

Output:

database/retail.db - SQLite database with 4 normalized tables
Customers, Products, Invoices, InvoiceItems

Step 2: Generate ML Dataset

Create features and labels for machine learning:

python data/feature_engineering.py

Output:

data/processed/ml_dataset.csv - ML-ready dataset with RFM features

Step 3: Run Experiments

Execute all 16 experiments:

python experiments/run_experiments.py

This will:

Train 16 models (4 algorithms × 4 configurations)
Save all models to models/ directory
Save metrics to results/experiment_results.json
Print comparison table

Expected Runtime: ~15-30 minutes depending on hardware

Step 4: Track with MLflow/DagsHub

python experiments/mlflow_tracking.py

This will:

Log all 16 experiments to DagsHub
Create comparison visualizations
Save charts to results/

Step 5: Run API Locally

cd api
uvicorn main:app --reload

API will be available at: http://localhost:8000

Step 6: Run Streamlit UI Locally

cd streamlit
streamlit run app.py

UI will be available at: http://localhost:8501

🗄️ Database Schema

3NF Normalized Design

Customers (CustomerID PK, Country, FirstPurchaseDate, LastPurchaseDate, TotalPurchases, TotalSpent)
Products (StockCode PK, Description, UnitPrice)
Invoices (InvoiceNo PK, CustomerID FK, InvoiceDate, Country)
InvoiceItems (ItemID PK, InvoiceNo FK, StockCode FK, Quantity, UnitPrice, TotalPrice)

Benefits:

✅ No data redundancy
✅ Easy to update customer/product info
✅ Maintains referential integrity
✅ Optimized queries with indexes

🔬 Experiments

Algorithms Used

Logistic Regression: Fast, interpretable baseline
Random Forest: Ensemble method, handles non-linearity
XGBoost: Gradient boosting, typically best performer
SVM: Support Vector Machine with RBF kernel

Hyperparameter Tuning

Using Optuna with 50 trials per model:

Bayesian optimization (TPE sampler)
3-fold cross-validation
F1-score as optimization metric

🐳 Deployment

Local Deployment with Docker

# Build and start all services
docker-compose up --build

# Run in background
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Services:

API: http://localhost:8000
Streamlit: http://localhost:8501

Cloud Deployment (DigitalOcean/Render)

Option 1: DigitalOcean

Create a Droplet (Ubuntu 22.04)
Install Docker and Docker Compose
Clone repository
Copy models to server
Run docker-compose up -d
Configure firewall (ports 8000, 8501)

Option 2: Render

Create new Web Service
Connect GitHub repository
Set Docker as runtime
Deploy api and streamlit as separate services
Configure environment variables

Environment Variables

Required for deployment:

# API
MODEL_PATH=/app/models/best_model.pkl

# DagsHub (optional for tracking)
DAGSHUB_USER=your-username
DAGSHUB_REPO=retail-churn-classification
DAGSHUB_TOKEN=your-token

📡 API Documentation

Endpoints

Health Check

GET /health

Model Info

GET /model/info

Single Prediction

POST /predict
Content-Type: application/json

{
  "Recency": 30,
  "Frequency": 5,
  "Monetary": 500.0,
  "InvoiceNo_nunique": 5,
  "Quantity_sum": 50.0,
  "Quantity_mean": 10.0,
  "TotalPrice_sum": 500.0,
  "TotalPrice_mean": 100.0,
  "TotalPrice_std": 0.0,
  "StockCode_nunique": 10,
  "CustomerLifetime": 180,
  "AvgDaysBetweenPurchases": 30.0,
  "Country_United_Kingdom": 1,
  ...
}

Response:

{
  "churn_probability": 0.25,
  "churn_prediction": 0,
  "risk_level": "Low",
  "timestamp": "2024-01-15T10:30:00"
}

Interactive API Docs

Visit http://localhost:8000/docs for Swagger UI

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
api		api
data		data
database		database
docs		docs
experiments		experiments
models		models
notebooks		notebooks
results		results
streamlit		streamlit
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
docker-compose.yml		docker-compose.yml
init_dagshub_mlflow.py		init_dagshub_mlflow.py
requirements.txt		requirements.txt
setup_script.sh		setup_script.sh
test_inference.py		test_inference.py

Folders and files

Latest commit

History

Repository files navigation

🛒 Retail Customer Churn Classification

📋 Table of Contents

🎯 Project Overview

Classification Problem

✨ Features

🏗️ Architecture

🚀 Installation

Prerequisites

1. Clone Repository

2. Create Virtual Environment

3. Install Dependencies

4. Download Dataset

5. Setup Environment Variables

📊 Usage

Step 1: Initialize Database

Step 2: Generate ML Dataset

Step 3: Run Experiments

Step 4: Track with MLflow/DagsHub

Step 5: Run API Locally

Step 6: Run Streamlit UI Locally

🗄️ Database Schema

3NF Normalized Design

🔬 Experiments

Algorithms Used

Hyperparameter Tuning

🐳 Deployment

Local Deployment with Docker

Cloud Deployment (DigitalOcean/Render)

Option 1: DigitalOcean

Option 2: Render

Environment Variables

📡 API Documentation

Endpoints

Health Check

Model Info

Single Prediction

Interactive API Docs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages