Skip to content

rashadmin/Weekly-Sales-Transaction-Clustering

Repository files navigation

Weekly-Sales-Transaction-Clustering

Segmenting customer behavior through K-Means clustering on weekly sales data.

🧠 Overview

This project provides a comprehensive solution for performing customer segmentation using K-Means clustering on weekly sales transaction data. The primary goal is to identify distinct groups of customers based on their purchasing patterns, enabling businesses to implement targeted marketing strategies, personalize sales approaches, and make informed product recommendations.

🔨 What I Built

This repository offers a suite of tools for data analysis, machine learning, and interactive visualization:

  • Interactive Exploratory Data Analysis (EDA): A dedicated module for generating interactive plots (histograms, box plots, scatter plots) to visually explore dataset features, distributions, and relationships.
  • K-Means Clustering Workflow: A detailed Jupyter Notebook outlining the end-to-end process of customer segmentation, including data loading, preprocessing, feature selection (using trimmed variance), optimal cluster determination (Elbow method, Silhouette scores), K-Means application, and visualization with PCA.
  • Dash Web Application: An interactive web dashboard built with Dash for visualizing and analyzing clustering results. It allows users to explore clusters in a 2D PCA-reduced space and understand the underlying patterns.
  • Streamlit User Interface: A user-friendly Streamlit application that empowers users to perform K-Means clustering interactively. It features dynamic selection of features and number of clusters, and visualizes the results through distribution plots, variance analysis, heatmaps, and PCA-reduced scatter plots.
  • Robust Data Wrangling: A modular script to clean and preprocess raw sales transaction data, specifically removing irrelevant columns and preparing the dataset for clustering.

💭 Thought Process

My approach to building this project focused on creating a flexible and intuitive solution for customer segmentation. I prioritized modularity by separating data wrangling into a dedicated wrangle.py file, ensuring that the core data preparation logic could be easily reused across different parts of the project.

For exploratory data analysis, I opted for interactive visualizations using ipywidgets in EDA.py. This decision was driven by the need for dynamic exploration, allowing users to quickly grasp insights into feature distributions and relationships without rerunning code.

The core clustering workflow in Sales Transaction Clustering.ipynb details the analytical journey. I made the key decision to employ K-Means clustering due to its efficiency and interpretability for segmentation tasks. To enhance the robustness of the clustering, StandardScaler was consistently applied to features, addressing the sensitivity of K-Means to feature scales. Furthermore, Principal Component Analysis (PCA) was integrated for dimensionality reduction, which proved crucial for visualizing high-dimensional cluster results in an understandable 2D space. The consideration of both the Elbow method and Silhouette scores for optimal 'k' selection provided a more comprehensive evaluation of clustering quality.

To make the clustering analysis accessible to a wider audience, I developed two interactive web applications: one using Dash (dash_app.py) and another with Streamlit (stream_lit.py). The choice to include both frameworks showcases different approaches to building interactive dashboards and allows users to choose their preferred interface. Streamlit, in particular, offered a rapid development pathway for creating a highly interactive and user-friendly experience, allowing dynamic feature selection and real-time visualization of clustering outcomes.

Throughout the project, emphasis was placed on clear visualization using seaborn, matplotlib, and plotly.express to effectively communicate the results of the clustering and provide actionable insights into customer segments.

🛠️ Tools & Tech Stack

Layer Technology
Language Python
Data Wrangling Pandas, re (Regular Expressions)
Scientific Computing SciPy
Data Visualization Seaborn, Matplotlib, Plotly Express
Interactive UI ipywidgets, Dash, Streamlit, Jupyter Dash
Machine Learning Scikit-learn (KMeans, StandardScaler, Pipeline, PCA, silhouette_score)

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • Git

Installation

git clone https://github.com/rashadmin/Weekly-Sales-Transaction-Clustering.git
cd Weekly-Sales-Transaction-Clustering
pip install -r requirements.txt

Note: A requirements.txt file needs to be created based on the tools_or_frameworks_used section from the file summaries. A sample requirements.txt based on the detected libraries would include:

pandas
scipy
seaborn
matplotlib
plotly
ipywidgets
scikit-learn
dash
jupyter-dash
streamlit

Data

Ensure the Sales_Transactions_Dataset_Weekly.csv file is present in the project's root directory.

Run

1. Jupyter Notebook for Detailed Analysis:

jupyter notebook "Sales Transaction Clustering.ipynb"

2. Dash Web Application:

python dash_app.py

Open your web browser and navigate to the address displayed in the console (usually http://127.0.0.1:8050/).

3. Streamlit Web Application:

streamlit run stream_lit.py

Open your web browser and navigate to the address displayed in the console (usually http://localhost:8501).

📖 Usage

Example 1: Exploring Data Interactively with EDA.py

To use the interactive EDA features in a Jupyter environment:

# In a Jupyter Notebook or IPython environment
from EDA import make_hist_box_plot, make_scatter_plot
import pandas as pd

df = pd.read_csv('Sales_Transactions_Dataset_Weekly.csv')

# Use ipywidgets to interact with these functions
# Example: make_hist_box_plot(df, 'Feature_Column')
# Example: make_scatter_plot(df, 'Feature_X', 'Feature_Y', 'Cluster_Label')

Example 2: Streamlit Interactive Clustering

Run the Streamlit application to visually perform clustering:

streamlit run stream_lit.py

Interact with the sidebar controls to:

  • Select the number of features for analysis.
  • Choose specific features from a multi-select dropdown.
  • Set the desired number of clusters (K).
  • View distribution plots, variance analysis, correlation heatmaps, and PCA-reduced scatter plots of the clusters.

📚 Resources

📄 License

MIT © rashadmin

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors