Segmenting customer behavior through K-Means clustering on weekly sales data.
This project provides a comprehensive solution for performing customer segmentation using K-Means clustering on weekly sales transaction data. The primary goal is to identify distinct groups of customers based on their purchasing patterns, enabling businesses to implement targeted marketing strategies, personalize sales approaches, and make informed product recommendations.
This repository offers a suite of tools for data analysis, machine learning, and interactive visualization:
- Interactive Exploratory Data Analysis (EDA): A dedicated module for generating interactive plots (histograms, box plots, scatter plots) to visually explore dataset features, distributions, and relationships.
- K-Means Clustering Workflow: A detailed Jupyter Notebook outlining the end-to-end process of customer segmentation, including data loading, preprocessing, feature selection (using trimmed variance), optimal cluster determination (Elbow method, Silhouette scores), K-Means application, and visualization with PCA.
- Dash Web Application: An interactive web dashboard built with Dash for visualizing and analyzing clustering results. It allows users to explore clusters in a 2D PCA-reduced space and understand the underlying patterns.
- Streamlit User Interface: A user-friendly Streamlit application that empowers users to perform K-Means clustering interactively. It features dynamic selection of features and number of clusters, and visualizes the results through distribution plots, variance analysis, heatmaps, and PCA-reduced scatter plots.
- Robust Data Wrangling: A modular script to clean and preprocess raw sales transaction data, specifically removing irrelevant columns and preparing the dataset for clustering.
My approach to building this project focused on creating a flexible and intuitive solution for customer segmentation. I prioritized modularity by separating data wrangling into a dedicated wrangle.py file, ensuring that the core data preparation logic could be easily reused across different parts of the project.
For exploratory data analysis, I opted for interactive visualizations using ipywidgets in EDA.py. This decision was driven by the need for dynamic exploration, allowing users to quickly grasp insights into feature distributions and relationships without rerunning code.
The core clustering workflow in Sales Transaction Clustering.ipynb details the analytical journey. I made the key decision to employ K-Means clustering due to its efficiency and interpretability for segmentation tasks. To enhance the robustness of the clustering, StandardScaler was consistently applied to features, addressing the sensitivity of K-Means to feature scales. Furthermore, Principal Component Analysis (PCA) was integrated for dimensionality reduction, which proved crucial for visualizing high-dimensional cluster results in an understandable 2D space. The consideration of both the Elbow method and Silhouette scores for optimal 'k' selection provided a more comprehensive evaluation of clustering quality.
To make the clustering analysis accessible to a wider audience, I developed two interactive web applications: one using Dash (dash_app.py) and another with Streamlit (stream_lit.py). The choice to include both frameworks showcases different approaches to building interactive dashboards and allows users to choose their preferred interface. Streamlit, in particular, offered a rapid development pathway for creating a highly interactive and user-friendly experience, allowing dynamic feature selection and real-time visualization of clustering outcomes.
Throughout the project, emphasis was placed on clear visualization using seaborn, matplotlib, and plotly.express to effectively communicate the results of the clustering and provide actionable insights into customer segments.
| Layer | Technology |
|---|---|
| Language | Python |
| Data Wrangling | Pandas, re (Regular Expressions) |
| Scientific Computing | SciPy |
| Data Visualization | Seaborn, Matplotlib, Plotly Express |
| Interactive UI | ipywidgets, Dash, Streamlit, Jupyter Dash |
| Machine Learning | Scikit-learn (KMeans, StandardScaler, Pipeline, PCA, silhouette_score) |
- Python 3.8+
- Git
git clone https://github.com/rashadmin/Weekly-Sales-Transaction-Clustering.git
cd Weekly-Sales-Transaction-Clustering
pip install -r requirements.txtNote: A requirements.txt file needs to be created based on the tools_or_frameworks_used section from the file summaries. A sample requirements.txt based on the detected libraries would include:
pandas
scipy
seaborn
matplotlib
plotly
ipywidgets
scikit-learn
dash
jupyter-dash
streamlit
Ensure the Sales_Transactions_Dataset_Weekly.csv file is present in the project's root directory.
1. Jupyter Notebook for Detailed Analysis:
jupyter notebook "Sales Transaction Clustering.ipynb"2. Dash Web Application:
python dash_app.pyOpen your web browser and navigate to the address displayed in the console (usually http://127.0.0.1:8050/).
3. Streamlit Web Application:
streamlit run stream_lit.pyOpen your web browser and navigate to the address displayed in the console (usually http://localhost:8501).
To use the interactive EDA features in a Jupyter environment:
# In a Jupyter Notebook or IPython environment
from EDA import make_hist_box_plot, make_scatter_plot
import pandas as pd
df = pd.read_csv('Sales_Transactions_Dataset_Weekly.csv')
# Use ipywidgets to interact with these functions
# Example: make_hist_box_plot(df, 'Feature_Column')
# Example: make_scatter_plot(df, 'Feature_X', 'Feature_Y', 'Cluster_Label')Run the Streamlit application to visually perform clustering:
streamlit run stream_lit.pyInteract with the sidebar controls to:
- Select the number of features for analysis.
- Choose specific features from a multi-select dropdown.
- Set the desired number of clusters (K).
- View distribution plots, variance analysis, correlation heatmaps, and PCA-reduced scatter plots of the clusters.
- Pandas Documentation — Data manipulation and analysis
- Scikit-learn Documentation — Machine learning algorithms (KMeans, StandardScaler, PCA, Pipeline)
- Streamlit Documentation — Building interactive web applications
- Dash Documentation — Building analytical web applications
- Plotly Express Documentation — High-level interface for Plotly
- Seaborn Tutorial — Statistical data visualization
- Matplotlib Tutorial — Basic plotting library
- ipywidgets Documentation — Interactive HTML widgets for Jupyter notebooks
- SciPy Documentation — Scientific and technical computing
- Python
remodule — Regular expression operations
MIT © rashadmin