Skip to content

Quantum-Software-Development/15-DataMining_Project_3_-Clustering_Comparison_KMeans_MeanShift_DBSCAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


[πŸ‡§πŸ‡· PortuguΓͺs] [πŸ‡¬πŸ‡§ English]





Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Important

⚠️ Heads Up







🎢 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

πŸ“Ί For better resolution, watch the video on YouTube.



Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository




Table of Contents

  1. Project Overview
  2. What's in this repo
  3. Quick Start (run the code)
  4. Step-by-Step Explanation
  5. Code Step-by-Step
  6. Algorithms used (K-Means, Mean-Shift, DBSCAN)
  7. How we chose DBSCAN eps (K-distance graph)
  8. Visualization
  9. Results summary & interpretation
  10. Next steps & suggestions
  11. Requirements & environment
  12. References
  13. License & credits



This project loads a CSV dataset (Dados-Grupo4.csv), inspects and cleans it, applies feature scaling, and compares three clustering algorithms: K-Means, Mean-Shift, and DBSCAN. It includes dark turquoise plots and clear explanations to help anyone understand the workflow and results.



  • Dados-Grupo4.csv β€” main dataset file.
  • notebook.ipynb or run_clustering.py β€” main code handling loading, cleaning, clustering, and plotting.
  • README.md β€” this documentation.
  • requirements.txt β€” list of Python packages needed.



3.1- Open Colab or your local Python environment.


3.2- Upload Dados-Grupo4.csv to the working folder.


3.3- Install dependencies:


pip install -r requirements.txt

3.4- Either run notebook.ipynb cell by cell or execute:


python run_clustering.py

3.5- Example requirements.txt:


pandas
numpy
matplotlib
seaborn
scikit-learn



  • We open the table (CSV) β€” like opening a spreadsheet.
  • Count how many rows (lines) and columns (types of information) it has.
  • Look at basic numbers: averages, smallest, biggest β€” helps understand the data.
  • Remove any extra "Unnamed: 0" column if present.
  • If some boxes are empty, fill them with the most common value (mode).
  • If two rows are identical, delete duplicates.
  • Scale the numbers so large values don't dominate the patterns.
  • Use three methods to group points (K-Means, Mean-Shift, DBSCAN).
  • Draw the groups as pictures with a dark background and turquoise color.
  • Compare the results and explain what each method discovered.




  • Typical code steps already here in the repo):



What it does: import libraries, set dark theme and turquoise palette, load CSV and print shape.


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Configure matplotlib for dark background
plt.style.use('dark_background')
sns.set_palette('GnBu_r')

# Load the dataset
df = pd.read_csv('/content/Dados-Grupo4.csv')

# Display the number of rows and columns
print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns.")



What it does: df.describe(), remove 'Unnamed: 0' if exists, fill missing values with mode, drop duplicates.



print(df.describe())

if 'Unnamed: 0' in df.columns:
    df = df.drop(columns=['Unnamed: 0'])

# fill missing
for col in df.columns:
    if df[col].isnull().any():
        df[col] = df[col].fillna(df[col].mode()[0])

# drop duplicates
df = df.drop_duplicates()



What it does: standardize numeric features and produce the initial scatter plot (figsize 12Γ—8).



from sklearn.preprocessing import StandardScaler

columns_to_scale = ['Coluna1', 'Coluna2']  # adapt if columns differ
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[columns_to_scale]), columns=columns_to_scale)

# --- PLOT 1: Initial scatter plot ---
plt.figure(figsize=(12, 8))
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'])
plt.title('Initial Scatter Plot of Scaled Data')
plt.xlabel('Scaled Coluna1')
plt.ylabel('Scaled Coluna2')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

PLOT 1 β€” Initial Scatter




Tip

πŸ‘ŒπŸ»

To save: add plt.savefig('initial_scatter.png', dpi=300, bbox_inches='tight') before de plt.show()



What it does: computes distance to 4th nearest neighbor for each point and plots sorted distances β€” the K-distance graph used to pick eps.


from sklearn.neighbors import NearestNeighbors
import numpy as np

neigh = NearestNeighbors(n_neighbors=4)
neigh.fit(df_scaled)
distances, indices = neigh.kneighbors(df_scaled)
distances = np.sort(distances[:, 3], axis=0)  # distance to 4th NN

# --- PLOT 2: K-distance graph ---
plt.figure(figsize=(12, 8))
plt.plot(distances)
plt.title('K-distance Graph for DBSCAN')
plt.xlabel('Data Points sorted by Distance')
plt.ylabel('Epsilon (Distance)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

PLOT 2 - generated by plt.plot(distances) + plt.show().

This plot is crucial for choosing the eps value β€” look for the β€œelbow” (sharp bend).




Tip

πŸ‘ŒπŸ»

To save: plt.savefig('k_distance.png', dpi=300, bbox_inches='tight').



What it does: runs K-Means, Mean-Shift and DBSCAN; stores labels; plots the three results side-by-side.


from sklearn.cluster import KMeans, MeanShift, DBSCAN
from sklearn.cluster import estimate_bandwidth

# K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df_scaled['kmeans_labels'] = kmeans.fit_predict(df_scaled[['Coluna1','Coluna2']])

# Mean-Shift
bandwidth = estimate_bandwidth(df_scaled[['Coluna1', 'Coluna2']], quantile=0.2, n_samples=len(df_scaled))
meanshift = MeanShift(bandwidth=bandwidth, bin_seeding=True)
df_scaled['meanshift_labels'] = meanshift.fit_predict(df_scaled[['Coluna1','Coluna2']])

# DBSCAN (choose eps from k-distance)
dbscan = DBSCAN(eps=0.25, min_samples=4)
df_scaled['dbscan_labels'] = dbscan.fit_predict(df_scaled[['Coluna1','Coluna2']])

# --- PLOT 3: Comparison (three subplots) ---
plt.figure(figsize=(20, 7))

plt.subplot(1, 3, 1)
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'], hue=df_scaled['kmeans_labels'], palette='GnBu_r', legend='full')
plt.title('K-Means Clustering (K=3)')
plt.grid(True, linestyle='--', alpha=0.7)

plt.subplot(1, 3, 2)
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'], hue=df_scaled['meanshift_labels'], palette='GnBu_r', legend='full')
plt.title(f'Mean-Shift (bandwidth={bandwidth:.2f})')
plt.grid(True, linestyle='--', alpha=0.7)

plt.subplot(1, 3, 3)
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'], hue=df_scaled['dbscan_labels'], palette='GnBu_r', legend='full')
plt.title('DBSCAN (eps=0.25, min_samples=4)')
plt.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

PLOT 3 - Comparison:** the sequence plt.subplot(...); sns.scatterplot(...); plt.show() generates the three plots together (K-Means, Mean-Shift, DBSCAN).




Tip

πŸ‘ŒπŸ»

To save: each subplot as a single image: before plt.show(), use plt.savefig('comparison_three_algorithms.png', dpi=300, bbox_inches='tight').

To save: separate images for each algorithm, move each subplot block into separate cells and save them individually.



What it does: prints how many clusters each method found and optionally computes silhouette score.


print(f"Number of K-Means clusters: {df_scaled['kmeans_labels'].nunique()}")
print(f"Number of Mean-Shift clusters: {df_scaled['meanshift_labels'].nunique()}")
print(f"Number of DBSCAN clusters (excluding noise -1): {df_scaled['dbscan_labels'].nunique() - (1 if -1 in df_scaled['dbscan_labels'].unique() else 0)}")

# Optional: silhouette
from sklearn.metrics import silhouette_score
print('KMeans silhouette:', silhouette_score(df_scaled[['Coluna1','Coluna2']], df_scaled['kmeans_labels']))



  • K-Means (K=3): Found 3 clusters β€” standard baseline, assumes round groups.
  • Mean-Shift: Found 4 clusters β€” adapts to dense regions automatically.
  • DBSCAN (eps chosen from K-distance, min_samples=4): Found 5 clusters + noise β€” good for dense groups and spotting outliers.
  • Interpretation: Each method groups points differently depending on the rules. Like sorting toys by color vs by how close they are on a shelf β€” the piles will be different.



- Compute metrics like silhouette score or Davies-Bouldin index to compare methods numerically.

- Try other K values for K-Means; test different quantiles for Mean-Shift bandwidth.

- With more than 2 features, try PCA/t-SNE/UMAP for visualization.

- If DBSCAN finds too much noise, adjust eps/min_samples or test HDBSCAN.

- Make a slide comparing all three plots side by side, with one-sentence conclusions.

- Example silhouette code:


from sklearn.metrics import silhouette_score
print('KMeans silhouette:', silhouette_score(df_scaled, kmeans_labels))



- Python 3.8 or higher

- pandas, numpy, matplotlib, seaborn, scikit-learn

- Optional: Jupyter Notebook or Google Colab



  • πŸ‘¨πŸ½β€πŸš€ Andson Ribeiro - Slide into my inbox

  • πŸ‘©πŸ»β€πŸš€ Fabiana ⚑️ Campanari - Shoot me an email

  • πŸ‘¨πŸ½β€πŸš€ JosΓ© Augusto de Souza Oliveira - email

  • πŸ§‘πŸΌβ€πŸš€ Luan Fabiano - email

  • πŸ‘¨πŸ½β€πŸš€ Pedro Barrenco - email

  • πŸ§‘πŸΌβ€πŸš€ Pedro Vyctor - Hit me up by email





1. Castro, L. N. & Ferrari, D. G. (2016). IntroduΓ§Γ£o Γ  mineraΓ§Γ£o de dados: conceitos bΓ‘sicos, algoritmos e aplicaΓ§Γ΅es. Saraiva.

2. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.

3. Ferreira, A. C. P. L. et al. (2024). InteligΓͺncia Artificial β€” Uma Abordagem de Aprendizado de MΓ‘quina. 2nd Ed. LTC.

4. Larson, R. & Farber, B. (2015). EstatΓ­stica Aplicada. Pearson.

5. MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations β€” origin of K-Means.

6. Meer, P. & Comaniciu, D. (2002). Mean Shift: A Robust Approach Toward Feature Space Analysis.

7. scikit-learn documentation β€” clustering algorithms.

8. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow β€” general DS reference.







πŸ›ΈΰΉ‹ My Contacts Hub





────────────── πŸ”­β‹† ──────────────

➣➒➀ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

πŸ‘©πŸ»β€πŸš€ 15- Data Mining Hands-on project exploring clustering techniques (K-Means, Mean-Shift, DBSCAN) with data cleaning, visualization, and method comparison

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Contributors 2

  •  
  •