[π§π· PortuguΓͺs] [π¬π§ English]
15- Data Mining / Project 3 β Clustering Algorithms Exploration and Comparison - K-Means - Mean-Shift - Dbscan
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
β Access Data Mining Main Repository
- Project Overview
- What's in this repo
- Quick Start (run the code)
- Step-by-Step Explanation
- Code Step-by-Step
- Algorithms used (K-Means, Mean-Shift, DBSCAN)
- How we chose DBSCAN eps (K-distance graph)
- Visualization
- Results summary & interpretation
- Next steps & suggestions
- Requirements & environment
- References
- License & credits
This project loads a CSV dataset (Dados-Grupo4.csv), inspects and cleans it, applies feature scaling, and compares three clustering algorithms: K-Means, Mean-Shift, and DBSCAN. It includes dark turquoise plots and clear explanations to help anyone understand the workflow and results.
Dados-Grupo4.csvβ main dataset file.notebook.ipynborrun_clustering.pyβ main code handling loading, cleaning, clustering, and plotting.README.mdβ this documentation.requirements.txtβ list of Python packages needed.
3.1- Open Colab or your local Python environment.
3.2- Upload Dados-Grupo4.csv to the working folder.
3.3- Install dependencies:
pip install -r requirements.txt3.4- Either run notebook.ipynb cell by cell or execute:
python run_clustering.py3.5- Example requirements.txt:
pandas
numpy
matplotlib
seaborn
scikit-learn
- We open the table (CSV) β like opening a spreadsheet.
- Count how many rows (lines) and columns (types of information) it has.
- Look at basic numbers: averages, smallest, biggest β helps understand the data.
- Remove any extra "Unnamed: 0" column if present.
- If some boxes are empty, fill them with the most common value (mode).
- If two rows are identical, delete duplicates.
- Scale the numbers so large values don't dominate the patterns.
- Use three methods to group points (K-Means, Mean-Shift, DBSCAN).
- Draw the groups as pictures with a dark background and turquoise color.
- Compare the results and explain what each method discovered.
- Typical code steps already here in the repo):
5.1 - Environment & load data
What it does: import libraries, set dark theme and turquoise palette, load CSV and print shape.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Configure matplotlib for dark background
plt.style.use('dark_background')
sns.set_palette('GnBu_r')
# Load the dataset
df = pd.read_csv('/content/Dados-Grupo4.csv')
# Display the number of rows and columns
print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns.")What it does: df.describe(), remove 'Unnamed: 0' if exists, fill missing values with mode, drop duplicates.β¨
print(df.describe())
if 'Unnamed: 0' in df.columns:
df = df.drop(columns=['Unnamed: 0'])
# fill missing
for col in df.columns:
if df[col].isnull().any():
df[col] = df[col].fillna(df[col].mode()[0])
# drop duplicates
df = df.drop_duplicates()What it does: standardize numeric features and produce the initial scatter plot (figsize 12Γ8).β¨
from sklearn.preprocessing import StandardScaler
columns_to_scale = ['Coluna1', 'Coluna2'] # adapt if columns differ
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[columns_to_scale]), columns=columns_to_scale)
# --- PLOT 1: Initial scatter plot ---
plt.figure(figsize=(12, 8))
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'])
plt.title('Initial Scatter Plot of Scaled Data')
plt.xlabel('Scaled Coluna1')
plt.ylabel('Scaled Coluna2')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()PLOT 1 β Initial Scatter
Tip
ππ»
To save: add plt.savefig('initial_scatter.png', dpi=300, bbox_inches='tight') before de plt.show()
What it does: computes distance to 4th nearest neighbor for each point and plots sorted distances β the K-distance graph used to pick eps.
from sklearn.neighbors import NearestNeighbors
import numpy as np
neigh = NearestNeighbors(n_neighbors=4)
neigh.fit(df_scaled)
distances, indices = neigh.kneighbors(df_scaled)
distances = np.sort(distances[:, 3], axis=0) # distance to 4th NN
# --- PLOT 2: K-distance graph ---
plt.figure(figsize=(12, 8))
plt.plot(distances)
plt.title('K-distance Graph for DBSCAN')
plt.xlabel('Data Points sorted by Distance')
plt.ylabel('Epsilon (Distance)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()PLOT 2 - generated by plt.plot(distances) + plt.show().
This plot is crucial for choosing the eps value β look for the βelbowβ (sharp bend).
What it does: runs K-Means, Mean-Shift and DBSCAN; stores labels; plots the three results side-by-side.
from sklearn.cluster import KMeans, MeanShift, DBSCAN
from sklearn.cluster import estimate_bandwidth
# K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df_scaled['kmeans_labels'] = kmeans.fit_predict(df_scaled[['Coluna1','Coluna2']])
# Mean-Shift
bandwidth = estimate_bandwidth(df_scaled[['Coluna1', 'Coluna2']], quantile=0.2, n_samples=len(df_scaled))
meanshift = MeanShift(bandwidth=bandwidth, bin_seeding=True)
df_scaled['meanshift_labels'] = meanshift.fit_predict(df_scaled[['Coluna1','Coluna2']])
# DBSCAN (choose eps from k-distance)
dbscan = DBSCAN(eps=0.25, min_samples=4)
df_scaled['dbscan_labels'] = dbscan.fit_predict(df_scaled[['Coluna1','Coluna2']])
# --- PLOT 3: Comparison (three subplots) ---
plt.figure(figsize=(20, 7))
plt.subplot(1, 3, 1)
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'], hue=df_scaled['kmeans_labels'], palette='GnBu_r', legend='full')
plt.title('K-Means Clustering (K=3)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.subplot(1, 3, 2)
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'], hue=df_scaled['meanshift_labels'], palette='GnBu_r', legend='full')
plt.title(f'Mean-Shift (bandwidth={bandwidth:.2f})')
plt.grid(True, linestyle='--', alpha=0.7)
plt.subplot(1, 3, 3)
sns.scatterplot(x=df_scaled['Coluna1'], y=df_scaled['Coluna2'], hue=df_scaled['dbscan_labels'], palette='GnBu_r', legend='full')
plt.title('DBSCAN (eps=0.25, min_samples=4)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()PLOT 3 - Comparison:** the sequence plt.subplot(...); sns.scatterplot(...); plt.show() generates the three plots together (K-Means, Mean-Shift, DBSCAN).
Tip
ππ»
To save: each subplot as a single image: before plt.show(), use plt.savefig('comparison_three_algorithms.png', dpi=300, bbox_inches='tight').
To save: separate images for each algorithm, move each subplot block into separate cells and save them individually.
What it does: prints how many clusters each method found and optionally computes silhouette score.
print(f"Number of K-Means clusters: {df_scaled['kmeans_labels'].nunique()}")
print(f"Number of Mean-Shift clusters: {df_scaled['meanshift_labels'].nunique()}")
print(f"Number of DBSCAN clusters (excluding noise -1): {df_scaled['dbscan_labels'].nunique() - (1 if -1 in df_scaled['dbscan_labels'].unique() else 0)}")
# Optional: silhouette
from sklearn.metrics import silhouette_score
print('KMeans silhouette:', silhouette_score(df_scaled[['Coluna1','Coluna2']], df_scaled['kmeans_labels']))- K-Means (K=3): Found 3 clusters β standard baseline, assumes round groups.
- Mean-Shift: Found 4 clusters β adapts to dense regions automatically.
- DBSCAN (eps chosen from K-distance, min_samples=4): Found 5 clusters + noise β good for dense groups and spotting outliers.
- Interpretation: Each method groups points differently depending on the rules. Like sorting toys by color vs by how close they are on a shelf β the piles will be different.
- Compute metrics like silhouette score or Davies-Bouldin index to compare methods numerically.
- Try other K values for K-Means; test different quantiles for Mean-Shift bandwidth.
- With more than 2 features, try PCA/t-SNE/UMAP for visualization.
- If DBSCAN finds too much noise, adjust eps/min_samples or test HDBSCAN.
- Make a slide comparing all three plots side by side, with one-sentence conclusions.
- Example silhouette code:
from sklearn.metrics import silhouette_score
print('KMeans silhouette:', silhouette_score(df_scaled, kmeans_labels))- Python 3.8 or higher
- pandas, numpy, matplotlib, seaborn, scikit-learn
- Optional: Jupyter Notebook or Google Colab
9. Our Crew:
-
π¨π½βπ Andson Ribeiro - Slide into my inbox
-
π©π»βπ Fabiana β‘οΈ Campanari - Shoot me an email
-
π¨π½βπ JosΓ© Augusto de Souza Oliveira - email
-
π§πΌβπ Luan Fabiano - email
-
π¨π½βπ Pedro Barrenco - email
-
π§πΌβπ Pedro Vyctor - Hit me up by email
10. Bibliography
1. Castro, L. N. & Ferrari, D. G. (2016). IntroduΓ§Γ£o Γ mineraΓ§Γ£o de dados: conceitos bΓ‘sicos, algoritmos e aplicaΓ§Γ΅es. Saraiva.
2. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
3. Ferreira, A. C. P. L. et al. (2024). InteligΓͺncia Artificial β Uma Abordagem de Aprendizado de MΓ‘quina. 2nd Ed. LTC.
4. Larson, R. & Farber, B. (2015). EstatΓstica Aplicada. Pearson.
5. MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations β origin of K-Means.
6. Meer, P. & Comaniciu, D. (2002). Mean Shift: A Robust Approach Toward Feature Space Analysis.
7. scikit-learn documentation β clustering algorithms.
8. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow β general DS reference.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.


