AntiSpam NLP Pipeline

Index of contents

Objective
About the data
Research Theory behind Decision-Making Process 🚧
- EDA
- Preprocessing
- Experimentation Theory behind Decision-Making Process 🚧
Technologies 🚧
Installation 🚧

Project Development Roadmap

This project started long time ago as a data science project for my personal portfolio with the goal of gaining experience in working with NLP techniques for text classification but I decided to make it grow and create a full implemented project with Machine Learning, Deep Learning (LLMs) and MLOps methods.

About the data

This project is based on four datasets extracted from Kaggle

Spam Email with 5,572 rows
Spam Email Classification Dataset with 83446 rows
Email Classification (Ham-Spam) with 179 rows
Spam email Dataset with 5695 rows

All datasets have a similar format consisting on two columns:

Message/text/email that contains the emails
Message/label/spam that contains ham/spam or 0/1 values

Due to lfs constrains, none of the datasets nor the concatenated raw dataset created in the notebook inside 01_feature_pipeline

Research

EDA

Notebook

Basic overview of the training set features.
Analysis of different approaches for feature engineering.
Analysis of different approaches for data cleaning.
Basic viz of features.

Preprocessing

Notebook

Change of data types
Drop duplicates
Chars cleaning. Based on this Notebook
1. Replacement of special replacements
2. Replacement of emojis
3. Conversion to lowecase
4. Removal of HTML tags
5. Removal of URLs
6. Replacement of numbers with "number"
7. Replacement of e-mail addresses with "emailaddr"
8. Removal of punctutation
9. Removal of Non-Alphabetic Characters
10. Collapse of multiple whitespaces into single whitespace
Tokenization
Removal of stopwords
Lemmatization

Experimentation

This section only contains results. If you want to take a look at the Decision-Making Process please take a look at the Experimentation README file

Base model BOW + Multinomial Naive Bayes

Classification Report (Test):

              precision    recall  f1-score   support

       0       0.99      0.97      0.98       453
       1       0.82      0.95      0.88        63

accuracy                           0.97       516
macro avg      0.91      0.96      0.93       516
weighted avg   0.97      0.97      0.97       516

Recall 0.95: the model is detecting a 95% of the class 1
Precision 0.82: lower than we would like to because there is a significative number of False Positives
F1-score 0.88: decent value but since recall is high and precission not that much there is room for improvement.

Technologies

Docker
aws

Installation

build docker container

docker build -t mlflow .

run docker container

docker run -p 5000:5000 -v $(pwd)/mlflow_artifacts:/mlflow/artifacts mlflow

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
.github/workflows		.github/workflows
data		data
experimentation_pipeline		experimentation_pipeline
images		images
research		research
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
exception.py		exception.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AntiSpam NLP Pipeline

Index of contents

Project Development Roadmap

About the data

Research

EDA

Preprocessing

Experimentation

Technologies

Installation

Conclusions

About

Releases

Packages

Languages

amaldu/AntiSpam-NLP-Pipeline

Folders and files

Latest commit

History

Repository files navigation

AntiSpam NLP Pipeline

Index of contents

Project Development Roadmap

About the data

Research

EDA

Preprocessing

Experimentation

Technologies

Installation

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages