- Objective
- About the data
- Research Theory behind Decision-Making Process 🚧
- EDA
- Preprocessing
- Experimentation Theory behind Decision-Making Process 🚧
- Technologies 🚧
- Installation 🚧
This project started long time ago as a data science project for my personal portfolio with the goal of gaining experience in working with NLP techniques for text classification but I decided to make it grow and create a full implemented project with Machine Learning, Deep Learning (LLMs) and MLOps methods.
This project is based on four datasets extracted from Kaggle
- Spam Email with 5,572 rows
- Spam Email Classification Dataset with 83446 rows
- Email Classification (Ham-Spam) with 179 rows
- Spam email Dataset with 5695 rows
All datasets have a similar format consisting on two columns:
- Message/text/email that contains the emails
- Message/label/spam that contains ham/spam or 0/1 values
Due to lfs constrains, none of the datasets nor the concatenated raw dataset created in the notebook inside 01_feature_pipeline
- Basic overview of the training set features.
- Analysis of different approaches for feature engineering.
- Analysis of different approaches for data cleaning.
- Basic viz of features.
-
Change of data types
-
Drop duplicates
-
Chars cleaning. Based on this Notebook
- Replacement of special replacements
- Replacement of emojis
- Conversion to lowecase
- Removal of HTML tags
- Removal of URLs
- Replacement of numbers with "number"
- Replacement of e-mail addresses with "emailaddr"
- Removal of punctutation
- Removal of Non-Alphabetic Characters
- Collapse of multiple whitespaces into single whitespace
-
Tokenization
-
Removal of stopwords
-
Lemmatization
This section only contains results. If you want to take a look at the Decision-Making Process please take a look at the Experimentation README file
- Base model BOW + Multinomial Naive Bayes
Classification Report (Test):
precision recall f1-score support
0 0.99 0.97 0.98 453
1 0.82 0.95 0.88 63
accuracy 0.97 516
macro avg 0.91 0.96 0.93 516
weighted avg 0.97 0.97 0.97 516
-
Recall 0.95: the model is detecting a 95% of the class 1
-
Precision 0.82: lower than we would like to because there is a significative number of False Positives
-
F1-score 0.88: decent value but since recall is high and precission not that much there is room for improvement.
-
Docker
-
aws
build docker container
docker build -t mlflow .
run docker container
docker run -p 5000:5000 -v $(pwd)/mlflow_artifacts:/mlflow/artifacts mlflow