For our assignment, we analyzed SMS text messages to classify them as ‘spam’ or ‘ham’. As online communication has adapted and shifted from email to various forms of direct messaging, phishers have adjusted where and how they target individuals with spam. Users want to know that their accounts are secure, and they do not have time to be bothered by receiving spam notifications and messages. We looked to create a classification model, using the following algorithms: Logistic Regression, Random Forest, and LSTM Neural Network. For each of these models, we performed feature extraction using sparse vectoratization techniques CountVectorizer and TfidfVectorizer as well as utilizing bigrams, n-grams and word embeddings to test against a dense vector representation. After performing an ablation study on the models and tuning each with the optimal hyperparameters via cross validation, we compared a set of performance metrics: AUC, precision, recall, and F1 score to choose the best model. Within our paper, we further discuss the the business value of counteracting spam and malicious messages through classification, the data sources and preprocessing, and explaining which model and feature extraction method performs best. Following this we construct a proposal of how to deploy our solution into production.
-
Notifications
You must be signed in to change notification settings - Fork 1
Group Project
License
beelze-b/WhosInMyDMs
About
Group Project
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published