Skip to content

cedric-c/spam_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 

Repository files navigation

Spam Classifier

End of year project for introductory course in artificial intelligence.

Approach

  1. Preparing Data
    • Need to create the table.
    • Naïve Bayes pre-processing (Smoothing)
    • k-NN pre-processing.
      • Load only significant words into memory to conserve space / memory.
      • Remove stopwords.
      • Convert case.
      • Score numeric.
      • Score $ mentions.
  2. ????
  3. Classify Emails
  4. Stats

Future Work

  • Classification of other types of emails such as phishing emails or critical emails.

References

k-Nearest Neighbours (k-NN)

  • Jason Brownlee. K-Nearest Neighbors for Machine Learning. Accessed on 2019-11-16 at https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/.

    • Entire training dataset is stored.
    • Predictions made by searching entire dataset for K closest instances.
    • Distance measure is used.
      • Euclidean distance: distance(x, xi) = sqrt( sum( (xj - xij)^2 ) )
      • Hamming: pending
      • Manhattan: pending
      • Minkowski: pending
      • Tanimoto: pending
      • Jaccard: pending
      • Mahalanobis: pending
      • Cosine: pending
    • Tuning of K.
    • A.K.A.: Instance-Based learning, Lazy Learning, Non-Parametric.
    • Can be used for regression and calssification problems.
      • Regression: Prediction based on mean or median of K-most similar instances.
      • Classification: Output calculated as class.
    • Works well with small number of input variables. More input variables increase input space exponentially. General problem of "Curse of Dimensionality".
    • Preparing Data:
      • Rescale data
      • Lower Dimensionality
  • Hardik Jaroli. K-Nearest Neighbors (KNN) with Python | DataScience+. Accessed on 2019-11-16 at https://datascienceplus.com/k-nearest-neighbors-knn-with-python/.

    • Pros: simplicity, any number of classes, adding data is easy.
    • Cons: Prediction cost, poor performance with high dimension data, categorical features don't work well.
  • Onel Harrison. Machine Learning Basics with the K-Nearest Neighbors Algorithm. Accessed on 2019-11-16 at https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761.

    • Pros: Simplicity, no model, no parameter tuning, versatility (classification, regression, search).
    • Cons: Slow.
  • Sumit Dua. Text Classification using K Nearest Neighbors - Towards Data Science. Accessed on 2019-11-16 at https://towardsdatascience.com/text-classification-using-k-nearest-neighbors-46fa8a77acc5.

    • .
  • Slides from Catherine

    • CSI 4107
      • Machine Learning for IR:
        • Use standard vector space inverted index methods to find the k nearest neighbors.
        • No feature selection required.
        • No traning required.
        • Scales well with large number of classes (no need for n classifiers for n classes).
        • Classes can influence each other.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages