End of year project for introductory course in artificial intelligence.
- Preparing Data
- Need to create the table.
- Naïve Bayes pre-processing (Smoothing)
- k-NN pre-processing.
- Load only significant words into memory to conserve space / memory.
- Remove stopwords.
- Convert case.
- Score numeric.
- Score $ mentions.
- ????
- Classify Emails
- Stats
- Classification of other types of emails such as phishing emails or critical emails.
-
Jason Brownlee. K-Nearest Neighbors for Machine Learning. Accessed on 2019-11-16 at
https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/
.- Entire training dataset is stored.
- Predictions made by searching entire dataset for K closest instances.
- Distance measure is used.
- Euclidean distance:
distance(x, xi) = sqrt( sum( (xj - xij)^2 ) )
- Hamming: pending
- Manhattan: pending
- Minkowski: pending
- Tanimoto: pending
- Jaccard: pending
- Mahalanobis: pending
- Cosine: pending
- Euclidean distance:
- Tuning of K.
- A.K.A.: Instance-Based learning, Lazy Learning, Non-Parametric.
- Can be used for regression and calssification problems.
- Regression: Prediction based on mean or median of K-most similar instances.
- Classification: Output calculated as class.
- Works well with small number of input variables. More input variables increase input space exponentially. General problem of "Curse of Dimensionality".
- Preparing Data:
- Rescale data
- Lower Dimensionality
-
Hardik Jaroli. K-Nearest Neighbors (KNN) with Python | DataScience+. Accessed on 2019-11-16 at
https://datascienceplus.com/k-nearest-neighbors-knn-with-python/
.- Pros: simplicity, any number of classes, adding data is easy.
- Cons: Prediction cost, poor performance with high dimension data, categorical features don't work well.
-
Onel Harrison. Machine Learning Basics with the K-Nearest Neighbors Algorithm. Accessed on 2019-11-16 at
https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
.- Pros: Simplicity, no model, no parameter tuning, versatility (classification, regression, search).
- Cons: Slow.
-
Sumit Dua. Text Classification using K Nearest Neighbors - Towards Data Science. Accessed on 2019-11-16 at
https://towardsdatascience.com/text-classification-using-k-nearest-neighbors-46fa8a77acc5
.- .
-
Slides from Catherine
- CSI 4107
- Machine Learning for IR:
- Use standard vector space inverted index methods to find the k nearest neighbors.
- No feature selection required.
- No traning required.
- Scales well with large number of classes (no need for n classifiers for n classes).
- Classes can influence each other.
- Machine Learning for IR:
- CSI 4107