-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning & Data Science Natural Language Processing Feature Extraction
In natural language processing, feature extraction refers to the process of transforming raw text data into a format that can be understood and analysed by machine learning algorithms.
Using vocabulary vectors is a very simple approach to representing a word or phrase as a vector, where the size of the vector is the size of your vocabulary, and you extract the embedding for a phrase by setting each corresponding element in the vector to a 1.
As your vocabulary size becomes larger, your vector becomes more and more sparse (a larger portion of the vector are zeros) which can have an effect on the efficiency and performance of various natural language processing tasks.
If you have a set of examples of phrases split into different classes, you can generate frequency vectors for each class by incrementing the count for every time a word shows up in the class. This results in