Ensure tokenisation matches with or without punctuation (Fixes Issue #115) #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Addressed issue #115 , where punctuation and the tokens attached to it were being removed regardless of whether the user specified they should
Changes Made
Created a function that checks whether a token given by string.split() is in the vocab dictionary by using the same regular expression used in the creation of the dictionary to isolate tokens, or if it is all punctuation. Then, rewrite the step where docs filter out words which are not in the vocab dictionary keep the token if it is a word or pure punctuation.
Created a new test to make sure unintended words were not being removed during preprocessing
Rationale
I assumed that the reason for only keeping the tokens if they were in the vocab set was in order to filter out unwanted words, since there are situations in which it would be appropriate to preserve lone punctuation, such as [word1 - word2]. This is why the new version preserves punctuation as well as vocab words. If the user specifies that punctuation should be removed, that removal happens before this step, so this has no effect on that case.
In order to test and make sure similar removals would not happen, I created my new test.