Skip to content

Conversation

February71st
Copy link

Description

Addressed issue #115 , where punctuation and the tokens attached to it were being removed regardless of whether the user specified they should

Changes Made

Created a function that checks whether a token given by string.split() is in the vocab dictionary by using the same regular expression used in the creation of the dictionary to isolate tokens, or if it is all punctuation. Then, rewrite the step where docs filter out words which are not in the vocab dictionary keep the token if it is a word or pure punctuation.

Created a new test to make sure unintended words were not being removed during preprocessing

Rationale

I assumed that the reason for only keeping the tokens if they were in the vocab set was in order to filter out unwanted words, since there are situations in which it would be appropriate to preserve lone punctuation, such as [word1 - word2]. This is why the new version preserves punctuation as well as vocab words. If the user specifies that punctuation should be removed, that removal happens before this step, so this has no effect on that case.

In order to test and make sure similar removals would not happen, I created my new test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant