Ensure tokenisation matches with or without punctuation (Fixes Issue #115) #134

February71st · 2025-04-22T17:26:16Z

Description

Addressed issue #115 , where punctuation and the tokens attached to it were being removed regardless of whether the user specified they should

Changes Made

Created a function that checks whether a token given by string.split() is in the vocab dictionary by using the same regular expression used in the creation of the dictionary to isolate tokens, or if it is all punctuation. Then, rewrite the step where docs filter out words which are not in the vocab dictionary keep the token if it is a word or pure punctuation.

Created a new test to make sure unintended words were not being removed during preprocessing

Rationale

I assumed that the reason for only keeping the tokens if they were in the vocab set was in order to filter out unwanted words, since there are situations in which it would be appropriate to preserve lone punctuation, such as [word1 - word2]. This is why the new version preserves punctuation as well as vocab words. If the user specifies that punctuation should be removed, that removal happens before this step, so this has no effect on that case.

In order to test and make sure similar removals would not happen, I created my new test.

… caused punctuation and tokens attached to it would be removed regardless of settings

…r accidentally removing it while creating a test for my fix.

…ing.

February71st added 5 commits April 22, 2025 01:51

Edited preprocessing to fix an error where a mismatch in tokenisation…

953b91e

… caused punctuation and tokens attached to it would be removed regardless of settings

Cleaned up my previous code in preprocessing

2c82501

Restored num_processes=10 to multiprocess test in preprocessing, afte…

01d6e22

…r accidentally removing it while creating a test for my fix.

Edited my test for readability, added docstring.

04d7042

Removed unneccessary code from my test and fixed a typo in the docstr…

35f4d9d

…ing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure tokenisation matches with or without punctuation (Fixes Issue #115) #134

Ensure tokenisation matches with or without punctuation (Fixes Issue #115) #134

Uh oh!

February71st commented Apr 22, 2025

Uh oh!

Uh oh!

Ensure tokenisation matches with or without punctuation (Fixes Issue #115) #134

Are you sure you want to change the base?

Ensure tokenisation matches with or without punctuation (Fixes Issue #115) #134

Uh oh!

Conversation

February71st commented Apr 22, 2025

Description

Changes Made

Rationale

Uh oh!

Uh oh!