You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to remove stopwords from a tokenized corpus.
Removing all words except the stopwords is easily achievable:
let stopwords = (open stopwords.txt | lines | into df)
let corpus = (open corpus.txt | split words | into df)
let mask = ($corpus | is-in $stopwords)
let result = ($corpus | filter-with $mask)
But I need the opposite, to get rid of the stopwords and keep the other words.
Describe the solution you'd like
The elegant solution would be a new command called is-not-in
(I think this is also termed antijoin in other systems)
An example:
let stopwords = (open stopwords.txt | lines | into df)
let corpus = (open corpus.txt | split words | into df)
let mask = ($corpus | is-not-in $stopwords) <------ requested feature
let tidy = ($corpus | filter-with $mask)
then $tidy would contain the words in $corpus minus the words in $stopwords
Describe alternatives you've considered
I've been trying to "negate" the mask, so it finds false instead of true - since that would also work, but I have found no way to negate a boolean in filter-with.
EDIT:
let tidy = ($corpus | filter-with ($mask | df-not))
can be used, so is-not-in is more of a "nice to have", I guess.
Additional context and details
No response
The text was updated successfully, but these errors were encountered:
Related problem
I am trying to remove stopwords from a tokenized corpus.
Removing all words except the stopwords is easily achievable:
But I need the opposite, to get rid of the stopwords and keep the other words.
Describe the solution you'd like
The elegant solution would be a new command called
is-not-in
(I think this is also termed antijoin in other systems)
An example:
then
$tidy
would contain the words in$corpus
minus the words in$stopwords
Describe alternatives you've considered
I've been trying to "negate" the mask, so it finds
false
instead oftrue
- since that would also work, but I have found no way to negate a boolean infilter-with
.EDIT:
let tidy = ($corpus | filter-with ($mask | df-not))
can be used, so
is-not-in
is more of a "nice to have", I guess.Additional context and details
No response
The text was updated successfully, but these errors were encountered: