Skip to content

Conversation

@agtsai-i
Copy link

preprocess.tokenize() pads texts with -2 (the SKIP index), which puts it in the corpus vocabulary and counts_loose.

_loose_keys_ordered() then prepends the specials tokens (OOV and SKIP) while making keys_loose, thus allocating two array entries to SKIP (instead of 1 as desired, I assume).

This becomes a problem when you try to train a model using all of the words in the vocabulary, and in lda2vec_run.py,

model.sampler.W.data[:, :] = vectors[:n_vocab, :]

W is created with one more row than there are unique words + specials, since n_keys is derived from the concatenated array length created in _loose_keys_ordered(), and not the unique number of words in the vocabulary as created by counts_loose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant