-
Notifications
You must be signed in to change notification settings - Fork 94
Open
Description
Hi,
I have a problem in re-generating SGNS embeddings on google ngram corpus
I follow these steps:
- use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
- use histwords/googlengram/pullscripts/downloadandsplit.py then histwords/googlengram/pullscripts/gramgrab.py (set context to 4)
- use histwords/googlengram/pullscripts/runmerge.py on the output from 2 and then histwords/googlengram/pullscripts/indexmerge.py
- use histwords/googlengram/freqperyear.py on the output of 3
- use histwords/googlengram/makedecades.py on the output of 3
- use histwords/sgns/makecorpus.py py passing the output of 1, 4 and 5
- train embeddings using histwords/sgns/runword2vec.py (using --sequential option)
- use histwords/sgns/postprocessingsgns.py on the trained data.
My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000
So, my question are there wrong in the steps I follow? or can you help me with any info why this happens
Thanks,
chengjun
Metadata
Metadata
Assignees
Labels
No labels