SGNS results

Hi, 

I have a problem in re-generating SGNS embeddings on google ngram corpus

I follow these steps:
1. use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
2. use **histwords/googlengram/pullscripts/downloadandsplit.py** then **histwords/googlengram/pullscripts/gramgrab.py** (set context to 4)
3. use **histwords/googlengram/pullscripts/runmerge.py** on the output from **2** and then **histwords/googlengram/pullscripts/indexmerge.py**
4. use **histwords/googlengram/freqperyear.py** on the output of **3**
5. use **histwords/googlengram/makedecades.py** on the output of **3**
6. use  **histwords/sgns/makecorpus.py** py passing the output of **1**, **4** and **5**
7. train embeddings using **histwords/sgns/runword2vec.py** (using --sequential option) 
8. use **histwords/sgns/postprocessingsgns.py** on the trained data.

My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000 

So, my question are there wrong in the steps I follow? or can you help me with any info why this happens

Thanks,


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SGNS results #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SGNS results #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions