Skip to content

SGNS results #7

@Alaa-Ebshihy

Description

@Alaa-Ebshihy

Hi,

I have a problem in re-generating SGNS embeddings on google ngram corpus

I follow these steps:

  1. use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
  2. use histwords/googlengram/pullscripts/downloadandsplit.py then histwords/googlengram/pullscripts/gramgrab.py (set context to 4)
  3. use histwords/googlengram/pullscripts/runmerge.py on the output from 2 and then histwords/googlengram/pullscripts/indexmerge.py
  4. use histwords/googlengram/freqperyear.py on the output of 3
  5. use histwords/googlengram/makedecades.py on the output of 3
  6. use histwords/sgns/makecorpus.py py passing the output of 1, 4 and 5
  7. train embeddings using histwords/sgns/runword2vec.py (using --sequential option)
  8. use histwords/sgns/postprocessingsgns.py on the trained data.

My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000

So, my question are there wrong in the steps I follow? or can you help me with any info why this happens

Thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions