Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[id2vec] Run id2vec on Public Git Archive #17

Open
vmarkovtsev opened this issue Feb 14, 2018 · 33 comments
Open

[id2vec] Run id2vec on Public Git Archive #17

vmarkovtsev opened this issue Feb 14, 2018 · 33 comments

Comments

@vmarkovtsev
Copy link
Collaborator

Run id2vec on PGA dataset and produce the model. Publish the model with modelforge. Fix all the found bugs.

Includes updating the Dockerfile for Python3 and infra issues.

@eiso
Copy link
Member

eiso commented Feb 22, 2018

This should definitely have a blog post associated with it as well /cc @campoy

@campoy
Copy link

campoy commented Mar 2, 2018

I thought id2vec had been renamed?
any chance we could do this on TPUs on GCP?
that would easily become a blog post on cloud.google.com/blog

@vmarkovtsev
Copy link
Collaborator Author

id2vec is a cool name and straight to the point... did you mean ast2vec instead?

We can run it on TPUs since it is Tensorflow under the hood. @zurk do you wish to play?

@zurk
Copy link

zurk commented Mar 5, 2018

@vmarkovtsev do we have TPUs somewhere? I did not know about that. Yes, definitely I want to try.

@eiso
Copy link
Member

eiso commented Mar 5, 2018

@zurk we have access to them via Google Cloud, in the srcd-playground project. Be aware to not get a huge bill for using them.

@vmarkovtsev
Copy link
Collaborator Author

Rephrasing Eiso: train it on science-3, measure the time, and then try TPUs.

@zurk
Copy link

zurk commented Mar 5, 2018

ok, need to collect our cooccurrence matrix first.

@zurk
Copy link

zurk commented Mar 7, 2018

@zurk
Copy link

zurk commented Mar 21, 2018

Since all ML team have a lot of problems with the Engine I pause my attempts to process all PGA dataset.
Next steps are following:

  1. Use siva from PGA files with size less than 50 Mb.
  2. And will work with next issues about toy problem: https://github.com/src-d/backlog/issues/1248

@zurk
Copy link

zurk commented Apr 7, 2018

Right now the engine is much better and I am able to process something.
Thanks to fixes from DR team, @r0mainK's performance hacks and @smola's help.

I also compute fast and slow siva files:
I run a program which calculates how many time it is needed to process a simple command via engine for only one separate siva-file. The command is: engine.repositories.references.head_ref.commits.tree_entries.count(). If this command takes to much time to finish it is a slow siva-file. I want to exclude such files for now to speed up ML experiments. Here you can find results of my experiment:
https://gist.github.com/zurk/44c87ef6b31dff6e56a198ebd27f48e4

and the list of fast siva files (it is approximately half of the PGA dataset -- a good point to start):
fast_sivas.txt.zip

My next steps:

  1. Extract cooccurrence matrix via repo2coocc on fast Siva-files.
  2. Calculate embeddings for this matrix.

@zurk
Copy link

zurk commented May 23, 2018

Current status:

  1. I decide to use Apollo preprocess to speed up cooccurrence matrix collection. So, I ask @r0mainK to run it on a cluster, since it is related to his current task: https://github.com/src-d/backlog/issues/1196#issuecomment-388740509 Now almost all PGA subdirectories were preprocessed.
  2. I was able to convert parquet files from 00 subdirectory into coocc model for Python, Java and Go languages all together. 🎉 Better than nothing.
  3. However, now I need to write code to merge Document Frequency models and Cooccurrence models. As discussed with @vmarkovtsev I use old code for Document Frequency model because it is small enough to be processed by one PC. And I write new code for Cooccurrence model using Spark.

@vmarkovtsev
Copy link
Collaborator Author

@zurk Do we really have the preprocessing in Apollo instead of sourced-ml? If it is so could you please move this preprocessing to sourced-ml.

@vmarkovtsev
Copy link
Collaborator Author

I must also note to the future reader that if we did not have problems with Spark, we would not write the code to merge DFs and Cooccs.

@zurk
Copy link

zurk commented May 23, 2018

Yeah, I also think about it. Defenetely it is time to move it. We can use siva -> parquet approach everywhere. Can you please create another issue for this task?

@zurk
Copy link

zurk commented May 23, 2018

PR for df part src-d/ml#252

@zurk
Copy link

zurk commented May 24, 2018

second PR for coocc part src-d/ml#254
@vmarkovtsev please review when you have time.

@zurk
Copy link

zurk commented Jun 9, 2018

long time since the last update. I will try to report frequently.

  1. Around 60 of 255 PGA subdirectories were processed by repo2coocc that means 60 co-occurrence matrix and document frequencies models need to be merged.
  2. They were merged with code in PRs, mentioned above. As a result, I have matrix around 700k x 700k.
  3. Running swivel was a problem for several days. Loss just hit nan values and it was hard to understand why. Finally, the reason was found. I just had an overflow of int32 during the merge process. It was fixed. And everything works well after that.
  4. As usual local and minor improvements of sourced-ml were done.
  5. I run simple experiments as @vmarkovtsev did with legacy embeddings to compare. Here you can find a result https://gist.github.com/zurk/df7cf66818e11271934581674128eeeb @vmarkovtsev please, review when you have time. It works okayish. I cannot see so good results as I see with legacy embeddings, but some relations can be observed. My thoughts why not so good:
    1. embeddings dimension is 300 instead of 200, so it is more place for noise/overfitting
    2. No filtering by document frequency, so there is more noise.
    3. As soon as we use our neural splitter, it should give us much better subidentifiers itself.

Right now I continue to process PGA via repo2coocc.
I think to prune current data using document frequency and learn 200-dim embeddings and continue my experiments.

@vmarkovtsev
Copy link
Collaborator Author

Very good report @zurk, thanks

@eiso
Copy link
Member

eiso commented Jun 26, 2018

From your report, it's quite obvious that the neural splitter is needed. Nice report!

@vmarkovtsev
Copy link
Collaborator Author

@zurk status?

@zurk
Copy link

zurk commented Jul 2, 2018

It was a pause in processing because I need computer resources for some other tasks. Now it is unpaused.
I process 140 from 210 PGA subdirectories. So it is 66% done.

After that, I need to merge models into one and run id2vec.
I think to use our cluster to speedup coocc matrixes collection process.

@zurk
Copy link

zurk commented Jul 4, 2018

Two more days and I have +14 PGA subdirectories.
Today I try to lunch extraction on a cluster.

@zurk
Copy link

zurk commented Aug 9, 2018

It is done. I train two models with 40 epoch and 100. they are on science-3:
/storage/konstantin/emb-0717.asdf and /storage/konstantin/emb-0717-2.asdf
however, I run my test task for it (https://github.com/src-d/backlog/issues/1249) and find out that I have worse results than before. Now I do a more fair comparison.

It is possible to have any kind of mistakes here. So, I am trying to find out what can be wrong.

@zurk
Copy link

zurk commented Aug 23, 2018

I found out a problem. It was a bug in df.greatest() method. Here is a PR: src-d/ml#305

@vmarkovtsev
Copy link
Collaborator Author

@zurk please update

@zurk
Copy link

zurk commented Sep 6, 2018

So, last time I had an idea that something bad with Document frequency model. I decided to take an old one from here: https://github.com/src-d/models/blob/master/docfreq/f64bacd4-67fb-4c64-8382-399a8e7db52a.md and build a new cooccurrence matrix only for tokens from both (my current and old) df model.
Idia failed, results are still bad.

Next step: take a deeper look for new cooccurrence model and compare with the old one. I will look for the nearest neighbors in cooccurrence matrix space. It is a memory intensive task, so I tried to avoid it before, but now I have no choice. One more hypothesis where to look for answers is that something wrong happens when we move from 0.5 PGA to full PGA. So I want to build one more id2vec model on a random half of PGA and see which performance I can achieve. And it is not about pruning values to 2**32-1, there is only ~200 values which was pruned.

@eiso
Copy link
Member

eiso commented Sep 28, 2018

Did we ever figure this out?

@vmarkovtsev
Copy link
Collaborator Author

@eiso No, Konstantin stopped working on this several weeks ago since we all had to work on the style-analyzer to fulfill the deadline.

@zurk
Copy link

zurk commented Sep 28, 2018

Yes, it is right.
Next step: subtract matrixes old one from a new one and see anomalies in diff.

@vmarkovtsev vmarkovtsev transferred this issue from another repository Jan 8, 2019
@vmarkovtsev vmarkovtsev changed the title sourced.ml: run id2vec on Public Git Archive [id2vec] Run id2vec on Public Git Archive Jan 8, 2019
@vmarkovtsev
Copy link
Collaborator Author

@r0mainK this is all yours now.

@vmarkovtsev
Copy link
Collaborator Author

Actually... @m09 do you think you can do this after Romain engineers a stable UAST processing pipeline? The guy already has a few tasks related to UASTs...

@r0mainK
Copy link

r0mainK commented Aug 5, 2019

@vmarkovtsev since this is basically a follow-up of the identifier extraction I don't mind doing this as well

@vmarkovtsev
Copy link
Collaborator Author

ok

@r0mainK r0mainK removed their assignment Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants