-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[id2vec] Run id2vec on Public Git Archive #17
Comments
This should definitely have a blog post associated with it as well /cc @campoy |
I thought id2vec had been renamed? |
id2vec is a cool name and straight to the point... did you mean ast2vec instead? We can run it on TPUs since it is Tensorflow under the hood. @zurk do you wish to play? |
@vmarkovtsev do we have TPUs somewhere? I did not know about that. Yes, definitely I want to try. |
@zurk we have access to them via Google Cloud, in the srcd-playground project. Be aware to not get a huge bill for using them. |
Rephrasing Eiso: train it on science-3, measure the time, and then try TPUs. |
ok, need to collect our cooccurrence matrix first. |
blocked by https://github.com/src-d/engine/issues/339 |
Since all ML team have a lot of problems with the Engine I pause my attempts to process all PGA dataset.
|
Right now the engine is much better and I am able to process something. I also compute fast and slow siva files: and the list of fast siva files (it is approximately half of the PGA dataset -- a good point to start): My next steps:
|
Current status:
|
@zurk Do we really have the preprocessing in Apollo instead of sourced-ml? If it is so could you please move this preprocessing to sourced-ml. |
I must also note to the future reader that if we did not have problems with Spark, we would not write the code to merge DFs and Cooccs. |
Yeah, I also think about it. Defenetely it is time to move it. We can use |
PR for df part src-d/ml#252 |
second PR for coocc part src-d/ml#254 |
long time since the last update. I will try to report frequently.
Right now I continue to process PGA via |
Very good report @zurk, thanks |
From your report, it's quite obvious that the neural splitter is needed. Nice report! |
@zurk status? |
It was a pause in processing because I need computer resources for some other tasks. Now it is unpaused. After that, I need to merge models into one and run id2vec. |
Two more days and I have +14 PGA subdirectories. |
It is done. I train two models with 40 epoch and 100. they are on science-3: It is possible to have any kind of mistakes here. So, I am trying to find out what can be wrong. |
I found out a problem. It was a bug in |
@zurk please update |
So, last time I had an idea that something bad with Document frequency model. I decided to take an old one from here: https://github.com/src-d/models/blob/master/docfreq/f64bacd4-67fb-4c64-8382-399a8e7db52a.md and build a new cooccurrence matrix only for tokens from both (my current and old) df model. Next step: take a deeper look for new cooccurrence model and compare with the old one. I will look for the nearest neighbors in cooccurrence matrix space. It is a memory intensive task, so I tried to avoid it before, but now I have no choice. One more hypothesis where to look for answers is that something wrong happens when we move from 0.5 PGA to full PGA. So I want to build one more id2vec model on a random half of PGA and see which performance I can achieve. And it is not about pruning values to 2**32-1, there is only ~200 values which was pruned. |
Did we ever figure this out? |
@eiso No, Konstantin stopped working on this several weeks ago since we all had to work on the style-analyzer to fulfill the deadline. |
Yes, it is right. |
@r0mainK this is all yours now. |
Actually... @m09 do you think you can do this after Romain engineers a stable UAST processing pipeline? The guy already has a few tasks related to UASTs... |
@vmarkovtsev since this is basically a follow-up of the identifier extraction I don't mind doing this as well |
ok |
Run id2vec on PGA dataset and produce the model. Publish the model with modelforge. Fix all the found bugs.
Includes updating the Dockerfile for Python3 and infra issues.
The text was updated successfully, but these errors were encountered: