[id2vec] Run id2vec on Public Git Archive #17

vmarkovtsev · 2018-02-14T14:42:49Z

Run id2vec on PGA dataset and produce the model. Publish the model with modelforge. Fix all the found bugs.

Includes updating the Dockerfile for Python3 and infra issues.

eiso · 2018-02-22T09:53:30Z

This should definitely have a blog post associated with it as well /cc @campoy

campoy · 2018-03-02T17:36:01Z

I thought id2vec had been renamed?
any chance we could do this on TPUs on GCP?
that would easily become a blog post on cloud.google.com/blog

vmarkovtsev · 2018-03-02T18:07:38Z

id2vec is a cool name and straight to the point... did you mean ast2vec instead?

We can run it on TPUs since it is Tensorflow under the hood. @zurk do you wish to play?

zurk · 2018-03-05T08:45:18Z

@vmarkovtsev do we have TPUs somewhere? I did not know about that. Yes, definitely I want to try.

eiso · 2018-03-05T08:52:13Z

@zurk we have access to them via Google Cloud, in the srcd-playground project. Be aware to not get a huge bill for using them.

vmarkovtsev · 2018-03-05T09:04:10Z

Rephrasing Eiso: train it on science-3, measure the time, and then try TPUs.

zurk · 2018-03-05T13:46:02Z

ok, need to collect our cooccurrence matrix first.

zurk · 2018-03-07T07:23:37Z

blocked by https://github.com/src-d/engine/issues/339

zurk · 2018-03-21T15:51:13Z

Since all ML team have a lot of problems with the Engine I pause my attempts to process all PGA dataset.
Next steps are following:

Use siva from PGA files with size less than 50 Mb.
And will work with next issues about toy problem: https://github.com/src-d/backlog/issues/1248

zurk · 2018-04-07T11:18:19Z

Right now the engine is much better and I am able to process something.
Thanks to fixes from DR team, @r0mainK's performance hacks and @smola's help.

I also compute fast and slow siva files:
I run a program which calculates how many time it is needed to process a simple command via engine for only one separate siva-file. The command is: engine.repositories.references.head_ref.commits.tree_entries.count(). If this command takes to much time to finish it is a slow siva-file. I want to exclude such files for now to speed up ML experiments. Here you can find results of my experiment:
https://gist.github.com/zurk/44c87ef6b31dff6e56a198ebd27f48e4

and the list of fast siva files (it is approximately half of the PGA dataset -- a good point to start):
fast_sivas.txt.zip

My next steps:

Extract cooccurrence matrix via repo2coocc on fast Siva-files.
Calculate embeddings for this matrix.

zurk · 2018-05-23T08:46:50Z

Current status:

I decide to use Apollo preprocess to speed up cooccurrence matrix collection. So, I ask @r0mainK to run it on a cluster, since it is related to his current task: https://github.com/src-d/backlog/issues/1196#issuecomment-388740509 Now almost all PGA subdirectories were preprocessed.
I was able to convert parquet files from 00 subdirectory into coocc model for Python, Java and Go languages all together. 🎉 Better than nothing.
However, now I need to write code to merge Document Frequency models and Cooccurrence models. As discussed with @vmarkovtsev I use old code for Document Frequency model because it is small enough to be processed by one PC. And I write new code for Cooccurrence model using Spark.

vmarkovtsev · 2018-05-23T08:49:44Z

@zurk Do we really have the preprocessing in Apollo instead of sourced-ml? If it is so could you please move this preprocessing to sourced-ml.

vmarkovtsev · 2018-05-23T08:51:54Z

I must also note to the future reader that if we did not have problems with Spark, we would not write the code to merge DFs and Cooccs.

zurk · 2018-05-23T09:00:01Z

Yeah, I also think about it. Defenetely it is time to move it. We can use siva -> parquet approach everywhere. Can you please create another issue for this task?

zurk · 2018-05-23T10:19:34Z

PR for df part src-d/ml#252

zurk · 2018-05-24T13:54:39Z

second PR for coocc part src-d/ml#254
@vmarkovtsev please review when you have time.

zurk · 2018-06-09T12:35:45Z

long time since the last update. I will try to report frequently.

Around 60 of 255 PGA subdirectories were processed by repo2coocc that means 60 co-occurrence matrix and document frequencies models need to be merged.
They were merged with code in PRs, mentioned above. As a result, I have matrix around 700k x 700k.
Running swivel was a problem for several days. Loss just hit nan values and it was hard to understand why. Finally, the reason was found. I just had an overflow of int32 during the merge process. It was fixed. And everything works well after that.
As usual local and minor improvements of sourced-ml were done.
I run simple experiments as @vmarkovtsev did with legacy embeddings to compare. Here you can find a result https://gist.github.com/zurk/df7cf66818e11271934581674128eeeb @vmarkovtsev please, review when you have time. It works okayish. I cannot see so good results as I see with legacy embeddings, but some relations can be observed. My thoughts why not so good:
1. embeddings dimension is 300 instead of 200, so it is more place for noise/overfitting
2. No filtering by document frequency, so there is more noise.
3. As soon as we use our neural splitter, it should give us much better subidentifiers itself.

Right now I continue to process PGA via repo2coocc.
I think to prune current data using document frequency and learn 200-dim embeddings and continue my experiments.

vmarkovtsev · 2018-06-12T10:07:03Z

Very good report @zurk, thanks

eiso · 2018-06-26T12:40:04Z

From your report, it's quite obvious that the neural splitter is needed. Nice report!

vmarkovtsev · 2018-07-02T13:08:43Z

@zurk status?

zurk · 2018-07-02T14:34:33Z

It was a pause in processing because I need computer resources for some other tasks. Now it is unpaused.
I process 140 from 210 PGA subdirectories. So it is 66% done.

After that, I need to merge models into one and run id2vec.
I think to use our cluster to speedup coocc matrixes collection process.

zurk · 2018-07-04T08:08:40Z

Two more days and I have +14 PGA subdirectories.
Today I try to lunch extraction on a cluster.

zurk · 2018-08-09T11:25:57Z

It is done. I train two models with 40 epoch and 100. they are on science-3:
/storage/konstantin/emb-0717.asdf and /storage/konstantin/emb-0717-2.asdf
however, I run my test task for it (https://github.com/src-d/backlog/issues/1249) and find out that I have worse results than before. Now I do a more fair comparison.

It is possible to have any kind of mistakes here. So, I am trying to find out what can be wrong.

zurk · 2018-08-23T09:25:46Z

I found out a problem. It was a bug in df.greatest() method. Here is a PR: src-d/ml#305

vmarkovtsev · 2018-09-05T13:12:48Z

@zurk please update

zurk · 2018-09-06T11:04:41Z

So, last time I had an idea that something bad with Document frequency model. I decided to take an old one from here: https://github.com/src-d/models/blob/master/docfreq/f64bacd4-67fb-4c64-8382-399a8e7db52a.md and build a new cooccurrence matrix only for tokens from both (my current and old) df model.
Idia failed, results are still bad.

Next step: take a deeper look for new cooccurrence model and compare with the old one. I will look for the nearest neighbors in cooccurrence matrix space. It is a memory intensive task, so I tried to avoid it before, but now I have no choice. One more hypothesis where to look for answers is that something wrong happens when we move from 0.5 PGA to full PGA. So I want to build one more id2vec model on a random half of PGA and see which performance I can achieve. And it is not about pruning values to 2**32-1, there is only ~200 values which was pruned.

eiso · 2018-09-28T10:57:01Z

Did we ever figure this out?

vmarkovtsev · 2018-09-28T10:58:33Z

@eiso No, Konstantin stopped working on this several weeks ago since we all had to work on the style-analyzer to fulfill the deadline.

zurk · 2018-09-28T11:56:57Z

Yes, it is right.
Next step: subtract matrixes old one from a new one and see anomalies in diff.

vmarkovtsev · 2019-08-05T10:53:46Z

@r0mainK this is all yours now.

vmarkovtsev · 2019-08-05T10:55:15Z

Actually... @m09 do you think you can do this after Romain engineers a stable UAST processing pipeline? The guy already has a few tasks related to UASTs...

r0mainK · 2019-08-05T10:57:28Z

@vmarkovtsev since this is basically a follow-up of the identifier extraction I don't mind doing this as well

vmarkovtsev · 2019-08-05T10:59:23Z

ok

vmarkovtsev assigned zurk Feb 14, 2018

vmarkovtsev assigned r0mainK Jul 2, 2018

vmarkovtsev transferred this issue from another repository Jan 8, 2019

vmarkovtsev changed the title ~~sourced.ml: run id2vec on Public Git Archive~~ [id2vec] Run id2vec on Public Git Archive Jan 8, 2019

vmarkovtsev unassigned zurk Aug 5, 2019

r0mainK removed their assignment Nov 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[id2vec] Run id2vec on Public Git Archive #17

[id2vec] Run id2vec on Public Git Archive #17

vmarkovtsev commented Feb 14, 2018

eiso commented Feb 22, 2018

campoy commented Mar 2, 2018

vmarkovtsev commented Mar 2, 2018

zurk commented Mar 5, 2018

eiso commented Mar 5, 2018

vmarkovtsev commented Mar 5, 2018

zurk commented Mar 5, 2018 •

edited

Loading

zurk commented Mar 7, 2018

zurk commented Mar 21, 2018

zurk commented Apr 7, 2018

zurk commented May 23, 2018

vmarkovtsev commented May 23, 2018

vmarkovtsev commented May 23, 2018

zurk commented May 23, 2018

zurk commented May 23, 2018

zurk commented May 24, 2018 •

edited

Loading

zurk commented Jun 9, 2018

vmarkovtsev commented Jun 12, 2018

eiso commented Jun 26, 2018

vmarkovtsev commented Jul 2, 2018

zurk commented Jul 2, 2018

zurk commented Jul 4, 2018

zurk commented Aug 9, 2018 •

edited

Loading

zurk commented Aug 23, 2018

vmarkovtsev commented Sep 5, 2018

zurk commented Sep 6, 2018

eiso commented Sep 28, 2018

vmarkovtsev commented Sep 28, 2018

zurk commented Sep 28, 2018

vmarkovtsev commented Aug 5, 2019

vmarkovtsev commented Aug 5, 2019

r0mainK commented Aug 5, 2019

vmarkovtsev commented Aug 5, 2019

[id2vec] Run id2vec on Public Git Archive #17

[id2vec] Run id2vec on Public Git Archive #17

Comments

vmarkovtsev commented Feb 14, 2018

eiso commented Feb 22, 2018

campoy commented Mar 2, 2018

vmarkovtsev commented Mar 2, 2018

zurk commented Mar 5, 2018

eiso commented Mar 5, 2018

vmarkovtsev commented Mar 5, 2018

zurk commented Mar 5, 2018 • edited Loading

zurk commented Mar 7, 2018

zurk commented Mar 21, 2018

zurk commented Apr 7, 2018

zurk commented May 23, 2018

vmarkovtsev commented May 23, 2018

vmarkovtsev commented May 23, 2018

zurk commented May 23, 2018

zurk commented May 23, 2018

zurk commented May 24, 2018 • edited Loading

zurk commented Jun 9, 2018

vmarkovtsev commented Jun 12, 2018

eiso commented Jun 26, 2018

vmarkovtsev commented Jul 2, 2018

zurk commented Jul 2, 2018

zurk commented Jul 4, 2018

zurk commented Aug 9, 2018 • edited Loading

zurk commented Aug 23, 2018

vmarkovtsev commented Sep 5, 2018

zurk commented Sep 6, 2018

eiso commented Sep 28, 2018

vmarkovtsev commented Sep 28, 2018

zurk commented Sep 28, 2018

vmarkovtsev commented Aug 5, 2019

vmarkovtsev commented Aug 5, 2019

r0mainK commented Aug 5, 2019

vmarkovtsev commented Aug 5, 2019

zurk commented Mar 5, 2018 •

edited

Loading

zurk commented May 24, 2018 •

edited

Loading

zurk commented Aug 9, 2018 •

edited

Loading