Anyone already working on including this in transformers? #2

l4b4r4b4b4 · 2024-06-22T02:11:44Z

Ill try my best, but thought to check if there is anyone else wanting to try this in context of transformers trainer.

ironjr · 2024-06-22T06:12:55Z

Although it is a fairly small one, the main algorithmic data experiments are done with a 2-layer Transformer with 400k parameters. I'll leave this issue open so people can share their thoughts.

l4b4r4b4b4 · 2024-06-23T00:21:26Z

worked the approach into a fork of HF transformer trainer.
Runs without errors using trl ORPO trainer and unsloth. Will do some testing and report over the coming week.
Since I did not follow a single best practice from the transformer library no PR yet but for anyone who wants to try it out: https://github.com/l4b4r4b4b4/transformers/blob/main/src/transformers/trainer.py

phalexo · 2024-06-27T23:09:10Z

Is there the expectation that all weights within a network are trainable or can it be used for fine-tuning when only some layers are trainable, @ironjr ?

I tried to insert the code into the Trainer inner training step as well but I get an error about NoneType.

lucasjinreal · 2024-06-29T07:30:05Z

@l4b4r4b4b4 hows the result going

HydrogenBombaklot · 2024-08-17T19:09:45Z

@l4b4r4b4b4 any update?

l4b4r4b4b4 · 2024-08-20T11:08:53Z

Had everything implemented in a transformers fork. I think @ehartford https://github.com/cognitivecomputations/grokadamw took it a bit further.

ehartford · 2024-08-20T11:57:56Z

I got creative, my implementation is inspired by the paper rather than a direct implementation of it

phalexo · 2024-08-20T12:05:39Z

Has it been tried on larger models to assess training time reduction? I am planning to deploy llama 3.1 70b instruct for machine translation and wondering if fine-tuning of it could benefit.

…

On Tue, Aug 20, 2024, 7:58 AM Eric Hartford ***@***.***> wrote: I got creative, my implementation is inspired by the paper rather than a direct implementation of it — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDD3ZJWN5BBEKN4LQMRPATZSMVNVAVCNFSM6AAAAABJXALV5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYGY3TSOJYHA> . You are receiving this because you commented.Message ID: ***@***.***>

ehartford · 2024-08-20T12:41:19Z

Mine so far I'm seeing only marginal divergence vs adamw-fused

An improvement but not an obvious slam dunk.

Maybe I need to improve the default grokking functions

phalexo · 2024-08-20T18:23:15Z

I am looking at the graphic and the two loss lines seem to track each other, there is no divergence, The absolute difference is likely just the artefact of how the two are initialized.

…

On Tue, Aug 20, 2024 at 8:41 AM Eric Hartford ***@***.***> wrote: Mine so far I'm seeing only marginal divergence vs adamw-fused An improvement but not an obvious slam dunk. Maybe I need to improve the default grokking functions image-8.png (view on web) <https://github.com/user-attachments/assets/397c77a3-09a2-45a1-a610-2b9c2f88cfb0> — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDD3ZIVOVQKESVBIZ3EMR3ZSM2QLAVCNFSM6AAAAABJXALV5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYG43DINRUGU> . You are receiving this because you commented.Message ID: ***@***.***>

ehartford · 2024-08-20T18:48:25Z

Well, I suppose you know best.

Here is how they started.

BradKML · 2024-10-14T08:42:15Z

Are there open data models that can test the efficiency of GrokFast? Only GPT-NeoX and the other one came to mind atm https://github.com/EleutherAI/gpt-neox https://github.com/openlm-research/open_llama

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anyone already working on including this in transformers? #2

Anyone already working on including this in transformers? #2

l4b4r4b4b4 commented Jun 22, 2024

ironjr commented Jun 22, 2024

l4b4r4b4b4 commented Jun 23, 2024

phalexo commented Jun 27, 2024 •

edited

Loading

lucasjinreal commented Jun 29, 2024

HydrogenBombaklot commented Aug 17, 2024

l4b4r4b4b4 commented Aug 20, 2024

ehartford commented Aug 20, 2024

phalexo commented Aug 20, 2024 via email

ehartford commented Aug 20, 2024

phalexo commented Aug 20, 2024 via email

ehartford commented Aug 20, 2024

BradKML commented Oct 14, 2024

Anyone already working on including this in transformers? #2

Anyone already working on including this in transformers? #2

Comments

l4b4r4b4b4 commented Jun 22, 2024

ironjr commented Jun 22, 2024

l4b4r4b4b4 commented Jun 23, 2024

phalexo commented Jun 27, 2024 • edited Loading

lucasjinreal commented Jun 29, 2024

HydrogenBombaklot commented Aug 17, 2024

l4b4r4b4b4 commented Aug 20, 2024

ehartford commented Aug 20, 2024

phalexo commented Aug 20, 2024 via email

ehartford commented Aug 20, 2024

phalexo commented Aug 20, 2024 via email

ehartford commented Aug 20, 2024

BradKML commented Oct 14, 2024

phalexo commented Jun 27, 2024 •

edited

Loading