-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anyone already working on including this in transformers? #2
Comments
Although it is a fairly small one, the main algorithmic data experiments are done with a 2-layer Transformer with 400k parameters. I'll leave this issue open so people can share their thoughts. |
worked the approach into a fork of HF transformer trainer. |
Is there the expectation that all weights within a network are trainable or can it be used for fine-tuning when only some layers are trainable, @ironjr ? I tried to insert the code into the Trainer inner training step as well but I get an error about NoneType. |
@l4b4r4b4b4 hows the result going |
@l4b4r4b4b4 any update? |
Had everything implemented in a transformers fork. I think @ehartford https://github.com/cognitivecomputations/grokadamw took it a bit further. |
I got creative, my implementation is inspired by the paper rather than a direct implementation of it |
Has it been tried on larger models to assess training time reduction? I am
planning to deploy llama 3.1 70b instruct for machine translation and
wondering if fine-tuning of it could benefit.
…On Tue, Aug 20, 2024, 7:58 AM Eric Hartford ***@***.***> wrote:
I got creative, my implementation is inspired by the paper rather than a
direct implementation of it
—
Reply to this email directly, view it on GitHub
<#2 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDD3ZJWN5BBEKN4LQMRPATZSMVNVAVCNFSM6AAAAABJXALV5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYGY3TSOJYHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I am looking at the graphic and the two loss lines seem to track each
other, there is no divergence, The absolute difference is likely just the
artefact of
how the two are initialized.
…On Tue, Aug 20, 2024 at 8:41 AM Eric Hartford ***@***.***> wrote:
Mine so far I'm seeing only marginal divergence vs adamw-fused
An improvement but not an obvious slam dunk.
Maybe I need to improve the default grokking functions
image-8.png (view on web)
<https://github.com/user-attachments/assets/397c77a3-09a2-45a1-a610-2b9c2f88cfb0>
—
Reply to this email directly, view it on GitHub
<#2 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDD3ZIVOVQKESVBIZ3EMR3ZSM2QLAVCNFSM6AAAAABJXALV5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYG43DINRUGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Are there open data models that can test the efficiency of GrokFast? Only GPT-NeoX and the other one came to mind atm https://github.com/EleutherAI/gpt-neox https://github.com/openlm-research/open_llama |
Ill try my best, but thought to check if there is anyone else wanting to try this in context of transformers trainer.
The text was updated successfully, but these errors were encountered: