-
Notifications
You must be signed in to change notification settings - Fork 4k
Implementation of layer-norm in the training script #3256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
No Taskcluster jobs started for this pull requestThe `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request. |
Master moved to r2.3
Pretty sure we already have those |
https://github.com/mozilla/tensorflow/blob/r2.3/tensorflow/core/kernels/BUILD#L8655-L8673 Ok, so you just have a few files to add? Please make sure you:
|
Yes, these new files that I need to check if it works on the r2.3. But I believe this will work just fine |
Do you mind sharing more on that? |
The experiments we have done are a little old right now. I performing a new training benchmark right now, soon I'll let you know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look good to me. Thanks Bernardo!
All is working in the binaries, already create a PR for mozilla tensorflow: mozilla/tensorflow#124 |
For training with already existing checkpoints you have to initialize the new layers first. This did work for me:
I also did run a short test transfer-learning the English checkpoint to German with a small dataset (~32h), but layer-norm didn't help here:
Maybe training the reinitialized LayerNorm tensors only and freezing the rest of the network (see #3247) before training the complete network would help here |
@DanBmh you cannot take the weights (checkpoint) from a model trained without layer-norm, and use them to finetune/transferlearning to a model that uses layernorm. The model's architecture are just different. For instance, while the 2nd dense layer from the current checkpoint (without LN) was trained to process tensors in a certain range; the 2nd dense layer in an architecture with LN will process tensors with a completely different range (not only range, but mean and var as well). Also, that's why I've put the argument |
@bernardohenz you're right. I just was hoping that transfer-learning performance wasn't that bad and did a short test run. |
Thanks @bernardohenz ! Can you send a PR against
This is required for us to be able to run your PR with all your changes and ensure nothing regresses |
@bernardohenz This is not complete, you have not updated |
Yes, I was about to post asking for some help with this. What am I supposed to do? Just replace the old sha references ('4336a5b49fa6d650e24dbdba55bcef9581535244') to the new one ('6dc2a1becfd1316eb4d77240133a548e93dbff63')? |
Just replace, that is the purpose of those references: our CI will check if the taskcluster index exists, and if not, it will build it. |
@bernardohenz Please don't merge but rebase. |
@bernardohenz Can you please clean the history? No merge, no "revert" of the previous commit. Force-push is fine, no worries. |
5e78032
to
cdcbfdf
Compare
Replacing old sha references ('4336a5b49fa6d650e24dbdba55bcef9581535244') with the new one ('6dc2a1becfd1316eb4d77240133a548e93dbff63')
cdcbfdf
to
77e6c93
Compare
I believe this is ok now. Sorry for the mess. |
Thanks; one thing I forgot, can you please update |
You can see the progress on the Community-TC link 😊 |
Nice :D |
macOS CI was a bit of a burden (it always is), but it's green in the end. I'm going to merge your TensorFlow part and then re-run PR with new sha1 and take care of the rest. |
PR that allows the use of layer-norm in the model. For our experiments, it allows for training more epochs without overfitting.
PS: I'll be creating a PR in you tensorflow repository, as layer-norm uses some dependencies that are not included in your build rule. Nonetheless, I'm sending the new rule here (which were found in
tensorflow/core/kernels/BUILD
). When compiling in the 0.6.1 version, the following rule worked just fine:I'll be compiling the binaries in the
master
to check if it still works.