Skip to content

Implementation of layer-norm in the training script #3256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

bernardohenz
Copy link
Contributor

PR that allows the use of layer-norm in the model. For our experiments, it allows for training more epochs without overfitting.

PS: I'll be creating a PR in you tensorflow repository, as layer-norm uses some dependencies that are not included in your build rule. Nonetheless, I'm sending the new rule here (which were found in tensorflow/core/kernels/BUILD). When compiling in the 0.6.1 version, the following rule worked just fine:

tf_kernel_library(
    name = "deepspeech_cwise_ops",
    srcs = [
        "cwise_op_less.cc",
        "cwise_op_minimum.cc",
        "cwise_op_mul_1.cc",
        "cwise_op_squared_difference.cc",
        "cwise_op_add_1.cc",
        "cwise_op_add_2.cc",
        "cwise_op_rsqrt.cc",
        "cwise_op_sub.cc",
    ],
    gpu_srcs = [
        "cwise_op_gpu_less.cu.cc",
        "cwise_op_gpu_minimum.cu.cc",
        "cwise_op_gpu_mul.cu.cc",
        "cwise_op_gpu_squared_difference.cu.cc",
        "cwise_op_gpu_add.cu.cc",
        "cwise_op_gpu_rsqrt.cu.cc",
        "cwise_op_gpu_sub.cu.cc",
    ],
    deps = [
        ":cwise_lib",
        "//tensorflow/core:framework",
        "//tensorflow/core:lib",
        "//third_party/eigen3",
    ],
}

I'll be compiling the binaries in the master to check if it still works.

@community-tc-integration
Copy link

No Taskcluster jobs started for this pull request
The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

@lissyx
Copy link
Collaborator

lissyx commented Aug 18, 2020

I'll be compiling the binaries in the master to check if it still works.

Master moved to r2.3

PS: I'll be creating a PR in you tensorflow repository, as layer-norm uses some dependencies that are not included in your build rule. Nonetheless, I'm sending the new rule here (which were found in tensorflow/core/kernels/BUILD).

Pretty sure we already have those

@lissyx
Copy link
Collaborator

lissyx commented Aug 18, 2020

I'm sending the new rule here

https://github.com/mozilla/tensorflow/blob/r2.3/tensorflow/core/kernels/BUILD#L8655-L8673

Ok, so you just have a few files to add?

Please make sure you:

  • update tensorflow submodule
  • update taskcluster/.shared.yml tensorflow references

@bernardohenz
Copy link
Contributor Author

bernardohenz commented Aug 18, 2020

Ok, so you just have a few files to add?

Yes, these new files that I need to check if it works on the r2.3. But I believe this will work just fine

@lissyx
Copy link
Collaborator

lissyx commented Aug 18, 2020

For our experiments, it allows for training more epochs without overfitting.

Do you mind sharing more on that?

@lissyx lissyx requested a review from reuben August 18, 2020 16:11
@bernardohenz
Copy link
Contributor Author

bernardohenz commented Aug 18, 2020

Do you mind sharing more on that?

The experiments we have done are a little old right now. I performing a new training benchmark right now, soon I'll let you know.

Copy link
Contributor

@reuben reuben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look good to me. Thanks Bernardo!

@bernardohenz
Copy link
Contributor Author

All is working in the binaries, already create a PR for mozilla tensorflow: mozilla/tensorflow#124

@DanBmh
Copy link
Contributor

DanBmh commented Aug 20, 2020

For training with already existing checkpoints you have to initialize the new layers first. This did work for me:

# training/deepspeech_training/utils/checkpoins.py

def _load_checkpoint(session, checkpoint_path, allow_drop_layers):
    [...]

    if FLAGS.layer_norm:
        for v in load_vars:
            if v.op.name not in vars_in_ckpt:
                if 'LayerNorm' in v.name:
                    init_vars.add(v)
                else:
                    msg = "Tried to train with layer normalization but there was " \
                          "a missing variable other than the LayerNorm tensors: {}"
                    log_error(msg.format(v))
                    sys.exit(1)
        load_vars -= init_vars

I also did run a short test transfer-learning the English checkpoint to German with a small dataset (~32h), but layer-norm didn't help here:

Dataset Additional Infos Losses Training epochs of best model Total training duration
Voxforge without layer-norm Test: 30.655203, Validation: 33.655750 9 48 min
Voxforge with layer normalization Test: 57.330410, Validation: 61.025009 45 2:37 h

Maybe training the reinitialized LayerNorm tensors only and freezing the rest of the network (see #3247) before training the complete network would help here

@bernardohenz
Copy link
Contributor Author

@DanBmh you cannot take the weights (checkpoint) from a model trained without layer-norm, and use them to finetune/transferlearning to a model that uses layernorm. The model's architecture are just different.

For instance, while the 2nd dense layer from the current checkpoint (without LN) was trained to process tensors in a certain range; the 2nd dense layer in an architecture with LN will process tensors with a completely different range (not only range, but mean and var as well).

Also, that's why I've put the argument layer_norm along with n_hidden in the geometry section. These arguments dictates the geometry/architecture of your model. If you wish to finetune/transfer-learning from a trained model, you should stick right with the architecture it was trained.

@DanBmh
Copy link
Contributor

DanBmh commented Aug 20, 2020

For instance, while the 2nd dense layer from the current checkpoint (without LN) was trained to process tensors in a certain range; the 2nd dense layer in an architecture with LN will process tensors with a completely different range (not only range, but mean and var as well).

@bernardohenz you're right. I just was hoping that transfer-learning performance wasn't that bad and did a short test run.

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

All is working in the binaries, already create a PR for mozilla tensorflow: mozilla/tensorflow#124

Thanks @bernardohenz !

Can you send a PR against mozilla/STT that:

  • changes .gitmodules to fetch your tensorflow repo
  • changes tensorflow sha1 checkout to your changes
  • changes taskcluster/.shared.yml tensorflow SHA1 references to your new sha1

This is required for us to be able to run your PR with all your changes and ensure nothing regresses

@lissyx lissyx self-requested a review August 24, 2020 15:10
@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

@bernardohenz This is not complete, you have not updated taskcluster/.shared.yml

@bernardohenz
Copy link
Contributor Author

@bernardohenz This is not complete, you have not updated taskcluster/.shared.yml

Yes, I was about to post asking for some help with this. What am I supposed to do? Just replace the old sha references ('4336a5b49fa6d650e24dbdba55bcef9581535244') to the new one ('6dc2a1becfd1316eb4d77240133a548e93dbff63')?
Or should I compile anything and upload to you guys?

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

@bernardohenz This is not complete, you have not updated taskcluster/.shared.yml

Yes, I was about to post asking for some help with this. What am I supposed to do? Just replace the old sha references ('4336a5b49fa6d650e24dbdba55bcef9581535244') to the new one ('6dc2a1becfd1316eb4d77240133a548e93dbff63')?
Or should I compile anything and upload to you guys?

Just replace, that is the purpose of those references: our CI will check if the taskcluster index exists, and if not, it will build it.

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

@bernardohenz Please don't merge but rebase.

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

@bernardohenz Can you please clean the history? No merge, no "revert" of the previous commit. Force-push is fine, no worries.

@bernardohenz
Copy link
Contributor Author

I believe this is ok now. Sorry for the mess.

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

I believe this is ok now. Sorry for the mess.

Thanks; one thing I forgot, can you please update taskcluster/.build.yml reference to tensorflow version with the value matching your git describe --long --tags?

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

You can see the progress on the Community-TC link 😊

@bernardohenz
Copy link
Contributor Author

You can see the progress on the Community-TC link

Nice :D

@lissyx
Copy link
Collaborator

lissyx commented Aug 25, 2020

You can see the progress on the Community-TC link

Nice :D

macOS CI was a bit of a burden (it always is), but it's green in the end. I'm going to merge your TensorFlow part and then re-run PR with new sha1 and take care of the rest.

@lissyx lissyx closed this Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants