-
Notifications
You must be signed in to change notification settings - Fork 29.6k
Correctly drop tokens in SwitchTransformer #37123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly drop tokens in SwitchTransformer #37123
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Would love to be part of this project one day. |
LGTM but I'm not an expert on MoE routing! @zucchini-nlp @ArthurZucker if you're happy with it feel free to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks oke for me, as long as slow tests are green
r""" | ||
This test checks if the token dropping actually drops tokens. | ||
""" | ||
config = SwitchTransformersConfig(expert_capacity=0) # we drop everything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using SwitchTransformersConfig
with defaults init a huge model , let's make it tiny. We can even move this under general model tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this config is used to init a single Expert-MLP module with 8 experts, hidden size 768, and intermediate size of 2048 and not a full model. There is no attention or multiple layers. In my case, this test ran super fast on CPU.
I could adjust the shapes to the ones chosen in SwitchTransformersModelTester but I don't think, it will have much impact.
I am new to HF testing and this code part. iIf I move this test to general model tests
, I assume, I would have to initialize and run a whole model. In this case, I could not easily assert the result of the module, there would be some embedding, attentions, residual connection, and head that would influence the final result. The advantage of going with SwitchTransformersSparseMLP
is that I know that the result needs to be all zeroes independent of the input (and input/model shapes).
run-slow: switch_transformers |
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/switch_transformers'] |
Previously, the identity function was used for dropped tokens with a weight from the expert that was not applied to the hidden states. This was misleading, because dropping means, the expert weight is zero. Instead of trying to fix the weight, we take an easier approach by initializing with zeros. Fixes issue huggingface#37017
f30dc00
to
8a79849
Compare
run-slow: switch_transformers |
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/switch_transformers'] |
pytest/custom-tests — Slow CI job It's green, so might be ready to go, @zucchini-nlp ? |
Cool thanks! Merging |
Previously, the identity function was used for dropped tokens with a weight from the expert that was not applied to the hidden states. This was misleading, because dropping means, the expert weight is zero. Instead of trying to fix the weight, we take an easier approach by initializing with zeros. Fixes issue huggingface#37017
Previously, the identity function was used for dropped tokens with a weight from the expert that was not applied to the hidden states. This was misleading, because dropping means, the expert weight is zero. Instead of trying to fix the weight, we take an easier approach by initializing with zeros. Fixes issue huggingface#37017
What does this PR do?
Previously, the identity function was used for dropped tokens with a weight from the expert that was not applied to the hidden states. This was misleading, because dropping means, the expert weight is zero. Instead of trying to fix the weight, we take an easier approach by initializing with zeros.
Fixes #37017
Related work
https://github.com/tensorflow/mesh/blob/e6798a2610a2c2f4c4cd236d8214422cb1ecc00a/mesh_tensorflow/transformer/moe.py#L1144 mentions that it needs to be zeroed out.
https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py#L507C18-L507C31 combines the results without any clone initialization beforehand.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
text models: @ArthurZucker
last major changing person: @zucchini-nlp
person who requested PR: @Rocketknight1