Skip to content

Conversation

@mdchuc
Copy link

@mdchuc mdchuc commented May 30, 2024

While trying to train vision mamba with bidirectional mode in masked autoencoder network, I experienced nan loss. Though switch training from mixed precision to full precision fixed the problem but significantly increased training time (almost twice). Looking at the code, the adding of forward and backward hidden_states/residuals does increased the magnitude of both twice as compared to the original hidden states (after patch embedding). By dividing by 2, nan loss was resolved and mixed precision training can continue.

While trying to train vision mamba with bidirectional mode in masked autoencoder network, I experienced nan loss. Though switch training from mixed precision to full precision fixed the problem but significantly increased training time (almost twice). 
Looking at the code, the adding of forward and backward hidden_states/residuals does increased the magnitude of both twice as compared to the original hidden states (after patch embedding). By dividing by 2, nan loss was resolved and mixed precision training can continue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant