We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With the config dp1_tp8_pp1_acc1_mbs64_seq2048_zero0_tpmodeALL for a 1B model we have
dp1_tp8_pp1_acc1_mbs64_seq2048_zero0_tpmodeALL
The Allreduce is taking super long
edit: In another profiling i got Seems like it's because we're too close to memory limits (Peak reserved: 72462.00MiB)
Can we explore better alternatives for ShardedCrossEntropy?
The text was updated successfully, but these errors were encountered:
No branches or pull requests
With the config
dp1_tp8_pp1_acc1_mbs64_seq2048_zero0_tpmodeALL
for a 1B model we haveThe Allreduce is taking super long
edit: In another profiling i got
Seems like it's because we're too close to memory limits (Peak reserved: 72462.00MiB)
Comparison with REDUCE_SCATTER
Can we explore better alternatives for ShardedCrossEntropy?
The text was updated successfully, but these errors were encountered: