New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

AllReduce taking extra long for ShardedCrossEntropy #262

Open

NouamaneTazi opened this issue Dec 10, 2024 · 0 comments

Labels

Member

NouamaneTazi commented Dec 10, 2024 •

edited

Loading

With the config dp1_tp8_pp1_acc1_mbs64_seq2048_zero0_tpmodeALL for a 1B model we have

The Allreduce is taking super long

edit: In another profiling i got

Seems like it's because we're too close to memory limits (Peak reserved: 72462.00MiB)

Comparison with REDUCE_SCATTER

Can we explore better alternatives for ShardedCrossEntropy?

The text was updated successfully, but these errors were encountered:

NouamaneTazi added the help wanted label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment