Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi launchers for deepseed should not set unsupported --num_gpus #3319

Open
chiragjn opened this issue Jan 2, 2025 · 2 comments · May be fixed by #3357
Open

openmpi launchers for deepseed should not set unsupported --num_gpus #3319

chiragjn opened this issue Jan 2, 2025 · 2 comments · May be fixed by #3357

Comments

@chiragjn
Copy link

chiragjn commented Jan 2, 2025

To reproduce:

accelerate launch --mixed_precision bf16 --use_deepspeed --deepspeed_multinode_launcher openmpi --deepspeed_hostfile /etc/mpi/hostfile --same_network --monitor_interval "30" --num_machines "2" --num_processes "2" --main_process_port 29500 --machine_rank "-1" train.py

Which translates to

deepspeed --no_local_rank --hostfile /etc/mpi/hostfile --launcher openmpi --master_port 29500 --num_gpus 1 train.py

Which will lead to

openmpi backend does not support limiting num nodes/gpus

Deepspeed error reference:

https://github.com/microsoft/DeepSpeed/blob/3573858e7ce2c723b8c43231c6c6b0cf97dca2fc/deepspeed/launcher/multinode_runner.py#L141C44-L141C92


Accelerate should skip setting num gpus for openmpi based launchers in the else clause here:

if num_machines > 1 and args.deepspeed_multinode_launcher != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
cmd = ["deepspeed", "--no_local_rank"]
cmd.extend(["--hostfile", str(args.deepspeed_hostfile), "--launcher", str(args.deepspeed_multinode_launcher)])
if args.deepspeed_exclusion_filter is not None:
cmd.extend(
[
"--exclude",
str(args.deepspeed_exclusion_filter),
]
)
elif args.deepspeed_inclusion_filter is not None:
cmd.extend(
[
"--include",
str(args.deepspeed_inclusion_filter),
]
)
else:
cmd.extend(["--num_gpus", str(args.num_processes // args.num_machines)])

@chiragjn
Copy link
Author

@muellerzr I can contribute a fix for this. Let me know if that sounds good

@muellerzr
Copy link
Collaborator

Thanks for pointing out the issue @chiragjn ! Yes a PR would be welcomed 🤗

@chiragjn chiragjn linked a pull request Jan 20, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants