-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Multi-GPU Training Configuration #61
Comments
Hi @Shadowfax-YJ, can you try passing the mattergen-train --config-name=csp data_module=mp_20 ~trainer.logger Let me know if that helped. |
@danielzuegner I also have similar question. I only have 4090 GPU, and one gpu is not able to run the mattergen-train task. Even if I use 4 4090 gpus, and set trainer/default.yaml to "device 4", and use the command you wrote above. it still gets "out of memory" errors. What is the solution for this problem you can advise if we keep using 4090 gpus ? |
Hi @yuhao1982, have you tried increasing the gradient accumulation? See this tip from the README: Tip Note that a single GPU's memory usually is not enough for the batch size of 512, hence we accumulate gradients over 4 batches. If you still run out of memory, increase this further. Try successively increasing the |
Looks reasonable. Can you share the full config that gets printed at the start of training? |
Yes, exactly. But this is cut off, I'd need to see the full config. |
Ah, I think I know what the problem is. Can you modify the batch_size:
# total batch size of 512, adjust for number of devices, nodes, and gradient accumulation
train: ${eval:'(512 // ${trainer.accumulate_grad_batches}) // (${trainer.devices} * ${trainer.num_nodes})'}
val: ${eval:'(64 // (${trainer.devices} * ${trainer.num_nodes})'}
test: ${eval:'(64 // (${trainer.devices} * ${trainer.num_nodes})'} And try again? This correctly accounts for gradient accumulation; we were only doing this for alex_mp_20 so far. |
I greatly appreciate your work on this project! I encountered an issue while trying to set up multi-GPU training, and I was hoping you could provide some guidance. I modified the devices parameter in conf/trainer/default.yaml to 4, then ran the following command:
mattergen-train data_module=mp_20 ~trainer.logger --config-name=csp
However, I encountered the following error:
mattergen-train: error: unrecognized arguments: hydra.run.dir="/mattergen/outputs/singlerun/2025-02-17/17-24-58" hydra.job.name=train_ddp_process_2
usage: mattergen-train [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE] [--run] [--multirun]
[--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR]
[--experimental-rerun EXPERIMENTAL_RERUN] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides ...]
Could you please advise on how to properly configure the training for multiple GPUs? Any assistance would be greatly appreciated!
The text was updated successfully, but these errors were encountered: