Issue with Multi-GPU Training Configuration #61

Shadowfax-YJ · 2025-02-17T09:45:24Z

I greatly appreciate your work on this project! I encountered an issue while trying to set up multi-GPU training, and I was hoping you could provide some guidance. I modified the devices parameter in conf/trainer/default.yaml to 4, then ran the following command:

mattergen-train data_module=mp_20 ~trainer.logger --config-name=csp

However, I encountered the following error:

mattergen-train: error: unrecognized arguments: hydra.run.dir="/mattergen/outputs/singlerun/2025-02-17/17-24-58" hydra.job.name=train_ddp_process_2
usage: mattergen-train [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE] [--run] [--multirun]
[--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR]
[--experimental-rerun EXPERIMENTAL_RERUN] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides ...]

Could you please advise on how to properly configure the training for multiple GPUs? Any assistance would be greatly appreciated!

danielzuegner · 2025-02-19T10:39:58Z

Hi @Shadowfax-YJ,

can you try passing the --config-name argument first, i.e.,

mattergen-train --config-name=csp data_module=mp_20 ~trainer.logger

Let me know if that helped.

yuhao1982 · 2025-02-21T08:35:44Z

@danielzuegner I also have similar question. I only have 4090 GPU, and one gpu is not able to run the mattergen-train task. Even if I use 4 4090 gpus, and set trainer/default.yaml to "device 4", and use the command you wrote above. it still gets "out of memory" errors. What is the solution for this problem you can advise if we keep using 4090 gpus ?

danielzuegner · 2025-02-21T08:56:21Z

Hi @yuhao1982,

have you tried increasing the gradient accumulation? See this tip from the README:

Tip

Note that a single GPU's memory usually is not enough for the batch size of 512, hence we accumulate gradients over 4 batches. If you still run out of memory, increase this further.

Try successively increasing the trainer.accumulate_grad_batches=4 argument until it can fit into memory.

yuhao1982 · 2025-02-21T09:54:24Z

Is the above conf fie correct ? even I change accmulate_grad_batches to 128, it still run out of memory.

yuhao1982 · 2025-02-21T09:59:40Z

The above is the error message for GPU 0， is also shows the same error messages for other three GPUs at the same time.

danielzuegner · 2025-02-21T10:12:21Z

Looks reasonable. Can you share the full config that gets printed at the start of training?

yuhao1982 · 2025-02-21T10:23:59Z

This is the start of the training.

danielzuegner · 2025-02-21T10:37:05Z

Yes, exactly. But this is cut off, I'd need to see the full config.

danielzuegner · 2025-02-21T10:39:34Z

Ah, I think I know what the problem is. Can you modify the batch_size in mattergen/conf/data_module/mp_20.yaml to be:

batch_size:
  # total batch size of 512, adjust for number of devices, nodes, and gradient accumulation
  train:  ${eval:'(512 // ${trainer.accumulate_grad_batches}) // (${trainer.devices} * ${trainer.num_nodes})'}
  val: ${eval:'(64 // (${trainer.devices} * ${trainer.num_nodes})'}
  test: ${eval:'(64 // (${trainer.devices} * ${trainer.num_nodes})'}

And try again? This correctly accounts for gradient accumulation; we were only doing this for alex_mp_20 so far.

ClaudioZeni added the bug Something isn't working label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Multi-GPU Training Configuration #61

Issue with Multi-GPU Training Configuration #61

Shadowfax-YJ commented Feb 17, 2025

danielzuegner commented Feb 19, 2025

yuhao1982 commented Feb 21, 2025

danielzuegner commented Feb 21, 2025

yuhao1982 commented Feb 21, 2025

yuhao1982 commented Feb 21, 2025

danielzuegner commented Feb 21, 2025 •

edited

Loading

yuhao1982 commented Feb 21, 2025

danielzuegner commented Feb 21, 2025

danielzuegner commented Feb 21, 2025

Issue with Multi-GPU Training Configuration #61

Issue with Multi-GPU Training Configuration #61

Comments

Shadowfax-YJ commented Feb 17, 2025

danielzuegner commented Feb 19, 2025

yuhao1982 commented Feb 21, 2025

danielzuegner commented Feb 21, 2025

yuhao1982 commented Feb 21, 2025

yuhao1982 commented Feb 21, 2025

danielzuegner commented Feb 21, 2025 • edited Loading

yuhao1982 commented Feb 21, 2025

danielzuegner commented Feb 21, 2025

danielzuegner commented Feb 21, 2025

danielzuegner commented Feb 21, 2025 •

edited

Loading