Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Multi-GPU Training Configuration #61

Open
Shadowfax-YJ opened this issue Feb 17, 2025 · 9 comments
Open

Issue with Multi-GPU Training Configuration #61

Shadowfax-YJ opened this issue Feb 17, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@Shadowfax-YJ
Copy link

I greatly appreciate your work on this project! I encountered an issue while trying to set up multi-GPU training, and I was hoping you could provide some guidance. I modified the devices parameter in conf/trainer/default.yaml to 4, then ran the following command:

mattergen-train data_module=mp_20 ~trainer.logger --config-name=csp

However, I encountered the following error:

mattergen-train: error: unrecognized arguments: hydra.run.dir="/mattergen/outputs/singlerun/2025-02-17/17-24-58" hydra.job.name=train_ddp_process_2
usage: mattergen-train [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE] [--run] [--multirun]
[--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR]
[--experimental-rerun EXPERIMENTAL_RERUN] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides ...]

Could you please advise on how to properly configure the training for multiple GPUs? Any assistance would be greatly appreciated!

@danielzuegner
Copy link
Contributor

Hi @Shadowfax-YJ,

can you try passing the --config-name argument first, i.e.,

mattergen-train --config-name=csp data_module=mp_20 ~trainer.logger

Let me know if that helped.

@yuhao1982
Copy link

@danielzuegner I also have similar question. I only have 4090 GPU, and one gpu is not able to run the mattergen-train task. Even if I use 4 4090 gpus, and set trainer/default.yaml to "device 4", and use the command you wrote above. it still gets "out of memory" errors. What is the solution for this problem you can advise if we keep using 4090 gpus ?

@danielzuegner
Copy link
Contributor

Hi @yuhao1982,

have you tried increasing the gradient accumulation? See this tip from the README:

Tip

Note that a single GPU's memory usually is not enough for the batch size of 512, hence we accumulate gradients over 4 batches. If you still run out of memory, increase this further.

Try successively increasing the trainer.accumulate_grad_batches=4 argument until it can fit into memory.

@yuhao1982
Copy link

Image

Is the above conf fie correct ? even I change accmulate_grad_batches to 128, it still run out of memory.

@yuhao1982
Copy link

Image

The above is the error message for GPU 0, is also shows the same error messages for other three GPUs at the same time.

@danielzuegner
Copy link
Contributor

danielzuegner commented Feb 21, 2025

Looks reasonable. Can you share the full config that gets printed at the start of training?

@ClaudioZeni ClaudioZeni added the bug Something isn't working label Feb 21, 2025
@yuhao1982
Copy link

Image

This is the start of the training.

@danielzuegner
Copy link
Contributor

Yes, exactly. But this is cut off, I'd need to see the full config.

@danielzuegner
Copy link
Contributor

Ah, I think I know what the problem is. Can you modify the batch_size in mattergen/conf/data_module/mp_20.yaml to be:

batch_size:
  # total batch size of 512, adjust for number of devices, nodes, and gradient accumulation
  train:  ${eval:'(512 // ${trainer.accumulate_grad_batches}) // (${trainer.devices} * ${trainer.num_nodes})'}
  val: ${eval:'(64 // (${trainer.devices} * ${trainer.num_nodes})'}
  test: ${eval:'(64 // (${trainer.devices} * ${trainer.num_nodes})'}

And try again? This correctly accounts for gradient accumulation; we were only doing this for alex_mp_20 so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants