Skip to content

Training from checkpoint fails when using multiple GPUs #28

Closed
@JWHennessey

Description

@JWHennessey

Hi,

Thanks for implementing the paper in pytorch. I am having problems training a model starting from a specific checkpoint when using multiple GPUs.

When using a single GPU I can run train.py using the following command:
python -m torch.distributed.launch --nproc_per_node=1 train.py --batch 16 --iter 150000 --ckpt checkpoint/start_ckpt.pt dataset
The first sample images look like they are from when I stopped training the network previously.

When I use 4 GPUs I run into a Cuda out of memory issue, using the command:
python -m torch.distributed.launch --nproc_per_node=4 train.py --batch 16 --iter 150000 --ckpt checkpoint/start_ckpt.pt dataset

I then get the following CUDA: Out of memory error

load model: checkpoint/start_ckpt.pt load model: checkpoint/start_ckpt.pt load model: checkpoint/start_ckpt.pt Traceback (most recent call last): File "trainoad model: checkpoint/start_ckpt.pt train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device) File "train.py", line 189, in train real_pred = discriminator(real_img) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward output = self.module(*inputs[0], **kwargs[0]) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/jameswhennessey/stylegan2-pytorch/model.py", line 647, in forward out = self.convs(input) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/jameswhennessey/stylegan2-pytorch/model.py", line 598, in forward out = self.conv2(out) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 82, in forward return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale) File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 86, in fused_leaky_relu return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale) File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 55, in forward out = fused.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale) File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 86, in fused_leaky_relu return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale) File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 55, in forward out = fused.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale) RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.75 GiB total capacity; 8.03 GiB already allocated; 101.38 MiB free; 565.07 MiB cached) (malloc at /opt/conda/conda-bld/pytorch_1573049306803/work/c10/cuda/CUDACachingAllocator.cpp:267) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f397a01f687 in /opt/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so) ......

Do you have any suggestions on how to resolve this issue?

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions