Training from checkpoint fails when using multiple GPUs

Hi, 

Thanks for implementing the paper in pytorch. I am having problems training a model starting from a specific checkpoint when using multiple GPUs. 

When using a single GPU I can run `train.py` using the following command:
`
python -m torch.distributed.launch --nproc_per_node=1 train.py --batch 16 --iter 150000 --ckpt checkpoint/start_ckpt.pt dataset
` 
The first sample images look like they are from when I stopped training the network previously. 

When I use 4 GPUs I run into a Cuda out of memory issue, using the command:
`
python -m torch.distributed.launch --nproc_per_node=4 train.py --batch 16 --iter 150000 --ckpt checkpoint/start_ckpt.pt dataset
` 

I then get the following CUDA: Out of memory error

`
load model: checkpoint/start_ckpt.pt
load model: checkpoint/start_ckpt.pt
load model: checkpoint/start_ckpt.pt
Traceback (most recent call last):
  File "trainoad model: checkpoint/start_ckpt.pt
    train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device)
  File "train.py", line 189, in train
    real_pred = discriminator(real_img)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jameswhennessey/stylegan2-pytorch/model.py", line 647, in forward
    out = self.convs(input)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jameswhennessey/stylegan2-pytorch/model.py", line 598, in forward
    out = self.conv2(out)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 82, in forward
    return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
  File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 86, in fused_leaky_relu
    return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
  File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 55, in forward
    out = fused.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 86, in fused_leaky_relu
    return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
  File "/home/jameswhennessey/stylegan2-pytorch/op/fused_act.py", line 55, in forward
    out = fused.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.75 GiB total capacity; 8.03 GiB already allocated; 101.38 MiB free; 565.07 MiB cached) (malloc at /opt/conda/conda-bld/pytorch_1573049306803/work/c10/cuda/CUDACachingAllocator.cpp:267)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f397a01f687 in /opt/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
......
`

Do you have any suggestions on how to resolve this issue? 

Thanks in advance. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training from checkpoint fails when using multiple GPUs #28

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training from checkpoint fails when using multiple GPUs #28

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions