Invalid Usage of NCCL Library #265

awsankur · 2024-12-17T19:20:01Z

Trying to train tiny llamaon 4 P5.48xlarge instances and I see this error:

 0: [rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:328, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.23.4
 0: [rank3]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

Library versions used:

FROM nvidia/cuda:12.2.2-devel-ubuntu22.04

ARG GDRCOPY_VERSION=v2.4.1
ARG EFA_INSTALLER_VERSION=1.37.0
ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws
ARG NCCL_VERSION=v2.23.4-1
ARG NCCL_TESTS_VERSION=v2.13.10

Parallelism parameters used:

parallelism:
  dp: 4
  expert_parallel_size: 1
  pp: 2
  pp_engine: 1f1b
  tp: 4

Slurm submission script:

#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH -N 4 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
#SBATCH --ntasks-per-node 8 # Number of GPU per node
#SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
#SBATCH --job-name=huggingface # name of your job
#SBATCH --exclusive
#SBATCH --nodelist=p5-dy-gpu-[9-12]


###########################
###### User Variables #####
###########################

# default variables for Enroot
: "${DATA_PATH:=/fsxl/awsankur}"
: "${FSX_MOUNT:=$DATA_PATH:$DATA_PATH}"
: "${IMAGE:=/fsxl/awsankur/huggingface.sqsh}"

## Plenty of EFA level variables
export FI_PROVIDER=efa
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_HUGE_PAGE=0    # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory.  Disabling huge page causes minor performance hit.
##export FI_EFA_ENABLE_SHM_TRANSFER=1
export NCCL_DEBUG=INFO

export CUDA_DEVICE_MAX_CONNECTIONS=1
# Trying to avoid hangs
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

declare -a ARGS=(
    --container-image $IMAGE
    --container-mount-home
    --container-mounts $FSX_MOUNT
    --no-container-remap-root
)

# Calculate total number of processes
export NNODES=$SLURM_NNODES
export GPUS_PER_NODE=8
export WORLD_SIZE=$(($NNODES * $GPUS_PER_NODE))


declare -a TORCHRUN_ARGS=(
    --nproc_per_node=$GPUS_PER_NODE
    --nnodes=$SLURM_NNODES
    --rdzv_id=$SLURM_JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$(hostname)
)

### Nanotron specific
export NANOTRON_BENCHMARK=1
### Disable wandb
export WANDB_MODE=disabled

srun "${ARGS[@]}" -l torchrun "${TORCHRUN_ARGS[@]}" run_train.py --config-file examples/config_tiny_llama.yaml

The text was updated successfully, but these errors were encountered:

awsankur · 2024-12-17T19:28:45Z

issue.zip

Attaching Dockerfile, Slurm submission script and full output log in issue.zip

awsankur · 2024-12-17T22:55:04Z

Get the same error for an older version of NCCL:

5: [rank12]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:328, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.3
 5: [rank12]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

Hugoch · 2024-12-18T09:35:35Z

Hi @awsankur !

From you logs you have

 1: p5-dy-gpu-9:1970263:1970590 [1] init.cc:783 NCCL WARN Duplicate GPU detected : rank 9 and rank 1 both on CUDA device 64000

which hints to the fact that multiple processes are trying to use the same GPUs.

It is likely caused by the way you launch you script via Slurm:

#SBATCH --ntasks-per-node 8 # Number of GPU per node

torchrun is responsible for launching one process per GPU on each node. Here the --ntasks-per-node 8 instructs Slurm to run the srun torchrun 8 times per nodes.

Can you retry setting:

#SBATCH --ntasks-per-node 1 # Only launch 1 torchrun process per node

?

awsankur · 2024-12-18T19:12:40Z

I get Cuda OOM error when I try setting #SBATCH --ntasks-per-node 1 .

 Traceback (most recent call last):
0:   File "/workspace/huggingface/nanotron/run_train.py", line 233, in <module>
0:     trainer = DistributedTrainer(config_file)
0:   File "/workspace/huggingface/nanotron/src/nanotron/trainer.py", line 147, in __init__
0:     self.parallel_context = ParallelContext(
0:   File "/workspace/huggingface/nanotron/src/nanotron/parallel/context.py", line 55, in __init__
0:     self.set_device()
0:   File "/workspace/huggingface/nanotron/src/nanotron/parallel/context.py", line 133, in set_device
0:     torch.cuda.set_device(torch.cuda.device(device_id))
0:   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 478, in set_device
0:     torch._C._cuda_setDevice(device)
0: RuntimeError: CUDA error: out of memory
0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
0: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid Usage of NCCL Library #265

Invalid Usage of NCCL Library #265

awsankur commented Dec 17, 2024

awsankur commented Dec 17, 2024

awsankur commented Dec 17, 2024

Hugoch commented Dec 18, 2024

awsankur commented Dec 18, 2024

Invalid Usage of NCCL Library #265

Invalid Usage of NCCL Library #265

Comments

awsankur commented Dec 17, 2024

awsankur commented Dec 17, 2024

awsankur commented Dec 17, 2024

Hugoch commented Dec 18, 2024

awsankur commented Dec 18, 2024