Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Usage of NCCL Library #265

Open
awsankur opened this issue Dec 17, 2024 · 4 comments
Open

Invalid Usage of NCCL Library #265

awsankur opened this issue Dec 17, 2024 · 4 comments

Comments

@awsankur
Copy link

Trying to train tiny llamaon 4 P5.48xlarge instances and I see this error:

 0: [rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:328, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.23.4
 0: [rank3]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

Library versions used:

FROM nvidia/cuda:12.2.2-devel-ubuntu22.04

ARG GDRCOPY_VERSION=v2.4.1
ARG EFA_INSTALLER_VERSION=1.37.0
ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws
ARG NCCL_VERSION=v2.23.4-1
ARG NCCL_TESTS_VERSION=v2.13.10

Parallelism parameters used:

parallelism:
  dp: 4
  expert_parallel_size: 1
  pp: 2
  pp_engine: 1f1b
  tp: 4

Slurm submission script:

#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH -N 4 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
#SBATCH --ntasks-per-node 8 # Number of GPU per node
#SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
#SBATCH --job-name=huggingface # name of your job
#SBATCH --exclusive
#SBATCH --nodelist=p5-dy-gpu-[9-12]


###########################
###### User Variables #####
###########################

# default variables for Enroot
: "${DATA_PATH:=/fsxl/awsankur}"
: "${FSX_MOUNT:=$DATA_PATH:$DATA_PATH}"
: "${IMAGE:=/fsxl/awsankur/huggingface.sqsh}"

## Plenty of EFA level variables
export FI_PROVIDER=efa
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_HUGE_PAGE=0    # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory.  Disabling huge page causes minor performance hit.
##export FI_EFA_ENABLE_SHM_TRANSFER=1
export NCCL_DEBUG=INFO

export CUDA_DEVICE_MAX_CONNECTIONS=1
# Trying to avoid hangs
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

declare -a ARGS=(
    --container-image $IMAGE
    --container-mount-home
    --container-mounts $FSX_MOUNT
    --no-container-remap-root
)

# Calculate total number of processes
export NNODES=$SLURM_NNODES
export GPUS_PER_NODE=8
export WORLD_SIZE=$(($NNODES * $GPUS_PER_NODE))


declare -a TORCHRUN_ARGS=(
    --nproc_per_node=$GPUS_PER_NODE
    --nnodes=$SLURM_NNODES
    --rdzv_id=$SLURM_JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$(hostname)
)

### Nanotron specific
export NANOTRON_BENCHMARK=1
### Disable wandb
export WANDB_MODE=disabled

srun "${ARGS[@]}" -l torchrun "${TORCHRUN_ARGS[@]}" run_train.py --config-file examples/config_tiny_llama.yaml
@awsankur
Copy link
Author

issue.zip

Attaching Dockerfile, Slurm submission script and full output log in issue.zip

@awsankur
Copy link
Author

Get the same error for an older version of NCCL:

5: [rank12]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:328, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.3
 5: [rank12]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

@Hugoch
Copy link
Member

Hugoch commented Dec 18, 2024

Hi @awsankur !

From you logs you have

 1: p5-dy-gpu-9:1970263:1970590 [1] init.cc:783 NCCL WARN Duplicate GPU detected : rank 9 and rank 1 both on CUDA device 64000

which hints to the fact that multiple processes are trying to use the same GPUs.

It is likely caused by the way you launch you script via Slurm:

#SBATCH --ntasks-per-node 8 # Number of GPU per node

torchrun is responsible for launching one process per GPU on each node. Here the --ntasks-per-node 8 instructs Slurm to run the srun torchrun 8 times per nodes.

Can you retry setting:

#SBATCH --ntasks-per-node 1 # Only launch 1 torchrun process per node

?

@awsankur
Copy link
Author

I get Cuda OOM error when I try setting #SBATCH --ntasks-per-node 1 .

 Traceback (most recent call last):
0:   File "/workspace/huggingface/nanotron/run_train.py", line 233, in <module>
0:     trainer = DistributedTrainer(config_file)
0:   File "/workspace/huggingface/nanotron/src/nanotron/trainer.py", line 147, in __init__
0:     self.parallel_context = ParallelContext(
0:   File "/workspace/huggingface/nanotron/src/nanotron/parallel/context.py", line 55, in __init__
0:     self.set_device()
0:   File "/workspace/huggingface/nanotron/src/nanotron/parallel/context.py", line 133, in set_device
0:     torch.cuda.set_device(torch.cuda.device(device_id))
0:   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 478, in set_device
0:     torch._C._cuda_setDevice(device)
0: RuntimeError: CUDA error: out of memory
0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
0: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants