Skip to content

Commit 867ac74

Browse files
fxmartyErikKaum
authored andcommitted
Fix nccl regression on PyTorch 2.3 upgrade (#2099)
* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD
1 parent 6ab7ade commit 867ac74

File tree

2 files changed

+7
-2
lines changed

2 files changed

+7
-2
lines changed

Dockerfile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,9 @@ RUN cargo build --profile release-opt
4040
# Adapted from: https://github.com/pytorch/pytorch/blob/master/Dockerfile
4141
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS pytorch-install
4242

43+
# NOTE: When updating PyTorch version, beware to remove `pip install nvidia-nccl-cu12==2.22.3` below in the Dockerfile. Context: https://github.com/huggingface/text-generation-inference/pull/2099
4344
ARG PYTORCH_VERSION=2.3.0
45+
4446
ARG PYTHON_VERSION=3.10
4547
# Keep in sync with `server/pyproject.toml
4648
ARG CUDA_VERSION=12.1
@@ -241,7 +243,10 @@ COPY server/Makefile server/Makefile
241243
RUN cd server && \
242244
make gen-server && \
243245
pip install -r requirements_cuda.txt && \
244-
pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
246+
pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir && \
247+
pip install nvidia-nccl-cu12==2.22.3
248+
249+
ENV LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
245250

246251
# Deps before the binaries
247252
# The binaries change on every build given we burn the SHA into them

server/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,5 +35,5 @@ run-dev:
3535
SAFETENSORS_FAST_GPU=1 python -m torch.distributed.run --nproc_per_node=2 text_generation_server/cli.py serve bigscience/bloom-560m --sharded
3636

3737
export-requirements:
38-
poetry export -o requirements_cuda.txt --without-hashes
38+
poetry export -o requirements_cuda.txt --without-hashes --with cuda
3939
poetry export -o requirements_rocm.txt --without-hashes

0 commit comments

Comments
 (0)