Skip to content

[ci] Hold CUDA packages in GPU base image to stop cuda-compat churn#64116

Open
elliot-barn wants to merge 1 commit into
masterfrom
elliot-barn-hold-cuda-pkgs-gpu-base
Open

[ci] Hold CUDA packages in GPU base image to stop cuda-compat churn#64116
elliot-barn wants to merge 1 commit into
masterfrom
elliot-barn-hold-cuda-pkgs-gpu-base

Conversation

@elliot-barn

@elliot-barn elliot-barn commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

The blanket apt-get upgrade pulls whatever cuda-compat/toolkit version NVIDIA's mutable apt index advertises, which causes transient .deb 404s on rebuilds and churns the container's effective CUDA driver (triggering an NCCL cuMem peer-access failure on non-P2P GPUs, CUDA 217 / NVIDIA/nccl#1838). apt-mark hold the cuda-* packages so the base image's toolkit is preserved.

The blanket `apt-get upgrade` pulls whatever cuda-compat/toolkit version
NVIDIA's mutable apt index advertises, which causes transient .deb 404s on
rebuilds and churns the container's effective CUDA driver (triggering an NCCL
cuMem peer-access failure on non-P2P GPUs, CUDA 217 / NVIDIA/nccl#1838).
`apt-mark hold` the cuda-* packages so the base image's toolkit is preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GPU base Dockerfile to pin CUDA packages before running apt-get upgrade to prevent issues with mutable CUDA packages. Feedback suggests improving the package pinning command to robustly handle cases where no CUDA packages are found, avoiding potential shell exits due to set -o pipefail.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

# container's effective CUDA driver version (via cuda-compat forward-compat),
# which can flip on NCCL's cuMem host path and break multi-GPU collectives on
# non-P2P GPUs (CUDA error 217). See NVIDIA/nccl#1838.
apt-mark hold $(dpkg-query -W -f='${Package}\n' 'cuda-*' 2>/dev/null) 2>/dev/null || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since set -o pipefail is active, we must be careful with pipelines. If dpkg-query finds no packages matching cuda-*, it exits with status 1. Using a pipeline with xargs -r and appending || true is a robust way to handle this without triggering a shell exit, while also avoiding running apt-mark hold with empty arguments (which would otherwise list all held packages).

dpkg-query -W -f='${Package}\n' 'cuda-*' 2>/dev/null | xargs -r apt-mark hold || true

@elliot-barn elliot-barn added the go add ONLY when ready to merge, run all tests label Jun 16, 2026
@elliot-barn elliot-barn marked this pull request as ready for review June 16, 2026 02:11
@elliot-barn elliot-barn requested a review from a team as a code owner June 16, 2026 02:11
@elliot-barn elliot-barn requested a review from Sparks0219 June 16, 2026 02:11
@ray-gardener ray-gardener Bot added the devprod label Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devprod go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant