[ci] Hold CUDA packages in GPU base image to stop cuda-compat churn#64116
[ci] Hold CUDA packages in GPU base image to stop cuda-compat churn#64116elliot-barn wants to merge 1 commit into
Conversation
The blanket `apt-get upgrade` pulls whatever cuda-compat/toolkit version NVIDIA's mutable apt index advertises, which causes transient .deb 404s on rebuilds and churns the container's effective CUDA driver (triggering an NCCL cuMem peer-access failure on non-P2P GPUs, CUDA 217 / NVIDIA/nccl#1838). `apt-mark hold` the cuda-* packages so the base image's toolkit is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates the GPU base Dockerfile to pin CUDA packages before running apt-get upgrade to prevent issues with mutable CUDA packages. Feedback suggests improving the package pinning command to robustly handle cases where no CUDA packages are found, avoiding potential shell exits due to set -o pipefail.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # container's effective CUDA driver version (via cuda-compat forward-compat), | ||
| # which can flip on NCCL's cuMem host path and break multi-GPU collectives on | ||
| # non-P2P GPUs (CUDA error 217). See NVIDIA/nccl#1838. | ||
| apt-mark hold $(dpkg-query -W -f='${Package}\n' 'cuda-*' 2>/dev/null) 2>/dev/null || true |
There was a problem hiding this comment.
Since set -o pipefail is active, we must be careful with pipelines. If dpkg-query finds no packages matching cuda-*, it exits with status 1. Using a pipeline with xargs -r and appending || true is a robust way to handle this without triggering a shell exit, while also avoiding running apt-mark hold with empty arguments (which would otherwise list all held packages).
dpkg-query -W -f='${Package}\n' 'cuda-*' 2>/dev/null | xargs -r apt-mark hold || true
The blanket
apt-get upgradepulls whatever cuda-compat/toolkit version NVIDIA's mutable apt index advertises, which causes transient .deb 404s on rebuilds and churns the container's effective CUDA driver (triggering an NCCL cuMem peer-access failure on non-P2P GPUs, CUDA 217 / NVIDIA/nccl#1838).apt-mark holdthe cuda-* packages so the base image's toolkit is preserved.