Skip to content

Update EFA installer version #3844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 7 additions & 21 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ ARG MOFED_VERSION=5.5-1.0.3.2

# Version of EFA Drivers to install (for AWS Elastic Fabric Adapter support)
# Leave blank for no EFA Drivers
ARG AWS_OFI_NCCL_VERSION=v1.7.4-aws
ARG EFA_INSTALLER_VERSION=1.39.0

# Upgrade certifi to resolve CVE-2022-23491
ARG CERTIFI_VERSION='>=2022.12.7'
Expand Down Expand Up @@ -202,18 +202,17 @@ RUN if [ -z "$PYTORCH_NIGHTLY_URL" ] ; then \
torchvision==${TORCHVISION_VERSION}.${PYTORCH_NIGHTLY_VERSION} ; \
fi

#####################################
# Install EFA and AWS-OFI-NCCL plugin
#####################################
#############
# Install EFA
#############

ARG EFA_INSTALLER_VERSION=1.38.1
ARG AWS_OFI_NCCL_VERSION
ARG EFA_INSTALLER_VERSION

ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to update the LD_LIBRARY_PATH to drop /opt/aws-ofi-nccl/install/lib in favor of the path from the installer as well, given you dropped the custom build now? Ref doc, it is in /opt/amazon/ofi-nccl

ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:$PATH
ENV FI_EFA_USE_DEVICE_RDMA=1

RUN if [ -n "$AWS_OFI_NCCL_VERSION" ] ; then \
RUN if [ -n "$EFA_INSTALLER_VERSION" ] ; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hwloc \
Expand All @@ -223,7 +222,7 @@ RUN if [ -n "$AWS_OFI_NCCL_VERSION" ] ; then \
rm -rf /var/lib/apt/lists/* ; \
fi

RUN if [ -n "$AWS_OFI_NCCL_VERSION" ] ; then \
RUN if [ -n "$EFA_INSTALLER_VERSION" ] ; then \
cd /tmp && \
curl -OsS https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
Expand All @@ -233,19 +232,6 @@ RUN if [ -n "$AWS_OFI_NCCL_VERSION" ] ; then \
rm -rf /tmp/aws-efa-installer* ; \
fi

RUN if [ -n "$AWS_OFI_NCCL_VERSION" ] ; then \
git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl && \
cd /opt/aws-ofi-nccl && \
git checkout ${AWS_OFI_NCCL_VERSION} && \
./autogen.sh && \
./configure --prefix=/opt/aws-ofi-nccl/install \
--with-libfabric=/opt/amazon/efa/ \
--with-cuda=/usr/local/cuda \
--disable-tests \
--enable-platform-aws && \
make && make install ; \
fi

###################################
# Mellanox OFED driver installation
###################################
Expand Down
60 changes: 30 additions & 30 deletions docker/build_matrix.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This file is automatically generated by generate_build_matrix.py. DO NOT EDIT!
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-6-0-cu124
MOFED_VERSION: latest-23.10
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -15,9 +15,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.6.3
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-6-0-cu126
MOFED_VERSION: latest-23.10
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -31,9 +31,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: v1.11.0-aws
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: 1.39.0
IMAGE_NAME: torch-2-6-0-cu124-aws
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -47,9 +47,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: v1.11.0-aws
BASE_IMAGE: nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.6.3
EFA_INSTALLER_VERSION: 1.39.0
IMAGE_NAME: torch-2-6-0-cu126-aws
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -63,9 +63,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:22.04
- BASE_IMAGE: ubuntu:22.04
CUDA_VERSION: ''
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-6-0-cpu
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -79,9 +79,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-5-1-cu124
MOFED_VERSION: latest-23.10
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -94,9 +94,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.20.1
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: v1.11.0-aws
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: 1.39.0
IMAGE_NAME: torch-2-5-1-cu124-aws
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -109,9 +109,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.20.1
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:22.04
- BASE_IMAGE: ubuntu:22.04
CUDA_VERSION: ''
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-5-1-cpu
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -124,9 +124,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.20.1
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-4-1-cu124
MOFED_VERSION: latest-23.10
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -139,9 +139,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.19.1
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: v1.11.0-aws
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: 1.39.0
IMAGE_NAME: torch-2-4-1-cu124-aws
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -154,9 +154,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.19.1
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:22.04
- BASE_IMAGE: ubuntu:22.04
CUDA_VERSION: ''
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-4-1-cpu
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -169,9 +169,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.19.1
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-6-0-cu124-ub2004
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -184,9 +184,9 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '20.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.6.3-cudnn-devel-ubuntu20.04
- BASE_IMAGE: nvidia/cuda:12.6.3-cudnn-devel-ubuntu20.04
CUDA_VERSION: 12.6.3
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: torch-2-6-0-cu126-ub2004
MOFED_VERSION: ''
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -199,10 +199,10 @@
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '20.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
COMPOSER_INSTALL_COMMAND: mosaicml[all]==0.30.0
CUDA_VERSION: 12.4.1
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: composer-0-30-0
MOFED_VERSION: latest-23.10
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand All @@ -216,10 +216,10 @@
TARGET: composer_stage
TORCHVISION_VERSION: 0.21.0
UBUNTU_VERSION: '22.04'
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:22.04
- BASE_IMAGE: ubuntu:22.04
COMPOSER_INSTALL_COMMAND: mosaicml[all]==0.30.0
CUDA_VERSION: ''
EFA_INSTALLER_VERSION: ''
IMAGE_NAME: composer-0-30-0-cpu
MOFED_VERSION: latest-23.10
NVIDIA_REQUIRE_CUDA_OVERRIDE: ''
Expand Down
9 changes: 5 additions & 4 deletions docker/generate_build_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

PRODUCTION_PYTHON_VERSION = '3.12'
PRODUCTION_PYTORCH_VERSION = '2.6.0'
EFA_INSTALLER_VERSION = '1.39.0'
PRODUCTION_UBUNTU_VERSION = '22.04'


Expand Down Expand Up @@ -286,9 +287,9 @@ def _main():

# Skip EFA drivers if not using EFA
if interconnect != 'EFA':
entry['AWS_OFI_NCCL_VERSION'] = ''
entry['EFA_INSTALLER_VERSION'] = ''
else:
entry['AWS_OFI_NCCL_VERSION'] = 'v1.11.0-aws'
entry['EFA_INSTALLER_VERSION'] = EFA_INSTALLER_VERSION

pytorch_entries.append(entry)

Expand Down Expand Up @@ -317,7 +318,7 @@ def _main():
'TARGET': 'composer_stage',
'TORCHVISION_VERSION': _get_torchvision_version(pytorch_version),
'MOFED_VERSION': 'latest-23.10',
'AWS_OFI_NCCL_VERSION': '',
'EFA_INSTALLER_VERSION': '',
'COMPOSER_INSTALL_COMMAND': f'mosaicml[all]=={composer_version}',
'TAGS': _get_composer_tags(
composer_version=composer_version,
Expand All @@ -342,7 +343,7 @@ def _main():
if entry['CUDA_VERSION']:
if entry['MOFED_VERSION'] != '':
interconnect = 'Infiniband'
elif entry['AWS_OFI_NCCL_VERSION'] != '':
elif entry['EFA_INSTALLER_VERSION'] != '':
interconnect = 'EFA'
cuda_version = f"{entry['CUDA_VERSION']} ({interconnect})" if entry['CUDA_VERSION'] else 'cpu'
linux_distro = f"Ubuntu {entry['UBUNTU_VERSION']}"
Expand Down
Loading