From b7fc4177591c8af967523907d3c491bf142f8c98 Mon Sep 17 00:00:00 2001 From: Keita Watanabe Date: Fri, 1 May 2026 08:48:57 +0000 Subject: [PATCH] nemo: drop OMPI_MCA_* Dockerfile ENVs (mpirun-only, never invoked) Per AWS NCCL/EFA team review (Brian Barrett, bbarrett@), the following 3 ENVs in 3.test_cases/megatron/nemo/{Dockerfile,kubernetes/Dockerfile} are unnecessary: - OMPI_MCA_pml=^cm,ucx - OMPI_MCA_btl=tcp,self - OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent These are OpenMPI MCA flags. They only affect mpirun behavior. A code audit confirms the test case never invokes mpirun: $ grep -rn -E "(mpirun|mpiexec)" 3.test_cases/megatron/nemo/slurm/ \ 3.test_cases/megatron/nemo/kubernetes/ \ --include="*.sh" --include="*.py" --include="*.sbatch" --include="*.yaml" (no matches) The launchers all use srun --container-image (Pyxis) -> ft_launcher -> torchrun-style worker spawn (slurm/run.py via nemo-run), or kubeflow PyTorchJob (kubernetes/{pretrain,finetune}_*.py). None of these read OMPI_MCA_* env vars. The only mpirun reference in this test case is the Dockerfile-level "mpirun-real" wrapper (Dockerfile lines 109-112; same in kubernetes/Dockerfile lines 113-116) which redefines mpirun as a thin shim over mpirun.real. The wrapper exists for users who drop into the container interactively; it is never invoked by the test case scripts. A user running mpirun manually can set MCA flags on the command line. Kept: OPAL_PREFIX (legitimate; needed for OpenMPI lib resolution), NCCL_SOCKET_IFNAME (Brian did not flag; conservative keep). --- 3.test_cases/megatron/nemo/Dockerfile | 6 +----- 3.test_cases/megatron/nemo/kubernetes/Dockerfile | 6 +----- 2 files changed, 2 insertions(+), 10 deletions(-) diff --git a/3.test_cases/megatron/nemo/Dockerfile b/3.test_cases/megatron/nemo/Dockerfile index e10cbfc57..bac5bd711 100644 --- a/3.test_cases/megatron/nemo/Dockerfile +++ b/3.test_cases/megatron/nemo/Dockerfile @@ -122,11 +122,7 @@ RUN mv $OPEN_MPI_PATH/bin/mpirun $OPEN_MPI_PATH/bin/mpirun.real \ ###################### RUN pip install transformers==${TRANSFORMERS_VERSION} sentencepiece python-etcd -## Set Open MPI variables to exclude network interface and conduit. -ENV OMPI_MCA_pml=^cm,ucx \ - OMPI_MCA_btl=tcp,self \ - OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent\ - OPAL_PREFIX=/opt/amazon/openmpi \ +ENV OPAL_PREFIX=/opt/amazon/openmpi \ NCCL_SOCKET_IFNAME=^docker,lo,veth ## Turn off PMIx Error https://github.com/open-mpi/ompi/issues/7516 diff --git a/3.test_cases/megatron/nemo/kubernetes/Dockerfile b/3.test_cases/megatron/nemo/kubernetes/Dockerfile index bcb2f9c2b..e93015aa6 100644 --- a/3.test_cases/megatron/nemo/kubernetes/Dockerfile +++ b/3.test_cases/megatron/nemo/kubernetes/Dockerfile @@ -125,11 +125,7 @@ RUN mv $OPEN_MPI_PATH/bin/mpirun $OPEN_MPI_PATH/bin/mpirun.real \ ###################### RUN pip install transformers==${TRANSFORMERS_VERSION} sentencepiece python-etcd -## Set Open MPI variables to exclude network interface and conduit. -ENV OMPI_MCA_pml=^cm,ucx \ - OMPI_MCA_btl=tcp,self \ - OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent\ - OPAL_PREFIX=/opt/amazon/openmpi \ +ENV OPAL_PREFIX=/opt/amazon/openmpi \ NCCL_SOCKET_IFNAME=^docker,lo,veth ## Turn off PMIx Error https://github.com/open-mpi/ompi/issues/7516