nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review#1078
Merged
Conversation
Per AWS NCCL/EFA team review (Brian Barrett, bbarrett@), the following
3 ENVs in 3.test_cases/megatron/nemo/{Dockerfile,kubernetes/Dockerfile}
are unnecessary:
- OMPI_MCA_pml=^cm,ucx
- OMPI_MCA_btl=tcp,self
- OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent
These are OpenMPI MCA flags. They only affect mpirun behavior. A code
audit confirms the test case never invokes mpirun:
$ grep -rn -E "(mpirun|mpiexec)" 3.test_cases/megatron/nemo/slurm/ \
3.test_cases/megatron/nemo/kubernetes/ \
--include="*.sh" --include="*.py" --include="*.sbatch" --include="*.yaml"
(no matches)
The launchers all use srun --container-image (Pyxis) -> ft_launcher
-> torchrun-style worker spawn (slurm/run.py via nemo-run), or
kubeflow PyTorchJob (kubernetes/{pretrain,finetune}_*.py). None of
these read OMPI_MCA_* env vars.
The only mpirun reference in this test case is the Dockerfile-level
"mpirun-real" wrapper (Dockerfile lines 109-112; same in
kubernetes/Dockerfile lines 113-116) which redefines mpirun as a thin
shim over mpirun.real. The wrapper exists for users who drop into the
container interactively; it is never invoked by the test case scripts.
A user running mpirun manually can set MCA flags on the command line.
Kept: OPAL_PREFIX (legitimate; needed for OpenMPI lib resolution),
NCCL_SOCKET_IFNAME (Brian did not flag; conservative keep).
bwbarrett
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per AWS NCCL/EFA team review, the following 3 ENVs in the nemo Dockerfiles
are unnecessary:
Removed:
OMPI_MCA_pml=^cm,ucxOMPI_MCA_btl=tcp,selfOMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agentKept (legitimate, not under review):
OPAL_PREFIX=/opt/amazon/openmpi— needed for OpenMPI lib resolution.NCCL_SOCKET_IFNAME=^docker,lo,veth— not flagged by the review;conservative keep.
Why these can be safely removed (code audit)
OMPI_MCA_*ENVs only affectmpirunbehavior. The nemo test casenever invokes mpirun in any of its launcher scripts:
The launchers all use:
slurm/run.py→ nemo-runSlurmExecutor→srun --container-image ... ft_launcher ... -m nemo_run.core.runners.fdl_runner(PMIx-based, not MPI).kubernetes/{pretrain,finetune}_*.py→ kubeflow PyTorchJob (no MPI).The only
mpirunreference inside this test case is the Dockerfile"mpirun-real" wrapper script setup (Dockerfile lines 109-112; same
pattern in kubernetes/Dockerfile lines 113-116) — this re-exposes
mpirunas a thin shim overmpirun.real, intended for users who dropinto the container interactively. It is never called by any test case
script. A user running
mpirunmanually can set MCA flags on thecommand line as needed.
Therefore the
OMPI_MCA_*ENVs are read by no executed code paththat this test case exercises.
Empirical 32-node Llama 3.1 8B verification (partial)
Hardware: SageMaker HyperPod Slurm cluster, 32 × p6-b300.48xlarge
(B300, sm_103). Recipe:
llm.llama31_8b.pretrain_recipe(num_nodes=32)via
slurm/run.py,--max_steps 50. Same allocation(
compute-gpu-st-distributed-ml-[2-33], node 1 excluded for ECCissues), same
slurm/env_vars.json. Two images differing only in the3 OMPI_MCA Dockerfile ENVs:
aws-nemo:26.02-v6(PR 1072 base)aws-nemo:26.02-v6-stripped(this PR)N1 ran the full 50 iterations cleanly (mean iter time iters 5-49 =
0.3212 s/iter, steady-state ~0.27 s/iter, loss stable at 11.03).
N2 ran 10 iterations cleanly, then was requeued by Slurm due to
hardware failure on
compute-gpu-st-distributed-ml-2mid-run(
slurmctld: requeue job JobId=414 due to failure of node ...). Theallocated cluster window ended before the requeued retry could land,
so we have only the first 10 iterations of N2.
Apples-to-apples iter-5-through-9 comparison (the 5 iters both runs
captured):
What we can and cannot claim from this 10-iter slice:
nodes, and runs 10 training iterations end-to-end. The OMPI_MCA
removal does not break any code path that training start exercises.
distinguishable signal — the variance of early iters dominates.
The verdict still rests on the code audit above (OMPI_MCA flags can't
reach srun-launched workers anyway). The partial empirical run
contributes a "no startup regression" check, not a steady-state perf
check. Re-running the steady-state check on the next available
allocation can be added as a follow-up if reviewers want it.
Note: bumped image required a megatron-core pin to run end-to-end
The N1/N2 images are PR-#1072's bumped Dockerfile plus a
MEGATRON_CORE_VERSION=core_v0.15.3ARG that pins the/opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/tree tomatch what NeMo 2.7.x is API-compatible with. Without this pin, the
recipe crashes at iter 0 on
get_megatron_optimizer() got an unexpected keyword argument 'no_weight_decay_cond'(kwargs removed in megatron-core 0.16.x). Thispin lives on PR #1072's branch, not on this PR's branch (this PR
branches off
mainwhich usesnemo:25.07.00, where the pin is notneeded).
Test plan
grep -rnaudit confirms zerompiruninvocations in launcherscripts.
docker buildsucceeds with the 3 ENVs removed (no Dockerfilesyntax errors; the
\line continuation re-flows correctly intothe remaining
OPAL_PREFIXandNCCL_SOCKET_IFNAMEENV).llm.llama31_8bpretrain at 32n × 8 B300completes cleanly (mean 0.3212 s/iter iters 5-49).
regression. Full 50-iter retry blocked on cluster hardware
failure mid-run + capacity-block timeout — a 5-sample early-iter
mean is too noisy to be a steady-state comparison and is not
reported. Flagged as deferred follow-up rather than in-scope
for this PR.