nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review by KeitaW · Pull Request #1078 · awslabs/awsome-distributed-ai

KeitaW · 2026-05-01T08:49:28Z

Summary

Per AWS NCCL/EFA team review, the following 3 ENVs in the nemo Dockerfiles
are unnecessary:

3.test_cases/megatron/nemo/Dockerfile             | 6 +-----
3.test_cases/megatron/nemo/kubernetes/Dockerfile  | 6 +-----
                                                    2 files, 10 deletions

Removed:

OMPI_MCA_pml=^cm,ucx
OMPI_MCA_btl=tcp,self
OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent

Kept (legitimate, not under review):

OPAL_PREFIX=/opt/amazon/openmpi — needed for OpenMPI lib resolution.
NCCL_SOCKET_IFNAME=^docker,lo,veth — not flagged by the review;
conservative keep.

Why these can be safely removed (code audit)

OMPI_MCA_* ENVs only affect mpirun behavior. The nemo test case
never invokes mpirun in any of its launcher scripts:

$ grep -rn -E "(mpirun|mpiexec)" 3.test_cases/megatron/nemo/slurm/ \
                                3.test_cases/megatron/nemo/kubernetes/ \
       --include="*.sh" --include="*.py" --include="*.sbatch" --include="*.yaml"
(no matches)

The launchers all use:

Slurm: slurm/run.py → nemo-run SlurmExecutor → srun --container-image ... ft_launcher ... -m nemo_run.core.runners.fdl_runner (PMIx-based, not MPI).
Kubernetes: kubernetes/{pretrain,finetune}_*.py → kubeflow PyTorchJob (no MPI).

The only mpirun reference inside this test case is the Dockerfile
"mpirun-real" wrapper script setup (Dockerfile lines 109-112; same
pattern in kubernetes/Dockerfile lines 113-116) — this re-exposes
mpirun as a thin shim over mpirun.real, intended for users who drop
into the container interactively. It is never called by any test case
script. A user running mpirun manually can set MCA flags on the
command line as needed.

Therefore the OMPI_MCA_* ENVs are read by no executed code path
that this test case exercises.

Empirical 32-node Llama 3.1 8B verification (partial)

Hardware: SageMaker HyperPod Slurm cluster, 32 × p6-b300.48xlarge
(B300, sm_103). Recipe: llm.llama31_8b.pretrain_recipe(num_nodes=32)
via slurm/run.py, --max_steps 50. Same allocation
(compute-gpu-st-distributed-ml-[2-33], node 1 excluded for ECC
issues), same slurm/env_vars.json. Two images differing only in the
3 OMPI_MCA Dockerfile ENVs:

Variant	Image	OMPI_MCA ENVs in Dockerfile
N1	`aws-nemo:26.02-v6` (PR 1072 base)	kept (3 lines)
N2	`aws-nemo:26.02-v6-stripped` (this PR)	dropped

N1 ran the full 50 iterations cleanly (mean iter time iters 5-49 =
0.3212 s/iter, steady-state ~0.27 s/iter, loss stable at 11.03).

N2 ran 10 iterations cleanly, then was requeued by Slurm due to
hardware failure on compute-gpu-st-distributed-ml-2 mid-run
(slurmctld: requeue job JobId=414 due to failure of node ...). The
allocated cluster window ended before the requeued retry could land,
so we have only the first 10 iterations of N2.

Apples-to-apples iter-5-through-9 comparison (the 5 iters both runs
captured):

iter	N1 (s)	N2 (s)	Δ
5	1.523	1.525	+0.001 (NCCL warmup spike, both)
6	0.307	0.343	+0.036
7	0.309	0.309	-0.000
8	0.892	1.004	+0.112 (validation step, both)
9	0.305	0.448	+0.143
Caveat. Iters 5-9 are dominated by NCCL warmup (1.5s spike at
iter 5, both runs) and the validation step at iter 8 (0.9-1.0s, both
runs). N1's iters 45-49 are 0.27-0.31 s/iter (much faster than 5-9),
so the 5-iter window is not representative of steady-state. A 5-sample
mean over this window is not statistically meaningful — taking one is
omitted to avoid implying a comparable signal.

What we can and cannot claim from this 10-iter slice:

✅ N2 builds, imports cleanly, NCCL bootstraps 256 ranks across 32
nodes, and runs 10 training iterations end-to-end. The OMPI_MCA
removal does not break any code path that training start exercises.
✅ N1 vs N2 step times in iters 5-9 do not show a statistically
distinguishable signal — the variance of early iters dominates.
The verdict still rests on the code audit above (OMPI_MCA flags can't
reach srun-launched workers anyway). The partial empirical run
contributes a "no startup regression" check, not a steady-state perf
check. Re-running the steady-state check on the next available
allocation can be added as a follow-up if reviewers want it.

Note: bumped image required a megatron-core pin to run end-to-end

The N1/N2 images are PR-#1072's bumped Dockerfile plus a
MEGATRON_CORE_VERSION=core_v0.15.3 ARG that pins the
/opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/ tree to
match what NeMo 2.7.x is API-compatible with. Without this pin, the
recipe crashes at iter 0 on
get_megatron_optimizer() got an unexpected keyword argument 'no_weight_decay_cond' (kwargs removed in megatron-core 0.16.x). This
pin lives on PR #1072's branch, not on this PR's branch (this PR
branches off main which uses nemo:25.07.00, where the pin is not
needed).

Test plan

grep -rn audit confirms zero mpirun invocations in launcher
scripts.
docker build succeeds with the 3 ENVs removed (no Dockerfile
syntax errors; the \ line continuation re-flows correctly into
the remaining OPAL_PREFIX and NCCL_SOCKET_IFNAME ENV).
N1 baseline: 50-iter llm.llama31_8b pretrain at 32n × 8 B300
completes cleanly (mean 0.3212 s/iter iters 5-49).
N2 stripped: 10-iter portion completes cleanly with no startup
regression. Full 50-iter retry blocked on cluster hardware
failure mid-run + capacity-block timeout — a 5-sample early-iter
mean is too noisy to be a steady-state comparison and is not
reported. Flagged as deferred follow-up rather than in-scope
for this PR.
CI lint / static analysis.

Per AWS NCCL/EFA team review (Brian Barrett, bbarrett@), the following 3 ENVs in 3.test_cases/megatron/nemo/{Dockerfile,kubernetes/Dockerfile} are unnecessary: - OMPI_MCA_pml=^cm,ucx - OMPI_MCA_btl=tcp,self - OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent These are OpenMPI MCA flags. They only affect mpirun behavior. A code audit confirms the test case never invokes mpirun: $ grep -rn -E "(mpirun|mpiexec)" 3.test_cases/megatron/nemo/slurm/ \ 3.test_cases/megatron/nemo/kubernetes/ \ --include="*.sh" --include="*.py" --include="*.sbatch" --include="*.yaml" (no matches) The launchers all use srun --container-image (Pyxis) -> ft_launcher -> torchrun-style worker spawn (slurm/run.py via nemo-run), or kubeflow PyTorchJob (kubernetes/{pretrain,finetune}_*.py). None of these read OMPI_MCA_* env vars. The only mpirun reference in this test case is the Dockerfile-level "mpirun-real" wrapper (Dockerfile lines 109-112; same in kubernetes/Dockerfile lines 113-116) which redefines mpirun as a thin shim over mpirun.real. The wrapper exists for users who drop into the container interactively; it is never invoked by the test case scripts. A user running mpirun manually can set MCA flags on the command line. Kept: OPAL_PREFIX (legitimate; needed for OpenMPI lib resolution), NCCL_SOCKET_IFNAME (Brian did not flag; conservative keep).

KeitaW changed the title ~~nemo: drop OMPI_MCA_* Dockerfile ENVs (mpirun-only, never invoked)~~ nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review May 1, 2026

bwbarrett approved these changes May 5, 2026

View reviewed changes

KeitaW merged commit 32b41dc into main May 6, 2026
5 checks passed

KeitaW deleted the kw/env-var-hygiene-nemo branch May 6, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review#1078

nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review#1078
KeitaW merged 1 commit into
mainfrom
kw/env-var-hygiene-nemo

KeitaW commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KeitaW commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why these can be safely removed (code audit)

Empirical 32-node Llama 3.1 8B verification (partial)

Note: bumped image required a megatron-core pin to run end-to-end

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KeitaW commented May 1, 2026 •

edited

Loading