Skip to content

nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review#1078

Merged
KeitaW merged 1 commit into
mainfrom
kw/env-var-hygiene-nemo
May 6, 2026
Merged

nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review#1078
KeitaW merged 1 commit into
mainfrom
kw/env-var-hygiene-nemo

Conversation

@KeitaW
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW commented May 1, 2026

Summary

Per AWS NCCL/EFA team review, the following 3 ENVs in the nemo Dockerfiles
are unnecessary:

3.test_cases/megatron/nemo/Dockerfile             | 6 +-----
3.test_cases/megatron/nemo/kubernetes/Dockerfile  | 6 +-----
                                                    2 files, 10 deletions

Removed:

  • OMPI_MCA_pml=^cm,ucx
  • OMPI_MCA_btl=tcp,self
  • OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent

Kept (legitimate, not under review):

  • OPAL_PREFIX=/opt/amazon/openmpi — needed for OpenMPI lib resolution.
  • NCCL_SOCKET_IFNAME=^docker,lo,veth — not flagged by the review;
    conservative keep.

Why these can be safely removed (code audit)

OMPI_MCA_* ENVs only affect mpirun behavior. The nemo test case
never invokes mpirun in any of its launcher scripts:

$ grep -rn -E "(mpirun|mpiexec)" 3.test_cases/megatron/nemo/slurm/ \
                                3.test_cases/megatron/nemo/kubernetes/ \
       --include="*.sh" --include="*.py" --include="*.sbatch" --include="*.yaml"
(no matches)

The launchers all use:

  • Slurm: slurm/run.py → nemo-run SlurmExecutorsrun --container-image ... ft_launcher ... -m nemo_run.core.runners.fdl_runner (PMIx-based, not MPI).
  • Kubernetes: kubernetes/{pretrain,finetune}_*.py → kubeflow PyTorchJob (no MPI).

The only mpirun reference inside this test case is the Dockerfile
"mpirun-real" wrapper script setup (Dockerfile lines 109-112; same
pattern in kubernetes/Dockerfile lines 113-116) — this re-exposes
mpirun as a thin shim over mpirun.real, intended for users who drop
into the container interactively. It is never called by any test case
script. A user running mpirun manually can set MCA flags on the
command line as needed.

Therefore the OMPI_MCA_* ENVs are read by no executed code path
that this test case exercises.

Empirical 32-node Llama 3.1 8B verification (partial)

Hardware: SageMaker HyperPod Slurm cluster, 32 × p6-b300.48xlarge
(B300, sm_103). Recipe: llm.llama31_8b.pretrain_recipe(num_nodes=32)
via slurm/run.py, --max_steps 50. Same allocation
(compute-gpu-st-distributed-ml-[2-33], node 1 excluded for ECC
issues), same slurm/env_vars.json. Two images differing only in the
3 OMPI_MCA Dockerfile ENVs:

Variant Image OMPI_MCA ENVs in Dockerfile
N1 aws-nemo:26.02-v6 (PR 1072 base) kept (3 lines)
N2 aws-nemo:26.02-v6-stripped (this PR) dropped

N1 ran the full 50 iterations cleanly (mean iter time iters 5-49 =
0.3212 s/iter, steady-state ~0.27 s/iter, loss stable at 11.03).

N2 ran 10 iterations cleanly, then was requeued by Slurm due to
hardware failure on compute-gpu-st-distributed-ml-2 mid-run
(slurmctld: requeue job JobId=414 due to failure of node ...). The
allocated cluster window ended before the requeued retry could land,
so we have only the first 10 iterations of N2.

Apples-to-apples iter-5-through-9 comparison (the 5 iters both runs
captured):

iter N1 (s) N2 (s) Δ
5 1.523 1.525 +0.001 (NCCL warmup spike, both)
6 0.307 0.343 +0.036
7 0.309 0.309 -0.000
8 0.892 1.004 +0.112 (validation step, both)
9 0.305 0.448 +0.143
Caveat. Iters 5-9 are dominated by NCCL warmup (1.5s spike at
iter 5, both runs) and the validation step at iter 8 (0.9-1.0s, both
runs). N1's iters 45-49 are 0.27-0.31 s/iter (much faster than 5-9),
so the 5-iter window is not representative of steady-state. A 5-sample
mean over this window is not statistically meaningful — taking one is
omitted to avoid implying a comparable signal.

What we can and cannot claim from this 10-iter slice:

  • ✅ N2 builds, imports cleanly, NCCL bootstraps 256 ranks across 32
    nodes, and runs 10 training iterations end-to-end. The OMPI_MCA
    removal does not break any code path that training start exercises.
  • ✅ N1 vs N2 step times in iters 5-9 do not show a statistically
    distinguishable signal — the variance of early iters dominates.
    The verdict still rests on the code audit above (OMPI_MCA flags can't
    reach srun-launched workers anyway). The partial empirical run
    contributes a "no startup regression" check, not a steady-state perf
    check. Re-running the steady-state check on the next available
    allocation can be added as a follow-up if reviewers want it.

Note: bumped image required a megatron-core pin to run end-to-end

The N1/N2 images are PR-#1072's bumped Dockerfile plus a
MEGATRON_CORE_VERSION=core_v0.15.3 ARG that pins the
/opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/ tree to
match what NeMo 2.7.x is API-compatible with. Without this pin, the
recipe crashes at iter 0 on
get_megatron_optimizer() got an unexpected keyword argument 'no_weight_decay_cond' (kwargs removed in megatron-core 0.16.x). This
pin lives on PR #1072's branch, not on this PR's branch (this PR
branches off main which uses nemo:25.07.00, where the pin is not
needed).

Test plan

  • grep -rn audit confirms zero mpirun invocations in launcher
    scripts.
  • docker build succeeds with the 3 ENVs removed (no Dockerfile
    syntax errors; the \ line continuation re-flows correctly into
    the remaining OPAL_PREFIX and NCCL_SOCKET_IFNAME ENV).
  • N1 baseline: 50-iter llm.llama31_8b pretrain at 32n × 8 B300
    completes cleanly (mean 0.3212 s/iter iters 5-49).
  • N2 stripped: 10-iter portion completes cleanly with no startup
    regression. Full 50-iter retry blocked on cluster hardware
    failure mid-run + capacity-block timeout — a 5-sample early-iter
    mean is too noisy to be a steady-state comparison and is not
    reported. Flagged as deferred follow-up rather than in-scope
    for this PR.
  • CI lint / static analysis.

Per AWS NCCL/EFA team review (Brian Barrett, bbarrett@), the following
3 ENVs in 3.test_cases/megatron/nemo/{Dockerfile,kubernetes/Dockerfile}
are unnecessary:

  - OMPI_MCA_pml=^cm,ucx
  - OMPI_MCA_btl=tcp,self
  - OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent

These are OpenMPI MCA flags. They only affect mpirun behavior. A code
audit confirms the test case never invokes mpirun:

  $ grep -rn -E "(mpirun|mpiexec)" 3.test_cases/megatron/nemo/slurm/ \
                                  3.test_cases/megatron/nemo/kubernetes/ \
       --include="*.sh" --include="*.py" --include="*.sbatch" --include="*.yaml"
  (no matches)

The launchers all use srun --container-image (Pyxis) -> ft_launcher
-> torchrun-style worker spawn (slurm/run.py via nemo-run), or
kubeflow PyTorchJob (kubernetes/{pretrain,finetune}_*.py). None of
these read OMPI_MCA_* env vars.

The only mpirun reference in this test case is the Dockerfile-level
"mpirun-real" wrapper (Dockerfile lines 109-112; same in
kubernetes/Dockerfile lines 113-116) which redefines mpirun as a thin
shim over mpirun.real. The wrapper exists for users who drop into the
container interactively; it is never invoked by the test case scripts.
A user running mpirun manually can set MCA flags on the command line.

Kept: OPAL_PREFIX (legitimate; needed for OpenMPI lib resolution),
NCCL_SOCKET_IFNAME (Brian did not flag; conservative keep).
@KeitaW KeitaW changed the title nemo: drop OMPI_MCA_* Dockerfile ENVs (mpirun-only, never invoked) nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review May 1, 2026
@KeitaW KeitaW merged commit 32b41dc into main May 6, 2026
5 checks passed
@KeitaW KeitaW deleted the kw/env-var-hygiene-nemo branch May 6, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants