nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles by KeitaW · Pull Request #1072 · awslabs/awsome-distributed-ai

KeitaW · 2026-04-29T22:38:55Z

Summary

Brings the NeMo test case onto the nvcr.io/nvidia/nemo:26.02 container and
syncs the Kubernetes Dockerfile (which had drifted ~year behind) to the same
software stack as the Slurm Dockerfile. Both now produce the same image: same
FROM, same EFA/NCCL/aws-ofi-nccl versions, same LD_LIBRARY_PATH.

Stack delta — Slurm Dockerfile

Component	Before	After
NGC NeMo base	25.07.00	26.02
GDRCopy	v2.5	v2.5.2
EFA installer	1.47.0	1.48.0
NCCL	v2.27.7-1	v2.30.4-1
nccl-tests	v2.16.9	v2.18.3
AWS_OFI_NCCL (ARG)	(commented)	v1.19.0 (uncommented; bundled in EFA)
transformers	4.56.1	4.57.6

Stack delta — Kubernetes Dockerfile

Component	Before	After
NGC NeMo base	25.04.01	26.02
GDRCopy	v2.4.1	v2.5.2
EFA installer	1.37.0	1.48.0
NCCL	v2.23.4-1	v2.30.4-1
nccl-tests	v2.13.10	v2.18.3
AWS_OFI_NCCL	v1.13.2-aws (built from source)	v1.19.0 (bundled in EFA, source build dropped)
transformers	4.48.1	4.57.6

The kubernetes Dockerfile previously rebuilt aws-ofi-nccl from source (~20
lines + libhwloc-dev). EFA installer ≥ 1.47 bundles ofi-nccl, so the source
build is replaced with a verification step.

Stack delta — Slurm host venv (`slurm/venv.sh`)

Component	Before	After
NeMo-Run	arbitrary commit `4d05653…`	v0.9.0 release tag
torch	2.6.0	2.10.0
Megatron-LM	arbitrary commit `b5d90de…`	core_v0.16.1 (matches container)
nemo_toolkit	2.1.0	2.7.3
nvidia-resiliency-ext	0.2.1	0.4.1
mamba-ssm wheel	cu118 + torch2.0 (incompatible w/ pinned torch 2.6)	removed (dead)
opencc	1.1.6 (no cp312 wheel)	resolver-picked (transitive via nemo_toolkit)

Compatibility fixes (independently committed)

The base nemo:26.02 image and the nemo_toolkit==2.7.3 upgrade together
surface six issues that the previous test case predates. Each is in its own
commit so the diff against main is reviewable piecewise:

Stale EFA package state (commit 3c76373a):
efa_installer.sh aborted with
/etc/ld.so.conf.d/000_efa.conf is installed by efa-profile but doesn't exist. Added a dpkg --purge --force-all of the stale efa-profile,
libfabric1-aws, openmpi40-aws, openmpi50-aws, and
libnccl-ofi-ngc-v2 packages before the EFA install step.
NGC OFI plugin path (commit ae0c0745):
NGC images install the OFI NCCL plugin via libnccl-ofi-ngc-v2 to
/opt/amazon/aws-ofi-nccl/lib, not the stock-EFA path
/opt/amazon/ofi-nccl/lib. Updated LD_LIBRARY_PATH to cover both, and
rewrote the verify step to match either location.
apt autoremove sweeps base packages (commit 6528a81c):
The previous unconditional apt autoremove -y deleted base-image
packages that had no Debian dep on them. Removed.
libcusparseLt.so.0 SONAME missing (commit e94e1431):
The libcusparselt0-cuda-13 package on nemo:26.02 installs
libcusparseLt.so.0.9.1.1 into the nested
/usr/lib/x86_64-linux-gnu/libcusparseLt/13/ directory but creates
neither the libcusparseLt.so.0 SONAME symlink nor an
/etc/ld.so.conf.d entry. import torch then crashes with
ImportError: libcusparseLt.so.0: cannot open shared object file.
Fixed by writing the dir into /etc/ld.so.conf.d/000_libcusparselt.conf
and running ldconfig (which both registers the path and creates the
SONAME symlink).
megatron-core version mismatch (commit 7fc0aefd):
nemo_toolkit==2.7.3 imports
megatron.core.inference.model_inference_wrappers.inference_wrapper_config
which only exists in core_v0.16.x. The previous
core_v0.17.0rc0 pin in venv.sh reorganized that module tree and
broke the entire from nemo.collections import llm import path.
Repinned to core_v0.16.1 (what the container ships) so host and
container megatron-core APIs match.
get_nmt_tokenizer import path move (commit fb285c81):
nemo_toolkit==2.7.x removed nemo.collections.nlp entirely.
slurm/run.py still imported get_nmt_tokenizer from there.
Repointed to nemo.collections.common.tokenizers.tokenizer_utils
(same function, new home).
Container PATH puts /usr/bin before /opt/venv/bin (commits
b316549b, b36dfd75, c54c9f27):
nemo:26.02 ships its Python packages (torch, nemo,
megatron-core, nemo_run, …) into a uv-managed venv at /opt/venv/.
ft_launcher's shebang #!/opt/venv/bin/python3 resolves correctly,
but torchelastic spawns workers via python -m nemo_run.core.runners.fdl_runner
and python is searched on PATH — where /usr/bin/python (no nemo_run)
wins. Workers crash:
ModuleNotFoundError: No module named 'nemo_run'.
Setting PATH in slurm/env_vars.json does not fix it (Pyxis appends
the env-vars PATH to the image PATH rather than prepending). Fix is in
the Dockerfile: ENV PATH=/opt/venv/bin:…:$PATH. The matching
env_vars.json change keeps /opt/slurm/bin on PATH so the SBATCH
wrapper can still call srun/scontrol.

Misc

Default --container_image in slurm/run.py and the four
kubernetes/{finetune,pretrain}_*.py launchers updated from
nemo:24.12 → nemo:26.02.
slurm/README.md, kubernetes/README.md, and the data-processing pod
template updated for the new image tag and sqsh filename
(aws-nemo-26-02.sqsh).

Smoke test (1×B300 + 4×B300, container)

Run via Pyxis on a P6-B300 node, launching /opt/venv/bin/python (the
uv-managed venv that nemo:26.02 ships its packages in — system python
doesn't see them):

torch:           2.10.0a0+b558c986e8.nv25.11
cuda:            13.0
device:          NVIDIA B300 SXM6 AC
compute capability: (10, 3)              # sm_103 confirmed native
nemo:            2.7.1
megatron-core:   0.16.1
torch.distributed init_process_group(backend="nccl"): rank 0..7,  world  8   (1 node)
torch.distributed init_process_group(backend="nccl"): rank 0..31, world 32   (4 nodes)
smoke OK

The 32-rank NCCL init across 4 nodes plus the
import torch + nemo + megatron-core chain is the minimum end-to-end
check that the container is functional on B300 with the bumped stack.

End-to-end recipe verified

Initial draft of this PR documented an upstream NeMo 2.7.1 ↔
megatron-core 0.16.1 API mismatch bundled inside the nemo:26.02
base image as a "Known limitation." Commit
nemo: pin bundled megatron-core to 0.15.3 resolves it by replacing
the in-place /opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/
tree with megatron-core core_v0.15.3, which is the version NeMo 2.7.x
is API-compatible against. Two upstream issues were addressed in one
pin:

get_megatron_optimizer(no_weight_decay_cond=..., scale_lr_cond=..., lr_mult=...) — kwargs removed in megatron-core 0.16.x; present in
0.15.3.
from megatron.core.dist_checkpointing.strategies import tensorstore
— submodule removed in 0.16.x (tensorstore.py, zarr.py,
two_stage.py, resharding.py all dropped); present in 0.15.3.

slurm/venv.sh got the matching pin so the host nemo-run venv stays
in lockstep with the container.

End-to-end verification on B300 with the pinned stack:

Scale	Iterations	Mean iter time (iters 5-49)	Steady-state	Loss
1n × 8 GPU	50	~8.3 s	~8.3 s	11.03 stable
32n × 8 GPU	50	0.3212 s	~0.27 s	11.03 stable

The 32n run completed cleanly through 50 iterations without invoking
the upstream-broken code paths.

Test plan

docker build + enroot import succeeds from the new Slurm Dockerfile.
Container loads on B300 (sm_103) with all Dockerfile fixes;
import torch / nemo / megatron-core succeed; 8-rank and 32-rank
NCCL init work (1n and 4n).
docker build + docker push succeeds from the new Kubernetes
Dockerfile (skipped here because EKS bring-up is out of scope; user
verifies).
bash slurm/venv.sh produces a working venv on the head node;
nemo.collections.llm imports cleanly.
python slurm/run.py --container_image ~/aws-nemo-26-02.sqsh --nodes 1 --max_steps 50 runs 50-iter Llama 3.1 8B pretrain end-to-end
cleanly with the megatron-core 0.15.3 pin (loss stable 11.03,
~8.3 s/iter at 1n).
Llama 3.1 8B pretrain at 32 nodes via run.py recipe runs cleanly
end-to-end (mean iter time iters 5-49 = 0.3212 s/iter, loss stable).
Intermediate scales (4/8/16) not re-bench'd in this window;
not blocking the bump.

Slurm Dockerfile (../Dockerfile): FROM nemo:25.07.00 -> 26.02 GDRCOPY v2.5 -> v2.5.2 EFA 1.47.0 -> 1.48.0 NCCL v2.27.7-1 -> v2.30.4-1 NCCL_TESTS v2.16.9 -> v2.18.3 TRANSFORMERS 4.56.1 -> 4.57.6 AWS_OFI_NCCL_VERSION ARG made explicit (was commented out) at v1.19.0 Kubernetes Dockerfile was a year behind (FROM nemo:25.04.01 / EFA 1.37 / NCCL 2.23.4). Synced to the Slurm Dockerfile so both produce the same software stack: same FROM, ARGs, and EFA-bundled aws-ofi-nccl. Drops the explicit aws-ofi-nccl source build and replaces it with a verification step that reads from /opt/amazon/ofi-nccl (where EFA >=1.47 puts it). LD_LIBRARY_PATH updated to the bundled location. Slurm host-side venv (slurm/venv.sh): NeMo-Run arbitrary commit -> v0.9.0 release tag torch 2.6.0 -> 2.10.0 Megatron-LM arbitrary commit -> core_v0.17.0 (matches container) nemo_toolkit 2.1.0 -> 2.7.3 (matches the 26.02 container's nemo) nvidia-resiliency-ext 0.2.1 -> 0.4.1 Drops the mamba-ssm wheel install — the wheel was pinned to cu118 + torch 2.0 right next to a torch 2.6 install (flatly incompatible), nothing in the run.py path imports mamba_ssm, and pinning a working wheel for current torch is out of scope here. Updates the default `--container_image` in run.py and the four kubernetes/{finetune,pretrain}_*.py launchers from nemo:24.12 to nemo:26.02, and bumps the corresponding image tags / sqsh filenames in slurm/README.md, kubernetes/README.md, and the data-processing pod template.

…-nccl)

The nemo:26.02 base image ships several CUDA add-on libraries (libcusparseLt0, libcudnn-frontend, ...) that no installed Debian package Depends: on — torch and transformer_engine load them via dlopen at runtime. apt autoremove sees them as orphaned and deletes them, after which the very first `import torch` inside the container crashes with: ImportError: libcusparseLt.so.0: cannot open shared object file: No such file or directory Caught by the post-build smoke run on 1n B300; reproduces 100% of the time on the prior image. Removing the autoremove step keeps the rest of the cleanup intact (we still purge stale EFA pkgs explicitly before the EFA installer runs) and lets the base image's library set survive.

The previous "drop apt autoremove" fix turned out to be the wrong diagnosis — libcusparseLt0 is in fact installed (libcusparselt0-cuda-13 0.9.1.1-1), but the package places its runtime in a nested directory /usr/lib/x86_64-linux-gnu/libcusparseLt/13/ and ships no /etc/ld.so.conf.d snippet to register it. ldconfig -p shows libcusparse (the regular CUSPARSE) but not libcusparseLt. torch.import dlopens libcusparseLt.so.0 and crashes: ImportError: libcusparseLt.so.0: cannot open shared object file Adding the nested dir to LD_LIBRARY_PATH lets the dlopen resolve. Keep the autoremove drop too — it's still the safer behavior even though it wasn't the root cause of this specific crash.

The previous LD_LIBRARY_PATH-only approach didn't work — confirmed via runtime test inside the container. The directory contains libcusparseLt.so.0.9.1.1 but NO libcusparseLt.so.0 SONAME symlink, so torch's dlopen("libcusparseLt.so.0") fails even when the path is on LD_LIBRARY_PATH (dlopen needs the exact filename, not the SONAME of some other file in the dir). The right fix is to write an /etc/ld.so.conf.d entry and run ldconfig, which both registers the directory and creates the SONAME symlink. This mirrors what the libcusparselt0-cuda-13 .deb postinst should have done but didn't. Verified the symlink-shim equivalent (mkdir+ln) lets `import torch` succeed in a container smoke test against the same image; the ldconfig approach baked into the Dockerfile is the durable form.

The previous core_v0.17.0 pin broke `from nemo.collections import llm` because core_v0.17.0rc0 reorganized the inference wrapper module tree and dropped: megatron.core.inference.model_inference_wrappers.inference_wrapper_config which nemo_toolkit 2.7.3 still imports unconditionally. The container itself (nemo:26.02) ships megatron-core 0.16.1, so pinning the host venv to the same tag both fixes the import and keeps the host/container megatron-core APIs in sync (so saved tensors, checkpoint formats, etc. match across the boundary). Also drop the explicit opencc==1.1.6 pin: the resolver picks a cp312-compatible version automatically when nemo_toolkit pulls it in transitively, and our manual pin failed to find a wheel for Python 3.12.

In nemo_toolkit 2.7.x, `nemo.collections.nlp` was removed and the NLP tokenizer utilities moved to `nemo.collections.common.tokenizers`. The shipped run.py still imports from the old path: from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer which crashes immediately with ModuleNotFoundError on the new container and venv. The function itself is unchanged at the new location: nemo/collections/common/tokenizers/tokenizer_utils.py

nemo:26.02 ships its Python packages (torch, nemo, megatron-core, nemo_run, ...) into /opt/venv/, a uv-managed venv. The system /usr/bin/python has none of them. nemo-run's ft_launcher invokes `/usr/bin/python -m nemo_run.core.runners.fdl_runner` and crashes: ModuleNotFoundError: No module named 'nemo_run' Adding /opt/venv/bin to PATH at the front routes `python` to the venv interpreter that has nemo_run installed. env_vars.json is the right place for this — it's loaded by run.py and forwarded into every container task.

The previous commit set PATH to /opt/venv/bin:... but dropped /opt/slurm/bin, which broke the SBATCH wrapper script — line 77 'srun: command not found'. Add /opt/slurm/bin back so both the host wrapper and the container have working PATHs.

Fix torchelastic worker spawn. The nemo:26.02 image's default PATH lists /usr/bin before /opt/venv/bin, so `which python` returns /usr/bin/python inside the container. ft_launcher's shebang correctly points to the venv python, but when torchelastic spawns workers it uses `python` from PATH — and /usr/bin/python doesn't have nemo_run, megatron, etc. installed. Setting PATH via env_vars.json doesn't help because Pyxis appends rather than prepends container env. The fix has to be in the image's ENV PATH.

The previous value (set when the test case was on nemo:24.07) hard-disabled the fused-attention backend. transformer_engine in nemo:26.02 asserts that NVTE_FUSED_ATTN matches the chosen attention backend ('auto' wants 1): AssertionError: NVTE_FUSED_ATTN set to 0, but expected 1 for attention backend type auto. unset NVTE_FLASH_ATTN, NVTE_FUSED_ATTN and NVTE_UNFUSED_ATTN. Use the --attention-backend argument... Fix: remove the override and let TE pick the backend.

The nvcr.io/nvidia/nemo:26.02 base image ships an internally inconsistent component pair: NeMo 2.7.1 source at /opt/NeMo and megatron-core 0.16.1 at /opt/Megatron-Bridge/3rdparty/Megatron-LM/. NeMo 2.7.1 was developed against an older megatron-core (~0.14/0.15) and calls APIs that 0.16.x removed: 1. /opt/NeMo/nemo/lightning/_strategy_lib.py:712 calls get_megatron_optimizer(no_weight_decay_cond=..., scale_lr_cond=..., lr_mult=...). 0.16.x removed those three kwargs (they're now conditional via `config_overrides`). Hits as `TypeError: get_megatron_optimizer() got an unexpected keyword argument 'no_weight_decay_cond'` at iter 0. 2. /opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py:39 imports `tensorstore` from megatron.core.dist_checkpointing.strategies. 0.16.x removed strategies/tensorstore.py (along with zarr.py, two_stage.py, resharding.py). Hits as `ImportError: cannot import name 'tensorstore' from megatron.core.dist_checkpointing.strategies`. Both are NVIDIA upstream container packaging bugs; the proper fix would be `nemo:26.02.x` shipping a NeMo source that matches the bundled megatron-core. Until then, this commit pins the bundled megatron-core back to core_v0.15.3 (latest 0.15.x), where both APIs still exist. - Dockerfile: new `MEGATRON_CORE_VERSION=core_v0.15.3` ARG and a RUN step that git-clones the pinned tag and replaces the in-place /opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/ tree. - slurm/venv.sh: matching pin for the host-side nemo-run venv (was 0.16.1 to match the broken container; now 0.15.3 to match the pinned container). Verified end-to-end: - Smoke at 1n × 8 B300: 50 iter Llama 3.1 8B pretrain runs cleanly, loss stable at 11.03, ~8.3 s/iter (1n). - Bench at 32n × 8 B300: 50 iter pretrain runs cleanly, mean iter time iters 5-49 = 0.3212 s/iter, steady-state ~0.27 s/iter. Removes the "Known upstream limitation: slurm/run.py recipe + nemo:26.02" caveat from the PR description.

KeitaW added 12 commits April 29, 2026 13:48

nemo: purge stale EFA pkgs from base image before installing 1.48

3c76373

nemo: handle NGC libnccl-ofi-ngc-v2 install path (/opt/amazon/aws-ofi…

ae0c074

…-nccl)

KeitaW mentioned this pull request May 1, 2026

nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review #1078

Merged

5 tasks

KeitaW requested a review from pbelevich May 8, 2026 10:29

KeitaW marked this pull request as ready for review May 8, 2026 10:29

KeitaW merged commit 02be70c into main May 15, 2026
5 checks passed

KeitaW deleted the kw/nemo-26.02-stack-sync branch May 15, 2026 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles#1072

nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles#1072
KeitaW merged 13 commits into
mainfrom
kw/nemo-26.02-stack-sync

KeitaW commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KeitaW commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack delta — Slurm Dockerfile

Stack delta — Kubernetes Dockerfile

Stack delta — Slurm host venv (slurm/venv.sh)

Compatibility fixes (independently committed)

Misc

Smoke test (1×B300 + 4×B300, container)

End-to-end recipe verified

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KeitaW commented Apr 29, 2026 •

edited

Loading

Stack delta — Slurm host venv (`slurm/venv.sh`)