nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles#1072
Merged
Conversation
Slurm Dockerfile (../Dockerfile):
FROM nemo:25.07.00 -> 26.02
GDRCOPY v2.5 -> v2.5.2
EFA 1.47.0 -> 1.48.0
NCCL v2.27.7-1 -> v2.30.4-1
NCCL_TESTS v2.16.9 -> v2.18.3
TRANSFORMERS 4.56.1 -> 4.57.6
AWS_OFI_NCCL_VERSION ARG made explicit (was commented out) at v1.19.0
Kubernetes Dockerfile was a year behind (FROM nemo:25.04.01 / EFA 1.37 /
NCCL 2.23.4). Synced to the Slurm Dockerfile so both produce the same
software stack: same FROM, ARGs, and EFA-bundled aws-ofi-nccl. Drops
the explicit aws-ofi-nccl source build and replaces it with a
verification step that reads from /opt/amazon/ofi-nccl (where EFA >=1.47
puts it). LD_LIBRARY_PATH updated to the bundled location.
Slurm host-side venv (slurm/venv.sh):
NeMo-Run arbitrary commit -> v0.9.0 release tag
torch 2.6.0 -> 2.10.0
Megatron-LM arbitrary commit -> core_v0.17.0 (matches container)
nemo_toolkit 2.1.0 -> 2.7.3 (matches the 26.02 container's nemo)
nvidia-resiliency-ext 0.2.1 -> 0.4.1
Drops the mamba-ssm wheel install — the wheel was pinned to cu118 +
torch 2.0 right next to a torch 2.6 install (flatly incompatible),
nothing in the run.py path imports mamba_ssm, and pinning a working
wheel for current torch is out of scope here.
Updates the default `--container_image` in run.py and the four
kubernetes/{finetune,pretrain}_*.py launchers from nemo:24.12 to
nemo:26.02, and bumps the corresponding image tags / sqsh filenames in
slurm/README.md, kubernetes/README.md, and the data-processing pod
template.
The nemo:26.02 base image ships several CUDA add-on libraries
(libcusparseLt0, libcudnn-frontend, ...) that no installed Debian package
Depends: on — torch and transformer_engine load them via dlopen at
runtime. apt autoremove sees them as orphaned and deletes them, after
which the very first `import torch` inside the container crashes with:
ImportError: libcusparseLt.so.0: cannot open shared object file:
No such file or directory
Caught by the post-build smoke run on 1n B300; reproduces 100% of the
time on the prior image. Removing the autoremove step keeps the rest of
the cleanup intact (we still purge stale EFA pkgs explicitly before the
EFA installer runs) and lets the base image's library set survive.
The previous "drop apt autoremove" fix turned out to be the wrong
diagnosis — libcusparseLt0 is in fact installed (libcusparselt0-cuda-13
0.9.1.1-1), but the package places its runtime in a nested directory
/usr/lib/x86_64-linux-gnu/libcusparseLt/13/ and ships no
/etc/ld.so.conf.d snippet to register it. ldconfig -p shows libcusparse
(the regular CUSPARSE) but not libcusparseLt. torch.import dlopens
libcusparseLt.so.0 and crashes:
ImportError: libcusparseLt.so.0: cannot open shared object file
Adding the nested dir to LD_LIBRARY_PATH lets the dlopen resolve. Keep
the autoremove drop too — it's still the safer behavior even though it
wasn't the root cause of this specific crash.
The previous LD_LIBRARY_PATH-only approach didn't work — confirmed via
runtime test inside the container. The directory contains
libcusparseLt.so.0.9.1.1 but NO libcusparseLt.so.0 SONAME symlink, so
torch's dlopen("libcusparseLt.so.0") fails even when the path is on
LD_LIBRARY_PATH (dlopen needs the exact filename, not the SONAME of
some other file in the dir).
The right fix is to write an /etc/ld.so.conf.d entry and run ldconfig,
which both registers the directory and creates the SONAME symlink. This
mirrors what the libcusparselt0-cuda-13 .deb postinst should have done
but didn't.
Verified the symlink-shim equivalent (mkdir+ln) lets `import torch`
succeed in a container smoke test against the same image; the ldconfig
approach baked into the Dockerfile is the durable form.
The previous core_v0.17.0 pin broke `from nemo.collections import llm`
because core_v0.17.0rc0 reorganized the inference wrapper module tree
and dropped:
megatron.core.inference.model_inference_wrappers.inference_wrapper_config
which nemo_toolkit 2.7.3 still imports unconditionally. The container
itself (nemo:26.02) ships megatron-core 0.16.1, so pinning the host
venv to the same tag both fixes the import and keeps the host/container
megatron-core APIs in sync (so saved tensors, checkpoint formats, etc.
match across the boundary).
Also drop the explicit opencc==1.1.6 pin: the resolver picks a
cp312-compatible version automatically when nemo_toolkit pulls it in
transitively, and our manual pin failed to find a wheel for Python 3.12.
In nemo_toolkit 2.7.x, `nemo.collections.nlp` was removed and the NLP
tokenizer utilities moved to `nemo.collections.common.tokenizers`. The
shipped run.py still imports from the old path:
from nemo.collections.nlp.modules.common.tokenizer_utils
import get_nmt_tokenizer
which crashes immediately with ModuleNotFoundError on the new container
and venv. The function itself is unchanged at the new location:
nemo/collections/common/tokenizers/tokenizer_utils.py
nemo:26.02 ships its Python packages (torch, nemo, megatron-core,
nemo_run, ...) into /opt/venv/, a uv-managed venv. The system
/usr/bin/python has none of them. nemo-run's ft_launcher invokes
`/usr/bin/python -m nemo_run.core.runners.fdl_runner` and crashes:
ModuleNotFoundError: No module named 'nemo_run'
Adding /opt/venv/bin to PATH at the front routes `python` to the venv
interpreter that has nemo_run installed. env_vars.json is the right place
for this — it's loaded by run.py and forwarded into every container task.
The previous commit set PATH to /opt/venv/bin:... but dropped /opt/slurm/bin, which broke the SBATCH wrapper script — line 77 'srun: command not found'. Add /opt/slurm/bin back so both the host wrapper and the container have working PATHs.
Fix torchelastic worker spawn. The nemo:26.02 image's default PATH lists /usr/bin before /opt/venv/bin, so `which python` returns /usr/bin/python inside the container. ft_launcher's shebang correctly points to the venv python, but when torchelastic spawns workers it uses `python` from PATH — and /usr/bin/python doesn't have nemo_run, megatron, etc. installed. Setting PATH via env_vars.json doesn't help because Pyxis appends rather than prepends container env. The fix has to be in the image's ENV PATH.
The previous value (set when the test case was on nemo:24.07) hard-disabled
the fused-attention backend. transformer_engine in nemo:26.02 asserts that
NVTE_FUSED_ATTN matches the chosen attention backend ('auto' wants 1):
AssertionError: NVTE_FUSED_ATTN set to 0, but expected 1 for attention
backend type auto. unset NVTE_FLASH_ATTN, NVTE_FUSED_ATTN and
NVTE_UNFUSED_ATTN. Use the --attention-backend argument...
Fix: remove the override and let TE pick the backend.
5 tasks
The nvcr.io/nvidia/nemo:26.02 base image ships an internally
inconsistent component pair: NeMo 2.7.1 source at /opt/NeMo and
megatron-core 0.16.1 at /opt/Megatron-Bridge/3rdparty/Megatron-LM/.
NeMo 2.7.1 was developed against an older megatron-core (~0.14/0.15)
and calls APIs that 0.16.x removed:
1. /opt/NeMo/nemo/lightning/_strategy_lib.py:712 calls
get_megatron_optimizer(no_weight_decay_cond=..., scale_lr_cond=...,
lr_mult=...). 0.16.x removed those three kwargs (they're now
conditional via `config_overrides`). Hits as
`TypeError: get_megatron_optimizer() got an unexpected keyword
argument 'no_weight_decay_cond'` at iter 0.
2. /opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py:39 imports
`tensorstore` from megatron.core.dist_checkpointing.strategies.
0.16.x removed strategies/tensorstore.py (along with zarr.py,
two_stage.py, resharding.py). Hits as `ImportError: cannot import
name 'tensorstore' from megatron.core.dist_checkpointing.strategies`.
Both are NVIDIA upstream container packaging bugs; the proper fix would
be `nemo:26.02.x` shipping a NeMo source that matches the bundled
megatron-core. Until then, this commit pins the bundled megatron-core
back to core_v0.15.3 (latest 0.15.x), where both APIs still exist.
- Dockerfile: new `MEGATRON_CORE_VERSION=core_v0.15.3` ARG and a RUN
step that git-clones the pinned tag and replaces the in-place
/opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/ tree.
- slurm/venv.sh: matching pin for the host-side nemo-run venv (was
0.16.1 to match the broken container; now 0.15.3 to match the
pinned container).
Verified end-to-end:
- Smoke at 1n × 8 B300: 50 iter Llama 3.1 8B pretrain runs cleanly,
loss stable at 11.03, ~8.3 s/iter (1n).
- Bench at 32n × 8 B300: 50 iter pretrain runs cleanly,
mean iter time iters 5-49 = 0.3212 s/iter, steady-state ~0.27 s/iter.
Removes the "Known upstream limitation: slurm/run.py recipe + nemo:26.02"
caveat from the PR description.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the NeMo test case onto the
nvcr.io/nvidia/nemo:26.02container andsyncs the Kubernetes Dockerfile (which had drifted ~year behind) to the same
software stack as the Slurm Dockerfile. Both now produce the same image: same
FROM, same EFA/NCCL/aws-ofi-nccl versions, sameLD_LIBRARY_PATH.Stack delta — Slurm Dockerfile
Stack delta — Kubernetes Dockerfile
The kubernetes Dockerfile previously rebuilt aws-ofi-nccl from source (~20
lines +
libhwloc-dev). EFA installer ≥ 1.47 bundles ofi-nccl, so the sourcebuild is replaced with a verification step.
Stack delta — Slurm host venv (
slurm/venv.sh)4d05653…b5d90de…Compatibility fixes (independently committed)
The base
nemo:26.02image and thenemo_toolkit==2.7.3upgrade togethersurface six issues that the previous test case predates. Each is in its own
commit so the diff against
mainis reviewable piecewise:Stale EFA package state (commit
3c76373a):efa_installer.shaborted with/etc/ld.so.conf.d/000_efa.conf is installed by efa-profile but doesn't exist. Added adpkg --purge --force-allof the staleefa-profile,libfabric1-aws,openmpi40-aws,openmpi50-aws, andlibnccl-ofi-ngc-v2packages before the EFA install step.NGC OFI plugin path (commit
ae0c0745):NGC images install the OFI NCCL plugin via
libnccl-ofi-ngc-v2to/opt/amazon/aws-ofi-nccl/lib, not the stock-EFA path/opt/amazon/ofi-nccl/lib. UpdatedLD_LIBRARY_PATHto cover both, andrewrote the verify step to match either location.
apt autoremovesweeps base packages (commit6528a81c):The previous unconditional
apt autoremove -ydeleted base-imagepackages that had no Debian dep on them. Removed.
libcusparseLt.so.0SONAME missing (commite94e1431):The
libcusparselt0-cuda-13package onnemo:26.02installslibcusparseLt.so.0.9.1.1into the nested/usr/lib/x86_64-linux-gnu/libcusparseLt/13/directory but createsneither the
libcusparseLt.so.0SONAME symlink nor an/etc/ld.so.conf.dentry.import torchthen crashes withImportError: libcusparseLt.so.0: cannot open shared object file.Fixed by writing the dir into
/etc/ld.so.conf.d/000_libcusparselt.confand running
ldconfig(which both registers the path and creates theSONAME symlink).
megatron-core version mismatch (commit
7fc0aefd):nemo_toolkit==2.7.3importsmegatron.core.inference.model_inference_wrappers.inference_wrapper_configwhich only exists in
core_v0.16.x. The previouscore_v0.17.0rc0pin invenv.shreorganized that module tree andbroke the entire
from nemo.collections import llmimport path.Repinned to
core_v0.16.1(what the container ships) so host andcontainer megatron-core APIs match.
get_nmt_tokenizerimport path move (commitfb285c81):nemo_toolkit==2.7.xremovednemo.collections.nlpentirely.slurm/run.pystill importedget_nmt_tokenizerfrom there.Repointed to
nemo.collections.common.tokenizers.tokenizer_utils(same function, new home).
Container
PATHputs/usr/binbefore/opt/venv/bin(commitsb316549b,b36dfd75,c54c9f27):nemo:26.02ships its Python packages (torch,nemo,megatron-core,nemo_run, …) into a uv-managed venv at/opt/venv/.ft_launcher's shebang#!/opt/venv/bin/python3resolves correctly,but torchelastic spawns workers via
python -m nemo_run.core.runners.fdl_runnerand
pythonis searched on PATH — where/usr/bin/python(no nemo_run)wins. Workers crash:
ModuleNotFoundError: No module named 'nemo_run'.Setting
PATHinslurm/env_vars.jsondoes not fix it (Pyxis appendsthe env-vars PATH to the image PATH rather than prepending). Fix is in
the Dockerfile:
ENV PATH=/opt/venv/bin:…:$PATH. The matchingenv_vars.jsonchange keeps/opt/slurm/binon PATH so the SBATCHwrapper can still call
srun/scontrol.Misc
--container_imageinslurm/run.pyand the fourkubernetes/{finetune,pretrain}_*.pylaunchers updated fromnemo:24.12→nemo:26.02.slurm/README.md,kubernetes/README.md, and the data-processing podtemplate updated for the new image tag and sqsh filename
(
aws-nemo-26-02.sqsh).Smoke test (1×B300 + 4×B300, container)
Run via Pyxis on a P6-B300 node, launching
/opt/venv/bin/python(theuv-managed venv that
nemo:26.02ships its packages in — systempythondoesn't see them):
The 32-rank NCCL init across 4 nodes plus the
import torch + nemo + megatron-corechain is the minimum end-to-endcheck that the container is functional on B300 with the bumped stack.
End-to-end recipe verified
Initial draft of this PR documented an upstream NeMo 2.7.1 ↔
megatron-core 0.16.1 API mismatch bundled inside the
nemo:26.02base image as a "Known limitation." Commit
nemo: pin bundled megatron-core to 0.15.3resolves it by replacingthe in-place
/opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/tree with megatron-core core_v0.15.3, which is the version NeMo 2.7.x
is API-compatible against. Two upstream issues were addressed in one
pin:
get_megatron_optimizer(no_weight_decay_cond=..., scale_lr_cond=..., lr_mult=...)— kwargs removed in megatron-core 0.16.x; present in0.15.3.
from megatron.core.dist_checkpointing.strategies import tensorstore— submodule removed in 0.16.x (
tensorstore.py,zarr.py,two_stage.py,resharding.pyall dropped); present in 0.15.3.slurm/venv.shgot the matching pin so the host nemo-run venv staysin lockstep with the container.
End-to-end verification on B300 with the pinned stack:
The 32n run completed cleanly through 50 iterations without invoking
the upstream-broken code paths.
Test plan
docker build+enroot importsucceeds from the new Slurm Dockerfile.import torch / nemo / megatron-coresucceed; 8-rank and 32-rankNCCL init work (1n and 4n).
docker build+docker pushsucceeds from the new KubernetesDockerfile (skipped here because EKS bring-up is out of scope; user
verifies).
bash slurm/venv.shproduces a working venv on the head node;nemo.collections.llmimports cleanly.python slurm/run.py --container_image ~/aws-nemo-26-02.sqsh --nodes 1 --max_steps 50runs 50-iter Llama 3.1 8B pretrain end-to-endcleanly with the megatron-core 0.15.3 pin (loss stable 11.03,
~8.3 s/iter at 1n).
end-to-end (mean iter time iters 5-49 = 0.3212 s/iter, loss stable).
Intermediate scales (4/8/16) not re-bench'd in this window;
not blocking the bump.