Skip to content

nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles#1072

Merged
KeitaW merged 13 commits into
mainfrom
kw/nemo-26.02-stack-sync
May 15, 2026
Merged

nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles#1072
KeitaW merged 13 commits into
mainfrom
kw/nemo-26.02-stack-sync

Conversation

@KeitaW
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW commented Apr 29, 2026

Summary

Brings the NeMo test case onto the nvcr.io/nvidia/nemo:26.02 container and
syncs the Kubernetes Dockerfile (which had drifted ~year behind) to the same
software stack as the Slurm Dockerfile. Both now produce the same image: same
FROM, same EFA/NCCL/aws-ofi-nccl versions, same LD_LIBRARY_PATH.

Stack delta — Slurm Dockerfile

Component Before After
NGC NeMo base 25.07.00 26.02
GDRCopy v2.5 v2.5.2
EFA installer 1.47.0 1.48.0
NCCL v2.27.7-1 v2.30.4-1
nccl-tests v2.16.9 v2.18.3
AWS_OFI_NCCL (ARG) (commented) v1.19.0 (uncommented; bundled in EFA)
transformers 4.56.1 4.57.6

Stack delta — Kubernetes Dockerfile

Component Before After
NGC NeMo base 25.04.01 26.02
GDRCopy v2.4.1 v2.5.2
EFA installer 1.37.0 1.48.0
NCCL v2.23.4-1 v2.30.4-1
nccl-tests v2.13.10 v2.18.3
AWS_OFI_NCCL v1.13.2-aws (built from source) v1.19.0 (bundled in EFA, source build dropped)
transformers 4.48.1 4.57.6

The kubernetes Dockerfile previously rebuilt aws-ofi-nccl from source (~20
lines + libhwloc-dev). EFA installer ≥ 1.47 bundles ofi-nccl, so the source
build is replaced with a verification step.

Stack delta — Slurm host venv (slurm/venv.sh)

Component Before After
NeMo-Run arbitrary commit 4d05653… v0.9.0 release tag
torch 2.6.0 2.10.0
Megatron-LM arbitrary commit b5d90de… core_v0.16.1 (matches container)
nemo_toolkit 2.1.0 2.7.3
nvidia-resiliency-ext 0.2.1 0.4.1
mamba-ssm wheel cu118 + torch2.0 (incompatible w/ pinned torch 2.6) removed (dead)
opencc 1.1.6 (no cp312 wheel) resolver-picked (transitive via nemo_toolkit)

Compatibility fixes (independently committed)

The base nemo:26.02 image and the nemo_toolkit==2.7.3 upgrade together
surface six issues that the previous test case predates. Each is in its own
commit so the diff against main is reviewable piecewise:

  1. Stale EFA package state (commit 3c76373a):
    efa_installer.sh aborted with
    /etc/ld.so.conf.d/000_efa.conf is installed by efa-profile but doesn't exist. Added a dpkg --purge --force-all of the stale efa-profile,
    libfabric1-aws, openmpi40-aws, openmpi50-aws, and
    libnccl-ofi-ngc-v2 packages before the EFA install step.

  2. NGC OFI plugin path (commit ae0c0745):
    NGC images install the OFI NCCL plugin via libnccl-ofi-ngc-v2 to
    /opt/amazon/aws-ofi-nccl/lib, not the stock-EFA path
    /opt/amazon/ofi-nccl/lib. Updated LD_LIBRARY_PATH to cover both, and
    rewrote the verify step to match either location.

  3. apt autoremove sweeps base packages (commit 6528a81c):
    The previous unconditional apt autoremove -y deleted base-image
    packages that had no Debian dep on them. Removed.

  4. libcusparseLt.so.0 SONAME missing (commit e94e1431):
    The libcusparselt0-cuda-13 package on nemo:26.02 installs
    libcusparseLt.so.0.9.1.1 into the nested
    /usr/lib/x86_64-linux-gnu/libcusparseLt/13/ directory but creates
    neither the libcusparseLt.so.0 SONAME symlink nor an
    /etc/ld.so.conf.d entry. import torch then crashes with
    ImportError: libcusparseLt.so.0: cannot open shared object file.
    Fixed by writing the dir into /etc/ld.so.conf.d/000_libcusparselt.conf
    and running ldconfig (which both registers the path and creates the
    SONAME symlink).

  5. megatron-core version mismatch (commit 7fc0aefd):
    nemo_toolkit==2.7.3 imports
    megatron.core.inference.model_inference_wrappers.inference_wrapper_config
    which only exists in core_v0.16.x. The previous
    core_v0.17.0rc0 pin in venv.sh reorganized that module tree and
    broke the entire from nemo.collections import llm import path.
    Repinned to core_v0.16.1 (what the container ships) so host and
    container megatron-core APIs match.

  6. get_nmt_tokenizer import path move (commit fb285c81):
    nemo_toolkit==2.7.x removed nemo.collections.nlp entirely.
    slurm/run.py still imported get_nmt_tokenizer from there.
    Repointed to nemo.collections.common.tokenizers.tokenizer_utils
    (same function, new home).

  7. Container PATH puts /usr/bin before /opt/venv/bin (commits
    b316549b, b36dfd75, c54c9f27):
    nemo:26.02 ships its Python packages (torch, nemo,
    megatron-core, nemo_run, …) into a uv-managed venv at /opt/venv/.
    ft_launcher's shebang #!/opt/venv/bin/python3 resolves correctly,
    but torchelastic spawns workers via python -m nemo_run.core.runners.fdl_runner
    and python is searched on PATH — where /usr/bin/python (no nemo_run)
    wins. Workers crash:
    ModuleNotFoundError: No module named 'nemo_run'.
    Setting PATH in slurm/env_vars.json does not fix it (Pyxis appends
    the env-vars PATH to the image PATH rather than prepending). Fix is in
    the Dockerfile: ENV PATH=/opt/venv/bin:…:$PATH. The matching
    env_vars.json change keeps /opt/slurm/bin on PATH so the SBATCH
    wrapper can still call srun/scontrol.

Misc

  • Default --container_image in slurm/run.py and the four
    kubernetes/{finetune,pretrain}_*.py launchers updated from
    nemo:24.12nemo:26.02.
  • slurm/README.md, kubernetes/README.md, and the data-processing pod
    template updated for the new image tag and sqsh filename
    (aws-nemo-26-02.sqsh).

Smoke test (1×B300 + 4×B300, container)

Run via Pyxis on a P6-B300 node, launching /opt/venv/bin/python (the
uv-managed venv that nemo:26.02 ships its packages in — system python
doesn't see them):

torch:           2.10.0a0+b558c986e8.nv25.11
cuda:            13.0
device:          NVIDIA B300 SXM6 AC
compute capability: (10, 3)              # sm_103 confirmed native
nemo:            2.7.1
megatron-core:   0.16.1
torch.distributed init_process_group(backend="nccl"): rank 0..7,  world  8   (1 node)
torch.distributed init_process_group(backend="nccl"): rank 0..31, world 32   (4 nodes)
smoke OK

The 32-rank NCCL init across 4 nodes plus the
import torch + nemo + megatron-core chain is the minimum end-to-end
check that the container is functional on B300 with the bumped stack.

End-to-end recipe verified

Initial draft of this PR documented an upstream NeMo 2.7.1 ↔
megatron-core 0.16.1 API mismatch bundled inside the nemo:26.02
base image as a "Known limitation." Commit
nemo: pin bundled megatron-core to 0.15.3 resolves it by replacing
the in-place /opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/
tree with megatron-core core_v0.15.3, which is the version NeMo 2.7.x
is API-compatible against. Two upstream issues were addressed in one
pin:

  1. get_megatron_optimizer(no_weight_decay_cond=..., scale_lr_cond=..., lr_mult=...) — kwargs removed in megatron-core 0.16.x; present in
    0.15.3.
  2. from megatron.core.dist_checkpointing.strategies import tensorstore
    — submodule removed in 0.16.x (tensorstore.py, zarr.py,
    two_stage.py, resharding.py all dropped); present in 0.15.3.

slurm/venv.sh got the matching pin so the host nemo-run venv stays
in lockstep with the container.

End-to-end verification on B300 with the pinned stack:

Scale Iterations Mean iter time (iters 5-49) Steady-state Loss
1n × 8 GPU 50 ~8.3 s ~8.3 s 11.03 stable
32n × 8 GPU 50 0.3212 s ~0.27 s 11.03 stable

The 32n run completed cleanly through 50 iterations without invoking
the upstream-broken code paths.

Test plan

  • docker build + enroot import succeeds from the new Slurm Dockerfile.
  • Container loads on B300 (sm_103) with all Dockerfile fixes;
    import torch / nemo / megatron-core succeed; 8-rank and 32-rank
    NCCL init work (1n and 4n).
  • docker build + docker push succeeds from the new Kubernetes
    Dockerfile (skipped here because EKS bring-up is out of scope; user
    verifies).
  • bash slurm/venv.sh produces a working venv on the head node;
    nemo.collections.llm imports cleanly.
  • python slurm/run.py --container_image ~/aws-nemo-26-02.sqsh --nodes 1 --max_steps 50 runs 50-iter Llama 3.1 8B pretrain end-to-end
    cleanly with the megatron-core 0.15.3 pin (loss stable 11.03,
    ~8.3 s/iter at 1n).
  • Llama 3.1 8B pretrain at 32 nodes via run.py recipe runs cleanly
    end-to-end (mean iter time iters 5-49 = 0.3212 s/iter, loss stable).
    Intermediate scales (4/8/16) not re-bench'd in this window;
    not blocking the bump.

KeitaW added 12 commits April 29, 2026 13:48
Slurm Dockerfile (../Dockerfile):
  FROM nemo:25.07.00 -> 26.02
  GDRCOPY    v2.5    -> v2.5.2
  EFA        1.47.0  -> 1.48.0
  NCCL       v2.27.7-1 -> v2.30.4-1
  NCCL_TESTS v2.16.9 -> v2.18.3
  TRANSFORMERS 4.56.1 -> 4.57.6
  AWS_OFI_NCCL_VERSION ARG made explicit (was commented out) at v1.19.0

Kubernetes Dockerfile was a year behind (FROM nemo:25.04.01 / EFA 1.37 /
NCCL 2.23.4). Synced to the Slurm Dockerfile so both produce the same
software stack: same FROM, ARGs, and EFA-bundled aws-ofi-nccl. Drops
the explicit aws-ofi-nccl source build and replaces it with a
verification step that reads from /opt/amazon/ofi-nccl (where EFA >=1.47
puts it). LD_LIBRARY_PATH updated to the bundled location.

Slurm host-side venv (slurm/venv.sh):
  NeMo-Run    arbitrary commit -> v0.9.0 release tag
  torch       2.6.0  -> 2.10.0
  Megatron-LM arbitrary commit -> core_v0.17.0 (matches container)
  nemo_toolkit 2.1.0 -> 2.7.3 (matches the 26.02 container's nemo)
  nvidia-resiliency-ext 0.2.1 -> 0.4.1
Drops the mamba-ssm wheel install — the wheel was pinned to cu118 +
torch 2.0 right next to a torch 2.6 install (flatly incompatible),
nothing in the run.py path imports mamba_ssm, and pinning a working
wheel for current torch is out of scope here.

Updates the default `--container_image` in run.py and the four
kubernetes/{finetune,pretrain}_*.py launchers from nemo:24.12 to
nemo:26.02, and bumps the corresponding image tags / sqsh filenames in
slurm/README.md, kubernetes/README.md, and the data-processing pod
template.
The nemo:26.02 base image ships several CUDA add-on libraries
(libcusparseLt0, libcudnn-frontend, ...) that no installed Debian package
Depends: on — torch and transformer_engine load them via dlopen at
runtime. apt autoremove sees them as orphaned and deletes them, after
which the very first `import torch` inside the container crashes with:

    ImportError: libcusparseLt.so.0: cannot open shared object file:
    No such file or directory

Caught by the post-build smoke run on 1n B300; reproduces 100% of the
time on the prior image. Removing the autoremove step keeps the rest of
the cleanup intact (we still purge stale EFA pkgs explicitly before the
EFA installer runs) and lets the base image's library set survive.
The previous "drop apt autoremove" fix turned out to be the wrong
diagnosis — libcusparseLt0 is in fact installed (libcusparselt0-cuda-13
0.9.1.1-1), but the package places its runtime in a nested directory
/usr/lib/x86_64-linux-gnu/libcusparseLt/13/ and ships no
/etc/ld.so.conf.d snippet to register it. ldconfig -p shows libcusparse
(the regular CUSPARSE) but not libcusparseLt. torch.import dlopens
libcusparseLt.so.0 and crashes:

    ImportError: libcusparseLt.so.0: cannot open shared object file

Adding the nested dir to LD_LIBRARY_PATH lets the dlopen resolve. Keep
the autoremove drop too — it's still the safer behavior even though it
wasn't the root cause of this specific crash.
The previous LD_LIBRARY_PATH-only approach didn't work — confirmed via
runtime test inside the container. The directory contains
libcusparseLt.so.0.9.1.1 but NO libcusparseLt.so.0 SONAME symlink, so
torch's dlopen("libcusparseLt.so.0") fails even when the path is on
LD_LIBRARY_PATH (dlopen needs the exact filename, not the SONAME of
some other file in the dir).

The right fix is to write an /etc/ld.so.conf.d entry and run ldconfig,
which both registers the directory and creates the SONAME symlink. This
mirrors what the libcusparselt0-cuda-13 .deb postinst should have done
but didn't.

Verified the symlink-shim equivalent (mkdir+ln) lets `import torch`
succeed in a container smoke test against the same image; the ldconfig
approach baked into the Dockerfile is the durable form.
The previous core_v0.17.0 pin broke `from nemo.collections import llm`
because core_v0.17.0rc0 reorganized the inference wrapper module tree
and dropped:

    megatron.core.inference.model_inference_wrappers.inference_wrapper_config

which nemo_toolkit 2.7.3 still imports unconditionally. The container
itself (nemo:26.02) ships megatron-core 0.16.1, so pinning the host
venv to the same tag both fixes the import and keeps the host/container
megatron-core APIs in sync (so saved tensors, checkpoint formats, etc.
match across the boundary).

Also drop the explicit opencc==1.1.6 pin: the resolver picks a
cp312-compatible version automatically when nemo_toolkit pulls it in
transitively, and our manual pin failed to find a wheel for Python 3.12.
In nemo_toolkit 2.7.x, `nemo.collections.nlp` was removed and the NLP
tokenizer utilities moved to `nemo.collections.common.tokenizers`. The
shipped run.py still imports from the old path:

    from nemo.collections.nlp.modules.common.tokenizer_utils
        import get_nmt_tokenizer

which crashes immediately with ModuleNotFoundError on the new container
and venv. The function itself is unchanged at the new location:

    nemo/collections/common/tokenizers/tokenizer_utils.py
nemo:26.02 ships its Python packages (torch, nemo, megatron-core,
nemo_run, ...) into /opt/venv/, a uv-managed venv. The system
/usr/bin/python has none of them. nemo-run's ft_launcher invokes
`/usr/bin/python -m nemo_run.core.runners.fdl_runner` and crashes:

    ModuleNotFoundError: No module named 'nemo_run'

Adding /opt/venv/bin to PATH at the front routes `python` to the venv
interpreter that has nemo_run installed. env_vars.json is the right place
for this — it's loaded by run.py and forwarded into every container task.
The previous commit set PATH to /opt/venv/bin:... but dropped
/opt/slurm/bin, which broke the SBATCH wrapper script — line 77
'srun: command not found'. Add /opt/slurm/bin back so both the
host wrapper and the container have working PATHs.
Fix torchelastic worker spawn. The nemo:26.02 image's default PATH lists
/usr/bin before /opt/venv/bin, so `which python` returns /usr/bin/python
inside the container. ft_launcher's shebang correctly points to the venv
python, but when torchelastic spawns workers it uses `python` from PATH —
and /usr/bin/python doesn't have nemo_run, megatron, etc. installed.

Setting PATH via env_vars.json doesn't help because Pyxis appends rather
than prepends container env. The fix has to be in the image's ENV PATH.
The previous value (set when the test case was on nemo:24.07) hard-disabled
the fused-attention backend. transformer_engine in nemo:26.02 asserts that
NVTE_FUSED_ATTN matches the chosen attention backend ('auto' wants 1):

  AssertionError: NVTE_FUSED_ATTN set to 0, but expected 1 for attention
  backend type auto. unset NVTE_FLASH_ATTN, NVTE_FUSED_ATTN and
  NVTE_UNFUSED_ATTN. Use the --attention-backend argument...

Fix: remove the override and let TE pick the backend.
The nvcr.io/nvidia/nemo:26.02 base image ships an internally
inconsistent component pair: NeMo 2.7.1 source at /opt/NeMo and
megatron-core 0.16.1 at /opt/Megatron-Bridge/3rdparty/Megatron-LM/.
NeMo 2.7.1 was developed against an older megatron-core (~0.14/0.15)
and calls APIs that 0.16.x removed:

  1. /opt/NeMo/nemo/lightning/_strategy_lib.py:712 calls
     get_megatron_optimizer(no_weight_decay_cond=..., scale_lr_cond=...,
     lr_mult=...). 0.16.x removed those three kwargs (they're now
     conditional via `config_overrides`). Hits as
     `TypeError: get_megatron_optimizer() got an unexpected keyword
     argument 'no_weight_decay_cond'` at iter 0.

  2. /opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py:39 imports
     `tensorstore` from megatron.core.dist_checkpointing.strategies.
     0.16.x removed strategies/tensorstore.py (along with zarr.py,
     two_stage.py, resharding.py). Hits as `ImportError: cannot import
     name 'tensorstore' from megatron.core.dist_checkpointing.strategies`.

Both are NVIDIA upstream container packaging bugs; the proper fix would
be `nemo:26.02.x` shipping a NeMo source that matches the bundled
megatron-core. Until then, this commit pins the bundled megatron-core
back to core_v0.15.3 (latest 0.15.x), where both APIs still exist.

  - Dockerfile: new `MEGATRON_CORE_VERSION=core_v0.15.3` ARG and a RUN
    step that git-clones the pinned tag and replaces the in-place
    /opt/Megatron-Bridge/3rdparty/Megatron-LM/megatron/core/ tree.
  - slurm/venv.sh: matching pin for the host-side nemo-run venv (was
    0.16.1 to match the broken container; now 0.15.3 to match the
    pinned container).

Verified end-to-end:
  - Smoke at 1n × 8 B300: 50 iter Llama 3.1 8B pretrain runs cleanly,
    loss stable at 11.03, ~8.3 s/iter (1n).
  - Bench at 32n × 8 B300: 50 iter pretrain runs cleanly,
    mean iter time iters 5-49 = 0.3212 s/iter, steady-state ~0.27 s/iter.

Removes the "Known upstream limitation: slurm/run.py recipe + nemo:26.02"
caveat from the PR description.
@KeitaW KeitaW requested a review from pbelevich May 8, 2026 10:29
@KeitaW KeitaW marked this pull request as ready for review May 8, 2026 10:29
@KeitaW KeitaW merged commit 02be70c into main May 15, 2026
5 checks passed
@KeitaW KeitaW deleted the kw/nemo-26.02-stack-sync branch May 15, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant