Skip to content

nccl-tests: drop unnecessary NCCL/MPI env vars per AWS NCCL/EFA review#1077

Merged
KeitaW merged 1 commit into
mainfrom
kw/env-var-hygiene-nccl-tests
May 6, 2026
Merged

nccl-tests: drop unnecessary NCCL/MPI env vars per AWS NCCL/EFA review#1077
KeitaW merged 1 commit into
mainfrom
kw/env-var-hygiene-nccl-tests

Conversation

@KeitaW
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW commented May 1, 2026

Summary

Per AWS NCCL/EFA team review, the following 5 settings in
micro-benchmarks/nccl-tests/slurm/nccl-tests-ami.sbatch are
unnecessary on current EFA releases:

  • NCCL_BUFFSIZE=8388608
  • NCCL_P2P_NET_CHUNKSIZE=524288
  • NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/libnccl-ofi-tuner.so
  • --mca pml ^ucx
  • --mca btl tcp,self (and the related --mca btl_tcp_if_exclude)

This PR removes them from all five places they appear under
micro-benchmarks/nccl-tests/:

micro-benchmarks/nccl-tests/slurm/nccl-tests-ami.sbatch                              | 16 ----
micro-benchmarks/nccl-tests/slurm/nccl-tests-container.sbatch                        | 12 ----
micro-benchmarks/nccl-tests/slurm/topology-aware-nccl-tests/nccl-tests-ami.sbatch    |  3 ---
micro-benchmarks/nccl-tests/kubernetes/nccl-tests.yaml                               | 15 -----
micro-benchmarks/nccl-tests/kubernetes/nccl-tests-gb200.yaml                         | 12 ----
                                                                                       58 deletions

Kept (legitimate, not under review): FI_PROVIDER, FI_EFA_FORK_SAFE,
NCCL_DEBUG, NCCL_SOCKET_IFNAME, LD_LIBRARY_PATH, NCCL_NVLS_ENABLE,
NCCL_MNNVL_ENABLE.

Empirical verification

Hardware: SageMaker HyperPod Slurm cluster, 32 (and 8) × p6-b300.48xlarge
(B300, sm_103). Image: post-#1070 stack (CUDA 13.0.2 + EFA 1.48 +
AWS_OFI_NCCL v1.19 + NCCL v2.30.4 + nccl-tests v2.18.3). Same allocation
across A/D pairs.

Baseline ("A") = pre-PR setting. Stripped ("D") = with the 5 settings
removed. Bench command: all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 1 -n 100
and alltoall_perf with the same flags.

16-GiB peak OOP busbw (GB/s):

Nodes Collective A baseline D stripped Δ (D vs A)
8 all_reduce 767.13 767.17 +0.005% ✓
8 alltoall 67.56 67.16 −0.59% ✓
32 all_reduce 764.86 764.78 −0.01% ✓
32 alltoall 52.02 57.64 +10.81% ★

★ Removing the NCCL tunables actually improves 32-node alltoall by
+10.8% at 16 GiB. The pre row (52.02 GB/s) extends the alltoall regression
already documented in #1070's body (2n=192 → 4n=81 → 32n=52); stripping
the tunables reverses the trend at 32 nodes.

Most likely root cause: NCCL_TUNER_PLUGIN selects an alltoall
algorithm/protocol via the AWS-OFI tuner that was tuned for older NCCL
versions; NCCL v2.30 selects a better algorithm/protocol for the same
payload via its built-in tuner. The 8n alltoall result (~0%) and both
allreduce results (~0%) confirm the change is otherwise neutral.

Why the MCA flag removal is a code-audit decision (not in the table)

--mca pml ^ucx and --mca btl tcp,self are mpirun flags. The shipped
nccl-tests-container.sbatch uses srun --mpi=pmix, so MCA flags would be
inert in that path regardless. AWS NCCL/EFA team confirmed they are not
necessary in either path on current EFA releases (--mca pml ^ucx only
matters if a UCX install is broken; --mca btl tcp,self is not helpful).

Out of scope (deliberate)

  • 17–18 other matches across 1.architectures/aws-batch/,
    3.test_cases/pytorch/deepspeed/, 3.test_cases/megatron/bionemo/,
    validation_and_observability/, etc. Those test cases aren't in the
    current bump scope; a follow-up sweep PR will address them.
  • NCCL_SOCKET_IFNAME=^docker,lo,veth — not flagged by the review;
    conservatively kept.
  • The nemo:26.02 test case's OMPI_MCA_* Dockerfile ENVs — handled in a
    sibling PR (different scope, branches off main, doesn't depend on this
    one).

Test plan

  • Build the post-nccl-tests: bump to CUDA 13.0.2 / NCCL 2.30.4 and add sm_103 (B300) #1070 nccl-tests image on a B300 node, import to enroot.
  • Run all_reduce_perf and alltoall_perf at 8 and 32 nodes, with
    and without the 5 settings; same allocation across the A/D pair.
  • Compare 16-GiB peak OOP busbw — table above.
  • Verify avg busbw (across all sizes) is unchanged: A 32n alltoall
    avg = 20.6382 GB/s; D 32n alltoall avg = 20.643 GB/s (Δ ≈ 0).
  • CI lint / static analysis (no logic changes — pure deletions).

Brian Barrett (AWS NCCL/EFA team) flagged the following 5 settings in
micro-benchmarks/nccl-tests/slurm/nccl-tests-ami.sbatch as unnecessary:

  - NCCL_BUFFSIZE=8388608
  - NCCL_P2P_NET_CHUNKSIZE=524288
  - NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/libnccl-ofi-tuner.so
  - --mca pml ^ucx
  - --mca btl tcp,self  (and the related --mca btl_tcp_if_exclude)

This commit removes them from all five places they appear under
micro-benchmarks/nccl-tests/ (slurm/{,topology-aware-nccl-tests/}*.sbatch
and kubernetes/*.yaml).

Empirical verification at 8 and 32 nodes on B300 (HyperPod Slurm,
EFA 1.48 + NCCL v2.30.4 + AWS_OFI_NCCL v1.19, post-PR-#1070 stack)
shows the change is at worst neutral and at 32-node alltoall a
+10.8% busbw improvement at 16 GiB:

| Nodes | Collective | Baseline | Stripped | Delta     |
|------:|------------|---------:|---------:|----------:|
| 8     | all_reduce |   767.13 |   767.17 | +0.005%   |
| 8     | alltoall   |    67.56 |    67.16 | -0.59%    |
| 32    | all_reduce |   764.86 |   764.78 | -0.01%    |
| 32    | alltoall   |    52.02 |    57.64 | **+10.81%** |

(GB/s OOP busbw at 16 GiB peak; same allocation pre/post; 32n
alltoall gain attributed to NCCL_TUNER_PLUGIN routing alltoall via a
stale algorithm/protocol selection that NCCL v2.30's built-in tuner
does not use.)

OpenMPI MCA flags (--mca pml ^ucx, --mca btl tcp,self) are mpirun-
specific. nccl-tests-container.sbatch uses srun --mpi=pmix and never
sets them. AWS NCCL/EFA team confirmed they are not necessary on
current EFA releases.

Kept: FI_PROVIDER, FI_EFA_FORK_SAFE, NCCL_DEBUG, NCCL_SOCKET_IFNAME,
LD_LIBRARY_PATH, NCCL_NVLS_ENABLE, NCCL_MNNVL_ENABLE.
@@ -35,18 +35,6 @@ export FI_EFA_FORK_SAFE=1
## NCCL Environment variables
export NCCL_DEBUG=INFO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be set to INFO. Change it to WARN, add a comment that if the user requires detailed data they should change to INFO. It can take a toll on the final results.

Copy link
Copy Markdown
Contributor

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can only remove those if the EFA Installer used is 1.47.0 or higher. Previous EFA Installer, specially those before 1.43.0, do not set those flags properly.

Please, make sure you also update the Dockerfile to install the EFA Installer 1.47.0.

A bonus addition would be to add a instructions on how to check that the EFA, AWS-OFI-NCCL, and the optimal NCCL flags were set automatically.

Suggestions on that:

  • For the first run, enable NCCL_DEBUG=INFO
  • Check aws-ofi-nccl version loaded: grep aws-ofi <job_id>.out
  • Check efa used as the provider: grep provider <job_id>.out
  • Check NCCL_BUFFSIZE and NCCL_P2P_NET_CHUNKSIZE are set properly: cat <job_id>.out |grep -E "BUFFSIZE|CHUNKSIZE"
  • if everyhting is ok, remove the NCCL_DEBUG export

@bwbarrett
Copy link
Copy Markdown

@paragao that's not correct, but also we need to push customers to update rather than hold their hands on this.

@KeitaW KeitaW merged commit a59b4db into main May 6, 2026
5 checks passed
@KeitaW KeitaW deleted the kw/env-var-hygiene-nccl-tests branch May 6, 2026 00:57
@paragao
Copy link
Copy Markdown
Contributor

paragao commented May 6, 2026

@paragao that's not correct, but also we need to push customers to update rather than hold their hands on this.

@bwbarrett what part is not correct? About the EFA Installer or the checks? Can you please let us know what is the best approach here?

Last time I've did a full check on a cluster I've used those commands and they proved to be successful. Since we are building a full cluster checkup tool, including EFA, your thoughts on best practices are welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants