nccl-tests: drop unnecessary NCCL/MPI env vars per AWS NCCL/EFA review#1077
Conversation
Brian Barrett (AWS NCCL/EFA team) flagged the following 5 settings in
micro-benchmarks/nccl-tests/slurm/nccl-tests-ami.sbatch as unnecessary:
- NCCL_BUFFSIZE=8388608
- NCCL_P2P_NET_CHUNKSIZE=524288
- NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/libnccl-ofi-tuner.so
- --mca pml ^ucx
- --mca btl tcp,self (and the related --mca btl_tcp_if_exclude)
This commit removes them from all five places they appear under
micro-benchmarks/nccl-tests/ (slurm/{,topology-aware-nccl-tests/}*.sbatch
and kubernetes/*.yaml).
Empirical verification at 8 and 32 nodes on B300 (HyperPod Slurm,
EFA 1.48 + NCCL v2.30.4 + AWS_OFI_NCCL v1.19, post-PR-#1070 stack)
shows the change is at worst neutral and at 32-node alltoall a
+10.8% busbw improvement at 16 GiB:
| Nodes | Collective | Baseline | Stripped | Delta |
|------:|------------|---------:|---------:|----------:|
| 8 | all_reduce | 767.13 | 767.17 | +0.005% |
| 8 | alltoall | 67.56 | 67.16 | -0.59% |
| 32 | all_reduce | 764.86 | 764.78 | -0.01% |
| 32 | alltoall | 52.02 | 57.64 | **+10.81%** |
(GB/s OOP busbw at 16 GiB peak; same allocation pre/post; 32n
alltoall gain attributed to NCCL_TUNER_PLUGIN routing alltoall via a
stale algorithm/protocol selection that NCCL v2.30's built-in tuner
does not use.)
OpenMPI MCA flags (--mca pml ^ucx, --mca btl tcp,self) are mpirun-
specific. nccl-tests-container.sbatch uses srun --mpi=pmix and never
sets them. AWS NCCL/EFA team confirmed they are not necessary on
current EFA releases.
Kept: FI_PROVIDER, FI_EFA_FORK_SAFE, NCCL_DEBUG, NCCL_SOCKET_IFNAME,
LD_LIBRARY_PATH, NCCL_NVLS_ENABLE, NCCL_MNNVL_ENABLE.
| @@ -35,18 +35,6 @@ export FI_EFA_FORK_SAFE=1 | |||
| ## NCCL Environment variables | |||
| export NCCL_DEBUG=INFO | |||
There was a problem hiding this comment.
this shouldn't be set to INFO. Change it to WARN, add a comment that if the user requires detailed data they should change to INFO. It can take a toll on the final results.
paragao
left a comment
There was a problem hiding this comment.
you can only remove those if the EFA Installer used is 1.47.0 or higher. Previous EFA Installer, specially those before 1.43.0, do not set those flags properly.
Please, make sure you also update the Dockerfile to install the EFA Installer 1.47.0.
A bonus addition would be to add a instructions on how to check that the EFA, AWS-OFI-NCCL, and the optimal NCCL flags were set automatically.
Suggestions on that:
- For the first run, enable
NCCL_DEBUG=INFO - Check aws-ofi-nccl version loaded:
grep aws-ofi <job_id>.out - Check efa used as the provider:
grep provider <job_id>.out - Check NCCL_BUFFSIZE and NCCL_P2P_NET_CHUNKSIZE are set properly:
cat <job_id>.out |grep -E "BUFFSIZE|CHUNKSIZE" - if everyhting is ok, remove the
NCCL_DEBUGexport
|
@paragao that's not correct, but also we need to push customers to update rather than hold their hands on this. |
@bwbarrett what part is not correct? About the EFA Installer or the checks? Can you please let us know what is the best approach here? Last time I've did a full check on a cluster I've used those commands and they proved to be successful. Since we are building a full cluster checkup tool, including EFA, your thoughts on best practices are welcomed. |
Summary
Per AWS NCCL/EFA team review, the following 5 settings in
micro-benchmarks/nccl-tests/slurm/nccl-tests-ami.sbatchareunnecessary on current EFA releases:
NCCL_BUFFSIZE=8388608NCCL_P2P_NET_CHUNKSIZE=524288NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/libnccl-ofi-tuner.so--mca pml ^ucx--mca btl tcp,self(and the related--mca btl_tcp_if_exclude)This PR removes them from all five places they appear under
micro-benchmarks/nccl-tests/:Kept (legitimate, not under review):
FI_PROVIDER,FI_EFA_FORK_SAFE,NCCL_DEBUG,NCCL_SOCKET_IFNAME,LD_LIBRARY_PATH,NCCL_NVLS_ENABLE,NCCL_MNNVL_ENABLE.Empirical verification
Hardware: SageMaker HyperPod Slurm cluster, 32 (and 8) × p6-b300.48xlarge
(B300, sm_103). Image: post-#1070 stack (CUDA 13.0.2 + EFA 1.48 +
AWS_OFI_NCCL v1.19 + NCCL v2.30.4 + nccl-tests v2.18.3). Same allocation
across A/D pairs.
Baseline ("A") = pre-PR setting. Stripped ("D") = with the 5 settings
removed. Bench command:
all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 1 -n 100and
alltoall_perfwith the same flags.16-GiB peak OOP busbw (GB/s):
★ Removing the NCCL tunables actually improves 32-node alltoall by
+10.8% at 16 GiB. The pre row (52.02 GB/s) extends the alltoall regression
already documented in #1070's body (2n=192 → 4n=81 → 32n=52); stripping
the tunables reverses the trend at 32 nodes.
Most likely root cause:
NCCL_TUNER_PLUGINselects an alltoallalgorithm/protocol via the AWS-OFI tuner that was tuned for older NCCL
versions; NCCL v2.30 selects a better algorithm/protocol for the same
payload via its built-in tuner. The 8n alltoall result (~0%) and both
allreduce results (~0%) confirm the change is otherwise neutral.
Why the MCA flag removal is a code-audit decision (not in the table)
--mca pml ^ucxand--mca btl tcp,selfarempirunflags. The shippednccl-tests-container.sbatchusessrun --mpi=pmix, so MCA flags would beinert in that path regardless. AWS NCCL/EFA team confirmed they are not
necessary in either path on current EFA releases (
--mca pml ^ucxonlymatters if a UCX install is broken;
--mca btl tcp,selfis not helpful).Out of scope (deliberate)
1.architectures/aws-batch/,3.test_cases/pytorch/deepspeed/,3.test_cases/megatron/bionemo/,validation_and_observability/, etc. Those test cases aren't in thecurrent bump scope; a follow-up sweep PR will address them.
NCCL_SOCKET_IFNAME=^docker,lo,veth— not flagged by the review;conservatively kept.
nemo:26.02test case'sOMPI_MCA_*Dockerfile ENVs — handled in asibling PR (different scope, branches off
main, doesn't depend on thisone).
Test plan
all_reduce_perfandalltoall_perfat 8 and 32 nodes, withand without the 5 settings; same allocation across the A/D pair.
avg = 20.6382 GB/s; D 32n alltoall avg = 20.643 GB/s (Δ ≈ 0).