Skip to content

Conversation

@bluecrayon52
Copy link
Contributor

Issue #, if available:

Description of changes:
Updated OFI NCCL plugin path to /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu
Updated CUDA toolkit path to /usr/local/cuda/lib64
Updated OFI NCCL tuner plugin path to /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@amanshanbhag amanshanbhag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value: $PATH:/opt/amazon/efa/bin:/usr/bin
- name: LD_LIBRARY_PATH
value: /opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
value: /opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add /opt/amazon/ofi-nccl/lib/aarch64-linux-gnu for compatibility with aarch64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do for NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so?

- NCCL_P2P_NET_CHUNKSIZE=524288
- -x
- NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
- NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so
Copy link
Contributor

@KeitaW KeitaW Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible solution: adding a tuner selection mechanism here?

+              - /bin/bash
+              - -lc
+              - |
+                # NCCL TUNER PLUGIN FALLBACK
+                TUNER_CANDIDATES=(
+                  /opt/amazon/ofi-nccl/lib/*/libnccl-ofi-tuner.so
+                )
+                for c in "${TUNER_CANDIDATES[@]}"; do
+                  for f in $c; do
+                    if [ -f "$f" ]; then export NCCL_TUNER_PLUGIN="$f"; break 2; fi
+                  done
+                done
+                exec /opt/amazon/openmpi/bin/mpirun --allow-run-as-root 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need libnccl-tuner.so, libnccl-tuner-ncclnet.so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, just libnccl-ofi-tuner.so should be enogh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea seems good though. What if we just did

            command:
            - /bin/bash
            - -lc
            - | 
              export NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/$(uname -m)-linux-gnu/libnccl-ofi-tuner.so
              exec /opt/amazon/openmpi/bin/mpirun \
                --allow-run-as-root \
...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using the AWS EFA Installer, there is no real reason to set LD_LIBRARY_PATH both for the ofi-nccl and the libfabric as well (/opt/amazon/efa/lib) as those are already set in /etc/ld.so.conf.d.
For choosing the tuner itself, you could just set NCCL_TUNER_PLUGIN=ofi (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-tuner-plugin)

ls -ll /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/
total 332
-rw-r--r--. 1 root root 339744 Aug 12 23:37 libnccl-net-ofi.so
lrwxrwxrwx. 1 root root     18 Aug 12 23:37 libnccl-net.so -> libnccl-net-ofi.so
lrwxrwxrwx. 1 root root     18 Aug 12 23:37 libnccl-ofi-tuner.so -> libnccl-net-ofi.so
lrwxrwxrwx. 1 root root     18 Aug 12 23:37 libnccl-tuner-ofi.so -> libnccl-net-ofi.so

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is no reason to add $LD_LIBRARY_PATH, this never gets interpolated as you assume it does.
It's better to make sure the Dockerfile have the custom paths for libs we want to use as defaults in /etc/ld.so.conf.d as configuration files (i will create a PR for this, as well as adding back the ability to use a custom build aws-ofi-nccl for testing purposes)

@nghtm
Copy link
Contributor

nghtm commented Oct 30, 2025

@bluecrayon52 can we close out on comments and merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants