-
Notifications
You must be signed in to change notification settings - Fork 148
update nccl-tests.yaml paths #878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. Please also make the same changes to
| value: $PATH:/opt/amazon/efa/bin:/usr/bin | ||
| - name: LD_LIBRARY_PATH | ||
| value: /opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH | ||
| value: /opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:/usr/local/cuda/lib64:$LD_LIBRARY_PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add /opt/amazon/ofi-nccl/lib/aarch64-linux-gnu for compatibility with aarch64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do for NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so?
| - NCCL_P2P_NET_CHUNKSIZE=524288 | ||
| - -x | ||
| - NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so | ||
| - NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible solution: adding a tuner selection mechanism here?
+ - /bin/bash
+ - -lc
+ - |
+ # NCCL TUNER PLUGIN FALLBACK
+ TUNER_CANDIDATES=(
+ /opt/amazon/ofi-nccl/lib/*/libnccl-ofi-tuner.so
+ )
+ for c in "${TUNER_CANDIDATES[@]}"; do
+ for f in $c; do
+ if [ -f "$f" ]; then export NCCL_TUNER_PLUGIN="$f"; break 2; fi
+ done
+ done
+ exec /opt/amazon/openmpi/bin/mpirun --allow-run-as-root
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need libnccl-tuner.so, libnccl-tuner-ncclnet.so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, just libnccl-ofi-tuner.so should be enogh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea seems good though. What if we just did
command:
- /bin/bash
- -lc
- |
export NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/$(uname -m)-linux-gnu/libnccl-ofi-tuner.so
exec /opt/amazon/openmpi/bin/mpirun \
--allow-run-as-root \
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using the AWS EFA Installer, there is no real reason to set LD_LIBRARY_PATH both for the ofi-nccl and the libfabric as well (/opt/amazon/efa/lib) as those are already set in /etc/ld.so.conf.d.
For choosing the tuner itself, you could just set NCCL_TUNER_PLUGIN=ofi (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-tuner-plugin)
ls -ll /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/
total 332
-rw-r--r--. 1 root root 339744 Aug 12 23:37 libnccl-net-ofi.so
lrwxrwxrwx. 1 root root 18 Aug 12 23:37 libnccl-net.so -> libnccl-net-ofi.so
lrwxrwxrwx. 1 root root 18 Aug 12 23:37 libnccl-ofi-tuner.so -> libnccl-net-ofi.so
lrwxrwxrwx. 1 root root 18 Aug 12 23:37 libnccl-tuner-ofi.so -> libnccl-net-ofi.so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there is no reason to add $LD_LIBRARY_PATH, this never gets interpolated as you assume it does.
It's better to make sure the Dockerfile have the custom paths for libs we want to use as defaults in /etc/ld.so.conf.d as configuration files (i will create a PR for this, as well as adding back the ability to use a custom build aws-ofi-nccl for testing purposes)
|
@bluecrayon52 can we close out on comments and merge? |
Issue #, if available:
Description of changes:
Updated OFI NCCL plugin path to
/opt/amazon/ofi-nccl/lib/x86_64-linux-gnuUpdated CUDA toolkit path to
/usr/local/cuda/lib64Updated OFI NCCL tuner plugin path to
/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.soBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.