Skip to content

Conversation

nghtm
Copy link

@nghtm nghtm commented Sep 5, 2025

Issue #, if available:

Description of changes: Remove OFI-NCCL installation as this is precompiled with EFA installer, update NCCL to latest compatible version of OFI-NCCL installed in pcluster AMI as of 09/03/2025 (https://github.com/aws/aws-ofi-nccl/releases/tag/v1.14.2).

This reference branch will be used for AI/ML Pcluster workshop that specifies use of pcluster 3.13, and does not require install of OFI NCCL.

https://catalog.workshops.aws/ml-on-aws-parallelcluster/en-US/03-cluster/02-setup-cluster

$ dpkg -l | grep nccl
ii  libnccl-ofi:amd64                          1.14.2-1                                amd64        NCCL libfabric plugin
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

confirming successful NCCL tests on AMI and with Container using these post-install scripts.

test on 2x g5.8xlarge (4x a10 per node) below:
nccl-all-reduce-val.log

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@nghtm
Copy link
Author

nghtm commented Sep 5, 2025

Unable to create fork on /aws-samples due to permissions error

git push upstream ref/workshop

remote: Permission to aws-samples/aws-parallelcluster-post-install-scripts.git denied to nghtm.
fatal: unable to access 'https://github.com/aws-samples/aws-parallelcluster-post-install-scripts.git/': The requested URL returned error: 403

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we maintain a way to install custom aws-ofi-nccl? It would be ideal if we could skip installation if the specified aws-ofi-nccl is preinstalled, otherwise compile it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants