Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial nccl implementation -- experimental #1

Open
wants to merge 8 commits into
base: cuda
Choose a base branch
from

Conversation

aniabrown-nvidia
Copy link
Collaborator

Initial nccl implementation using only the default stream. Contains some temporary fixes to get code to compile and some temporary experimentation into mpi performance -- will require cleanup later. Suggest merging into a separate nccl branch for now.

Requires linking additional libraries: -lnvToolsExt -lnccl

Some extra context for commits:
01a6df2, e77d120 -- very temporary fixes to compile time and run time errors outside of dgemms -- needs looking into
5c4ce5b -- Bug fix -- source memory was not being allocated. At time of writing commit I had assumed memory for sources was allocated every iteration. This is not the case, but we may still want to allocate a pool of memory for atrip early on in program execution, particularly to address the following commit.
5c4ce5b -- This alloc was taking a similar amount of time as the dgemm for No=50. Need to check if this takes non-negligible time for larger sizes
46c56b9 -- removing host-device transfer that's not needed in the gpu source version
e95ca45 -- this 'warm up' was for experimentation only -- used to test point-to-point handle creation at start of app
7878a14 -- switching point to point comms to use nccl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant