Initial nccl implementation -- experimental #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Initial nccl implementation using only the default stream. Contains some temporary fixes to get code to compile and some temporary experimentation into mpi performance -- will require cleanup later. Suggest merging into a separate nccl branch for now.
Requires linking additional libraries:
-lnvToolsExt -lnccl
Some extra context for commits:
01a6df2, e77d120 -- very temporary fixes to compile time and run time errors outside of dgemms -- needs looking into
5c4ce5b -- Bug fix -- source memory was not being allocated. At time of writing commit I had assumed memory for sources was allocated every iteration. This is not the case, but we may still want to allocate a pool of memory for atrip early on in program execution, particularly to address the following commit.
5c4ce5b -- This alloc was taking a similar amount of time as the dgemm for No=50. Need to check if this takes non-negligible time for larger sizes
46c56b9 -- removing host-device transfer that's not needed in the gpu source version
e95ca45 -- this 'warm up' was for experimentation only -- used to test point-to-point handle creation at start of app
7878a14 -- switching point to point comms to use nccl