-
Notifications
You must be signed in to change notification settings - Fork 5
Add NCCL support to DistributedArray #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tharittk,
Great start!
I have started reviewing your PR and I am providing here a number of general comments, I will then make more specific ones in the code itself (in progress...):
- one of the successfull decisions that we made in PyLops since the beginning was to limit the number of mandatory dependencies to the minimum (basically numpy and scipy) and allow a lot of optional dependencies which a user can choose to install or not, whilst still being able to use the library. Similar for
pylops-mpi
we extended the mandatory dependencies tompi4py
for obvious reasons but no more than that. In our current implementation ofDistributedArray
we support a double backend via the engine keyword (numpy and cupy), however we never importcupy
- this is possible because we use the methodsget_module, get_array_module, get_module_name
in the modulepylops.utils.backend
- have a look at them to understand what they do. I will also provide more detailed comments in the code. - Whilst we now want to provide a NCCL backend using
cupy
, I still believe that we should try not to makecupy
a mandatory dependency, also because there are various flavors of cupy installations based on the CUDA drivers so we cannot just add one to the dependencies in thepyproject.toml
file. Based on the above, I suggesting adding the following to thepyproject.toml
file:
[project.optional-dependencies]
cupy11 = [
"cupy-cuda11x",
]
cupy12 = [
"cupy-cuda12x",
]
to be tested. We should also add nccl
but I see that there are a lot in PyPI so we need to check which one is the correct one.
- I very much like your detailed description of the setup in the PR. I suggest you have a go at updating
README.md
anddocs/source/installation.rst
. Here you could do like in the current files, add new targets toMakeFile
with your installation instructions and refer to them - in #132 - Same for the tests, you can add some text in the
docs/source/installation.rst
file (both thempirun
andsrun
as these are valid for any SLURM cluster and not really NCSA Delta specific) - in #132
…tyle issues from PR
pylops_mpi/utils/deps.py
Outdated
# error message at import of available package | ||
def nccl_import(message: Optional[str] = None) -> str: | ||
nccl_test = ( | ||
# TODO: setting of OS env should go to READme somewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is explained here in the documentation of PyLops: https://pylops.readthedocs.io/en/stable/gpu.html
We could add a similar page to that of PyLops-MPI and in the README simply say we refer to the documentation for details on how to use NCCL.
To be done in https://github.com/PyLops/pylops-mpi/pull/132/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52
|
@hongyx11 for sure there are interesting scientific problems where the data and/or model is complex valued. For now I think we can leave this out but I think it would make it a nice PR for @tharittk later on in the GSoC project to handle this: probably the easiest will be to pack/unpack c on complex arrays into real/imag concatenations… I think we may be able to do this with a decorator that we add to all methods that do communications 😀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tharittk very good work!
You will see that I left a lot of comments, but mostly are very minor and intended to help you make your new code as consistent as possible with various pylops conventions in terms of docstrings, code patterns etc.
My only main comment remains wrt to the many changes you did in the code from base_comm
to hard-coded MPI.COMM_WORLD
. @hongyx11 do you see a scenario when this may be dangerous, I was reading a bit at it seems MPI.COMM_WORLD is almost always used as the primary base communicator but there are some other options like using MPI_Comm_create
? With the current changes if that were to happen we would still get all info from MPI.COMM_WORLD
. Another idea that I had could be to add a new optional input parameter called base_comm_nccl = None (default it to None) and check if this is not None, in that case this will take priority, however since we have base_comm we could keep using it throughout the code where now you changed to MPI.COMM_WORLD
I think after you have addressed all of these comments, we may be ready for your first PR to be merged 🚀
@tharittk I also triggered the CI and as you can see the actions fail. But I think I have identified the problem 😄 The way
If instead you do
it will work as it will stop already at the first check and if cupy isn't present will return False... try to change it and we will see if this is the only problem or if more will arise (from now on the CI will be automatically triggered at any push you do) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks great!
@tharittk, some general comments for you to address, but it's a great start.
@mrava87, regarding the case of avoiding MPI.COMM_WORLD
, I think it might be better to introduce base_comm_nccl
, and we consistently use self.base_comm
everywhere instead of hardcoding MPI.COMM_WORLD
.
We can do something like this for now,
# Check and override base_comm with MPI.COMM_WORLD for now
if base_comm_nccl is not None and isinstance(base_comm_nccl, NCCLCommunicator):
self._base_comm = MPI.COMM_WORLD
self._base_comm_nccl = base_comm_nccl
Thanks @rohanbabbar04 for taking the time to review this first PR of @tharittk I agree with your judgment on the MPI.COMM_WORLD usage: let’s try to go for the option of having two base_comm inputs and use NCCL if both are set… @tharittk try to have a go at it and we can discuss all together to make sure it’s done in the best way to ensure backward compatibility also in edge scenarios 😀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tharittk this looks great to me, and I think you have addressed all our suggestions!
@rohanbabbar04 just waiting to merge in case you want to take another look, otherwise good to go for me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can replace this line with, should work and then you can uncomment the dottest
from pylops_mpi import DistributedArray
I added the support for NCCL collective calls to DistributedArray.py. These calls include
allReduce
,allGather
, andbcast
The user passes thenccl.cuda.nccl.NcclCommunicator
object instead ofMPI.COMM_WORLD
to theDistributedArray
constructor to use NCCLSetup
fork from the main branch and install cupy with nccl to the environment
$conda env create -f environment-dev.yml
$conda activate pylops_mpi
$conda install -c conda-forge cupy cudnn cutensor nccl
$pip install -e .
Change Made
DistributedArray
base_comm
now take eithermpi4py.MPI.Comm
orNcclCommunicator
sub_comm
assignment is carried through thesubcomm_split()
cupy nccl
doesn't provide split API as its C++ counterpart so this function implements the split manually.add
_allgather
- MPI providesallgather
(returns value) andAllgather
(requires recv buffer) whilecupy nccl
only providesallGather
that requires recv buffer. This function provides unified API for both cases ofbase_comm
. Now the the call isself._allgather()
instead ofself.basecomm.allgather
- which only worked with MPIadd NCCL support to
_allreduce
and_allreduce_subcomm
bcast
in__setitem__
can also useNcclCommunicator
counterpartExplicitly use
MPI.COMM_WORLD.Get_size()
andMPI.COMM_WORLD.Get_rank()
inrank()
andsize()
getter instead ofbase_comm
. This is because NCCL recommends to launch with one MPI process per GPU -- violates this may cause deadlock. This means in either MPI and NCCL, the global rank is managed through MPI. Plus, thecupy nccl
does not provide rank getter API.dot()
and_compute_vector_norm()
now usencp = get_module(self.engine)
to reflect that computation may now be done in GPU memorycupy array
test_distributedarray_nccl.py
test_distributedarray.py
but now test underbase_comm
iscupy.cuda.nccl.NcclCommunicator
dtype=complex
so some test cases involve complex number are removedassert_allclose
, the.get()
is called because under NCCL, the array lives inside GPU memory (and did not implicitly copy it CPU)Testing
the test is done in two scenarios in NCSA Delta A40x40
srun -N 1 -n 3 --gpus-per-node=3 pytest --with-mpi
for 3 GPUsor in local workstation with 3 GPUs
mpirun -n 3 pytest --with-mpi
srun -N 2 -n 6 --gpus-per-node=3 pytest --with-mpi