Skip to content

Add NCCL support to DistributedArray #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 29, 2025
Merged

Conversation

tharittk
Copy link
Collaborator

I added the support for NCCL collective calls to DistributedArray.py. These calls include allReduce, allGather, and bcast The user passes the nccl.cuda.nccl.NcclCommunicator object instead of MPI.COMM_WORLD to the DistributedArray constructor to use NCCL

Setup

fork from the main branch and install cupy with nccl to the environment
$conda env create -f environment-dev.yml
$conda activate pylops_mpi
$conda install -c conda-forge cupy cudnn cutensor nccl
$pip install -e .

Change Made

DistributedArray

  • base_comm now take either mpi4py.MPI.Comm or NcclCommunicator

  • sub_comm assignment is carried through the subcomm_split() cupy nccl doesn't provide split API as its C++ counterpart so this function implements the split manually.

  • add _allgather - MPI provides allgather (returns value) and Allgather (requires recv buffer) while cupy nccl only provides allGather that requires recv buffer. This function provides unified API for both cases of base_comm. Now the the call is self._allgather() instead of self.basecomm.allgather - which only worked with MPI

  • add NCCL support to _allreduce and _allreduce_subcomm

  • bcast in __setitem__ can also use NcclCommunicator counterpart

  • Explicitly use MPI.COMM_WORLD.Get_size() and MPI.COMM_WORLD.Get_rank() in rank() and size() getter instead of base_comm. This is because NCCL recommends to launch with one MPI process per GPU -- violates this may cause deadlock. This means in either MPI and NCCL, the global rank is managed through MPI. Plus, the cupy nccl does not provide rank getter API.

  • dot() and _compute_vector_norm() now use ncp = get_module(self.engine) to reflect that computation may now be done in GPU memory cupy array

test_distributedarray_nccl.py

  • Almost exact copy of test_distributedarray.py but now test under base_comm is cupy.cuda.nccl.NcclCommunicator
  • NCCL does not support the dtype=complex so some test cases involve complex number are removed
  • in order to call assert_allclose, the .get() is called because under NCCL, the array lives inside GPU memory (and did not implicitly copy it CPU)

Testing

the test is done in two scenarios in NCSA Delta A40x40

  • 1 node with multi-gpu (2 and 3 GPUs) with following commands
    srun -N 1 -n 3 --gpus-per-node=3 pytest --with-mpi for 3 GPUs
    or in local workstation with 3 GPUs
    mpirun -n 3 pytest --with-mpi
  • 2 nodes with multi-gpu (2 and GPUs per node)
    srun -N 2 -n 6 --gpus-per-node=3 pytest --with-mpi

Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tharittk,
Great start!

I have started reviewing your PR and I am providing here a number of general comments, I will then make more specific ones in the code itself (in progress...):

  • one of the successfull decisions that we made in PyLops since the beginning was to limit the number of mandatory dependencies to the minimum (basically numpy and scipy) and allow a lot of optional dependencies which a user can choose to install or not, whilst still being able to use the library. Similar for pylops-mpi we extended the mandatory dependencies to mpi4py for obvious reasons but no more than that. In our current implementation of DistributedArray we support a double backend via the engine keyword (numpy and cupy), however we never import cupy - this is possible because we use the methods get_module, get_array_module, get_module_name in the module pylops.utils.backend - have a look at them to understand what they do. I will also provide more detailed comments in the code.
  • Whilst we now want to provide a NCCL backend using cupy, I still believe that we should try not to make cupy a mandatory dependency, also because there are various flavors of cupy installations based on the CUDA drivers so we cannot just add one to the dependencies in the pyproject.toml file. Based on the above, I suggesting adding the following to the pyproject.toml file:
[project.optional-dependencies]
cupy11 = [
    "cupy-cuda11x",
]
cupy12 = [
    "cupy-cuda12x",
]

to be tested. We should also add nccl but I see that there are a lot in PyPI so we need to check which one is the correct one.

  • I very much like your detailed description of the setup in the PR. I suggest you have a go at updating README.md and docs/source/installation.rst. Here you could do like in the current files, add new targets to MakeFile with your installation instructions and refer to them - in #132
  • Same for the tests, you can add some text in the docs/source/installation.rst file (both the mpirun and srun as these are valid for any SLURM cluster and not really NCSA Delta specific) - in #132

@mrava87 mrava87 assigned hongyx11 and mrava87 and unassigned hongyx11 and mrava87 May 19, 2025
@mrava87 mrava87 requested a review from hongyx11 May 19, 2025 20:03
@mrava87 mrava87 added the enhancement New feature or request label May 19, 2025
# error message at import of available package
def nccl_import(message: Optional[str] = None) -> str:
nccl_test = (
# TODO: setting of OS env should go to READme somewhere
Copy link
Contributor

@mrava87 mrava87 May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explained here in the documentation of PyLops: https://pylops.readthedocs.io/en/stable/gpu.html

We could add a similar page to that of PyLops-MPI and in the README simply say we refer to the documentation for details on how to use NCCL.

To be done in https://github.com/PyLops/pylops-mpi/pull/132/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52

@hongyx11
Copy link
Collaborator

  • Good news, I was able to run the tests on my local environment. I think it's quite clean and clear.
  • @mrava87 complex support is important to PyLops right? we might find other way to do it.

@mrava87
Copy link
Contributor

mrava87 commented May 21, 2025

  • Good news, I was able to run the tests on my local environment. I think it's quite clean and clear.
  • @mrava87 complex support is important to PyLops right? we might find other way to do it.

@hongyx11 for sure there are interesting scientific problems where the data and/or model is complex valued. For now I think we can leave this out but I think it would make it a nice PR for @tharittk later on in the GSoC project to handle this: probably the easiest will be to pack/unpack c on complex arrays into real/imag concatenations… I think we may be able to do this with a decorator that we add to all methods that do communications 😀

@mrava87 mrava87 mentioned this pull request May 23, 2025
3 tasks
Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tharittk very good work!

You will see that I left a lot of comments, but mostly are very minor and intended to help you make your new code as consistent as possible with various pylops conventions in terms of docstrings, code patterns etc.

My only main comment remains wrt to the many changes you did in the code from base_comm to hard-coded MPI.COMM_WORLD. @hongyx11 do you see a scenario when this may be dangerous, I was reading a bit at it seems MPI.COMM_WORLD is almost always used as the primary base communicator but there are some other options like using MPI_Comm_create? With the current changes if that were to happen we would still get all info from MPI.COMM_WORLD. Another idea that I had could be to add a new optional input parameter called base_comm_nccl = None (default it to None) and check if this is not None, in that case this will take priority, however since we have base_comm we could keep using it throughout the code where now you changed to MPI.COMM_WORLD

I think after you have addressed all of these comments, we may be ready for your first PR to be merged 🚀

@mrava87
Copy link
Contributor

mrava87 commented May 24, 2025

@tharittk I also triggered the CI and as you can see the actions fail. But I think I have identified the problem 😄

The way util.find_spec works is that it tries to go all the way to nccl assuming cupy is present, but since it is not it fails

util.find_spec("cupy.cuda.nccl")

If instead you do

 util.find_spec("cupy") is not None and util.find_spec("cupy.cuda.nccl") is not None

it will work as it will stop already at the first check and if cupy isn't present will return False... try to change it and we will see if this is the only problem or if more will arise (from now on the CI will be automatically triggered at any push you do)

Copy link
Collaborator

@rohanbabbar04 rohanbabbar04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks great!

@tharittk, some general comments for you to address, but it's a great start.

@mrava87, regarding the case of avoiding MPI.COMM_WORLD, I think it might be better to introduce base_comm_nccl, and we consistently use self.base_comm everywhere instead of hardcoding MPI.COMM_WORLD.

We can do something like this for now,

# Check and override base_comm with MPI.COMM_WORLD for now
if base_comm_nccl is not None and isinstance(base_comm_nccl, NCCLCommunicator):
    self._base_comm = MPI.COMM_WORLD
    self._base_comm_nccl = base_comm_nccl

@mrava87
Copy link
Contributor

mrava87 commented May 25, 2025

Overall, it looks great!

@tharittk, some general comments for you to address, but it's a great start.

@mrava87, regarding the case of avoiding MPI.COMM_WORLD, I think it might be better to introduce base_comm_nccl, and we consistently use self.base_comm everywhere instead of hardcoding MPI.COMM_WORLD.

We can do something like this for now,

# Check and override base_comm with MPI.COMM_WORLD for now
if base_comm_nccl is not None and isinstance(base_comm_nccl, NCCLCommunicator):
    self._base_comm = MPI.COMM_WORLD
    self._base_comm_nccl = base_comm_nccl

Thanks @rohanbabbar04 for taking the time to review this first PR of @tharittk

I agree with your judgment on the MPI.COMM_WORLD usage: let’s try to go for the option of having two base_comm inputs and use NCCL if both are set… @tharittk try to have a go at it and we can discuss all together to make sure it’s done in the best way to ensure backward compatibility also in edge scenarios 😀

@hongyx11
Copy link
Collaborator

The code looks great! @mrava87 @tharittk

Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tharittk this looks great to me, and I think you have addressed all our suggestions!

@rohanbabbar04 just waiting to merge in case you want to take another look, otherwise good to go for me :)

@mrava87 mrava87 removed the request for review from hongyx11 May 27, 2025 21:45
Copy link
Collaborator

@rohanbabbar04 rohanbabbar04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/tharittk/pylops-mpi/blob/3dc41fe740b2a608fa909692bab8a8d5987968fa/pylops_mpi/utils/dottest.py#L7

You can replace this line with, should work and then you can uncomment the dottest

from pylops_mpi import DistributedArray

@mrava87 mrava87 merged commit f9f5d13 into PyLops:main May 29, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants