Add NCCL support to DistributedArray #130

tharittk · 2025-05-18T04:49:54Z

I added the support for NCCL collective calls to DistributedArray.py. These calls include allReduce, allGather, and bcast The user passes the nccl.cuda.nccl.NcclCommunicator object instead of MPI.COMM_WORLD to the DistributedArray constructor to use NCCL

Setup

fork from the main branch and install cupy with nccl to the environment
$conda env create -f environment-dev.yml
$conda activate pylops_mpi
$conda install -c conda-forge cupy cudnn cutensor nccl
$pip install -e .

Change Made

DistributedArray

base_comm now take either mpi4py.MPI.Comm or NcclCommunicator
sub_comm assignment is carried through the subcomm_split() cupy nccl doesn't provide split API as its C++ counterpart so this function implements the split manually.
add _allgather - MPI provides allgather (returns value) and Allgather (requires recv buffer) while cupy nccl only provides allGather that requires recv buffer. This function provides unified API for both cases of base_comm. Now the the call is self._allgather() instead of self.basecomm.allgather - which only worked with MPI
add NCCL support to _allreduce and _allreduce_subcomm
bcast in __setitem__ can also use NcclCommunicator counterpart
Explicitly use MPI.COMM_WORLD.Get_size() and MPI.COMM_WORLD.Get_rank() in rank() and size() getter instead of base_comm. This is because NCCL recommends to launch with one MPI process per GPU -- violates this may cause deadlock. This means in either MPI and NCCL, the global rank is managed through MPI. Plus, the cupy nccl does not provide rank getter API.
dot() and _compute_vector_norm() now use ncp = get_module(self.engine) to reflect that computation may now be done in GPU memory cupy array

test_distributedarray_nccl.py

Almost exact copy of test_distributedarray.py but now test under base_comm is cupy.cuda.nccl.NcclCommunicator
NCCL does not support the dtype=complex so some test cases involve complex number are removed
in order to call assert_allclose, the .get() is called because under NCCL, the array lives inside GPU memory (and did not implicitly copy it CPU)

Testing

the test is done in two scenarios in NCSA Delta A40x40

1 node with multi-gpu (2 and 3 GPUs) with following commands
srun -N 1 -n 3 --gpus-per-node=3 pytest --with-mpi for 3 GPUs
or in local workstation with 3 GPUs
mpirun -n 3 pytest --with-mpi
2 nodes with multi-gpu (2 and GPUs per node)
srun -N 2 -n 6 --gpus-per-node=3 pytest --with-mpi

mrava87

Hi @tharittk,
Great start!

I have started reviewing your PR and I am providing here a number of general comments, I will then make more specific ones in the code itself (in progress...):

one of the successfull decisions that we made in PyLops since the beginning was to limit the number of mandatory dependencies to the minimum (basically numpy and scipy) and allow a lot of optional dependencies which a user can choose to install or not, whilst still being able to use the library. Similar for pylops-mpi we extended the mandatory dependencies to mpi4py for obvious reasons but no more than that. In our current implementation of DistributedArray we support a double backend via the engine keyword (numpy and cupy), however we never import cupy - this is possible because we use the methods get_module, get_array_module, get_module_name in the module pylops.utils.backend - have a look at them to understand what they do. I will also provide more detailed comments in the code.
Whilst we now want to provide a NCCL backend using cupy, I still believe that we should try not to make cupy a mandatory dependency, also because there are various flavors of cupy installations based on the CUDA drivers so we cannot just add one to the dependencies in the pyproject.toml file. Based on the above, I suggesting adding the following to the pyproject.toml file:

[project.optional-dependencies]
cupy11 = [
    "cupy-cuda11x",
]
cupy12 = [
    "cupy-cuda12x",
]

to be tested. We should also add nccl but I see that there are a lot in PyPI so we need to check which one is the correct one.

I very much like your detailed description of the setup in the PR. I suggest you have a go at updating README.md and docs/source/installation.rst. Here you could do like in the current files, add new targets to MakeFile with your installation instructions and refer to them - in #132
Same for the tests, you can add some text in the docs/source/installation.rst file (both the mpirun and srun as these are valid for any SLURM cluster and not really NCSA Delta specific) - in #132

pylops_mpi/DistributedArray.py

tests/test_distributedarray_nccl.py

…tyle issues from PR

pylops_mpi/utils/__init__.py

mrava87 · 2025-05-20T17:12:08Z

pylops_mpi/utils/deps.py

+# error message at import of available package
+def nccl_import(message: Optional[str] = None) -> str:
+    nccl_test = (
+        # TODO: setting of OS env should go to READme somewhere


This is explained here in the documentation of PyLops: https://pylops.readthedocs.io/en/stable/gpu.html

We could add a similar page to that of PyLops-MPI and in the README simply say we refer to the documentation for details on how to use NCCL.

To be done in https://github.com/PyLops/pylops-mpi/pull/132/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52

hongyx11 · 2025-05-20T21:45:34Z

Good news, I was able to run the tests on my local environment. I think it's quite clean and clear.
@mrava87 complex support is important to PyLops right? we might find other way to do it.

mrava87 · 2025-05-21T13:17:44Z

Good news, I was able to run the tests on my local environment. I think it's quite clean and clear.

@mrava87 complex support is important to PyLops right? we might find other way to do it.

@hongyx11 for sure there are interesting scientific problems where the data and/or model is complex valued. For now I think we can leave this out but I think it would make it a nice PR for @tharittk later on in the GSoC project to handle this: probably the easiest will be to pack/unpack c on complex arrays into real/imag concatenations… I think we may be able to do this with a decorator that we add to all methods that do communications 😀

mrava87

@tharittk very good work!

You will see that I left a lot of comments, but mostly are very minor and intended to help you make your new code as consistent as possible with various pylops conventions in terms of docstrings, code patterns etc.

My only main comment remains wrt to the many changes you did in the code from base_comm to hard-coded MPI.COMM_WORLD. @hongyx11 do you see a scenario when this may be dangerous, I was reading a bit at it seems MPI.COMM_WORLD is almost always used as the primary base communicator but there are some other options like using MPI_Comm_create? With the current changes if that were to happen we would still get all info from MPI.COMM_WORLD. Another idea that I had could be to add a new optional input parameter called base_comm_nccl = None (default it to None) and check if this is not None, in that case this will take priority, however since we have base_comm we could keep using it throughout the code where now you changed to MPI.COMM_WORLD

I think after you have addressed all of these comments, we may be ready for your first PR to be merged 🚀

Makefile

pylops_mpi/utils/deps.py

pylops_mpi/utils/backend.py

tests_nccl/test_distributedarray_nccl.py

pylops_mpi/DistributedArray.py

mrava87 · 2025-05-24T13:30:43Z

@tharittk I also triggered the CI and as you can see the actions fail. But I think I have identified the problem 😄

The way util.find_spec works is that it tries to go all the way to nccl assuming cupy is present, but since it is not it fails

util.find_spec("cupy.cuda.nccl")

If instead you do

 util.find_spec("cupy") is not None and util.find_spec("cupy.cuda.nccl") is not None

it will work as it will stop already at the first check and if cupy isn't present will return False... try to change it and we will see if this is the only problem or if more will arise (from now on the CI will be automatically triggered at any push you do)

rohanbabbar04

Overall, it looks great!

@tharittk, some general comments for you to address, but it's a great start.

@mrava87, regarding the case of avoiding MPI.COMM_WORLD, I think it might be better to introduce base_comm_nccl, and we consistently use self.base_comm everywhere instead of hardcoding MPI.COMM_WORLD.

We can do something like this for now,

# Check and override base_comm with MPI.COMM_WORLD for now
if base_comm_nccl is not None and isinstance(base_comm_nccl, NCCLCommunicator):
    self._base_comm = MPI.COMM_WORLD
    self._base_comm_nccl = base_comm_nccl

pylops_mpi/DistributedArray.py

pylops_mpi/utils/backend.py

mrava87 · 2025-05-25T07:44:14Z

Overall, it looks great!

@tharittk, some general comments for you to address, but it's a great start.

@mrava87, regarding the case of avoiding MPI.COMM_WORLD, I think it might be better to introduce base_comm_nccl, and we consistently use self.base_comm everywhere instead of hardcoding MPI.COMM_WORLD.

We can do something like this for now,
# Check and override base_comm with MPI.COMM_WORLD for now
if base_comm_nccl is not None and isinstance(base_comm_nccl, NCCLCommunicator):
    self._base_comm = MPI.COMM_WORLD
    self._base_comm_nccl = base_comm_nccl

Thanks @rohanbabbar04 for taking the time to review this first PR of @tharittk

I agree with your judgment on the MPI.COMM_WORLD usage: let’s try to go for the option of having two base_comm inputs and use NCCL if both are set… @tharittk try to have a go at it and we can discuss all together to make sure it’s done in the best way to ensure backward compatibility also in edge scenarios 😀

…le from PR suggestions

…gested by PR

hongyx11 · 2025-05-26T19:56:33Z

The code looks great! @mrava87 @tharittk

mrava87

@tharittk this looks great to me, and I think you have addressed all our suggestions!

@rohanbabbar04 just waiting to merge in case you want to take another look, otherwise good to go for me :)

rohanbabbar04

https://github.com/tharittk/pylops-mpi/blob/3dc41fe740b2a608fa909692bab8a8d5987968fa/pylops_mpi/utils/dottest.py#L7

You can replace this line with, should work and then you can uncomment the dottest

from pylops_mpi import DistributedArray

pylops_mpi/DistributedArray.py

tharittk added 2 commits May 17, 2025 17:29

implements NCCL collective calls in DistributedArray.py

f418a7d

Fix dimensional slicing bug in NCCL asarray()

b0da374

mrava87 requested changes May 18, 2025

View reviewed changes

mrava87 assigned hongyx11 and mrava87 and unassigned hongyx11 and mrava87 May 19, 2025

mrava87 requested a review from hongyx11 May 19, 2025 20:03

mrava87 added the enhancement New feature or request label May 19, 2025

mrava87 reviewed May 19, 2025

View reviewed changes

tests/test_distributedarray_nccl.py Outdated Show resolved Hide resolved

tests/test_distributedarray_nccl.py Outdated Show resolved Hide resolved

tests/test_distributedarray_nccl.py Outdated Show resolved Hide resolved

tests/test_distributedarray_nccl.py Outdated Show resolved Hide resolved

add utils.backend and deps and move some nccl-related function. Fix s…

bf15ea0

…tyle issues from PR

mrava87 reviewed May 20, 2025

View reviewed changes

pylops_mpi/utils/__init__.py Show resolved Hide resolved

mrava87 reviewed May 20, 2025

View reviewed changes

mrava87 mentioned this pull request May 21, 2025

Documentation update for NCCL #131

Closed

move nccl-related calls to backend.py to avoid direct import

f1238cb

mrava87 mentioned this pull request May 23, 2025

Documentation Update for NCCL #132

Merged

3 tasks

mrava87 requested changes May 23, 2025

View reviewed changes

rohanbabbar04 requested changes May 25, 2025

View reviewed changes

pylops_mpi/DistributedArray.py Outdated Show resolved Hide resolved

pylops_mpi/utils/backend.py Outdated Show resolved Hide resolved

tharittk added 3 commits May 25, 2025 21:47

change protected import pattern (backend.py -> _nccl.py) and docs sty…

6d0895e

…le from PR suggestions

limit CI test to only trigger pure MPI tests and not nccl/cupy

cdc5950

Change DistributedArray() to take base_comm and base_comm_nccl as sug…

3dc41fe

…gested by PR

mrava87 approved these changes May 27, 2025

View reviewed changes

mrava87 removed the request for review from hongyx11 May 27, 2025 21:45

rohanbabbar04 requested changes May 28, 2025

View reviewed changes

pylops_mpi/DistributedArray.py Outdated Show resolved Hide resolved

base_comm instantiation - suggested in PR

9693d9b

mrava87 merged commit f9f5d13 into PyLops:main May 29, 2025
61 checks passed

tharittk mentioned this pull request May 31, 2025

Fix dottest dependency and use base_comm instead of MPI.COMM_WORLD #134

Merged

Add NCCL support to DistributedArray #130

Add NCCL support to DistributedArray #130

Uh oh!

Conversation

tharittk commented May 18, 2025

Setup

Change Made

Testing

Uh oh!

mrava87 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrava87 May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongyx11 commented May 20, 2025

Uh oh!

mrava87 commented May 21, 2025

Uh oh!

mrava87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrava87 commented May 24, 2025

Uh oh!

rohanbabbar04 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mrava87 commented May 25, 2025

Uh oh!

hongyx11 commented May 26, 2025

Uh oh!

mrava87 left a comment

Choose a reason for hiding this comment

Uh oh!

rohanbabbar04 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrava87 left a comment •

edited

Loading

mrava87 May 20, 2025 •

edited

Loading