Skip to content

Documentation Update for NCCL #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ PIP := $(shell command -v pip3 2> /dev/null || command which pip 2> /dev/null)
PYTHON := $(shell command -v python3 2> /dev/null || command which python 2> /dev/null)
NUM_PROCESSES = 3

.PHONY: install dev-install install_conda dev-install_conda tests doc docupdate run_examples run_tutorials
.PHONY: install dev-install dev-install_nccl install_conda install_conda_nccl dev-install_conda dev-install_conda_nccl tests tests_nccl doc docupdate run_examples run_tutorials

pipcheck:
ifndef PIP
Expand All @@ -24,19 +24,29 @@ dev-install:
make pipcheck
$(PIP) install -r requirements-dev.txt && $(PIP) install -e .

dev-install_nccl:
make pipcheck
$(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12 $(PIP) install -e .

install_conda:
conda env create -f environment.yml && conda activate pylops_mpi && pip install .

install_conda_nccl:
conda env create -f environment.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install .

dev-install_conda:
conda env create -f environment-dev.yml && conda activate pylops_mpi && pip install -e .

dev-install_conda_nccl:
conda env create -f environment-dev.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install -e .

lint:
flake8 pylops_mpi/ tests/ examples/ tutorials/

tests:
mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi

# assuming NUM_PRCESS <= number of gpus available
# assuming NUM_PROCESSES <= number of gpus available
tests_nccl:
mpiexec -n $(NUM_PROCESSES) pytest tests_nccl/ --with-mpi

Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ and running the following command:
```
make install_conda
```
Optionally, if you work with multi-GPU environment and want to use Nvidia's collective communication calls (NCCL) enabled, install your environment with
```
make install_conda_nccl
```

## Run Pylops-MPI
Once you have installed the prerequisites and pylops-mpi, you can run pylops-mpi using the `mpiexec` command.
Expand Down
75 changes: 73 additions & 2 deletions docs/source/gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,15 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi
some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
cupy arrays.

Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized
collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the
proprietary technology like NVLink that might be available in their infrastructure for fast data communication.

.. note::

Set environment variable ``NCCL_PYLOPS_MPI=0`` to explicitly force PyLops-MPI to ignore the ``NCCL`` backend.
However, this is optional as users may opt-out for NCCL by skip passing `cupy.cuda.nccl.NcclCommunicator` to
the :class:`pylops_mpi.DistributedArray`

Example
-------
Expand Down Expand Up @@ -79,7 +88,69 @@ your GPU:
The code is almost unchanged apart from the fact that we now use ``cupy`` arrays,
PyLops-mpi will figure this out!

Finally, if NCCL is available, a ``cupy.cuda.nccl.NcclCommunicator`` can be initialized and passed to :class:`pylops_mpi.DistributedArray`
as follows:

.. code-block:: python

from pylops_mpi.utils._nccl import initialize_nccl_comm

# Initilize NCCL Communicator
nccl_comm = initialize_nccl_comm()

# Create distributed data (broadcast)
nxl, nt = 20, 20
dtype = np.float32
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
base_comm_nccl=nccl_comm,
partition=pylops_mpi.Partition.BROADCAST,
engine="cupy", dtype=dtype)
d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)

# Create and apply VStack operator
Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, this is not going to work as we haven't enabled MPIVStack to NCCL, but I think it's fine, we will get there soon anywyas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, NCCL-enabled MPIVStack will be coming soon.

HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
y_dist = HOp @ d_dist

Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to
one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL

.. note::

The CuPy backend is in active development, with many examples not yet in the docs.
You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
The CuPy and NCCL backend is in active development, with many examples not yet in the docs.
You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.

Supports for NCCL Backend
----------------------------
In the following, we provide a list of modules (i.e., operators and solvers) where we plan to support NCCL and the current status:

.. list-table::
:widths: 50 25
:header-rows: 1

* - modules
- NCCL supported
* - :class:`pylops_mpi.DistributedArray`
- /
* - :class:`pylops_mpi.basicoperators.MPIVStack`
- Ongoing
* - :class:`pylops_mpi.basicoperators.MPIHStack`
- Ongoing
* - :class:`pylops_mpi.basicoperators.MPIBlockDiag`
- Ongoing
* - :class:`pylops_mpi.basicoperators.MPIGradient`
- Ongoing
* - :class:`pylops_mpi.basicoperators.MPIFirstDerivative`
- Ongoing
* - :class:`pylops_mpi.basicoperators.MPISecondDerivative`
- Ongoing
* - :class:`pylops_mpi.basicoperators.MPILaplacian`
- Ongoing
* - :class:`pylops_mpi.optimization.basic.cg`
- Ongoing
* - :class:`pylops_mpi.optimization.basic.cgls`
- Ongoing
* - ISTA Solver
- Planned
* - Complex Numeric Data Type for NCCL
- Planned
4 changes: 4 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ By integrating MPI (Message Passing Interface), PyLops-MPI optimizes the collabo
computing nodes, enabling large and intricate tasks to be divided, solved, and aggregated in an efficient and
parallelized manner.

PyLops-MPI also supports the Nvidia's Collective Communication Library `(NCCL) <https://developer.nvidia.com/nccl>`_ for high-performance
GPU-to-GPU communications. The PyLops-MPI's NCCL engine works congruently with MPI by delegating the GPU-to-GPU communication tasks to
highly-optimized NCCL, while leveraging MPI for CPU-side coordination and orchestration.

Get started by :ref:`installing PyLops-MPI <Installation>` and following our quick tour.

Terminology
Expand Down
37 changes: 37 additions & 0 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,14 @@ Fork the `PyLops-MPI repository <https://github.com/PyLops/pylops-mpi>`_ and clo
We recommend installing dependencies into a separate environment.
For that end, we provide a `Makefile` with useful commands for setting up the environment.

Enable Nvidia Collective Communication Library
=======================================================
To obtain highly-optimized performance on GPU clusters, PyLops-MPI also supports the Nvidia's collective communication calls
`(NCCL) <https://developer.nvidia.com/nccl>`_. Two additional dependencies are required: CuPy and NCCL

* `CuPy with NCCL <https://docs.cupy.dev/en/stable/install.html>`_


Step-by-step installation for users
***********************************

Expand Down Expand Up @@ -89,6 +97,12 @@ For a ``conda`` environment, run

This will create and activate an environment called ``pylops_mpi``, with all required and optional dependencies.

If you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ in PyLops-MPI, run this instead

.. code-block:: bash

>> make dev-install_conda_nccl

Pip
---
If you prefer a ``pip`` installation, we provide the following command
Expand All @@ -100,6 +114,23 @@ If you prefer a ``pip`` installation, we provide the following command
Note that, differently from the ``conda`` command, the above **will not** create a virtual environment.
Make sure you create and activate your environment previously.

Simlarly, if you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ but prefer using pip,
you must first check the CUDA version of your system:

.. code-block:: bash

>> nvidia-smi

The `Makefile` is pre-configured with CUDA 12.x. If you use this version, run

.. code-block:: bash

>> make dev-install_nccl

Otherwise, you can change the command in `Makefile` to an appropriate CUDA version
i.e., If you use CUDA 11.x, change ``cupy-cuda12x`` and ``nvidia-nccl-cu12`` to ``cupy-cuda11x`` and ``nvidia-nccl-cu11``
and run the command.

Run tests
=========
To ensure that everything has been setup correctly, run tests:
Expand All @@ -110,6 +141,12 @@ To ensure that everything has been setup correctly, run tests:

Make sure no tests fail, this guarantees that the installation has been successful.

If PyLops-MPI is installed with NCCL, also run tests:

.. code-block:: bash

>> make tests_nccl

Run examples and tutorials
==========================
Since the sphinx-gallery creates examples/tutorials using only a single process, it is highly recommended to test the
Expand Down