-
Notifications
You must be signed in to change notification settings - Fork 5
Documentation Update for NCCL #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a few comments on what you have done so far, this needs much more to be ready but hopefully with the checklist that I put in the description of the PR you will have a easier life to navigate the documentation and add bits related to NCCL
======================================================= | ||
To obtain highly-optimized performance on GPU clusters, PyLops-MPI also supports the Nvidia's collective communication calls (NCCL). | ||
`NCCL <https://developer.nvidia.com/nccl> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we worked with MPI we did not assume that users would be forced to use conda
so we also provided some quick instructions on how to setup MPI (see above) in case one would use pip
to install mpy4py
... I don't know how easy would be to do the same for NCCL, but I am saying this so you can at least understand why this page was written this way.
I would move this later when we have the make install_conda
and make dev-install_conda
saying that to use NCCL you would instead use the commands make install_conda_nc
(and add this to the Makefile) and make dev-install_conda_nc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I saw that there is a nccl package in pip
like pip install nvidia-nccl-cu12
and also for CUDA 11 and 13 listed separately. It could be a bit more complicated than conda
as conda
can automatically detect the version based on the driver $ conda install -c conda-forge cupy nccl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess, the users/developers have to define their own CUDA version anyway because for CuPy, there is also separated package for CUDA 11 and 12 i.e. pip install cupy-cuda11x
I can try making the venv and test on pip install cupy-cuda12x
& pip install nvidia-nccl-cu12
since I have CUDA Driver version 12. If it works cleanly, I can add to the doc for installing NCCL with Pip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely 😄 I also like to use conda but for users that are not allowed to (as well as for ourselves as in CI it is better to use barebore python with venv), it would be great if we can find a pip only solution too
Ongoing update - in parallel to NCCL implementation PR (#130)
Tasks
README
mentioning the possibility to use NCCL instead of MPI for distributed cupy arrays, updating the install, example and tests sections with NCCL-related commandsindex.rst
similar to README to reflect new NCCL enginegpu.rst
documenting the new env variable (NCCL_PYLOPS_MPI
), adding NCCL to the example, and perhaps consider adding a table like in https://pylops.readthedocs.io/en/stable/gpu.html to document what features are supported in NCCL and what are not, eg the missing support for complex numbers (this can also serve as a live roadmap for you work, as we progress we should see more and more features being supported by both MPI and NCCL)