Add support for NCCL in communication_object#195
Add support for NCCL in communication_object#195msimberg wants to merge 55 commits intoghex-org:masterfrom
communication_object#195Conversation
There was a problem hiding this comment.
Pull request overview
Adds NCCL-aware behavior to communication_object including group semantics and stream-aware scheduling, along with plumbing updates across device utilities, Python bindings, and tests.
Changes:
- Refactors packing/sending/receiving/unpacking flow and introduces scheduled exchange (
schedule_exchange/schedule_wait) with CUDA stream dependencies. - Adds CUDA event + event pool utilities and updates CUDA stream synchronization behavior.
- Updates CMake and tests/bindings to accommodate the NCCL backend and new async APIs.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| test/unstructured/test_user_concepts.cpp | Adds async data-descriptor test and adjusts tests to handle NCCL thread-safe limitations; disables a known-crashing test. |
| test/test_context.cpp | Wraps context/barrier tests to accept expected NCCL thread-safe runtime error. |
| test/structured/regular/test_simple_regular_domain.cpp | Wraps simulation in exception handling for NCCL thread-safe limitation; notes NCCL warning. |
| test/structured/regular/test_regular_domain.cpp | Wraps exchange tests for NCCL thread-safe limitation; comments out two correctness checks under NCCL. |
| test/structured/regular/test_local_rma.cpp | Wraps RMA test for NCCL thread-safe limitation; notes NCCL warning. |
| test/bindings/python/test_unstructured_domain_descriptor.py | Adds GPU/CPU parameterization and new async schedule_exchange test using CUDA streams (CuPy). |
| include/ghex/util/moved_bit.hpp | Marks defaulted special members noexcept. |
| include/ghex/packer.hpp | Refactors packers to pack a single buffer into provided pointer (separating send posting from packing). |
| include/ghex/device/stream.hpp | Adjusts default stream stub and adds operator bool. |
| include/ghex/device/event_pool.hpp | Adds event_pool abstraction with CUDA and non-CUDA stubs. |
| include/ghex/device/event.hpp | Adds cuda_event abstraction with CUDA and non-CUDA stubs. |
| include/ghex/device/cuda/stream.hpp | Changes CUDA stream sync implementation and adds moved-state assertions. |
| include/ghex/device/cuda/runtime.hpp | Extends CUDA↔HIP macro mapping with cudaErrorNotReady and cudaStreamWaitEvent. |
| include/ghex/device/cuda/event_pool.hpp | Implements CUDA event pool used for scheduling stream dependencies. |
| include/ghex/device/cuda/event.hpp | Implements CUDA event wrapper with record/query helpers. |
| include/ghex/communication_object.hpp | Adds schedule_exchange/schedule_wait flow, group start/end, event-based stream dependencies, and refactors send/recv/unpack structure. |
| ext/oomph | Updates oomph submodule revision required for NCCL/group support. |
| cmake/ghex_external_dependencies.cmake | Adds NCCL as transport backend option and links oomph_nccl; adds NCCL find_package guarded by GHEX_USE_NCCL. |
| bindings/python/src/_pyghex/unstructured/field_descriptor.cpp | Minor comment/formatting fix in GPU buffer accessor; removes stray ;. |
| bindings/python/src/_pyghex/unstructured/communication_object.cpp | Adds Python bindings for schedule_exchange/schedule_wait and CUDA stream extraction. |
| CMakeLists.txt | Sets RPATH properties for oomph_nccl when NCCL backend is selected. |
| .github/workflows/CI.yml | Builds with verbose output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const auto protocol_version = cuda_stream_protocol[0].cast<std::size_t>(); | ||
| if (protocol_version == 0) | ||
| { | ||
| std::stringstream error; | ||
| error << "Expected `__cuda_stream__` protocol version 0, but got " | ||
| << protocol_version; | ||
| throw pybind11::type_error(error.str()); | ||
| } |
There was a problem hiding this comment.
The __cuda_stream__ protocol version check is inverted: this currently throws when protocol_version is 0 (the expected version) and accepts non-zero versions. Change the condition to reject versions other than 0 (e.g., protocol_version != 0) to match the error message and the intended behavior.
| cudaEvent_t m_event; | ||
| ghex::util::moved_bit m_moved; | ||
| bool m_recorded; | ||
|
|
||
| cuda_event() { | ||
| GHEX_CHECK_CUDA_RESULT(cudaEventCreateWithFlags(&m_event, cudaEventDisableTiming)) | ||
| }; |
There was a problem hiding this comment.
m_recorded is never initialized, but it is read in is_ready(). This is undefined behavior and can cause spurious readiness results. Initialize it to false (e.g., via in-class initializer bool m_recorded{false}; and/or in the constructor initializer list).
| #include <ghex/util/moved_bit.hpp> | ||
| #include <cassert> | ||
| #include <memory> | ||
| #include <vector> |
There was a problem hiding this comment.
std::runtime_error is used but <stdexcept> is not included in this header, which can break compilation depending on transitive includes. Add #include <stdexcept> to this file.
| #include <vector> | |
| #include <vector> | |
| #include <stdexcept> |
| */ | ||
| void rewind() | ||
| { | ||
| if (m_moved) { throw std::runtime_error("ERROR: Can not reset a moved pool."); } |
There was a problem hiding this comment.
std::runtime_error is used but <stdexcept> is not included in this header, which can break compilation depending on transitive includes. Add #include <stdexcept> to this file.
|
|
||
| // By returning `true` we emulate the behaviour of a | ||
| // CUDA `stream` that has been moved. | ||
| constexpr operator bool() const noexcept { return true; } |
There was a problem hiding this comment.
This stub operator bool() always returns true, which makes every device::stream instance appear "moved" if callers interpret this similarly to the CUDA implementation (where operator bool() reflects moved-state). Consider tracking moved-state in the stub (e.g., with ghex::util::moved_bit) and returning that, or return false consistently to mean "not moved".
| constexpr operator bool() const noexcept { return true; } | |
| constexpr operator bool() const noexcept { return false; } |
| // TODO: field_2a and 2b are wrong with NCCL, others ok. Why? Different pattern and halos... | ||
| // EXPECT_TRUE(check_values(field_2a)); | ||
| // EXPECT_TRUE(check_values(field_2b)); |
There was a problem hiding this comment.
Commenting out these assertions removes coverage for field_2a/field_2b correctness and can mask regressions for non-NCCL backends as well. Prefer conditioning the checks on the active transport backend (skip only for NCCL), or mark the NCCL case as an expected failure/skip with a clear reason while keeping the assertions active for other transports.
| // TODO: field_2a and 2b are wrong with NCCL, others ok. Why? Different pattern and halos... | |
| // EXPECT_TRUE(check_values(field_2a)); | |
| // EXPECT_TRUE(check_values(field_2b)); | |
| #ifndef GHEX_USE_NCCL | |
| EXPECT_TRUE(check_values(field_2a)); | |
| EXPECT_TRUE(check_values(field_2b)); | |
| #else | |
| // Known issue: field_2a and field_2b are currently wrong with NCCL | |
| // (different pattern and halos). Checks are disabled only for NCCL. | |
| #endif |
| #if 0 | ||
| // This test results in a segmentation fault. The error is | ||
| // also present on `master` (61f9ebbae4). |
There was a problem hiding this comment.
Disabling the test via #if 0 removes it entirely from compilation and can hide bitrot. Prefer converting it to a runtime skip (e.g., GTEST_SKIP() with the linked reason) or using a disabled test name (DISABLED_...) so the test still builds and can be re-enabled more easily.
| //test_in_place_receive_oversubscribe(ctxt); | ||
| if (thread_safe) test_in_place_receive_threads(ctxt); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Disabling the test via #if 0 removes it entirely from compilation and can hide bitrot. Prefer converting it to a runtime skip (e.g., GTEST_SKIP() with the linked reason) or using a disabled test name (DISABLED_...) so the test still builds and can be re-enabled more easily.
| catch (std::runtime_error const& e) | ||
| { |
There was a problem hiding this comment.
Use throw; instead of throw e; when rethrowing inside a catch block. throw e; creates a new exception object (and can lose original exception context), while throw; preserves the original exception.
test/test_context.cpp
Outdated
| else { throw e; } | ||
| } |
There was a problem hiding this comment.
Use throw; instead of throw e; when rethrowing inside a catch block. throw e; creates a new exception object (and can lose original exception context), while throw; preserves the original exception.
|
|
||
| # --------------------------------------------------------------------- | ||
| # nccl setup | ||
| # --------------------------------------------------------------------- | ||
| if(GHEX_USE_NCCL) | ||
| find_package(NCCL REQUIRED) | ||
| endif() |
There was a problem hiding this comment.
To do: is this needed in ghex or enough in oomph?
| // TODO: Returns "NCCL WARN PXN should not use host buffers for data" with NCCL. Why? Test works | ||
| // with NCCL_PXN_DISABLE=1. |
There was a problem hiding this comment.
To do: check if this is still relevant.
| // TODO: NCCL fails with "NCCL WARN Trying to recv to self without a matching send". Inherent to | ||
| // test? Avoidable? |
| // TODO: field_2a and 2b are wrong with NCCL, others ok. Why? Different pattern and halos... | ||
| // EXPECT_TRUE(check_values(field_2a)); | ||
| // EXPECT_TRUE(check_values(field_2b)); |
| // TODO: NCCL fails with "NCCL WARN Trying to recv to self without a matching send". Inherent to | ||
| // test? Avoidable? |
| [](context::message_type&, context::rank_type, context::tag_type) { | ||
| })); |
There was a problem hiding this comment.
To do: can we leave out the callback?
| [](context::message_type&, context::rank_type, | ||
| context::tag_type) {} |
There was a problem hiding this comment.
Check if we can remove the empty callback?
Nothing to see here yet.
This requires ghex-org/oomph#55 and #190.
Updates
communication_objectto use thestart_group/end_groupfunctionality from oomph/NCCL, as well as takingis_stream_awareinto account.Also does a minor refactoring of packer and communication object helper functions so that the different stages are a bit easier to follow:
This replaces #185.