Skip to content

Add support for NCCL in communication_object#195

Draft
msimberg wants to merge 55 commits intoghex-org:masterfrom
msimberg:nccl-2
Draft

Add support for NCCL in communication_object#195
msimberg wants to merge 55 commits intoghex-org:masterfrom
msimberg:nccl-2

Conversation

@msimberg
Copy link
Copy Markdown
Collaborator

@msimberg msimberg commented Dec 22, 2025

Nothing to see here yet.

This requires ghex-org/oomph#55 and #190.

Updates communication_object to use the start_group/end_group functionality from oomph/NCCL, as well as taking is_stream_aware into account.

Also does a minor refactoring of packer and communication object helper functions so that the different stages are a bit easier to follow:

  • optional: sync before packing
  • pack
  • send
  • recv
  • unpack
  • optional: sync after unpacking

This replaces #185.

@philip-paul-mueller philip-paul-mueller marked this pull request as ready for review February 3, 2026 09:15
@philip-paul-mueller philip-paul-mueller marked this pull request as draft February 3, 2026 09:15
@msimberg msimberg requested a review from Copilot March 20, 2026 09:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds NCCL-aware behavior to communication_object including group semantics and stream-aware scheduling, along with plumbing updates across device utilities, Python bindings, and tests.

Changes:

  • Refactors packing/sending/receiving/unpacking flow and introduces scheduled exchange (schedule_exchange / schedule_wait) with CUDA stream dependencies.
  • Adds CUDA event + event pool utilities and updates CUDA stream synchronization behavior.
  • Updates CMake and tests/bindings to accommodate the NCCL backend and new async APIs.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
test/unstructured/test_user_concepts.cpp Adds async data-descriptor test and adjusts tests to handle NCCL thread-safe limitations; disables a known-crashing test.
test/test_context.cpp Wraps context/barrier tests to accept expected NCCL thread-safe runtime error.
test/structured/regular/test_simple_regular_domain.cpp Wraps simulation in exception handling for NCCL thread-safe limitation; notes NCCL warning.
test/structured/regular/test_regular_domain.cpp Wraps exchange tests for NCCL thread-safe limitation; comments out two correctness checks under NCCL.
test/structured/regular/test_local_rma.cpp Wraps RMA test for NCCL thread-safe limitation; notes NCCL warning.
test/bindings/python/test_unstructured_domain_descriptor.py Adds GPU/CPU parameterization and new async schedule_exchange test using CUDA streams (CuPy).
include/ghex/util/moved_bit.hpp Marks defaulted special members noexcept.
include/ghex/packer.hpp Refactors packers to pack a single buffer into provided pointer (separating send posting from packing).
include/ghex/device/stream.hpp Adjusts default stream stub and adds operator bool.
include/ghex/device/event_pool.hpp Adds event_pool abstraction with CUDA and non-CUDA stubs.
include/ghex/device/event.hpp Adds cuda_event abstraction with CUDA and non-CUDA stubs.
include/ghex/device/cuda/stream.hpp Changes CUDA stream sync implementation and adds moved-state assertions.
include/ghex/device/cuda/runtime.hpp Extends CUDA↔HIP macro mapping with cudaErrorNotReady and cudaStreamWaitEvent.
include/ghex/device/cuda/event_pool.hpp Implements CUDA event pool used for scheduling stream dependencies.
include/ghex/device/cuda/event.hpp Implements CUDA event wrapper with record/query helpers.
include/ghex/communication_object.hpp Adds schedule_exchange/schedule_wait flow, group start/end, event-based stream dependencies, and refactors send/recv/unpack structure.
ext/oomph Updates oomph submodule revision required for NCCL/group support.
cmake/ghex_external_dependencies.cmake Adds NCCL as transport backend option and links oomph_nccl; adds NCCL find_package guarded by GHEX_USE_NCCL.
bindings/python/src/_pyghex/unstructured/field_descriptor.cpp Minor comment/formatting fix in GPU buffer accessor; removes stray ;.
bindings/python/src/_pyghex/unstructured/communication_object.cpp Adds Python bindings for schedule_exchange/schedule_wait and CUDA stream extraction.
CMakeLists.txt Sets RPATH properties for oomph_nccl when NCCL backend is selected.
.github/workflows/CI.yml Builds with verbose output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +58 to +65
const auto protocol_version = cuda_stream_protocol[0].cast<std::size_t>();
if (protocol_version == 0)
{
std::stringstream error;
error << "Expected `__cuda_stream__` protocol version 0, but got "
<< protocol_version;
throw pybind11::type_error(error.str());
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __cuda_stream__ protocol version check is inverted: this currently throws when protocol_version is 0 (the expected version) and accepts non-zero versions. Change the condition to reject versions other than 0 (e.g., protocol_version != 0) to match the error message and the intended behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +33
cudaEvent_t m_event;
ghex::util::moved_bit m_moved;
bool m_recorded;

cuda_event() {
GHEX_CHECK_CUDA_RESULT(cudaEventCreateWithFlags(&m_event, cudaEventDisableTiming))
};
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_recorded is never initialized, but it is read in is_ready(). This is undefined behavior and can cause spurious readiness results. Initialize it to false (e.g., via in-class initializer bool m_recorded{false}; and/or in the constructor initializer list).

Copilot uses AI. Check for mistakes.
#include <ghex/util/moved_bit.hpp>
#include <cassert>
#include <memory>
#include <vector>
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::runtime_error is used but <stdexcept> is not included in this header, which can break compilation depending on transitive includes. Add #include <stdexcept> to this file.

Suggested change
#include <vector>
#include <vector>
#include <stdexcept>

Copilot uses AI. Check for mistakes.
*/
void rewind()
{
if (m_moved) { throw std::runtime_error("ERROR: Can not reset a moved pool."); }
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::runtime_error is used but <stdexcept> is not included in this header, which can break compilation depending on transitive includes. Add #include <stdexcept> to this file.

Copilot uses AI. Check for mistakes.

// By returning `true` we emulate the behaviour of a
// CUDA `stream` that has been moved.
constexpr operator bool() const noexcept { return true; }
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stub operator bool() always returns true, which makes every device::stream instance appear "moved" if callers interpret this similarly to the CUDA implementation (where operator bool() reflects moved-state). Consider tracking moved-state in the stub (e.g., with ghex::util::moved_bit) and returning that, or return false consistently to mean "not moved".

Suggested change
constexpr operator bool() const noexcept { return true; }
constexpr operator bool() const noexcept { return false; }

Copilot uses AI. Check for mistakes.
Comment on lines +703 to +705
// TODO: field_2a and 2b are wrong with NCCL, others ok. Why? Different pattern and halos...
// EXPECT_TRUE(check_values(field_2a));
// EXPECT_TRUE(check_values(field_2b));
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out these assertions removes coverage for field_2a/field_2b correctness and can mask regressions for non-NCCL backends as well. Prefer conditioning the checks on the active transport backend (skip only for NCCL), or mark the NCCL case as an expected failure/skip with a clear reason while keeping the assertions active for other transports.

Suggested change
// TODO: field_2a and 2b are wrong with NCCL, others ok. Why? Different pattern and halos...
// EXPECT_TRUE(check_values(field_2a));
// EXPECT_TRUE(check_values(field_2b));
#ifndef GHEX_USE_NCCL
EXPECT_TRUE(check_values(field_2a));
EXPECT_TRUE(check_values(field_2b));
#else
// Known issue: field_2a and field_2b are currently wrong with NCCL
// (different pattern and halos). Checks are disabled only for NCCL.
#endif

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +153
#if 0
// This test results in a segmentation fault. The error is
// also present on `master` (61f9ebbae4).
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the test via #if 0 removes it entirely from compilation and can hide bitrot. Prefer converting it to a runtime skip (e.g., GTEST_SKIP() with the linked reason) or using a disabled test name (DISABLED_...) so the test still builds and can be re-enabled more easily.

Copilot uses AI. Check for mistakes.
//test_in_place_receive_oversubscribe(ctxt);
if (thread_safe) test_in_place_receive_threads(ctxt);
}
#endif
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the test via #if 0 removes it entirely from compilation and can hide bitrot. Prefer converting it to a runtime skip (e.g., GTEST_SKIP() with the linked reason) or using a disabled test name (DISABLED_...) so the test still builds and can be re-enabled more easily.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +27
catch (std::runtime_error const& e)
{
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use throw; instead of throw e; when rethrowing inside a catch block. throw e; creates a new exception object (and can lose original exception context), while throw; preserves the original exception.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +35
else { throw e; }
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use throw; instead of throw e; when rethrowing inside a catch block. throw e; creates a new exception object (and can lose original exception context), while throw; preserves the original exception.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +113

# ---------------------------------------------------------------------
# nccl setup
# ---------------------------------------------------------------------
if(GHEX_USE_NCCL)
find_package(NCCL REQUIRED)
endif()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: is this needed in ghex or enough in oomph?

Comment on lines +932 to +933
// TODO: Returns "NCCL WARN PXN should not use host buffers for data" with NCCL. Why? Test works
// with NCCL_PXN_DISABLE=1.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: check if this is still relevant.

Comment on lines +369 to +370
// TODO: NCCL fails with "NCCL WARN Trying to recv to self without a matching send". Inherent to
// test? Avoidable?
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check again.

Comment on lines +703 to +705
// TODO: field_2a and 2b are wrong with NCCL, others ok. Why? Different pattern and halos...
// EXPECT_TRUE(check_values(field_2a));
// EXPECT_TRUE(check_values(field_2b));
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this.

Comment on lines +477 to +478
// TODO: NCCL fails with "NCCL WARN Trying to recv to self without a matching send". Inherent to
// test? Avoidable?
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check.

Comment on lines +635 to +636
[](context::message_type&, context::rank_type, context::tag_type) {
}));
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: can we leave out the callback?

Comment on lines +708 to +709
[](context::message_type&, context::rank_type,
context::tag_type) {}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if we can remove the empty callback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants