Skip to content
Merged
10 changes: 10 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,13 @@ rapids_cmake_write_version_file(include/rapidsmpf/version_config.hpp)
# Set a default build type if none was specified
rapids_cmake_build_type(Release)

# Set RAPIDSMPF_DEBUG default based on build type
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
set(RAPIDSMPF_DEBUG_DEFAULT ON)
else()
set(RAPIDSMPF_DEBUG_DEFAULT OFF)
endif()

Copy link
Member

@pentschev pentschev Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that the default is always off, even when CMake build type is debug, thus requiring you to specify it manually via --cmake-args, for example. The reason for that is we are more likely to build in debug mode to do general debugging than to require the extended NVTX annotations.

Furthermore, I think it makes more sense to have a different name, debug invokes a meaning of trying to resolve a problem, whereas I think this is more about performance validation/investigation. How do you feel about RAPIDSMPF_NVTX_DETAIL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DEBUG mode macro came up in several discussions. So, I would like this to be used for not just nvtx annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAPIDSMPF_VERBOSE or simply RAPIDSMPF_DETAIL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DEBUG is a more specific term IMO, I'd prefer RAPIDSMPF_DETAIL. In any case, I think we should only enable it when explicitly specified, and not just depend on CMAKE_BUILD_TYPE=Debug to enable it by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let's not use DEBUG.

But DETAIL might not be ideal either. It could be confused with our detail namespace and with macros like RAPIDSMPF_CONCAT_DETAIL_

# ##################################################################################################
# * build options ---------------------------------------------------------------------------------

Expand All @@ -53,6 +60,7 @@ option(BUILD_SHARED_LIBS "Build RapidsMPF shared library" ON)
option(CUDA_STATIC_RUNTIME "Statically link the CUDA runtime" OFF)
option(RAPIDSMPF_CLANG_TIDY "Enable clang-tidy during compilation" OFF)
option(RAPIDSMPF_ASAN "Enable AddressSanitizer" OFF)
option(RAPIDSMPF_DEBUG "Enable debug mode" ${RAPIDSMPF_DEBUG_DEFAULT})

message(STATUS "librapidsmpf build options:")
message(STATUS " BUILD_MPI_SUPPORT : ${BUILD_MPI_SUPPORT}")
Expand All @@ -67,6 +75,7 @@ message(STATUS " BUILD_SHARED_LIBS : ${BUILD_SHARED_LIBS}")
message(STATUS " CUDA_STATIC_RUNTIME : ${CUDA_STATIC_RUNTIME}")
message(STATUS " RAPIDSMPF_CLANG_TIDY : ${RAPIDSMPF_CLANG_TIDY}")
message(STATUS " RAPIDSMPF_ASAN : ${RAPIDSMPF_ASAN}")
message(STATUS " RAPIDSMPF_DEBUG : ${RAPIDSMPF_DEBUG}")

# Copy options to our prefix to prevent upstream projects from modifying them.
set(RAPIDSMPF_HAVE_MPI ${BUILD_MPI_SUPPORT})
Expand Down Expand Up @@ -245,6 +254,7 @@ target_compile_definitions(
$<$<BOOL:${RAPIDSMPF_HAVE_STREAMING}>:RAPIDSMPF_HAVE_STREAMING>
$<$<BOOL:${RAPIDSMPF_HAVE_CUPTI}>:RAPIDSMPF_HAVE_CUPTI>
$<$<BOOL:${RAPIDSMPF_HAVE_NUMA}>:RAPIDSMPF_HAVE_NUMA>
$<$<BOOL:${RAPIDSMPF_DEBUG}>:RAPIDSMPF_DEBUG>
)

rapids_cuda_set_runtime(rapidsmpf USE_STATIC ${CUDA_STATIC_RUNTIME})
Expand Down
18 changes: 18 additions & 0 deletions cpp/src/shuffler/shuffler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,9 @@ class Shuffler::Progress {
* @return The progress state of the shuffler.
*/
ProgressThread::ProgressState operator()() {
#if RAPIDSMPF_DEBUG
RAPIDSMPF_NVTX_SCOPED_RANGE("Shuffler.Progress", p_iters++);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we embed the RAPIDSMPF_DEBUG check inside RAPIDSMPF_NVTX_SCOPED_RANGE()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing that means we disable it completely unless built with that flag. Most software, cuDF inclusive, AFAIK, doesn't do that, NVTX symbols are always present. I think the primary goal of this PR is not disabling NVTX entirely, but rather the very verbose/fine-grained NVTX annotations.

Copy link
Member

@madsbk madsbk Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then call it RAPIDSMPF_NVTX_SCOPED_RANGE_VERBOSE ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that option I like, +1 to that.

auto const t0_event_loop = Clock::now();

// Tags for each stage of the shuffle
Expand All @@ -186,7 +188,9 @@ class Shuffler::Progress {
{
auto const t0_send_metadata = Clock::now();
auto ready_chunks = shuffler_.outgoing_postbox_.extract_all_ready();
#if RAPIDSMPF_DEBUG
RAPIDSMPF_NVTX_SCOPED_RANGE("meta_send", ready_chunks.size());
#endif
for (auto&& chunk : ready_chunks) {
// All messages in the chunk maps to the same key (checked by the PostBox)
// thus we can use the partition ID of the first message in the chunk to
Expand Down Expand Up @@ -226,8 +230,10 @@ class Shuffler::Progress {
// `incoming_chunks_`.
{
auto const t0_metadata_recv = Clock::now();
#if RAPIDSMPF_DEBUG
RAPIDSMPF_NVTX_SCOPED_RANGE("meta_recv");
int i = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: now that this needs guards, rename i to something a bit more descriptive.

#endif
while (true) {
auto const [msg, src] = shuffler_.comm_->recv_any(metadata_tag);
if (msg) {
Expand All @@ -245,17 +251,23 @@ class Shuffler::Progress {
} else {
break;
}
#if RAPIDSMPF_DEBUG
i++;
#endif
}
stats.add_duration_stat(
"event-loop-metadata-recv", Clock::now() - t0_metadata_recv
);
#if RAPIDSMPF_DEBUG
RAPIDSMPF_NVTX_MARKER("meta_recv_iters", i);
#endif
}

// Post receives for incoming chunks
{
#if RAPIDSMPF_DEBUG
RAPIDSMPF_NVTX_SCOPED_RANGE("post_chunk_recv", incoming_chunks_.size());
#endif
auto const t0_post_incoming_chunk_recv = Clock::now();
for (auto it = incoming_chunks_.begin(); it != incoming_chunks_.end();) {
auto& [src, chunk] = *it;
Expand Down Expand Up @@ -342,6 +354,7 @@ class Shuffler::Progress {
// requested data.
{
auto const t0_init_gpu_data_send = Clock::now();
#if RAPIDSMPF_DEBUG
RAPIDSMPF_NVTX_SCOPED_RANGE(
"init_gpu_send",
std::transform_reduce(
Expand All @@ -352,6 +365,7 @@ class Shuffler::Progress {
[](auto& kv) { return kv.second.size(); }
)
);
#endif
// ready_ack_receives_ are separated by rank so that we
// can guarantee that we don't match messages out of order
// when using the UCXX communicator. See comment in
Expand Down Expand Up @@ -379,7 +393,9 @@ class Shuffler::Progress {
// Check if any data in transit is finished.
{
auto const t0_check_future_finish = Clock::now();
#if RAPIDSMPF_DEBUG
RAPIDSMPF_NVTX_SCOPED_RANGE("check_fut_finish", in_transit_futures_.size());
#endif
if (!in_transit_futures_.empty()) {
std::vector<ChunkID> finished =
shuffler_.comm_->test_some(in_transit_futures_);
Expand Down Expand Up @@ -439,7 +455,9 @@ class Shuffler::Progress {
std::unordered_map<Rank, std::vector<std::unique_ptr<Communicator::Future>>>
ready_ack_receives_; ///< Receives matching ready for data messages.

#if RAPIDSMPF_DEBUG
int64_t p_iters = 0; ///< Number of progress iterations (for NVTX)
#endif
};

std::vector<PartID> Shuffler::local_partitions(
Expand Down