Skip to content

Conversation

@pentschev
Copy link
Member

Introduce a new launcher for RapidsMPF without dependencies on mpirun. The new launcher currently named rrun (as in RapidsMPF run) introduces basic functionality to launch processes locally to run on multiple-GPUs while providing an extension to allow it to support different backends, the only existing backend supports file-based synchronization, in the future we plan to extend it to support other synchronization mechanisms including over-the-network via specialized service, Slurm, as well as Kubernetes and others that may be requested. Additionally, it will support automatic discovery and configuration of system topology per process (i.e., per GPU).

This initial implementation only supports local node multi-GPU, the implementation for multi-node multi-GPU already exists using SSH to spawn multiple processes but was split into a different PR to make this shorter.

@pentschev pentschev self-assigned this Oct 31, 2025
@pentschev pentschev requested review from a team as code owners October 31, 2025 14:14
@pentschev pentschev added the feature request New feature or request label Oct 31, 2025
@pentschev pentschev requested a review from AyodeAwe October 31, 2025 14:14
@pentschev pentschev added the non-breaking Introduces a non-breaking change label Oct 31, 2025
@pentschev
Copy link
Member Author

Thanks @madsbk , I think I've addressed/responded to everything. Could you take another look when you have a chance?

// Write to temporary file
std::ofstream ofs(tmp_path, std::ios::binary | std::ios::trunc);
if (!ofs) {
throw std::runtime_error("Failed to open temporary file: " + tmp_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Comment on lines +231 to +233
if (!ifs) {
throw std::runtime_error("Failed to open file for reading: " + path);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RAPIDSMPF_EXPECTS()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed in db112cb

suppress_output->store(true, std::memory_order_relaxed);
// Forward signal to all local children
for (pid_t pid : pids) {
(void)kill(pid, sig);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(void)kill(pid, sig);
std::ignore = kill(pid, sig);

Comment on lines 579 to 586
{
std::error_code ec;
std::filesystem::remove_all(cfg.coord_dir, ec);
if (ec) {
std::cerr << "Warning: Failed to cleanup directory: " << cfg.coord_dir
<< ": " << ec.message() << std::endl;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
std::error_code ec;
std::filesystem::remove_all(cfg.coord_dir, ec);
if (ec) {
std::cerr << "Warning: Failed to cleanup directory: " << cfg.coord_dir
<< ": " << ec.message() << std::endl;
}
}
std::error_code ec;
std::filesystem::remove_all(cfg.coord_dir, ec);
if (ec) {
std::cerr << "Warning: Failed to cleanup directory: " << cfg.coord_dir
<< ": " << ec.message() << std::endl;
}

Copy link
Member Author

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed your new comments now @madsbk . Sorry again for missing to push some of the changes from yesterday.

// Write to temporary file
std::ofstream ofs(tmp_path, std::ios::binary | std::ios::trunc);
if (!ofs) {
throw std::runtime_error("Failed to open temporary file: " + tmp_path);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I committed these changes locally yesterday but didn't push them. 🤦

Fixed in db112cb

Comment on lines +231 to +233
if (!ifs) {
throw std::runtime_error("Failed to open file for reading: " + path);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed in db112cb

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @pentschev

Comment on lines 210 to 221
{
std::error_code ec;
std::filesystem::rename(tmp_path, path, ec);
if (ec) {
std::error_code rm_ec;
std::filesystem::remove(tmp_path, rm_ec); // Clean up temp file
RAPIDSMPF_FAIL(
"Failed to rename " + tmp_path + " to " + path + ": " + ec.message(),
std::runtime_error
);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
std::error_code ec;
std::filesystem::rename(tmp_path, path, ec);
if (ec) {
std::error_code rm_ec;
std::filesystem::remove(tmp_path, rm_ec); // Clean up temp file
RAPIDSMPF_FAIL(
"Failed to rename " + tmp_path + " to " + path + ": " + ec.message(),
std::runtime_error
);
}
}
std::error_code ec;
std::filesystem::rename(tmp_path, path, ec);
if (ec) {
std::error_code rm_ec;
std::filesystem::remove(tmp_path, rm_ec); // Clean up temp file
RAPIDSMPF_FAIL(
"Failed to rename " + tmp_path + " to " + path + ": " + ec.message(),
std::runtime_error
);
}

std::vector<int> gpus;
std::stringstream ss(gpu_str);
std::string item;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

RAPIDSMPF_FAIL("Invalid GPU ID: " + item, std::runtime_error);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@pentschev
Copy link
Member Author

@madsbk as discussed offline, we can't use RAPIDSMPF_{EXPECTS,FAIL} in the files that will be compiled for the rrun tool because they create a CUDA dependency for error checks and also indirectly a version check. We could have moved those to different files without CUDA dependencies but chose not to. Therefore, a note was added to the relevant files in 55757a4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants