-
Notifications
You must be signed in to change notification settings - Fork 22
Add RapidsMPF launcher rrun
#616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 47 commits
Commits
Show all changes
66 commits
Select commit
Hold shift + click to select a range
39c9cf9
Add blastoff
pentschev 63fe70f
Multi-node support via SSH
pentschev fffc5a5
Add support to pass environment variables to spawned processes
pentschev 593ab52
Disable output buffering
pentschev 1f5aa97
Remove `initialized` synchronization file that causes deadlocks
pentschev 700b5c0
Periodically refresh NFS directory cache
pentschev 5a83c37
Add --tag-output support
pentschev adc5eeb
Support terminating remote SSH child processes
pentschev a2fac42
Omit verbose `sh kill` messages
pentschev 2e05013
Fix broken early termination
pentschev e6b749b
Acknowledge terminate request
pentschev 88ce281
Use std::chrono::milliseconds instead of int
pentschev 629aa18
Use string_view where possible
pentschev 71c87e2
Use std::filesystem::create_directories
pentschev 696f3c6
Use rename and remove from std::filesystem
pentschev deba42a
Use std::filesystem to attempt refreshing NFS directories
pentschev decb3e8
Remove remove_dir_recursive/create_coord_dir in favor of std::filesystem
pentschev 509b627
Replace usleep with sleep_for
pentschev 0f9887c
Use dedicated signal handling thread
pentschev 6c9be66
Close stdout/stderr pipes when termination begins
pentschev 9c880ac
Fix per-line atomicity of output
pentschev 6a38dde
Add rrun smoketests
pentschev fb1c1e9
Remove SSH support
pentschev 1ec2650
Move UCXX implementation to new files, fix compile-time checks
pentschev 6126103
Fix missing nodiscard
pentschev 839973c
Update cpp/src/bootstrap/file_backend.cpp
KyleFromNVIDIA 12b3947
Update cpp/tools/CMakeLists.txt
KyleFromNVIDIA 8680224
Add and use Conda recipe for tools
KyleFromNVIDIA e4ae979
Fix header
KyleFromNVIDIA 7342798
Remove ignore_run_exports
KyleFromNVIDIA 1a3aae9
Revert "Remove ignore_run_exports"
KyleFromNVIDIA b88f010
Revert "Add and use Conda recipe for tools"
KyleFromNVIDIA 07b7e3c
Put tools in librapidsmpf
KyleFromNVIDIA 4aaa231
Merge remote-tracking branch 'upstream/main' into rrun
pentschev aa9eefc
Unify ucxx-bootstrap and ucxx
pentschev 9204eaf
Install tools into librapidsmpf wheel
KyleFromNVIDIA 6afb08a
Bring namespace open/close bracket closer to format
pentschev 51d0f14
rapidsmpf component
KyleFromNVIDIA 61ecf6a
Use Duration alias
pentschev 8df332d
Simplify backend selection code
pentschev 615c5ee
Fix older mentions
pentschev 97ef6d5
Remove unnecessary break statement
pentschev 9ba0e3d
Print FileBackend destructor errors to stderr
pentschev 40465cc
Cleanup temporary directory during FileBackend destructor
pentschev 2cd918f
Further cleanups
pentschev 467a861
Merge remote-tracking branch 'origin/rrun' into rrun
pentschev 28e9969
Merge remote-tracking branch 'upstream/main' into rrun
pentschev b6aba18
Document return of generate_session_id
pentschev db112cb
Use RAPIDSMPF_{EXPECTS,FAIL}
pentschev 73fde7d
Use std::ignore instead of void cast
pentschev 8ee9525
Use seconds instead of milliseconds as defaults
pentschev 3aa29d9
Merge remote-tracking branch 'origin/rrun' into rrun
pentschev 22e4742
Use Duration in wait_for_file
pentschev 7df3104
Linting
pentschev d9fff6d
More linting
pentschev 4926172
Clarify GPU indices
pentschev a02ffd5
Fix linting
pentschev 1b7c558
Merge branch 'main' into rrun
pentschev a9563ad
More code formatting changes
pentschev e190b68
Merge branch 'main' into rrun
pentschev 6325b32
Fix build errors
pentschev eafe45e
Fix build error attempt 2
pentschev 9ff9d3a
Formatting
pentschev 51e010c
Revert "Use RAPIDSMPF_{EXPECTS,FAIL}"
pentschev df05f11
Replace other uses of RAPIDSMPF_{EXPECTS,FAIL} with throw
pentschev 55757a4
Add note on not using `RAPIDSMPF_{EXPECTS,FAIL}`
pentschev File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| /** | ||
| * SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <chrono> | ||
| #include <cstdint> | ||
| #include <memory> | ||
| #include <optional> | ||
| #include <string> | ||
|
|
||
| #include <rapidsmpf/config.hpp> | ||
| #include <rapidsmpf/utils.hpp> | ||
|
|
||
| namespace rapidsmpf::bootstrap { | ||
|
|
||
| /// @brief Type alias for communicator::Rank | ||
| using Rank = std::int32_t; | ||
|
|
||
| /// @brief Type alias for duration type | ||
| using rapidsmpf::Duration; | ||
|
|
||
| /** | ||
| * @brief Backend types for process coordination and bootstrapping. | ||
| */ | ||
| enum class Backend { | ||
| /** | ||
| * @brief Automatically detect the best backend based on environment. | ||
| * | ||
| * Detection order: | ||
| * 1. File-based (default fallback) | ||
| */ | ||
| AUTO, | ||
|
|
||
| /** | ||
| * @brief File-based coordination using a shared directory. | ||
| * | ||
| * Uses filesystem for rank coordination and address exchange. Works on single-node | ||
| * and multi-node with shared storage (e.g., NFS) via SSH. Requires RAPIDSMPF_RANK, | ||
| * RAPIDSMPF_NRANKS, RAPIDSMPF_COORD_DIR environment variables. | ||
| */ | ||
| FILE, | ||
| }; | ||
|
|
||
| /** | ||
| * @brief Context information for the current process/rank. | ||
| * | ||
| * This structure contains the rank assignment and total rank count, | ||
| * along with additional metadata about the execution environment. | ||
| */ | ||
| struct Context { | ||
| /** @brief This process's rank (0-indexed). */ | ||
| Rank rank; | ||
|
|
||
| /** @brief Total number of ranks in the job. */ | ||
| Rank nranks; | ||
|
|
||
| /** @brief Backend used for coordination. */ | ||
| Backend backend; | ||
|
|
||
| /** @brief Coordination directory (for FILE backend). */ | ||
| std::optional<std::string> coord_dir; | ||
| }; | ||
|
|
||
| /** | ||
| * @brief Initialize the bootstrap context from environment variables. | ||
| * | ||
| * This function reads environment variables to determine rank, nranks, and | ||
| * backend configuration. It should be called early in the application lifecycle. | ||
| * | ||
| * Environment variables checked (in order of precedence): | ||
| * - RAPIDSMPF_RANK: Explicitly set rank | ||
| * - RAPIDSMPF_NRANKS: Explicitly set total rank count | ||
| * - RAPIDSMPF_COORD_DIR: File-based coordination directory | ||
| * | ||
| * @param backend Backend to use (default: AUTO for auto-detection). | ||
| * @return Context object containing rank and coordination information. | ||
| * @throws std::runtime_error if environment is not properly configured. | ||
| * | ||
| * @code | ||
| * auto ctx = rapidsmpf::bootstrap::init(); | ||
| * std::cout << "I am rank " << ctx.rank << " of " << ctx.nranks << std::endl; | ||
| * @endcode | ||
| */ | ||
| Context init(Backend backend = Backend::AUTO); | ||
|
|
||
| /** | ||
| * @brief Broadcast data from root rank to all other ranks. | ||
| * | ||
| * This is a helper function for broadcasting small amounts of data during | ||
| * bootstrapping. It uses the underlying backend's coordination mechanism. | ||
| * | ||
| * @param ctx Bootstrap context. | ||
| * @param data Data buffer to broadcast (both input on root, output on others). | ||
| * @param size Size of data in bytes. | ||
| * @param root Root rank performing the broadcast (default: 0). | ||
| */ | ||
| void broadcast(Context const& ctx, void* data, std::size_t size, Rank root = 0); | ||
|
|
||
| /** | ||
| * @brief Perform a barrier synchronization across all ranks. | ||
| * | ||
| * This ensures all ranks reach this point before any rank proceeds. | ||
| * | ||
| * @param ctx Bootstrap context. | ||
| */ | ||
| void barrier(Context const& ctx); | ||
|
|
||
| /** | ||
| * @brief Store a key-value pair in the coordination backend. | ||
| * | ||
| * This is useful for custom coordination beyond UCXX address exchange. | ||
| * | ||
| * @param ctx Bootstrap context. | ||
| * @param key Key name. | ||
| * @param value Value to store. | ||
| */ | ||
| void put(Context const& ctx, std::string const& key, std::string const& value); | ||
|
|
||
| /** | ||
| * @brief Retrieve a value from the coordination backend. | ||
| * | ||
| * This function blocks until the key is available or timeout occurs. | ||
| * | ||
| * @param ctx Bootstrap context. | ||
| * @param key Key name to retrieve. | ||
| * @param timeout Timeout duration. | ||
| * @return Value associated with the key. | ||
| * @throws std::runtime_error if key not found within timeout. | ||
| */ | ||
| std::string get( | ||
| Context const& ctx, | ||
| std::string const& key, | ||
| Duration timeout = std::chrono::milliseconds{30000} | ||
pentschev marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ); | ||
|
|
||
| } // namespace rapidsmpf::bootstrap | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.