-
Notifications
You must be signed in to change notification settings - Fork 22
Support for dynamic system topology discovery #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pentschev
wants to merge
47
commits into
rapidsai:main
Choose a base branch
from
pentschev:topology-discovery
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,235
−16
Open
Changes from 6 commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
2f53487
Add topology discovery tool
pentschev da7304d
Improve network topology discovery to account PCIe for proximity
pentschev ee789d1
Refactor into separate API and CLI tool
pentschev 5a3cd48
Merge remote-tracking branch 'upstream/main' into topology-discovery
pentschev 4dbb65b
Add NVML dependency
pentschev e080dfc
Fix CMakeLists.txt linting
pentschev 1c5ad2e
Cleanup
pentschev 73e56be
Use `std::optional` instead of additional `bool`
pentschev 9ff4917
Merge branch 'main' into topology-discovery
pentschev c671ff2
Fix linting
pentschev 3f8debd
Apply std::optional changes to cpp file
pentschev e5bb157
Code formatting
pentschev 8ec3d31
Merge remote-tracking branch 'upstream/main' into topology-discovery
pentschev 48aa0a6
Update CMakeLists
pentschev 1c45505
Do not link to nvml
pentschev 9af9451
Test topology discovery
pentschev 4e9bee7
Improve docs
pentschev 859acbb
Link to NVML again and allow missing DSO
pentschev bc8f9fb
Merge remote-tracking branch 'upstream/main' into topology-discovery
pentschev 1148fee
Enable debug output
pentschev 120f32c
More debugging output
pentschev c15814f
Disable failure on first error
pentschev b27c087
Fix disable failure on first error
pentschev dd8a4b2
Print numa_node contents
pentschev 2fd6577
Revert debug output
pentschev 0683a46
Remove memory binding validation
pentschev 3088770
Fix clang-tidy failures
pentschev 320ec5a
Do not link, dlopen
pentschev 89bc8c7
Fix more clang-tidy failures
pentschev 872dcbb
Merge remote-tracking branch 'upstream/main' into topology-discovery
pentschev cb96bed
One more clang-tidy error...
pentschev 4f85ad9
Attempt to remove linking to librapidsmpf
pentschev 7f44047
Fix typo
pentschev f306994
Add CMAKE_DL_LIBS
pentschev 4b7fe9d
Add wheels dependency
pentschev 94d3505
Install nvidia-nvml-dev in devcontainers
pentschev 2950cb8
Attempt to find nvml.h from wheels
pentschev 9b2cb9a
Set SITE_PACKAGES
pentschev 28dc7ee
Make SITE_PACKAGES available for singlecomm build also
pentschev 9eaf2c2
Install NVML packages to venv
pentschev 506769d
Set SITE_PACKAGES in devcontainer
pentschev cf0c51e
Merge remote-tracking branch 'upstream/main' into topology-discovery
pentschev a7b1723
Attempt to find nvml.h in CUDAToolkit_INCLUDE_DIRS
pentschev b9d8be8
Remove NVML install from devcontainers
pentschev b59afb8
Remove setting NVML_INCLUDE_DIR
pentschev 5310fe5
Merge remote-tracking branch 'upstream/main' into topology-discovery
pentschev 0985a0c
Add pyproject to build-nvml
pentschev File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| /** | ||
| * SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <string> | ||
| #include <vector> | ||
|
|
||
| namespace rapidsmpf { | ||
|
|
||
| /** | ||
| * @brief GPU information. | ||
| */ | ||
| struct GpuTopologyInfo { | ||
| unsigned int id; ///< GPU device ID. | ||
| std::string name; ///< GPU device name. | ||
| std::string pci_bus_id; ///< PCI bus ID. | ||
| std::string uuid; ///< GPU UUID. | ||
| int numa_node; ///< NUMA node ID (-1 if unknown). | ||
| std::string cpu_affinity_list; ///< CPU affinity list. | ||
| std::vector<int> cpu_cores; ///< List of CPU core IDs. | ||
| std::vector<int> memory_binding; ///< NUMA nodes for memory binding. | ||
| std::vector<std::string> | ||
| network_devices; ///< Network devices (NICs) optimal for this GPU. | ||
| }; | ||
|
|
||
| /** | ||
| * @brief Network device information. | ||
| */ | ||
| struct NetworkDeviceInfo { | ||
| std::string name; ///< Device name (e.g., "mlx5_0"). | ||
| int numa_node; ///< NUMA node ID (-1 if unknown). | ||
| std::string pci_bus_id; ///< PCI bus ID. | ||
| }; | ||
|
|
||
| /** | ||
| * @brief System topology information. | ||
| */ | ||
| struct SystemTopologyInfo { | ||
| std::string hostname; ///< System hostname. | ||
| unsigned int num_gpus; ///< Total number of GPUs. | ||
| int num_numa_nodes; ///< Total number of NUMA nodes. | ||
| int num_network_devices; ///< Total number of network devices. | ||
| std::vector<GpuTopologyInfo> gpus; ///< GPU topology information. | ||
| std::vector<NetworkDeviceInfo> network_devices; ///< Network device information. | ||
| }; | ||
|
|
||
| /** | ||
| * @brief PCIe topology path types. | ||
| */ | ||
| enum class PciePathType { | ||
| PIX = 0, ///< Connection traversing at most a single PCIe bridge (best). | ||
| PXB = 1, ///< Connection traversing multiple PCIe bridges. | ||
| PHB = 2, ///< Connection traversing PCIe Host Bridge. | ||
| NODE = 3, ///< Connection traversing PCIe and interconnect within NUMA node. | ||
| SYS = 4 ///< Connection traversing NUMA interconnect (worst). | ||
| }; | ||
|
|
||
| /** | ||
| * @brief Discover system topology including GPUs, NUMA nodes, and network devices. | ||
| * | ||
| * This class provides methods to discover system topology information using NVML | ||
| * and /sys filesystem queries. It dynamically identifies GPU-to-NUMA-to-NIC mappings | ||
| * based on PCIe topology. | ||
| * | ||
| * Example usage: | ||
| * @code | ||
| * rapidsmpf::TopologyDiscovery discovery; | ||
| * if (discovery.discover()) { | ||
| * auto topology = discovery.get_topology(); | ||
| * } | ||
| * @endcode | ||
| */ | ||
| class TopologyDiscovery { | ||
| public: | ||
| /** | ||
| * @brief Construct a TopologyDiscovery instance. | ||
| */ | ||
| TopologyDiscovery() = default; | ||
|
|
||
| /** | ||
| * @brief Destroy the TopologyDiscovery instance. | ||
| */ | ||
| ~TopologyDiscovery() = default; | ||
|
|
||
| /** | ||
| * @brief Discover system topology. | ||
| * | ||
| * This method performs the actual discovery of GPUs, NUMA nodes, CPU affinity, | ||
| * and network devices. It must be called before `get_topology()`. | ||
| * | ||
| * @return true if discovery was successful, false otherwise. | ||
| */ | ||
| bool discover(); | ||
|
|
||
| /** | ||
| * @brief Get the discovered topology information. | ||
| * | ||
| * @return SystemTopologyInfo structure containing all topology data. | ||
| * @note `discover()` must be called first. | ||
| */ | ||
| SystemTopologyInfo const& get_topology() const { | ||
| return topology_; | ||
| } | ||
|
|
||
| /** | ||
| * @brief Check if topology has been discovered. | ||
| * | ||
| * @return true if `discover()` has been called successfully. | ||
| */ | ||
| bool is_discovered() const { | ||
| return discovered_; | ||
| } | ||
|
|
||
| private: | ||
| SystemTopologyInfo topology_; ///< Discovered topology information. | ||
madsbk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| bool discovered_{false}; ///< Flag indicating if topology has been discovered. | ||
| }; | ||
|
|
||
| } // namespace rapidsmpf | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.