Support for dynamic system topology discovery #624

pentschev · 2025-11-03T21:49:30Z

This PR introduces a new topology discovery feature that enables automatic detection of system topology, including GPU-to-NUMA-to-NIC mappings, using NVML to query GPU information and directly queries /sys to build a comprehensive view of the system's PCIe topology.

The core changes include:

New TopologyDiscovery class with supporting data structures to compose system/GPU/network topology information
CLI tool to inspect and dump system topology to a JSON file

This tool will later be integrated in the rrun launcher #616 for automatic topology discovery and configuration, it will also allow to read JSON files to override dynamic discovery and instead use the declarative file to set CPU/memory/network affinity.

Sample JSON output for a DGX-1

{
  "system": {
    "hostname": "dgx13",
    "num_gpus": 8,
    "num_numa_nodes": 2,
    "num_network_devices": 4
  },
  "gpus": [
    {
      "id": 0,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:06:00.0",
      "uuid": "GPU-b41abe4a-8553-43be-9d49-f1c4591959fa",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_0"]
    },
    {
      "id": 1,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:07:00.0",
      "uuid": "GPU-d30af75d-3d17-4155-9659-416789520ca5",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_0"]
    },
    {
      "id": 2,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:0A:00.0",
      "uuid": "GPU-d417ef25-b26f-4381-84fb-a6725b5b05ad",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_1"]
    },
    {
      "id": 3,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:0B:00.0",
      "uuid": "GPU-900a54bc-9e88-432e-b70f-42772a5c7f3e",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_1"]
    },
    {
      "id": 4,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:85:00.0",
      "uuid": "GPU-8cbe65cb-8b1f-4d32-af73-0d6267702fea",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_2"]
    },
    {
      "id": 5,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:86:00.0",
      "uuid": "GPU-8c39dca7-bd84-46c7-80f5-86ef16e3e163",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_2"]
    },
    {
      "id": 6,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:89:00.0",
      "uuid": "GPU-ef048cce-30c3-4844-8aec-9863bedbf67c",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_3"]
    },
    {
      "id": 7,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:8A:00.0",
      "uuid": "GPU-ff15fdcb-3bba-4a21-8e2f-f3ab897e21a0",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_3"]
    }
  ],
  "network_devices": [
    {
      "name": "mlx5_3",
      "numa_node": 1,
      "pci_bus_id": "0000:8b:00.0"
    },
    {
      "name": "mlx5_1",
      "numa_node": 0,
      "pci_bus_id": "0000:0c:00.0"
    },
    {
      "name": "mlx5_2",
      "numa_node": 1,
      "pci_bus_id": "0000:84:00.0"
    },
    {
      "name": "mlx5_0",
      "numa_node": 0,
      "pci_bus_id": "0000:05:00.0"
    }
  ]
}

madsbk

nice

cpp/include/rapidsmpf/topology_discovery.hpp

cpp/src/topology_discovery.cpp

Co-authored-by: Mads R. B. Kristensen <[email protected]>

cpp/src/topology_discovery.cpp

Co-authored-by: Mads R. B. Kristensen <[email protected]>

bdice

This is really nice work. I have one question.

bdice · 2025-11-07T16:49:10Z

.devcontainer/cuda12.9-pip/devcontainer.json

+  "postCreateCommand": [
+    "/bin/bash",
+    "-c",
+    "VENV_DIR=\"/home/coder/.local/share/venvs/${DEFAULT_VIRTUAL_ENV:-rapids}\" && ( [ -x \"$VENV_DIR/bin/python\" ] || python -m venv \"$VENV_DIR\" ) && \"$VENV_DIR/bin/python\" -m pip install --upgrade pip && \"$VENV_DIR/bin/python\" -m pip install nvidia-nvml-dev-cu12 && SITE_PACKAGES=\"$(\"$VENV_DIR/bin/python\" -c 'import site; print(site.getsitepackages()[0])')\" && sed -i '/^export SITE_PACKAGES=/d' /home/coder/.bashrc && printf 'export SITE_PACKAGES=\"%s\"\\n' \"$SITE_PACKAGES\" >> /home/coder/.bashrc"


Can you explain what this does? It installs nvidia-nvml-dev-cu12. And what's the rest?

Maybe the better fix is for us to install the NVML system libraries in https://github.com/rapidsai/devcontainers/tree/main/features/src/cuda?

cc: @trxcllnt

It does:

Set VENV_DIR with the preconceived knowledge of the default location

Activate the venv and install the NVML package

Determines the location of Python's sitepackages and stores it in SITE_PACKAGES variable, make it exportable via the user's .bashrc (first by attempting to replace it if it already exists, then adding the line if it doesn't exist).

The SITE_PACKAGES variable is required to be set because it is used by CMake to find the nvml.h file. Outside of devcontainers setting the variable happens in the build scripts, but they don't run in devcontainers AFAICT.

I don't have a preference where this should live, I'm fine if we want that the devcontainer has the package already installed and the variable already set, but I wanted something to demonstrate this was indeed working here, and it looks like it is.

I think our pip devcontainers generally install their CUDA dependencies as system dependencies. @trxcllnt Please correct me if I'm wrong here. We probably want to align with that. There might be a better way to do this.

Yeah the pip devcontainers should install the cuda-nvml-dev-13-0 package via apt, which we can do during the image build by adding a feature to the devcontainer.json (example here).

And for CMake packages that are only available via pip, the CMAKE_PREFIX_PATH modification we do here should make find_package() work.

find_package(CUDAToolkit) should allow us to link to the CUDA::nvml target, is that not the case?

Yeah the pip devcontainers should install the cuda-nvml-dev-13-0 package via apt, which we can do during the image build by adding a feature to the devcontainer.json (example here).

Where can I see a list of available features to find the right name for NVML?

And for CMake packages that are only available via pip, the CMAKE_PREFIX_PATH modification we do here should make find_package() work.

find_package(CUDAToolkit) should allow us to link to the CUDA::nvml target, is that not the case?

We don't want to link to NVML, just find nvml.h, we are dlopening libnvidia-ml.so.1, plus the pip package seems to only ship nvml.h which is all I need. Will CMAKE_PREFIX_PATH/find_package() somehow help in this specific case?

The list of options is here, it looks like nvml is installed by default. I believe nvml.h should be in the CUDAToolkit_INCLUDE_DIRS list populated by find_package(CUDAToolkit).

pentschev added 5 commits November 3, 2025 12:37

Add topology discovery tool

2f53487

Improve network topology discovery to account PCIe for proximity

da7304d

Refactor into separate API and CLI tool

ee789d1

Merge remote-tracking branch 'upstream/main' into topology-discovery

5a3cd48

Add NVML dependency

4dbb65b

pentschev self-assigned this Nov 3, 2025

pentschev added the feature request New feature or request label Nov 3, 2025

pentschev requested review from a team as code owners November 3, 2025 21:49

pentschev added the non-breaking Introduces a non-breaking change label Nov 3, 2025

pentschev requested a review from a team as a code owner November 3, 2025 21:49

pentschev requested a review from bdice November 3, 2025 21:49

Fix CMakeLists.txt linting

e080dfc

madsbk reviewed Nov 4, 2025

View reviewed changes

pentschev and others added 5 commits November 4, 2025 16:31

Cleanup

1c5ad2e

Co-authored-by: Mads R. B. Kristensen <[email protected]>

Use std::optional instead of additional bool

73e56be

Merge branch 'main' into topology-discovery

9ff4917

Fix linting

c671ff2

Apply std::optional changes to cpp file

3f8debd

madsbk reviewed Nov 5, 2025

View reviewed changes

cpp/src/topology_discovery.cpp Show resolved Hide resolved

cpp/src/topology_discovery.cpp Outdated Show resolved Hide resolved

cpp/src/topology_discovery.cpp Outdated Show resolved Hide resolved

pentschev and others added 5 commits November 5, 2025 08:47

Code formatting

e5bb157

Co-authored-by: Mads R. B. Kristensen <[email protected]>

Merge remote-tracking branch 'upstream/main' into topology-discovery

8ec3d31

Update CMakeLists

48aa0a6

Do not link to nvml

1c45505

Test topology discovery

9af9451

pentschev requested a review from a team as a code owner November 5, 2025 08:43

pentschev added 4 commits November 5, 2025 01:42

Improve docs

4e9bee7

Link to NVML again and allow missing DSO

859acbb

Merge remote-tracking branch 'upstream/main' into topology-discovery

bc8f9fb

Enable debug output

1148fee

pentschev added 22 commits November 5, 2025 10:41

More debugging output

120f32c

Disable failure on first error

c15814f

Fix disable failure on first error

b27c087

Print numa_node contents

dd8a4b2

Revert debug output

2fd6577

Remove memory binding validation

0683a46

Fix clang-tidy failures

3088770

Do not link, dlopen

320ec5a

Fix more clang-tidy failures

89bc8c7

Merge remote-tracking branch 'upstream/main' into topology-discovery

872dcbb

One more clang-tidy error...

cb96bed

Attempt to remove linking to librapidsmpf

4f85ad9

Fix typo

7f44047

Add CMAKE_DL_LIBS

f306994

Add wheels dependency

4b7fe9d

Install nvidia-nvml-dev in devcontainers

94d3505

Attempt to find nvml.h from wheels

2950cb8

Set SITE_PACKAGES

9b2cb9a

Make SITE_PACKAGES available for singlecomm build also

28dc7ee

Install NVML packages to venv

9eaf2c2

Set SITE_PACKAGES in devcontainer

506769d

Merge remote-tracking branch 'upstream/main' into topology-discovery

cf0c51e

bdice reviewed Nov 7, 2025

View reviewed changes

Support for dynamic system topology discovery #624

Are you sure you want to change the base?

Support for dynamic system topology discovery #624

Uh oh!

Conversation

pentschev commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

bdice Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

pentschev Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

bdice Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

trxcllnt Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

pentschev Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trxcllnt Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pentschev commented Nov 3, 2025 •

edited

Loading

pentschev Nov 7, 2025 •

edited

Loading