Skip to content

Conversation

@pentschev
Copy link
Member

@pentschev pentschev commented Nov 3, 2025

This PR introduces a new topology discovery feature that enables automatic detection of system topology, including GPU-to-NUMA-to-NIC mappings, using NVML to query GPU information and directly queries /sys to build a comprehensive view of the system's PCIe topology.

The core changes include:

  • New TopologyDiscovery class with supporting data structures to compose system/GPU/network topology information
  • CLI tool to inspect and dump system topology to a JSON file

This tool will later be integrated in the rrun launcher #616 for automatic topology discovery and configuration, it will also allow to read JSON files to override dynamic discovery and instead use the declarative file to set CPU/memory/network affinity.

Sample JSON output for a DGX-1
{
  "system": {
    "hostname": "dgx13",
    "num_gpus": 8,
    "num_numa_nodes": 2,
    "num_network_devices": 4
  },
  "gpus": [
    {
      "id": 0,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:06:00.0",
      "uuid": "GPU-b41abe4a-8553-43be-9d49-f1c4591959fa",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_0"]
    },
    {
      "id": 1,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:07:00.0",
      "uuid": "GPU-d30af75d-3d17-4155-9659-416789520ca5",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_0"]
    },
    {
      "id": 2,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:0A:00.0",
      "uuid": "GPU-d417ef25-b26f-4381-84fb-a6725b5b05ad",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_1"]
    },
    {
      "id": 3,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:0B:00.0",
      "uuid": "GPU-900a54bc-9e88-432e-b70f-42772a5c7f3e",
      "numa_node": 0,
      "cpu_affinity": {
        "cpulist": "0-19,40-59",
        "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
      },
      "memory_binding": [0],
      "network_devices": ["mlx5_1"]
    },
    {
      "id": 4,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:85:00.0",
      "uuid": "GPU-8cbe65cb-8b1f-4d32-af73-0d6267702fea",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_2"]
    },
    {
      "id": 5,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:86:00.0",
      "uuid": "GPU-8c39dca7-bd84-46c7-80f5-86ef16e3e163",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_2"]
    },
    {
      "id": 6,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:89:00.0",
      "uuid": "GPU-ef048cce-30c3-4844-8aec-9863bedbf67c",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_3"]
    },
    {
      "id": 7,
      "name": "Tesla V100-SXM2-32GB",
      "pci_bus_id": "00000000:8A:00.0",
      "uuid": "GPU-ff15fdcb-3bba-4a21-8e2f-f3ab897e21a0",
      "numa_node": 1,
      "cpu_affinity": {
        "cpulist": "20-39,60-79",
        "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
      },
      "memory_binding": [1],
      "network_devices": ["mlx5_3"]
    }
  ],
  "network_devices": [
    {
      "name": "mlx5_3",
      "numa_node": 1,
      "pci_bus_id": "0000:8b:00.0"
    },
    {
      "name": "mlx5_1",
      "numa_node": 0,
      "pci_bus_id": "0000:0c:00.0"
    },
    {
      "name": "mlx5_2",
      "numa_node": 1,
      "pci_bus_id": "0000:84:00.0"
    },
    {
      "name": "mlx5_0",
      "numa_node": 0,
      "pci_bus_id": "0000:05:00.0"
    }
  ]
}

@pentschev pentschev self-assigned this Nov 3, 2025
@pentschev pentschev added the feature request New feature or request label Nov 3, 2025
@pentschev pentschev requested review from a team as code owners November 3, 2025 21:49
@pentschev pentschev added the non-breaking Introduces a non-breaking change label Nov 3, 2025
@pentschev pentschev requested a review from a team as a code owner November 3, 2025 21:49
@pentschev pentschev requested a review from bdice November 3, 2025 21:49
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@pentschev pentschev requested a review from a team as a code owner November 5, 2025 08:43
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice work. I have one question.

"postCreateCommand": [
"/bin/bash",
"-c",
"VENV_DIR=\"/home/coder/.local/share/venvs/${DEFAULT_VIRTUAL_ENV:-rapids}\" && ( [ -x \"$VENV_DIR/bin/python\" ] || python -m venv \"$VENV_DIR\" ) && \"$VENV_DIR/bin/python\" -m pip install --upgrade pip && \"$VENV_DIR/bin/python\" -m pip install nvidia-nvml-dev-cu12 && SITE_PACKAGES=\"$(\"$VENV_DIR/bin/python\" -c 'import site; print(site.getsitepackages()[0])')\" && sed -i '/^export SITE_PACKAGES=/d' /home/coder/.bashrc && printf 'export SITE_PACKAGES=\"%s\"\\n' \"$SITE_PACKAGES\" >> /home/coder/.bashrc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this does? It installs nvidia-nvml-dev-cu12. And what's the rest?

Maybe the better fix is for us to install the NVML system libraries in https://github.com/rapidsai/devcontainers/tree/main/features/src/cuda?

cc: @trxcllnt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does:

  1. Set VENV_DIR with the preconceived knowledge of the default location
  2. Activate the venv and install the NVML package
  3. Determines the location of Python's sitepackages and stores it in SITE_PACKAGES variable, make it exportable via the user's .bashrc (first by attempting to replace it if it already exists, then adding the line if it doesn't exist).

The SITE_PACKAGES variable is required to be set because it is used by CMake to find the nvml.h file. Outside of devcontainers setting the variable happens in the build scripts, but they don't run in devcontainers AFAICT.

I don't have a preference where this should live, I'm fine if we want that the devcontainer has the package already installed and the variable already set, but I wanted something to demonstrate this was indeed working here, and it looks like it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our pip devcontainers generally install their CUDA dependencies as system dependencies. @trxcllnt Please correct me if I'm wrong here. We probably want to align with that. There might be a better way to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the pip devcontainers should install the cuda-nvml-dev-13-0 package via apt, which we can do during the image build by adding a feature to the devcontainer.json (example here).

And for CMake packages that are only available via pip, the CMAKE_PREFIX_PATH modification we do here should make find_package() work.

find_package(CUDAToolkit) should allow us to link to the CUDA::nvml target, is that not the case?

Copy link
Member Author

@pentschev pentschev Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the pip devcontainers should install the cuda-nvml-dev-13-0 package via apt, which we can do during the image build by adding a feature to the devcontainer.json (example here).

Where can I see a list of available features to find the right name for NVML?

And for CMake packages that are only available via pip, the CMAKE_PREFIX_PATH modification we do here should make find_package() work.

find_package(CUDAToolkit) should allow us to link to the CUDA::nvml target, is that not the case?

We don't want to link to NVML, just find nvml.h, we are dlopening libnvidia-ml.so.1, plus the pip package seems to only ship nvml.h which is all I need. Will CMAKE_PREFIX_PATH/find_package() somehow help in this specific case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of options is here, it looks like nvml is installed by default. I believe nvml.h should be in the CUDAToolkit_INCLUDE_DIRS list populated by find_package(CUDAToolkit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants