-
Notifications
You must be signed in to change notification settings - Fork 22
Support for dynamic system topology discovery #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
madsbk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Co-authored-by: Mads R. B. Kristensen <[email protected]>
Co-authored-by: Mads R. B. Kristensen <[email protected]>
bdice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice work. I have one question.
| "postCreateCommand": [ | ||
| "/bin/bash", | ||
| "-c", | ||
| "VENV_DIR=\"/home/coder/.local/share/venvs/${DEFAULT_VIRTUAL_ENV:-rapids}\" && ( [ -x \"$VENV_DIR/bin/python\" ] || python -m venv \"$VENV_DIR\" ) && \"$VENV_DIR/bin/python\" -m pip install --upgrade pip && \"$VENV_DIR/bin/python\" -m pip install nvidia-nvml-dev-cu12 && SITE_PACKAGES=\"$(\"$VENV_DIR/bin/python\" -c 'import site; print(site.getsitepackages()[0])')\" && sed -i '/^export SITE_PACKAGES=/d' /home/coder/.bashrc && printf 'export SITE_PACKAGES=\"%s\"\\n' \"$SITE_PACKAGES\" >> /home/coder/.bashrc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this does? It installs nvidia-nvml-dev-cu12. And what's the rest?
Maybe the better fix is for us to install the NVML system libraries in https://github.com/rapidsai/devcontainers/tree/main/features/src/cuda?
cc: @trxcllnt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does:
- Set
VENV_DIRwith the preconceived knowledge of the default location - Activate the venv and install the NVML package
- Determines the location of Python's sitepackages and stores it in
SITE_PACKAGESvariable, make it exportable via the user's.bashrc(first by attempting to replace it if it already exists, then adding the line if it doesn't exist).
The SITE_PACKAGES variable is required to be set because it is used by CMake to find the nvml.h file. Outside of devcontainers setting the variable happens in the build scripts, but they don't run in devcontainers AFAICT.
I don't have a preference where this should live, I'm fine if we want that the devcontainer has the package already installed and the variable already set, but I wanted something to demonstrate this was indeed working here, and it looks like it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our pip devcontainers generally install their CUDA dependencies as system dependencies. @trxcllnt Please correct me if I'm wrong here. We probably want to align with that. There might be a better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the pip devcontainers should install the cuda-nvml-dev-13-0 package via apt, which we can do during the image build by adding a feature to the devcontainer.json (example here).
And for CMake packages that are only available via pip, the CMAKE_PREFIX_PATH modification we do here should make find_package() work.
find_package(CUDAToolkit) should allow us to link to the CUDA::nvml target, is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the pip devcontainers should install the
cuda-nvml-dev-13-0package via apt, which we can do during the image build by adding a feature to thedevcontainer.json(example here).
Where can I see a list of available features to find the right name for NVML?
And for CMake packages that are only available via pip, the
CMAKE_PREFIX_PATHmodification we do here should makefind_package()work.
find_package(CUDAToolkit)should allow us to link to theCUDA::nvmltarget, is that not the case?
We don't want to link to NVML, just find nvml.h, we are dlopening libnvidia-ml.so.1, plus the pip package seems to only ship nvml.h which is all I need. Will CMAKE_PREFIX_PATH/find_package() somehow help in this specific case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of options is here, it looks like nvml is installed by default. I believe nvml.h should be in the CUDAToolkit_INCLUDE_DIRS list populated by find_package(CUDAToolkit).
This PR introduces a new topology discovery feature that enables automatic detection of system topology, including GPU-to-NUMA-to-NIC mappings, using NVML to query GPU information and directly queries
/systo build a comprehensive view of the system's PCIe topology.The core changes include:
TopologyDiscoveryclass with supporting data structures to composesystem/GPU/networktopology informationThis tool will later be integrated in the
rrunlauncher #616 for automatic topology discovery and configuration, it will also allow to read JSON files to override dynamic discovery and instead use the declarative file to set CPU/memory/network affinity.Sample JSON output for a DGX-1
{ "system": { "hostname": "dgx13", "num_gpus": 8, "num_numa_nodes": 2, "num_network_devices": 4 }, "gpus": [ { "id": 0, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:06:00.0", "uuid": "GPU-b41abe4a-8553-43be-9d49-f1c4591959fa", "numa_node": 0, "cpu_affinity": { "cpulist": "0-19,40-59", "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] }, "memory_binding": [0], "network_devices": ["mlx5_0"] }, { "id": 1, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:07:00.0", "uuid": "GPU-d30af75d-3d17-4155-9659-416789520ca5", "numa_node": 0, "cpu_affinity": { "cpulist": "0-19,40-59", "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] }, "memory_binding": [0], "network_devices": ["mlx5_0"] }, { "id": 2, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:0A:00.0", "uuid": "GPU-d417ef25-b26f-4381-84fb-a6725b5b05ad", "numa_node": 0, "cpu_affinity": { "cpulist": "0-19,40-59", "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] }, "memory_binding": [0], "network_devices": ["mlx5_1"] }, { "id": 3, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:0B:00.0", "uuid": "GPU-900a54bc-9e88-432e-b70f-42772a5c7f3e", "numa_node": 0, "cpu_affinity": { "cpulist": "0-19,40-59", "cores": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] }, "memory_binding": [0], "network_devices": ["mlx5_1"] }, { "id": 4, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:85:00.0", "uuid": "GPU-8cbe65cb-8b1f-4d32-af73-0d6267702fea", "numa_node": 1, "cpu_affinity": { "cpulist": "20-39,60-79", "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] }, "memory_binding": [1], "network_devices": ["mlx5_2"] }, { "id": 5, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:86:00.0", "uuid": "GPU-8c39dca7-bd84-46c7-80f5-86ef16e3e163", "numa_node": 1, "cpu_affinity": { "cpulist": "20-39,60-79", "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] }, "memory_binding": [1], "network_devices": ["mlx5_2"] }, { "id": 6, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:89:00.0", "uuid": "GPU-ef048cce-30c3-4844-8aec-9863bedbf67c", "numa_node": 1, "cpu_affinity": { "cpulist": "20-39,60-79", "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] }, "memory_binding": [1], "network_devices": ["mlx5_3"] }, { "id": 7, "name": "Tesla V100-SXM2-32GB", "pci_bus_id": "00000000:8A:00.0", "uuid": "GPU-ff15fdcb-3bba-4a21-8e2f-f3ab897e21a0", "numa_node": 1, "cpu_affinity": { "cpulist": "20-39,60-79", "cores": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] }, "memory_binding": [1], "network_devices": ["mlx5_3"] } ], "network_devices": [ { "name": "mlx5_3", "numa_node": 1, "pci_bus_id": "0000:8b:00.0" }, { "name": "mlx5_1", "numa_node": 0, "pci_bus_id": "0000:0c:00.0" }, { "name": "mlx5_2", "numa_node": 1, "pci_bus_id": "0000:84:00.0" }, { "name": "mlx5_0", "numa_node": 0, "pci_bus_id": "0000:05:00.0" } ] }