-
Notifications
You must be signed in to change notification settings - Fork 41
design-proposal: Support associating NUMA nodes with GPU devices #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: jinshi <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Migrated from kubevirt/community#394 @lyarwood |
|
Hey @jimkingstone! Thanks for this proposal. Please read the instructions on how to propose VEPs here: https://github.com/kubevirt/enhancements/blob/main/README.md. For example, please create a tracking issue. In addition, please use the github template format for this PR instead of copy-pasting the kubevirt/community template. Thanks. |
|
/hold |
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
OK, the issue has been created, and this PR template has been updated. @iholder101 |
|
/cc I am interested in reviewing this proposal as this could have overlapping solutions with DRA gpu devices cc @rthallisey |
|
Thank you @jimkingstone for looking into this issue. First, I'm not 100% sure if we need a formal design proposal to address this, mainly because there is no change in KubeVirt API. That's, indeed, an issue we have been trying to resolve for a while. My primary concern here is that so far KubeVirt has delegated the creation of controllers to libvirt. This can get very complex, very quickly especially when there are multiple different devices involved. How confident are we that we want to open this door? /cc @alicefr @andreabolognani what do you think? |
Thanks to @vladikr for the suggestion. As for whether we need a formal proposal, I’ve also submitted the related code changes(kubevirt/kubevirt#14406), which can be reviewed alongside the proposal. Regarding whether we should directly use the topology file — as mentioned in the proposal, “Compared to directly providing a topology file, this approach more accurately reflects the physical GPU NUMA relationship and is more friendly to NUMA topology awareness because users do not need to manually obtain and depend on a topology file.” This change does introduce multiple different devices. I’ve done some initial testing on an NVIDIA A100 machine, and the results (as shown in the screenshot kubevirt/kubevirt#13926) are as expected. However, more testing on different machine models may be needed for further validation. |
|
|
||
| ```bash | ||
| cat /sys/bus/pci/devices/0000\:60\:00.0/numa_node | ||
| 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if this is a stupid question but is the device guaranteed to be present in the same pNUMA node as the pCPUs we've been scheduled to? I'm assuming so otherwise none of this would work but I have no idea if that's done by the device plugin, cpu, memory, topology manager etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to @lyarwood for the question.
Yes, it relies on the kubelet to enable and configure CPU, memory, device, and topology manager policies to ensure that the allocated CPU, memory, and devices are on the same NUMA node.
For example, with the following Kubelet configuration:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
cpuManagerPolicyOptions:
distribute-cpus-across-numa: "true"
featureGates:
CPUManagerPolicyAlphaOptions: true
memoryManagerPolicy: Static
topologyManagerPolicy: restricted
topologyManagerScope: pod
reservedMemory:
- numaNode: 0
limits:
memory: "1424Mi"If the requested CPU, memory, and device resources cannot be aligned on the same NUMA node, the pod creation will fail.
| memory: # memory NUMA settings | ||
| hugepages: # (KubeVirt already supported) | ||
| pageSize: "2Mi" # | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to request a GPU in the VM still right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry for the oversight. The GPU settings in the VM have been added.
| configuration: | ||
| developerConfiguration: | ||
| featureGates: | ||
| - GPUDeviceNUMA # the feature gate to control the enabling and disabling of this functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature gate is a temporary way of enabling and disabling this functionality.
If we want cluster admins to be able to enable/disable or otherwise configure the feature then an additional set of configurables should be introduced somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expose a VM CR annotation or other fields to the creators to control the feature enable/disable?
| <cell id="1" cpus="40,41,42,43,44,45,46,47" memory="32212254720" unit="b"></cell> | ||
| </numa> | ||
| </cpu> | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth calling out the above in the context of guestMappingPassthrough and the limitations with that policy at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, it's been added.
Signed-off-by: jinshi <[email protected]>
|
@jimkingstone @vladikr I've given this proposal and the corresponding PR a cursory look. IIUC the main idea is that, when the existing When the feature you're introducing is enabled then, KubeVirt will go the extra mile and figure out the host NUMA node for each GPU, look up the corresponding guest NUMA node based on that information, and ensure that a PCIe Expander Bus assigned to the correct guest NUMA node is used to attach the GPU. Seems fairly reasonable overall, and certainly desirable when it comes to performance. There are a few things that I'm concerned/wondering about though. Have you made sure that these PCIe controllers, for which you're allocating indexes explicitly, are always added to the domain XML before any other PCIe controller? Failing to do so might trip up libvirt's PCI address allocation logic. AFAIK PCIe Expander Buses, just like most other PCIe controllers, can't be hot(un)plugged. Are assigned devices generally expected to be hot(un)pluggable in KubeVirt? If so, this might impose a new limitation. Why is this restricted to GPUs? It seems to me that it should apply to all assigned devices, including e.g. network adapters. |
Signed-off-by: jinshi <[email protected]>
Thanks to @andreabolognani for the questions. Apologies for the delay in responding.
|
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea makes a lot of sense to me, thank you!
I don't want to be pedantic about spelling/grammar/punctuation, but the document is currently hard to read because of such mistakes, please proof-read. (see my comments for some examples)
|
|
||
| # Design | ||
|
|
||
| We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to | |
| We propose emulating a `pcie-expander-bus (pxb-pcie)` in the VM to |
| # Design | ||
|
|
||
| We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to | ||
| configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices) | |
| configure NUMA and expose a `pcie-root-port` for PCIe devices (GPU devices) |
| We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to | ||
| configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices) | ||
| to plug into, according to the [QEMU device placement strategy]( | ||
| https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74): | |
| https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74). |
| https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74): | ||
|
|
||
| PCIe endpoint devices are not themselves associated with NUMA nodes, rather the | ||
| bus they are connected to has affinity. The root complex(pcie.0) is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bus they are connected to has affinity. The root complex(pcie.0) is not | |
| bus they are connected to has a NUMA affinity. The root complex(pcie.0) is not |
| be added and associated with a NUMA node. | ||
|
|
||
| It is not possible to plug PCIe endpoint devices directly into the | ||
| `pcie-expander-bus (pxb-pcie)`, so it is necessary to add `pcie-root-port` into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `pcie-expander-bus (pxb-pcie)`, so it is necessary to add `pcie-root-port` into | |
| `pcie-expander-bus (pxb-pcie)`, so it is necessary to add a `pcie-root-port` into |
|
|
||
| ## Update/Rollback Compatibility | ||
|
|
||
| The feature gate `GPUDeviceNUMA` can control the enabling and disabling of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore the feature gate in this section. Feature-gated code is disabled by default, so it obviously doesn't affect updates. Please assume that the feature gate is enabled as part of the update. Can you think of any potential issue then?
|
I’d like to follow up on the latest progress of this proposal, @jimkingstone @vladikr @alicefr
@andreabolognani I completely agree with this point. In fact, NUMA topology awareness inside KubeVirt guests is not only critical for GPUs, but also for other high-performance devices such as InfiniBand, RoCE adapters, and even NVLink. For workloads like NCCL, having the host device’s NUMA association transparently passed through to the guest VM is essential for performance. The lack of this feature in KubeVirt is currently a blocker for our work, and we’re very willing to actively collaborate to help move this feature forward as quickly as possible. Thanks. |
VEP Metadata
Tracking issue:
#39
SIG label:
/sig compute
What this PR does
To optimize computing performance in KubeVirt GPU virtual machines.
Special notes for your reviewer
it's a follow-up to this: kubevirt/community#394
related issue: kubevirt/kubevirt#13926