design-proposal: Support associating NUMA nodes with GPU devices #38

jimkingstone · 2025-04-07T02:37:08Z

VEP Metadata

Tracking issue:
#39

SIG label:
/sig compute

What this PR does

To optimize computing performance in KubeVirt GPU virtual machines.

Special notes for your reviewer

it's a follow-up to this: kubevirt/community#394
related issue: kubevirt/kubevirt#13926

Signed-off-by: jinshi <[email protected]>

kubevirt-bot · 2025-04-07T02:37:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vladikr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jimkingstone · 2025-04-07T02:38:26Z

Migrated from kubevirt/community#394 @lyarwood

iholder101 · 2025-04-07T07:50:50Z

Hey @jimkingstone! Thanks for this proposal.

Please read the instructions on how to propose VEPs here: https://github.com/kubevirt/enhancements/blob/main/README.md. For example, please create a tracking issue.

In addition, please use the github template format for this PR instead of copy-pasting the kubevirt/community template.

Thanks.

iholder101 · 2025-04-07T07:50:56Z

/hold

kubevirt-bot · 2025-04-07T14:27:15Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jimkingstone · 2025-04-07T14:31:15Z

Hey @jimkingstone! Thanks for this proposal.

Please read the instructions on how to propose VEPs here: https://github.com/kubevirt/enhancements/blob/main/README.md. For example, please create a tracking issue.

In addition, please use the github template format for this PR instead of copy-pasting the kubevirt/community template.

Thanks.

OK, the issue has been created, and this PR template has been updated. @iholder101

alaypatel07 · 2025-04-07T15:55:00Z

/cc
/assign

I am interested in reviewing this proposal as this could have overlapping solutions with DRA gpu devices

cc @rthallisey

vladikr · 2025-04-07T18:42:51Z

Thank you @jimkingstone for looking into this issue.

First, I'm not 100% sure if we need a formal design proposal to address this, mainly because there is no change in KubeVirt API.

That's, indeed, an issue we have been trying to resolve for a while.
In the past, I tried to expose the NUMA-related (collection of the vCPUs associated with a specific numa node - in the case without guest numa mapping) information using device metadata: kubevirt/kubevirt#6946

My primary concern here is that so far KubeVirt has delegated the creation of controllers to libvirt. This can get very complex, very quickly especially when there are multiple different devices involved. How confident are we that we want to open this door?

/cc @alicefr @andreabolognani what do you think?

jimkingstone · 2025-04-08T01:51:17Z

Thank you @jimkingstone for looking into this issue.

First, I'm not 100% sure if we need a formal design proposal to address this, mainly because there is no change in KubeVirt API.

That's, indeed, an issue we have been trying to resolve for a while. In the past, I tried to expose the NUMA-related (collection of the vCPUs associated with a specific numa node - in the case without guest numa mapping) information using device metadata: kubevirt/kubevirt#6946

My primary concern here is that so far KubeVirt has delegated the creation of controllers to libvirt. This can get very complex, very quickly especially when there are multiple different devices involved. How confident are we that we want to open this door?

/cc @alicefr @andreabolognani what do you think?

Thanks to @vladikr for the suggestion.

As for whether we need a formal proposal, I’ve also submitted the related code changes(kubevirt/kubevirt#14406), which can be reviewed alongside the proposal.

Regarding whether we should directly use the topology file — as mentioned in the proposal, “Compared to directly providing a topology file, this approach more accurately reflects the physical GPU NUMA relationship and is more friendly to NUMA topology awareness because users do not need to manually obtain and depend on a topology file.”

This change does introduce multiple different devices. I’ve done some initial testing on an NVIDIA A100 machine, and the results (as shown in the screenshot kubevirt/kubevirt#13926) are as expected. However, more testing on different machine models may be needed for further validation.

lyarwood · 2025-04-09T15:18:07Z

veps/sig-compute/gpu-numa.md

+
+```bash
+cat /sys/bus/pci/devices/0000\:60\:00.0/numa_node
+0


Apologies if this is a stupid question but is the device guaranteed to be present in the same pNUMA node as the pCPUs we've been scheduled to? I'm assuming so otherwise none of this would work but I have no idea if that's done by the device plugin, cpu, memory, topology manager etc.

Thanks to @lyarwood for the question.

Yes, it relies on the kubelet to enable and configure CPU, memory, device, and topology manager policies to ensure that the allocated CPU, memory, and devices are on the same NUMA node.

For example, with the following Kubelet configuration:

cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s cpuManagerPolicyOptions: distribute-cpus-across-numa: "true" featureGates: CPUManagerPolicyAlphaOptions: true memoryManagerPolicy: Static topologyManagerPolicy: restricted topologyManagerScope: pod reservedMemory: - numaNode: 0 limits: memory: "1424Mi"

If the requested CPU, memory, and device resources cannot be aligned on the same NUMA node, the pod creation will fail.

lyarwood · 2025-04-09T15:21:30Z

veps/sig-compute/gpu-numa.md

+        memory:              # memory NUMA settings
+          hugepages:         # (KubeVirt already supported)
+            pageSize: "2Mi"  #
+```


You need to request a GPU in the VM still right?

Yes, sorry for the oversight. The GPU settings in the VM have been added.

lyarwood · 2025-04-09T15:23:07Z

veps/sig-compute/gpu-numa.md

+  configuration:
+    developerConfiguration:
+      featureGates:
+      - GPUDeviceNUMA # the feature gate to control the enabling and disabling of this functionality


The feature gate is a temporary way of enabling and disabling this functionality.

If we want cluster admins to be able to enable/disable or otherwise configure the feature then an additional set of configurables should be introduced somewhere.

Can we expose a VM CR annotation or other fields to the creators to control the feature enable/disable?

lyarwood · 2025-04-09T15:23:34Z

veps/sig-compute/gpu-numa.md

+            <cell id="1" cpus="40,41,42,43,44,45,46,47" memory="32212254720" unit="b"></cell>
+        </numa>
+    </cpu>
+...


Might be worth calling out the above in the context of guestMappingPassthrough and the limitations with that policy at the moment.

Thx, it's been added.

Signed-off-by: jinshi <[email protected]>

andreabolognani · 2025-04-14T17:07:44Z

@jimkingstone @vladikr I've given this proposal and the corresponding PR a cursory look.

IIUC the main idea is that, when the existing cpu.numa.guestMappingPassthrough feature is enabled, the guest NUMA topology is going to be configured so that it matches the host NUMA topology.

When the feature you're introducing is enabled then, KubeVirt will go the extra mile and figure out the host NUMA node for each GPU, look up the corresponding guest NUMA node based on that information, and ensure that a PCIe Expander Bus assigned to the correct guest NUMA node is used to attach the GPU.

Seems fairly reasonable overall, and certainly desirable when it comes to performance. There are a few things that I'm concerned/wondering about though.

Have you made sure that these PCIe controllers, for which you're allocating indexes explicitly, are always added to the domain XML before any other PCIe controller? Failing to do so might trip up libvirt's PCI address allocation logic.

AFAIK PCIe Expander Buses, just like most other PCIe controllers, can't be hot(un)plugged. Are assigned devices generally expected to be hot(un)pluggable in KubeVirt? If so, this might impose a new limitation.

Why is this restricted to GPUs? It seems to me that it should apply to all assigned devices, including e.g. network adapters.

Signed-off-by: jinshi <[email protected]>

jimkingstone · 2025-04-22T08:57:44Z

@jimkingstone @vladikr I've given this proposal and the corresponding PR a cursory look.

IIUC the main idea is that, when the existing cpu.numa.guestMappingPassthrough feature is enabled, the guest NUMA topology is going to be configured so that it matches the host NUMA topology.

When the feature you're introducing is enabled then, KubeVirt will go the extra mile and figure out the host NUMA node for each GPU, look up the corresponding guest NUMA node based on that information, and ensure that a PCIe Expander Bus assigned to the correct guest NUMA node is used to attach the GPU.

Seems fairly reasonable overall, and certainly desirable when it comes to performance. There are a few things that I'm concerned/wondering about though.

Have you made sure that these PCIe controllers, for which you're allocating indexes explicitly, are always added to the domain XML before any other PCIe controller? Failing to do so might trip up libvirt's PCI address allocation logic.

AFAIK PCIe Expander Buses, just like most other PCIe controllers, can't be hot(un)plugged. Are assigned devices generally expected to be hot(un)pluggable in KubeVirt? If so, this might impose a new limitation.

Why is this restricted to GPUs? It seems to me that it should apply to all assigned devices, including e.g. network adapters.

Thanks to @andreabolognani for the questions. Apologies for the delay in responding.

Yes, these PCIe controllers will be explicitly added, with index and bus numbers allocated, and we will ensure the order of the indexes. The remaining PCIe controllers will be automatically allocated by libvirt.
Hot-plugging is not supported at the moment. I have added this limitation to the "Non Goals" section of the proposal.
Currently, we only apply this to GPU devices, i.e., based on vm.spec.template.spec.domain.devices.gpus, to determine whether a device is associated with a NUMA node. I have added this limitation to both the "Non Goals" and "API Examples" sections of the proposal.

kubevirt-bot · 2025-07-21T09:20:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

vladikr · 2025-07-28T16:46:18Z

/remove-lifecycle stale

jean-edouard

The main idea makes a lot of sense to me, thank you!
I don't want to be pedantic about spelling/grammar/punctuation, but the document is currently hard to read because of such mistakes, please proof-read. (see my comments for some examples)

jean-edouard · 2025-08-12T19:11:25Z

veps/sig-compute/gpu-numa.md

+
+# Design
+
+We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to


Suggested change

We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to

We propose emulating a `pcie-expander-bus (pxb-pcie)` in the VM to

jean-edouard · 2025-08-12T19:12:07Z

veps/sig-compute/gpu-numa.md

+# Design
+
+We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to
+configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices)


Suggested change

configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices)

configure NUMA and expose a `pcie-root-port` for PCIe devices (GPU devices)

jean-edouard · 2025-08-12T19:13:03Z

veps/sig-compute/gpu-numa.md

+We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to
+configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices)
+to plug into, according to the [QEMU device placement strategy](
+https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74):


Suggested change

https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74):

https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74).

jean-edouard · 2025-08-12T19:13:49Z

veps/sig-compute/gpu-numa.md

+https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74):
+
+PCIe endpoint devices are not themselves associated with NUMA nodes, rather the
+bus they are connected to has affinity. The root complex(pcie.0) is not


Suggested change

bus they are connected to has affinity. The root complex(pcie.0) is not

bus they are connected to has a NUMA affinity. The root complex(pcie.0) is not

jean-edouard · 2025-08-12T19:14:20Z

veps/sig-compute/gpu-numa.md

+be added and associated with a NUMA node.
+
+It is not possible to plug PCIe endpoint devices directly into the
+`pcie-expander-bus (pxb-pcie)`, so it is necessary to add `pcie-root-port` into


Suggested change

`pcie-expander-bus (pxb-pcie)`, so it is necessary to add `pcie-root-port` into

`pcie-expander-bus (pxb-pcie)`, so it is necessary to add a `pcie-root-port` into

jean-edouard · 2025-08-12T19:20:32Z

veps/sig-compute/gpu-numa.md

+
+## Update/Rollback Compatibility
+
+The feature gate `GPUDeviceNUMA` can control the enabling and disabling of


Please ignore the feature gate in this section. Feature-gated code is disabled by default, so it obviously doesn't affect updates. Please assume that the feature gate is enabled as part of the update. Can you think of any potential issue then?

fanzhangio · 2025-08-27T17:22:22Z

I’d like to follow up on the latest progress of this proposal, @jimkingstone @vladikr @alicefr

Why is this restricted to GPUs? It seems to me that it should apply to all assigned devices, including e.g. network adapters.

@andreabolognani I completely agree with this point. In fact, NUMA topology awareness inside KubeVirt guests is not only critical for GPUs, but also for other high-performance devices such as InfiniBand, RoCE adapters, and even NVLink. For workloads like NCCL, having the host device’s NUMA association transparently passed through to the guest VM is essential for performance.

The lack of this feature in KubeVirt is currently a blocker for our work, and we’re very willing to actively collaborate to help move this feature forward as quickly as possible. Thanks.

docs(proposal): add KubeVirt GPU NUMA proposal

ee872b8

Signed-off-by: jinshi <[email protected]>

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Apr 7, 2025

kubevirt-bot requested review from lyarwood, vladikr and xpivarc April 7, 2025 02:37

kubevirt-bot added the size/L label Apr 7, 2025

jimkingstone changed the title ~~docs(proposal): add KubeVirt GPU NUMA proposal~~ design-proposal: Support associating NUMA nodes with GPU devices Apr 7, 2025

This was referenced Apr 7, 2025

design-proposal: Support associating NUMA nodes with GPU devices kubevirt/community#394

Closed

Support associating NUMA nodes with PCI devices (GPU devices) kubevirt/kubevirt#14406

Closed

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2025

jimkingstone mentioned this pull request Apr 7, 2025

VEP 39: Support associating NUMA nodes with GPU devices #39

Open

kubevirt-bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 7, 2025

kubevirt-bot assigned alaypatel07 Apr 7, 2025

kubevirt-bot requested a review from alaypatel07 April 7, 2025 15:55

lyarwood reviewed Apr 9, 2025

View reviewed changes

docs(proposal): fix the review comments

3b6011f

Signed-off-by: jinshi <[email protected]>

docs(proposal): fix the review comments

ed942c9

Signed-off-by: jinshi <[email protected]>

kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2025

kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2025

jean-edouard reviewed Aug 12, 2025

View reviewed changes


		# Design

		We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to

	We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to
	We propose emulating a `pcie-expander-bus (pxb-pcie)` in the VM to

	configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices)
	configure NUMA and expose a `pcie-root-port` for PCIe devices (GPU devices)

	https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74):
	https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74).

	bus they are connected to has affinity. The root complex(pcie.0) is not
	bus they are connected to has a NUMA affinity. The root complex(pcie.0) is not

	`pcie-expander-bus (pxb-pcie)`, so it is necessary to add `pcie-root-port` into
	`pcie-expander-bus (pxb-pcie)`, so it is necessary to add a `pcie-root-port` into


		## Update/Rollback Compatibility

		The feature gate `GPUDeviceNUMA` can control the enabling and disabling of

design-proposal: Support associating NUMA nodes with GPU devices #38

Are you sure you want to change the base?

design-proposal: Support associating NUMA nodes with GPU devices #38

Uh oh!

Conversation

jimkingstone commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VEP Metadata

What this PR does

Special notes for your reviewer

Uh oh!

kubevirt-bot commented Apr 7, 2025

Uh oh!

jimkingstone commented Apr 7, 2025

Uh oh!

iholder101 commented Apr 7, 2025

Uh oh!

iholder101 commented Apr 7, 2025

Uh oh!

kubevirt-bot commented Apr 7, 2025

Uh oh!

jimkingstone commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alaypatel07 commented Apr 7, 2025

Uh oh!

vladikr commented Apr 7, 2025

Uh oh!

jimkingstone commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimkingstone Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreabolognani commented Apr 14, 2025

Uh oh!

jimkingstone commented Apr 22, 2025

Uh oh!

kubevirt-bot commented Jul 21, 2025

Uh oh!

vladikr commented Jul 28, 2025

Uh oh!

jean-edouard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fanzhangio commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

jimkingstone commented Apr 7, 2025 •

edited

Loading

jimkingstone commented Apr 7, 2025 •

edited

Loading

jimkingstone commented Apr 8, 2025 •

edited

Loading

jimkingstone Apr 10, 2025 •

edited

Loading