Skip to content

Conversation

@jimkingstone
Copy link

@jimkingstone jimkingstone commented Apr 7, 2025

VEP Metadata

Tracking issue:
#39

SIG label:
/sig compute

What this PR does

To optimize computing performance in KubeVirt GPU virtual machines.

Special notes for your reviewer

it's a follow-up to this: kubevirt/community#394
related issue: kubevirt/kubevirt#13926

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Apr 7, 2025
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vladikr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jimkingstone jimkingstone changed the title docs(proposal): add KubeVirt GPU NUMA proposal design-proposal: Support associating NUMA nodes with GPU devices Apr 7, 2025
@jimkingstone
Copy link
Author

Migrated from kubevirt/community#394 @lyarwood

@iholder101
Copy link
Contributor

Hey @jimkingstone! Thanks for this proposal.

Please read the instructions on how to propose VEPs here: https://github.com/kubevirt/enhancements/blob/main/README.md. For example, please create a tracking issue.

In addition, please use the github template format for this PR instead of copy-pasting the kubevirt/community template.

Thanks.

@iholder101
Copy link
Contributor

/hold

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2025
@kubevirt-bot
Copy link

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubevirt-bot kubevirt-bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 7, 2025
@jimkingstone
Copy link
Author

jimkingstone commented Apr 7, 2025

Hey @jimkingstone! Thanks for this proposal.

Please read the instructions on how to propose VEPs here: https://github.com/kubevirt/enhancements/blob/main/README.md. For example, please create a tracking issue.

In addition, please use the github template format for this PR instead of copy-pasting the kubevirt/community template.

Thanks.

OK, the issue has been created, and this PR template has been updated. @iholder101

@alaypatel07
Copy link
Contributor

/cc
/assign

I am interested in reviewing this proposal as this could have overlapping solutions with DRA gpu devices

cc @rthallisey

@vladikr
Copy link
Member

vladikr commented Apr 7, 2025

Thank you @jimkingstone for looking into this issue.

First, I'm not 100% sure if we need a formal design proposal to address this, mainly because there is no change in KubeVirt API.

That's, indeed, an issue we have been trying to resolve for a while.
In the past, I tried to expose the NUMA-related (collection of the vCPUs associated with a specific numa node - in the case without guest numa mapping) information using device metadata: kubevirt/kubevirt#6946

My primary concern here is that so far KubeVirt has delegated the creation of controllers to libvirt. This can get very complex, very quickly especially when there are multiple different devices involved. How confident are we that we want to open this door?

/cc @alicefr @andreabolognani what do you think?

@jimkingstone
Copy link
Author

jimkingstone commented Apr 8, 2025

Thank you @jimkingstone for looking into this issue.

First, I'm not 100% sure if we need a formal design proposal to address this, mainly because there is no change in KubeVirt API.

That's, indeed, an issue we have been trying to resolve for a while. In the past, I tried to expose the NUMA-related (collection of the vCPUs associated with a specific numa node - in the case without guest numa mapping) information using device metadata: kubevirt/kubevirt#6946

My primary concern here is that so far KubeVirt has delegated the creation of controllers to libvirt. This can get very complex, very quickly especially when there are multiple different devices involved. How confident are we that we want to open this door?

/cc @alicefr @andreabolognani what do you think?

Thanks to @vladikr for the suggestion.

As for whether we need a formal proposal, I’ve also submitted the related code changes(kubevirt/kubevirt#14406), which can be reviewed alongside the proposal.

Regarding whether we should directly use the topology file — as mentioned in the proposal, “Compared to directly providing a topology file, this approach more accurately reflects the physical GPU NUMA relationship and is more friendly to NUMA topology awareness because users do not need to manually obtain and depend on a topology file.”

This change does introduce multiple different devices. I’ve done some initial testing on an NVIDIA A100 machine, and the results (as shown in the screenshot kubevirt/kubevirt#13926) are as expected. However, more testing on different machine models may be needed for further validation.


```bash
cat /sys/bus/pci/devices/0000\:60\:00.0/numa_node
0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this is a stupid question but is the device guaranteed to be present in the same pNUMA node as the pCPUs we've been scheduled to? I'm assuming so otherwise none of this would work but I have no idea if that's done by the device plugin, cpu, memory, topology manager etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @lyarwood for the question.

Yes, it relies on the kubelet to enable and configure CPU, memory, device, and topology manager policies to ensure that the allocated CPU, memory, and devices are on the same NUMA node.

For example, with the following Kubelet configuration:

cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
cpuManagerPolicyOptions:
  distribute-cpus-across-numa: "true"
featureGates:
  CPUManagerPolicyAlphaOptions: true
memoryManagerPolicy: Static
topologyManagerPolicy: restricted
topologyManagerScope: pod
reservedMemory:
- numaNode: 0
  limits:
    memory: "1424Mi"

If the requested CPU, memory, and device resources cannot be aligned on the same NUMA node, the pod creation will fail.

memory: # memory NUMA settings
hugepages: # (KubeVirt already supported)
pageSize: "2Mi" #
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to request a GPU in the VM still right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry for the oversight. The GPU settings in the VM have been added.

configuration:
developerConfiguration:
featureGates:
- GPUDeviceNUMA # the feature gate to control the enabling and disabling of this functionality
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature gate is a temporary way of enabling and disabling this functionality.

If we want cluster admins to be able to enable/disable or otherwise configure the feature then an additional set of configurables should be introduced somewhere.

Copy link
Author

@jimkingstone jimkingstone Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expose a VM CR annotation or other fields to the creators to control the feature enable/disable?

<cell id="1" cpus="40,41,42,43,44,45,46,47" memory="32212254720" unit="b"></cell>
</numa>
</cpu>
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth calling out the above in the context of guestMappingPassthrough and the limitations with that policy at the moment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, it's been added.

@andreabolognani
Copy link

@jimkingstone @vladikr I've given this proposal and the corresponding PR a cursory look.

IIUC the main idea is that, when the existing cpu.numa.guestMappingPassthrough feature is enabled, the guest NUMA topology is going to be configured so that it matches the host NUMA topology.

When the feature you're introducing is enabled then, KubeVirt will go the extra mile and figure out the host NUMA node for each GPU, look up the corresponding guest NUMA node based on that information, and ensure that a PCIe Expander Bus assigned to the correct guest NUMA node is used to attach the GPU.

Seems fairly reasonable overall, and certainly desirable when it comes to performance. There are a few things that I'm concerned/wondering about though.

Have you made sure that these PCIe controllers, for which you're allocating indexes explicitly, are always added to the domain XML before any other PCIe controller? Failing to do so might trip up libvirt's PCI address allocation logic.

AFAIK PCIe Expander Buses, just like most other PCIe controllers, can't be hot(un)plugged. Are assigned devices generally expected to be hot(un)pluggable in KubeVirt? If so, this might impose a new limitation.

Why is this restricted to GPUs? It seems to me that it should apply to all assigned devices, including e.g. network adapters.

@jimkingstone
Copy link
Author

@jimkingstone @vladikr I've given this proposal and the corresponding PR a cursory look.

IIUC the main idea is that, when the existing cpu.numa.guestMappingPassthrough feature is enabled, the guest NUMA topology is going to be configured so that it matches the host NUMA topology.

When the feature you're introducing is enabled then, KubeVirt will go the extra mile and figure out the host NUMA node for each GPU, look up the corresponding guest NUMA node based on that information, and ensure that a PCIe Expander Bus assigned to the correct guest NUMA node is used to attach the GPU.

Seems fairly reasonable overall, and certainly desirable when it comes to performance. There are a few things that I'm concerned/wondering about though.

Have you made sure that these PCIe controllers, for which you're allocating indexes explicitly, are always added to the domain XML before any other PCIe controller? Failing to do so might trip up libvirt's PCI address allocation logic.

AFAIK PCIe Expander Buses, just like most other PCIe controllers, can't be hot(un)plugged. Are assigned devices generally expected to be hot(un)pluggable in KubeVirt? If so, this might impose a new limitation.

Why is this restricted to GPUs? It seems to me that it should apply to all assigned devices, including e.g. network adapters.

Thanks to @andreabolognani for the questions. Apologies for the delay in responding.

  1. Yes, these PCIe controllers will be explicitly added, with index and bus numbers allocated, and we will ensure the order of the indexes. The remaining PCIe controllers will be automatically allocated by libvirt.

  2. Hot-plugging is not supported at the moment. I have added this limitation to the "Non Goals" section of the proposal.

  3. Currently, we only apply this to GPU devices, i.e., based on vm.spec.template.spec.domain.devices.gpus, to determine whether a device is associated with a NUMA node. I have added this limitation to both the "Non Goals" and "API Examples" sections of the proposal.

@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2025
@vladikr
Copy link
Member

vladikr commented Jul 28, 2025

/remove-lifecycle stale

@kubevirt-bot kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2025
Copy link
Contributor

@jean-edouard jean-edouard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea makes a lot of sense to me, thank you!
I don't want to be pedantic about spelling/grammar/punctuation, but the document is currently hard to read because of such mistakes, please proof-read. (see my comments for some examples)


# Design

We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to
We propose emulating a `pcie-expander-bus (pxb-pcie)` in the VM to

# Design

We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to
configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices)
configure NUMA and expose a `pcie-root-port` for PCIe devices (GPU devices)

We propose using emulating a `pcie-expander-bus (pxb-pcie)` in the VM to
configure NUMA node and expose a `pcie-root-port` for PCIe device (GPU devices)
to plug into, according to the [QEMU device placement strategy](
https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74):
https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74).

https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L37-L74):

PCIe endpoint devices are not themselves associated with NUMA nodes, rather the
bus they are connected to has affinity. The root complex(pcie.0) is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bus they are connected to has affinity. The root complex(pcie.0) is not
bus they are connected to has a NUMA affinity. The root complex(pcie.0) is not

be added and associated with a NUMA node.

It is not possible to plug PCIe endpoint devices directly into the
`pcie-expander-bus (pxb-pcie)`, so it is necessary to add `pcie-root-port` into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`pcie-expander-bus (pxb-pcie)`, so it is necessary to add `pcie-root-port` into
`pcie-expander-bus (pxb-pcie)`, so it is necessary to add a `pcie-root-port` into


## Update/Rollback Compatibility

The feature gate `GPUDeviceNUMA` can control the enabling and disabling of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore the feature gate in this section. Feature-gated code is disabled by default, so it obviously doesn't affect updates. Please assume that the feature gate is enabled as part of the update. Can you think of any potential issue then?

@fanzhangio
Copy link

I’d like to follow up on the latest progress of this proposal, @jimkingstone @vladikr @alicefr

Why is this restricted to GPUs? It seems to me that it should apply to all assigned devices, including e.g. network adapters.

@andreabolognani I completely agree with this point. In fact, NUMA topology awareness inside KubeVirt guests is not only critical for GPUs, but also for other high-performance devices such as InfiniBand, RoCE adapters, and even NVLink. For workloads like NCCL, having the host device’s NUMA association transparently passed through to the guest VM is essential for performance.

The lack of this feature in KubeVirt is currently a blocker for our work, and we’re very willing to actively collaborate to help move this feature forward as quickly as possible. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants