Skip to content

Conversation

vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Jul 4, 2025

Description

This PR allows attaching of GPU devices via PCI, mdev or VF to an Instance for KVM.
CWiki Design doc: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Support+for+GPU+with+KVM+hosts
Doc PR: apache/cloudstack-documentation#526

Generated summary

This pull request introduces several changes across multiple files, focusing on enhancing GPU-related functionality, adding new properties for VM hooks, and updating resource management capabilities. The most significant updates include the addition of GPU properties and event types, the introduction of new VM shell script properties, and modifications to resource limits and types to support GPU devices.

GPU-related enhancements:

VM hook properties:

Resource management updates:

Miscellaneous updates:

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Screenshots (if appropriate):

How Has This Been Tested?

This was tested locally on my laptop with passthrough of a consumer graphics card (NVIDIA RTX 3050). Due to unavailability of actual hardware, I wasn't able to test with vGPU profiles or mdev.

Framework level testing was done using the simulator plugin.

How did you try to break this feature and the system with this change?

@vishesh92 vishesh92 changed the title Integrate gpu Feature: Add GPU support for KVM Jul 4, 2025
@vishesh92 vishesh92 changed the title Feature: Add GPU support for KVM Feature: Add support for GPU with KVM hosts Jul 4, 2025
@vishesh92 vishesh92 requested a review from Copilot July 4, 2025 09:12
Copilot

This comment was marked as outdated.

@vishesh92 vishesh92 requested a review from Copilot July 4, 2025 09:16
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables GPU support for KVM hosts by updating both backend utilities and the compute offering UI to attach and configure GPU cards and vGPU profiles.

  • Removed a stray comment in the script utility header.
  • Updated AddComputeOffering.vue to let users select GPU cards, vGPU profiles, GPU count, and display options.

Reviewed Changes

Copilot reviewed 153 out of 153 changed files in this pull request and generated 1 comment.

File Description
utils/src/main/java/com/cloud/utils/script/Script.java Removed an extraneous comment line at the top of the file.
ui/src/views/offering/AddComputeOffering.vue Renamed form fields for GPU card and profile selection, added count/display controls and data-fetch methods.
Comments suppressed due to low confidence (1)

ui/src/views/offering/AddComputeOffering.vue:262

  • The form field name 'vgpuprofile' may conflict with the API parameter 'vgpuprofileid'. Consider renaming it to 'vgpuprofileid' to maintain consistency and avoid confusion when mapping form values to request parameters.
        <a-form-item name="vgpuprofile" ref="vgpuprofile" :label="$t('label.vgpu.profile')" v-if="!isSystem && form.gpucardid && vgpuProfiles.length > 0">

Copy link

codecov bot commented Jul 4, 2025

Codecov Report

❌ Patch coverage is 45.42233% with 1848 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.17%. Comparing base (4aed972) to head (69289ab).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...java/org/apache/cloudstack/gpu/GpuServiceImpl.java 78.20% 107 Missing and 73 partials ⚠️
.../com/cloud/agent/manager/MockAgentManagerImpl.java 0.00% 113 Missing ⚠️
...main/java/com/cloud/simulator/MockGpuDeviceVO.java 0.00% 89 Missing ⚠️
...n/java/com/cloud/resource/ResourceManagerImpl.java 0.00% 81 Missing ⚠️
...che/cloudstack/api/response/GpuDeviceResponse.java 31.31% 68 Missing ⚠️
...oudstack/api/response/ServiceOfferingResponse.java 0.00% 60 Missing ⚠️
...chema/src/main/java/com/cloud/gpu/GpuDeviceVO.java 34.78% 59 Missing and 1 partial ⚠️
...ema/src/main/java/com/cloud/gpu/VgpuProfileVO.java 36.90% 52 Missing and 1 partial ⚠️
...c/main/java/com/cloud/agent/api/VgpuTypesInfo.java 54.46% 51 Missing ⚠️
.../main/java/com/cloud/gpu/dao/GpuDeviceDaoImpl.java 65.75% 47 Missing and 3 partials ⚠️
... and 72 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11143      +/-   ##
============================================
+ Coverage     17.00%   17.17%   +0.16%     
- Complexity    14727    14983     +256     
============================================
  Files          5832     5869      +37     
  Lines        517620   521513    +3893     
  Branches      62996    63474     +478     
============================================
+ Hits          88008    89553    +1545     
- Misses       419673   421894    +2221     
- Partials       9939    10066     +127     
Flag Coverage Δ
uitests 3.75% <ø> (-0.07%) ⬇️
unittests 18.15% <45.42%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@apache apache deleted a comment from blueorangutan Jul 4, 2025
@sureshanaparti sureshanaparti added this to the 4.21.0 milestone Jul 4, 2025
@sureshanaparti sureshanaparti moved this to In Progress in Apache CloudStack 4.21.0 Jul 4, 2025
@apache apache deleted a comment from blueorangutan Jul 4, 2025
@apache apache deleted a comment from blueorangutan Jul 4, 2025
@GutoVeronezi
Copy link
Contributor

Great initiative @vishesh92; do you have any spec or documentation about it?

@vishesh92
Copy link
Member Author

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

@vishesh92 vishesh92 force-pushed the integrate-gpu branch 2 times, most recently from f6945ef to 9bc8518 Compare July 4, 2025 12:19
@vishesh92
Copy link
Member Author

@blueorangutan package

@apache apache deleted a comment from blueorangutan Jul 4, 2025
@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@GutoVeronezi
Copy link
Contributor

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

Just to clarify, you have the spec/documentation and are working on the PR or you still do not have it and will create it?

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14034

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 14410

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 14414

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14415

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13937)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 59647 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11143-t13937-kvm-ol8.zip
Smoke tests completed. 145 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

1 similar comment
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@vishesh92
Copy link
Member Author

@blueorangutan package

@apache apache deleted a comment from blueorangutan Jul 29, 2025
@apache apache deleted a comment from blueorangutan Jul 29, 2025
@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14435

@sureshanaparti sureshanaparti merged commit f6ad184 into apache:main Jul 29, 2025
23 of 26 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Apache CloudStack 4.21.0 Jul 29, 2025
@sureshanaparti sureshanaparti deleted the integrate-gpu branch July 29, 2025 08:19
@blueorangutan
Copy link

[SF] Trillian test result (tid-13924)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 313818 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11143-t13924-vmware-70u3.zip
Smoke tests completed. 131 look OK, 10 have errors, 1 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_vpn_usage Error 1.14 test_usage.py
test_01_deploy_vm_on_specific_host Error 1.34 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 3604.42 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 1.38 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 4.42 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 4.40 test_vm_deployment_planner.py
test_02_offline_migrate_VM_with_two_data_disks Error 3676.68 test_vm_life_cycle.py
test_03_live_migrate_VM_with_two_data_disks Error 3672.78 test_vm_life_cycle.py
test_04_migrate_detached_volume Error 1865.53 test_vm_life_cycle.py
test_09_expunge_vm Failure 427.76 test_vm_life_cycle.py
test_11_destroy_vm_and_volumes Error 1868.86 test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated Error 1954.34 test_vm_life_cycle.py
test_01_unmanage_vm_cycle Error 1951.47 test_vm_lifecycle_unmanage_import.py
ContextSuite context=TestUnmanageVM>:teardown Error 1951.59 test_vm_lifecycle_unmanage_import.py
test_01_migrate_vm_strict_tags_success Error 0.94 test_vm_strict_host_tags.py
test_02_migrate_vm_strict_tags_failure Error 0.91 test_vm_strict_host_tags.py
test_01_restore_vm_strict_tags_success Error 1.05 test_vm_strict_host_tags.py
test_02_restore_vm_strict_tags_failure Error 1.16 test_vm_strict_host_tags.py
test_01_scale_vm_strict_tags_success Error 0.99 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Error 1.84 test_vm_strict_host_tags.py
test_01_deploy_vm_on_specific_host_without_strict_tags Error 1.38 test_vm_strict_host_tags.py
test_02_deploy_vm_on_any_host_without_strict_tags Error 0.92 test_vm_strict_host_tags.py
test_03_deploy_vm_on_specific_host_with_strict_tags_success Error 0.89 test_vm_strict_host_tags.py
test_04_deploy_vm_on_any_host_with_strict_tags_success Error 0.98 test_vm_strict_host_tags.py
test_05_deploy_vm_on_specific_host_with_strict_tags_failure Failure 1.20 test_vm_strict_host_tags.py
test_06_deploy_vm_on_any_host_with_strict_tags_failure Failure 1.22 test_vm_strict_host_tags.py
ContextSuite context=TestCreateVolume>:teardown Error 2234.30 test_volumes.py
test_13_migrate_volume_and_change_offering Error 2412.37 test_volumes.py
ContextSuite context=TestVolumes>:teardown Error 1822.16 test_volumes.py
test_01_verify_ipv6_vpc Error 4254.72 test_vpc_ipv6.py
test_01_verify_ipv6_vpc Error 4255.05 test_vpc_ipv6.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 3696.55 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 3698.46 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 3618.14 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 3702.09 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 1813.18 test_vpc_redundant.py
test_01_redundant_vpc_site2site_vpn Failure 3607.97 test_vpc_vpn.py
test_01_redundant_vpc_site2site_vpn Error 3608.14 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Failure 3692.65 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Error 3692.85 test_vpc_vpn.py
test_01_vpc_remote_access_vpn Failure 3610.14 test_vpc_vpn.py
test_01_vpc_site2site_vpn Failure 3609.25 test_vpc_vpn.py
test_01_vpc_site2site_vpn Error 3609.41 test_vpc_vpn.py
test_01_cancel_host_maintenace_with_no_migration_jobs Error 0.02 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 0.01 test_host_maintenance.py
test_03_cancel_host_maintenace_with_migration_jobs_failure Error 0.01 test_host_maintenance.py
ContextSuite context=TestHostMaintenanceAgents>:setup Error 0.02 test_host_maintenance.py
all_test_hostha_kvm Skipped --- test_hostha_kvm.py

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Aug 6, 2025
This PR allows attaching of GPU devices via PCI, mdev or VF to an Instance for KVM.

It allows the operator to discover the GPU devices on the KVM host and create a Compute Offering with GPU support based on the available GPU devices on the host. Once the operator has created the Compute offering, it can be used by users to launch Instances with GPU devices.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add NVidia Datacenter (V100/A100/H100) GPU assignment to CloudStack guests
10 participants