Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 47 additions & 7 deletions roles/vgpu/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,57 @@
# stackhpc.linux.vgpu

This role can configure vGPUs or Multi Instance GPU (MIG) on NVIDIA cards.

## Prerequisites

- [Download Nvidia GRID driver](https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html#redeeming-pak-and-downloading-grid-software) (This requires a login).
- The location of this file can be customised with the `vgpu_driver_url` variable:
* e.g to use an artifact uploaded to a http server:
`vgpu_driver_url: http://seed/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.85.07-525.85.05-528.24.zip`
* e.g to use file the control host:
`vgpu_driver_url: "{{ lookup('env', 'HOME'}}/NVIDIA-GRID-Linux-KVM-525.85.07-525.85.05-528.24.zip"`
### Multi Instance GPU (MIG)

When creating MIG devices with no vGPU instances layered on top, there are no special requirements.

### VGPUs:

- Enable IOMUU
- Make sure the related options are enabled in the BIOS
- Intel CPUs require the intel_iommu kernel command line argument

## Enabling SR-IOV on dell hardware

#### Enabling SR-IOV on dell hardware

```
/opt/dell/srvadmin/bin/idracadm7 set BIOS.IntegratedDevices.SriovGlobalEnable Enabled
/opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1
```

## Drivers

The role will attempt to install a driver from ``vgpu_driver_url``. Currently this only works with
the data center drivers such as the
[Nvidia GRID drivers](https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html#redeeming-pak-and-downloading-grid-software)
or the [AI enterprise drivers](https://www.nvidia.com/en-gb/data-center/products/ai-enterprise/);
both of which can be obtained from the NVIDIA licensing portal. The use of data centre drivers is not mandatory
if you only want to use MIG without vGPUs.

The location of this file can be customised with the `vgpu_driver_url` variable, e.g to use an artifact uploaded to a http server:

```
vgpu_driver_url: http://seed/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.85.07-525.85.05-528.24.zip
```

e.g to use a file on the control host:

```
vgpu_driver_url: "{{ lookup('env', 'HOME'}}/NVIDIA-GRID-Linux-KVM-525.85.07-525.85.05-528.24.zip"
```

At this moment in time, the role only supports zip archives, Future work may add support for other packaging formats such as: .deb and .rpm, and .run.

It is possible to install a driver via some other means by setting the ``vgpu_nvidia_driver_install_enabled`` configuration option, e.g:
```
---
vgpu_nvidia_driver_install_enabled: false
```

This will cause the role to assume that the driver is already installed.

## Running the role

Expand Down Expand Up @@ -72,6 +105,13 @@ vgpu_definitions:
index: 0
- mdev_type: nvidia-697
index: 1
# Configuring a MIG without creating VGPUs. You may also want to set
# vgpu_nvidia_driver_install_enabled: false if you have installed the nvidia
# driver by some other means.
- pci_address: "0000:17:00.0"
mig_devices:
"1g.10gb": 1
"2g.20gb": 3
```


Expand Down
7 changes: 5 additions & 2 deletions roles/vgpu/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
---
# Whether to install the nvidia driver. Set to false if you want to install the driver
# via some other means.
vgpu_nvidia_driver_install_enabled: true
vgpu_driver_url: ""
vgpu_driver_force_install: false
vgpu_driver_dkms: false
Expand All @@ -13,5 +16,5 @@ vgpu_mig_definitions: []
vgpu_definitions: "{{ vgpu_mig_definitions }}"

# Packages providing nvidia-mig-manager
vgpu_nvidia_mig_manager_rpm_url: https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager-0.5.1-1.x86_64.rpm
vgpu_nvidia_mig_manager_deb_url: https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager_0.5.1-1_amd64.deb
vgpu_nvidia_mig_manager_rpm_url: https://github.com/NVIDIA/mig-parted/releases/download/v0.12.1/nvidia-mig-manager-0.12.1-1.x86_64.rpm
vgpu_nvidia_mig_manager_deb_url: https://github.com/NVIDIA/mig-parted/releases/download/v0.12.1/nvidia-mig-manager_0.12.1-1_amd64.deb
11 changes: 10 additions & 1 deletion roles/vgpu/tasks/configure-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,20 @@
become: true
when: vgpu_definition.mig_devices is defined

- name: Collect mig status
ansible.builtin.command: nvidia-smi -i {{ vgpu_definition.pci_address }} --query-gpu='mig.mode.current' --format csv,noheader
changed_when: false
register: mig_status_result
when:
- vgpu_definition.mig_devices is defined

- name: Enable mig mode
ansible.builtin.command: nvidia-smi -i {{ vgpu_definition.pci_address }} -mig 1
changed_when: false
become: true
when: vgpu_definition.mig_devices is defined
when:
- vgpu_definition.mig_devices is defined
- mig_status_result.stdout != "Enabled"

- name: Template nvidia-sriov service
ansible.builtin.template:
Expand Down
1 change: 1 addition & 0 deletions roles/vgpu/tasks/install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
filename: "{{ vgpu_driver_url_components.path | basename }}"
install_script: "{{ find_result.files.0.path }}"
ansible_become: true
when: vgpu_nvidia_driver_install_enabled | bool
block:
- name: Ensure target directory exists
ansible.builtin.file:
Expand Down
1 change: 1 addition & 0 deletions roles/vgpu/tasks/validate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
ansible.builtin.assert:
that: vgpu_driver_url | length > 0
fail_msg: "Please ensure you set the variable: vgpu_driver_url"
when: vgpu_nvidia_driver_install_enabled | bool