Skip to content

Commit 0a9d81e

Browse files
dgarcia18prisorue
authored andcommitted
F OpenNebula/engineering#408 AI Factories: New Configuration and Deployment page (#362)
Signed-off-by: Daniel <[email protected]> Co-authored-by: Priscill Orue <[email protected]> (cherry picked from commit 9c8d895)
1 parent d283ae5 commit 0a9d81e

File tree

1 file changed

+146
-0
lines changed

1 file changed

+146
-0
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
title: "Configuration and Deployment"
3+
date: "2025-10-21"
4+
weight: 3
5+
---
6+
7+
Here you will find the details to deploy and configure an AI-ready OpenNebula cloud using the [OneDeploy](https://github.com/OpenNebula/one-deploy) tool. This guide focuses on a local environment, preparing it for demanding AI workloads by leveraging PCI passthrough for GPUs like the NVIDIA H100 and L40S.
8+
9+
Machine Learning (ML) training and inference are resource-intensive tasks that often require the full power of a dedicated GPU. PCI passthrough allows a Virtual Machine to have exclusive access to a physical GPU, delivering bare-metal performance for the most demanding AI workloads.
10+
11+
## Prerequisites
12+
13+
Before you begin, ensure your environment meets the following prerequisites.
14+
15+
### Hardware Requirements
16+
17+
The virtualization hosts (hypervisors) must support I/O MMU virtualization:
18+
* **Intel CPUs**: Must support **VT-d**.
19+
* **AMD CPUs**: Must support **AMD-Vi**.
20+
21+
You must enable this feature in your server's BIOS/UEFI. Refer to your hardware vendor's documentation for instructions.
22+
23+
### Kernel Configuration (Manual Step)
24+
25+
The OneDeploy tool automates many aspects of the configuration, but you must manually enable IOMMU support in the kernel on each hypervisor node. This is a critical step that OneDeploy does not perform automatically.
26+
27+
Before modifying the kernel parameters, check if IOMMU is already active by inspecting the `/sys/kernel/iommu_groups/` directory on the hypervisor.
28+
29+
```shell
30+
ls /sys/kernel/iommu_groups/
31+
```
32+
33+
If this directory exists and contains subdirectories (e.g., `0/`, `1/`, etc.), IOMMU is likely active. An empty directory or a non-existent directory indicates that IOMMU is not correctly enabled in your kernel or BIOS/UEFI.
34+
35+
If IOMMU is not active, add the appropriate parameter to the kernel's boot command line:
36+
37+
* For Intel CPUs: `intel_iommu=on`
38+
* For AMD CPUs: `amd_iommu=on`
39+
40+
For a detailed guide on how to perform this kernel configuration, refer to the [NVIDIA GPU Passthrough documentation]({{% relref "product/cluster_configuration/hosts_and_clusters/nvidia_gpu_passthrough.md" %}}).
41+
42+
### Hypervisor Preparation
43+
44+
For a correct performance of the PCI passthrough with NVIDIA GPUs, start with a clean state on the hypervisor nodes regarding NVIDIA drivers.
45+
46+
Avoid pre-installing NVIDIA drivers on the hypervisor nodes before running the OneDeploy playbook. An active proprietary NVIDIA driver will claim the GPU and prevent other drivers, like `vfio-pci`, from binding to the device. This will block the PCI passthrough configuration from succeeding.
47+
48+
## Deployment with OneDeploy
49+
50+
Use OneDeploy to automate the deployment of our OpenNebula cloud with PCI passthrough configured for our GPUs.
51+
52+
### Setting Up OneDeploy
53+
54+
The OneDeploy tool is a collection of Ansible playbooks that streamline the installation of OpenNebula. Before running this collection, prepare your control node which is the machine where you will execute the Ansible commands.
55+
56+
1. **Clone the repository**:
57+
```shell
58+
git clone https://github.com/OpenNebula/one-deploy.git
59+
cd one-deploy
60+
```
61+
2. **Install dependencies**:
62+
OneDeploy requires Ansible and a few other Python libraries. For detailed system requirements and setup instructions, follow the [Platform Notes](https://github.com/OpenNebula/one-deploy/wiki/sys_reqs) in the official wiki.
63+
64+
For guidance on how to execute the playbooks in different cloud architectures, see the [Playbook Usage Guide](https://github.com/OpenNebula/one-deploy/wiki/sys_use).
65+
66+
### Step 1: Configure the Inventory for PCI Passthrough
67+
68+
For this configuration, use a dedicated inventory file to define the general cloud architecture, where you specify PCI devices for passthrough.
69+
70+
Here is an example inventory file, which you can adapt for your environment. This example is based on the `inventory/pci_passthrough.yml` file found in the `one-deploy` repository. For more details on the `pci_passthrough` roles, refer to the [PCI Passthrough wiki page](https://github.com/OpenNebula/one-deploy/wiki/pci_passthrough). The inventory file shown below is a basic example, and you should adjust it to match your specific cloud architecture, including your frontend and node IP addresses, network configuration (`vn`), and datastore setup (`ds`). For more detailed information on configuring OneDeploy for different architectures like shared or Ceph-based storage, refer to the official [OneDeploy wiki](https://github.com/OpenNebula/one-deploy/wiki).
71+
72+
73+
```yaml
74+
---
75+
all:
76+
vars:
77+
ansible_user: root
78+
one_version: '7.0'
79+
one_pass: opennebulapass
80+
ds:
81+
mode: ssh
82+
vn:
83+
admin_net:
84+
managed: true
85+
template:
86+
VN_MAD: bridge
87+
BRIDGE: br0
88+
AR:
89+
TYPE: IP4
90+
IP: 192.168.122.100
91+
SIZE: 48
92+
NETWORK_ADDRESS: 192.168.122.0
93+
NETWORK_MASK: 255.255.255.0
94+
GATEWAY: 192.168.122.1
95+
DNS: 1.1.1.1
96+
97+
frontend:
98+
hosts:
99+
f1: { ansible_host: 192.168.122.2 }
100+
101+
node:
102+
hosts:
103+
h100-node:
104+
ansible_host: 192.168.122.3
105+
pci_passthrough_enabled: true
106+
pci_devices:
107+
- address: "0000:09:00.0" # NVIDIA H100 GPU
108+
l40s-node:
109+
ansible_host: 192.168.122.4
110+
pci_passthrough_enabled: true
111+
pci_devices:
112+
- address: "0000:0a:00.0" # NVIDIA L40S GPU
113+
standard-node:
114+
ansible_host: 192.168.122.5
115+
pci_passthrough_enabled: false
116+
```
117+
118+
Key configuration parameters to setup:
119+
120+
* `pci_passthrough_enabled: true`: this boolean flag enables the PCI passthrough configuration for a specific node.
121+
* `pci_devices`: this is a list of PCI devices to be configured for passthrough on that node.
122+
* `address`: the full PCI address of the device (e.g., `"0000:09:00.0"`). List this address by running the `lspci -D` command on the hypervisor. Note that you must provide the full address, as short addresses are not supported by this OneDeploy feature.
123+
124+
### Step 2: Run the Deployment
125+
126+
Once your inventory file is ready (e.g., saved as `inventory/ai_factory.yml`), run OneDeploy to provision your OpenNebula cloud.
127+
128+
```shell
129+
make I=inventory/ai_factory.yml
130+
```
131+
132+
When you enable the PCI passthrough feature in your inventory, OneDeploy handles all the necessary configuration steps. On each hypervisor node, OneDeploy prepares the specified GPUs for passthrough by binding them to the required `vfio-pci` driver. It also ensures the correct permissions are set so that OpenNebula manages the devices.
133+
134+
Simultaneously, on the OpenNebula front-end, OneDeploy configures the monitoring system to recognize these GPUs and intelligently updates each Host's template. This ensures that the GPUs are always correctly identified by OpenNebula, even if hardware addresses change, providing a stable and reliable passthrough setup.
135+
136+
## Post-Deployment Validation
137+
138+
After the deployment is complete, verify that the GPUs are correctly configured and available to OpenNebula by checking the Host information in Sunstone:
139+
140+
1. Log in to your OpenNebula Sunstone GUI
141+
2. Navigate to **Infrastructure -> Hosts**
142+
3. Select one of the hypervisors you configured for passthrough (e.g., `h100-node`).
143+
4. Go to the **PCI** tab.
144+
5. You will see your GPU listed as an available PCI device.
145+
146+
If the device is visible here, your AI-ready OpenNebula cloud is correctly configured. The H100 and L40S GPUs are now ready to be passed through to Virtual Machines for high-performance AI and ML tasks.

0 commit comments

Comments
 (0)