-
Notifications
You must be signed in to change notification settings - Fork 14
F OpenNebula/engineering#408 AI Factories: New Configuration and Deployment page #362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rsmontero
merged 32 commits into
OpenNebula:one-7.0-maintenance
from
dgarcia18:one-7.0-maintenance
Oct 30, 2025
Merged
Changes from 5 commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
b51ed28
F OpenNebula/engineering#408: recovering PR
dgarcia18 afa8630
Update content/product/cluster_configuration/hosts_and_clusters/nvidi…
dgarcia18 40462bc
Merge branch 'OpenNebula:one-7.0-maintenance' into one-7.0-maintenance
dgarcia18 96d7bb5
F OpenNebula/engineering#408: Add configuration and deployment AI-Fac…
dgarcia18 75544d7
Merge branch 'OpenNebula:one-7.0-maintenance' into one-7.0-maintenance
dgarcia18 0819e08
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 87e4e9e
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 be4c975
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 9ede5c8
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 a559f63
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 e95d94c
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 b8d3feb
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 e5b5e70
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 e114fe5
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 bcd5649
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 0326b84
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 d0e29ff
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 49d2889
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 0d43a37
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 e6b8c7b
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 3a4617a
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 c356ef6
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 c398da6
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 634fd61
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 14f4cb3
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 9c0c20f
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 d9eeaca
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 8d7d934
Fix comments
dgarcia18 3491bf6
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 1b4e288
Update content/solutions/deployment_blueprints/ai-ready_opennebula/co…
dgarcia18 dae3869
Solve comments
dgarcia18 32f2ff3
Resolve comments
dgarcia18 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
148 changes: 148 additions & 0 deletions
148
...tions/deployment_blueprints/ai-ready_opennebula/configuration_and_deployment.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| --- | ||
| title: "Configuration and Deployment" | ||
| date: "2025-10-21" | ||
| # description: "Learn how to configure and deploy an AI-ready OpenNebula cloud with PCI passthrough for GPUs using one-deploy." | ||
| weight: 3 | ||
| --- | ||
|
|
||
| This guide details the process of deploying and configuring an AI-ready OpenNebula cloud using the [one-deploy](https://github.com/OpenNebula/one-deploy) tool. We will focus on a local environment, preparing it for demanding AI workloads by leveraging PCI passthrough for GPUs like the NVIDIA H100 and L40S. | ||
|
|
||
| Machine Learning (ML) training and inference are resource-intensive tasks that often require the full power of a dedicated GPU. PCI passthrough allows a Virtual Machine to have exclusive access to a physical GPU, delivering bare-metal performance for the most demanding AI workloads. | ||
|
|
||
| ## Requirements | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Before you begin, ensure your environment meets the following requirements. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Hardware Requirements | ||
|
|
||
| Your virtualization hosts (hypervisors) must support I/O MMU virtualization: | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * **Intel CPUs**: Must support **VT-d**. | ||
| * **AMD CPUs**: Must support **AMD-Vi**. | ||
|
|
||
| This feature must be enabled in your server's BIOS/UEFI. Please consult your hardware vendor's documentation for instructions. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Kernel Configuration (Manual Step) | ||
|
|
||
| The `one-deploy` tool automates many aspects of the configuration, but **you must manually enable IOMMU support in the kernel** on each hypervisor node. This is a critical step that `one-deploy` does not perform automatically. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Before modifying your kernel parameters, you should check if IOMMU is already active. You can do this by inspecting the `/sys/kernel/iommu_groups/` directory on your hypervisor. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```shell | ||
| ls /sys/kernel/iommu_groups/ | ||
| ``` | ||
|
|
||
| If this directory exists and contains subdirectories (e.g., `0/`, `1/`, etc.), IOMMU is likely active. An empty directory or a non-existent directory indicates that IOMMU is not correctly enabled in your kernel or BIOS/UEFI. | ||
|
|
||
| If IOMMU is not active, you need to add the appropriate parameter to your kernel's boot command line: | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| * For Intel CPUs: `intel_iommu=on` | ||
| * For AMD CPUs: `amd_iommu=on` | ||
|
|
||
| You may also need to add `iommu=pt` for pass-through specific configurations. For a detailed guide on how to perform this kernel configuration, refer to the [NVIDIA GPU Passthrough documentation]({{% relref "product/cluster_configuration/hosts_and_clusters/nvidia_gpu_passthrough.md" %}}). | ||
prisorue marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Hypervisor Preparation | ||
|
|
||
| For PCI passthrough to work correctly with NVIDIA GPUs, it is recommended to start with a clean state on the hypervisor nodes regarding NVIDIA drivers. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| It's a good practice to avoid pre-installing NVIDIA drivers on the hypervisor nodes before running the `one-deploy` playbook. An active proprietary NVIDIA driver will claim the GPU and prevent other drivers, like `vfio-pci`, from binding to the device. This will block the PCI passthrough configuration from succeeding. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Deployment with one-deploy | ||
|
|
||
| We will use `one-deploy` to automate the deployment of our OpenNebula cloud with PCI passthrough configured for our GPUs. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Setting Up one-deploy | ||
|
|
||
| The `one-deploy` tool is a collection of Ansible playbooks that streamline the installation of OpenNebula. Before you can use it, you need to prepare your control node (the machine where you will run the Ansible commands). | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. **Clone the repository**: | ||
| ```shell | ||
| git clone https://github.com/OpenNebula/one-deploy.git | ||
| cd one-deploy | ||
| ``` | ||
| 2. **Install dependencies**: | ||
| `one-deploy` requires Ansible and a few other Python libraries. For detailed system requirements and setup instructions, please follow the [Platform Notes](https://github.com/OpenNebula/one-deploy/wiki/sys_reqs) in the official wiki. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| For a general guide on how to execute the playbooks for different cloud architectures, see the [Playbook Usage Guide](https://github.com/OpenNebula/one-deploy/wiki/sys_use). | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Step 1: Configure the Inventory for PCI Passthrough | ||
|
|
||
| `one-deploy` uses an Ansible inventory file to define the hosts and their configurations. We'll use a dedicated inventory file to enable and specify the PCI devices for passthrough. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Here is an example inventory file, which you can adapt for your environment. This example is based on the `inventory/pci_passthrough.yml` file found in the `one-deploy` repository. For more details on the `pci_passthrough` roles, you can consult the [PCI Passthrough wiki page](https://github.com/OpenNebula/one-deploy/wiki/pci_passthrough). | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```yaml | ||
| --- | ||
| all: | ||
| vars: | ||
| ansible_user: root | ||
| one_version: '7.0' | ||
| one_pass: opennebulapass | ||
| ds: | ||
| mode: ssh | ||
| vn: | ||
| admin_net: | ||
| managed: true | ||
| template: | ||
| VN_MAD: bridge | ||
| BRIDGE: br0 | ||
| AR: | ||
| TYPE: IP4 | ||
| IP: 192.168.122.100 | ||
| SIZE: 48 | ||
| NETWORK_ADDRESS: 192.168.122.0 | ||
| NETWORK_MASK: 255.255.255.0 | ||
| GATEWAY: 192.168.122.1 | ||
| DNS: 1.1.1.1 | ||
|
|
||
| frontend: | ||
| hosts: | ||
| f1: { ansible_host: 192.168.122.2 } | ||
|
|
||
| node: | ||
| hosts: | ||
| h100-node: | ||
| ansible_host: 192.168.122.3 | ||
| pci_passthrough_enabled: true | ||
| pci_devices: | ||
| - address: "0000:09:00.0" # NVIDIA H100 GPU | ||
| l40s-node: | ||
| ansible_host: 192.168.122.4 | ||
| pci_passthrough_enabled: true | ||
| pci_devices: | ||
| - address: "0000:0a:00.0" # NVIDIA L40S GPU | ||
| standard-node: | ||
| ansible_host: 192.168.122.5 | ||
| pci_passthrough_enabled: false | ||
| ``` | ||
|
|
||
| {{< alert title="Note" color="info" >}} | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| The inventory file shown above is a basic example. You should adjust it to match your specific cloud architecture, including your frontend and node IP addresses, network configuration (`vn`), and datastore setup (`ds`). For more detailed information on configuring `one-deploy` for different architectures (like shared or Ceph-based storage), please refer to the official [one-deploy wiki](https://github.com/OpenNebula/one-deploy/wiki). | ||
prisorue marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| {{< /alert >}} | ||
|
|
||
| **Key Configuration Parameters:** | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| * `pci_passthrough_enabled: true`: This boolean flag enables the PCI passthrough configuration for a specific node. | ||
prisorue marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * `pci_devices`: This is a list of PCI devices to be configured for passthrough on that node. | ||
prisorue marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * `address`: The full PCI address of the device (e.g., `"0000:09:00.0"`). You can find this using the `lspci -D` command on the hypervisor. Note that you must provide the full address, as short addresses are not supported by this `one-deploy` feature. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Step 2: Run the Deployment | ||
|
|
||
| Once your inventory file is ready (e.g., saved as `inventory/ai_factory.yml`), you can run `one-deploy` to provision your OpenNebula cloud. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```shell | ||
| make I=inventory/ai_factory.yml | ||
| ``` | ||
|
|
||
| The `one-deploy` tool will automatically deploy your entire OpenNebula cloud. When you enable the PCI passthrough feature in your inventory, `one-deploy` handles all the necessary configuration steps automatically. | ||
|
|
||
| On each hypervisor node, it prepares the specified GPUs for passthrough by binding them to the required `vfio-pci` driver. It also ensures the correct permissions are set so that OpenNebula can manage the devices. | ||
prisorue marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Simultaneously, on the OpenNebula front-end, it configures the monitoring system to recognize these GPUs and intelligently updates each Host's template. This ensures that the GPUs are always correctly identified by OpenNebula, even if hardware addresses change, providing a stable and reliable passthrough setup. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Post-Deployment Validation | ||
|
|
||
| After the deployment is complete, you can verify that the GPUs are correctly configured and available to OpenNebula by checking the Host information in Sunstone. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Log in to your OpenNebula Sunstone GUI, navigate to **Infrastructure -> Hosts**, and select one of the hypervisors you configured for passthrough (e.g., `h100-node`). Go to the **PCI** tab. You should see your GPU listed as an available PCI device. | ||
dgarcia18 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| If the device is visible here, your AI-ready OpenNebula cloud is correctly configured. The H100 and L40S GPUs are now ready to be passed through to Virtual Machines for high-performance AI and ML tasks. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.