|
| 1 | +# Deployment Guide |
| 2 | + |
| 3 | +This section describes how to deploy Azure GPU Supercomputing infrastructure, with options for CLI-based provisioning, infrastructure-as-code tools, and key networking considerations. |
| 4 | + |
| 5 | +## 1. Choose a Deployment Method |
| 6 | + |
| 7 | +Azure Supercomputing clusters can be deployed using the following options: |
| 8 | + |
| 9 | +- **Azure CLI** – for lightweight manual provisioning and testing |
| 10 | +- **Bicep or ARM templates** – for reproducible and auditable deployments |
| 11 | +- **Terraform** – popular among infrastructure teams for cloud-agnostic deployment |
| 12 | +- **AzHPC** – Microsoft-supported toolkit for deploying tightly coupled HPC clusters with InfiniBand |
| 13 | + |
| 14 | +> **Recommendation:** Use AzHPC for complex topologies or when IB tuning is required. |
| 15 | +
|
| 16 | +## 2. Define Your Topology |
| 17 | + |
| 18 | +Define: |
| 19 | + |
| 20 | +- Desired VM SKU (NDv4 or NDv5) |
| 21 | +- Number of nodes |
| 22 | +- InfiniBand network topology (e.g., flat, SHARP-enabled, non-SHARP) |
| 23 | +- Placement policies (e.g., proximity placement groups) |
| 24 | + |
| 25 | +Use the appropriate parameters or variable files depending on your tooling. |
| 26 | + |
| 27 | +## 3. Configure Networking |
| 28 | + |
| 29 | +Ensure the following: |
| 30 | + |
| 31 | +- VNet and subnet are provisioned with sufficient IPs |
| 32 | +- Accelerated networking is enabled |
| 33 | +- NSGs allow SSH, telemetry, and any required workload ports |
| 34 | +- If using IB, ensure the correct partitioning and ToR topology alignment |
| 35 | + |
| 36 | +## 4. Provision Resources |
| 37 | + |
| 38 | +Example CLI steps: |
| 39 | + |
| 40 | +```bash |
| 41 | +az group create -n myResourceGroup -l eastus |
| 42 | + |
| 43 | +az vm create \ |
| 44 | + --resource-group myResourceGroup \ |
| 45 | + --name myVM \ |
| 46 | + --image OpenLogic:CentOS-HPC:7_9:latest \ |
| 47 | + --size Standard_ND96asr_v4 \ |
| 48 | + --vnet-name myVNet \ |
| 49 | + --subnet mySubnet \ |
| 50 | + --admin-username azureuser \ |
| 51 | + --ssh-key-values ~/.ssh/id_rsa.pub |
| 52 | + ``` |
| 53 | +Replace with your VM SKU, region, and networking details. |
| 54 | + |
| 55 | +## 5. Post-Deployment Validation |
| 56 | + |
| 57 | +After deployment, verify: |
| 58 | + |
| 59 | +- Node health (see the [Validation](validation.md) section) |
| 60 | +- IB topology and functionality (see the [Topology](topology.md) section) |
| 61 | +- Telemetry pipeline is functional (see the [Telemetry](telemetry.md) section) |
| 62 | + |
| 63 | +## 6. Automation and Scaling |
| 64 | + |
| 65 | +We recommend integrating deployment pipelines into your CI/CD system for reproducibility. For scale-out, consider: |
| 66 | + |
| 67 | +- VM Scale Sets (VMSS) with custom image |
| 68 | +- Azure CycleCloud |
| 69 | +- AzHPC scripts with looped host creation |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +Next: [VM SKU Reference](vm-skus.md) |
0 commit comments