Azure
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/appendices.md‎
Lines changed: 57 additions & 0 deletions b/‎docs/appendices.md‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎docs/benchmarking.md‎
Lines changed: 59 additions & 0 deletions b/‎docs/benchmarking.md‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎docs/deployment.md‎
Lines changed: 73 additions & 0 deletions b/‎docs/deployment.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎docs/getting-started.md‎
Lines changed: 42 additions & 0 deletions b/‎docs/getting-started.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎docs/ghr.md‎
Lines changed: 71 additions & 0 deletions b/‎docs/ghr.md‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 24 additions & 0 deletions b/‎docs/index.md‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎docs/operations.md‎
Lines changed: 58 additions & 0 deletions b/‎docs/operations.md‎
Lines changed: 58 additions & 0 deletions
@@ -0,0 +1 @@
+venv-docs
@@ -0,0 +1,57 @@
+# Appendices
+
+This section contains reference material, scripts, and diagnostic guidance to support users operating GPU supercomputing clusters on Azure.
+
+## A. Diagnostic Scripts
+
+### Node Health Check (AzHPC)
+
+The AzHPC validation toolkit includes a modular node health check script:
+
+```bash
+git clone https://github.com/Azure/azhpc-validation
+cd azhpc-validation
+bash scripts/run-validation.sh
+```
+
+Includes checks for:
+
+- GPU enumeration and driver status
+- ECC errors
+- PCIe/NVLink/IB connectivity
+- NCCL functionality
+- Clock/thermal status
+
+### NCCL Benchmark Scripts
+
+Preconfigured NCCL benchmark wrappers can be found in the same repo or customized:
+
+```bash
+mpirun -np 8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
+  ./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
+```
+
+## B. Common Issues and Signatures
+
+| Symptom                          | Possible Cause                         | Tool                        |
+|----------------------------------|----------------------------------------|-----------------------------|
+| Missing GPU                     | GPU failure, driver issue              | `nvidia-smi`, NHC           |
+| Low NCCL bandwidth              | SHARP off, job not packed              | `all_reduce_perf`, ToRset   |
+| InfiniBand link down            | Cable/NIC/switch issue                 | `ibstat`, `perfquery`       |
+| ECC error spike                 | Faulty GPU                             | `nvidia-smi -q`, DCGM       |
+| PCIe bus errors                 | NUMA misalignment, system misconfig    | `lspci`, `dmesg`            |
+
+## C. Reference Links
+
+- [AzHPC GitHub](https://github.com/Azure/azhpc)
+- [Moneo GitHub](https://github.com/Azure/moneo)
+- [GHR API Docs](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/guest-health)
+- [NVIDIA NCCL](https://developer.nvidia.com/nccl)
+
+## D. Feedback & Contributions
+
+This guide is open to customer feedback. If you notice outdated info or would like to contribute improvements, reach out to your Microsoft account team or submit a pull request if hosted on GitHub.
+
+---
+
+End of Guide.
@@ -0,0 +1,59 @@
+
+# Benchmarking
+ 
+This section describes how to benchmark your Azure supercomputing cluster to verify expected performance and identify potential bottlenecks. Benchmarks also serve as a pre-check for production readiness and support engagement.
+ 
+## 1. Why Benchmark?
+ 
+- Validate cluster configuration (e.g., topology, SHARP enablement)
+- Establish performance baselines for future regressions
+- Identify underperforming nodes or links
+- Support escalation by demonstrating hardware-level anomalies
+ 
+## 2. NCCL Benchmarks
+ 
+NCCL is the standard collective communication library for multi-GPU workloads using NVLink and InfiniBand.
+ 
+Clone and build the tests:
+ 
+```bash
+git clone https://github.com/NVIDIA/nccl-tests.git
+cd nccl-tests
+make MPI=1
+```
+ 
+Then run:
+ 
+```bash
+mpirun -np 8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
+  ./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
+```
+ 
+Use the number of GPUs you have available, and ensure each rank maps to a separate GPU.
+ 
+## 3. SHARP vs Non-SHARP Output
+ 
+| Test Pattern      | SHARP-enabled (NDv4) | Non-SHARP         |
+|-------------------|----------------------|-------------------|
+| AllReduce 1GB     | ~180 GB/s            | ~120 GB/s         |
+| AllReduce 256MB   | ~90–120 GB/s         | ~60–80 GB/s       |
+ 
+Performance depends on node locality and job packing. Use ToRset information to diagnose.
+ 
+## 4. Interpreting Results
+ 
+- **Flat or low throughput** across sizes suggests topology misalignment or SHARP not engaged
+- **One GPU consistently slower** can indicate a bad PCIe lane or thermal throttling
+- **High variability** between runs = likely job placement issue
+ 
+Plot and compare runs to a known-good benchmark from your team or Microsoft.
+ 
+## 5. Additional Tests
+ 
+- `ib_read_bw` / `ib_write_bw` – raw IB throughput per link
+- `dcgmi dmon -e 1000` – GPU perf counters
+- `nvidia-smi nvlink --status` – validate NVLink health
+ 
+---
+ 
+Next: [Telemetry & Observability](telemetry.md)
@@ -0,0 +1,73 @@
+# Deployment Guide
+
+This section describes how to deploy Azure GPU Supercomputing infrastructure, with options for CLI-based provisioning, infrastructure-as-code tools, and key networking considerations.
+
+## 1. Choose a Deployment Method
+
+Azure Supercomputing clusters can be deployed using the following options:
+
+- **Azure CLI** – for lightweight manual provisioning and testing  
+- **Bicep or ARM templates** – for reproducible and auditable deployments  
+- **Terraform** – popular among infrastructure teams for cloud-agnostic deployment  
+- **AzHPC** – Microsoft-supported toolkit for deploying tightly coupled HPC clusters with InfiniBand  
+
+> **Recommendation:** Use AzHPC for complex topologies or when IB tuning is required.
+
+## 2. Define Your Topology
+
+Define:
+
+- Desired VM SKU (NDv4 or NDv5)  
+- Number of nodes  
+- InfiniBand network topology (e.g., flat, SHARP-enabled, non-SHARP)  
+- Placement policies (e.g., proximity placement groups)  
+
+Use the appropriate parameters or variable files depending on your tooling.
+
+## 3. Configure Networking
+
+Ensure the following:
+
+- VNet and subnet are provisioned with sufficient IPs  
+- Accelerated networking is enabled  
+- NSGs allow SSH, telemetry, and any required workload ports  
+- If using IB, ensure the correct partitioning and ToR topology alignment  
+
+## 4. Provision Resources
+
+Example CLI steps:
+
+```bash
+az group create -n myResourceGroup -l eastus
+
+az vm create \
+  --resource-group myResourceGroup \
+  --name myVM \
+  --image OpenLogic:CentOS-HPC:7_9:latest \
+  --size Standard_ND96asr_v4 \
+  --vnet-name myVNet \
+  --subnet mySubnet \
+  --admin-username azureuser \
+  --ssh-key-values ~/.ssh/id_rsa.pub
+  ```
+Replace with your VM SKU, region, and networking details.
+
+## 5. Post-Deployment Validation
+
+After deployment, verify:
+
+- Node health (see the [Validation](validation.md) section)  
+- IB topology and functionality (see the [Topology](topology.md) section)  
+- Telemetry pipeline is functional (see the [Telemetry](telemetry.md) section)  
+
+## 6. Automation and Scaling
+
+We recommend integrating deployment pipelines into your CI/CD system for reproducibility. For scale-out, consider:
+
+- VM Scale Sets (VMSS) with custom image  
+- Azure CycleCloud  
+- AzHPC scripts with looped host creation  
+
+---
+
+Next: [VM SKU Reference](vm-skus.md)
@@ -0,0 +1,42 @@
+# Getting Started
+
+This section provides the essential setup steps to prepare for using Azure's AI Supercomputing infrastructure. It covers subscription readiness, quota configuration, access roles, and optional onboarding for advanced features like Guest Health Reporting (GHR).
+
+## 1. Subscription Preparation
+
+Before deploying GPU clusters, ensure the following:
+
+- You have access to an Azure subscription in the correct region.
+- Sufficient quota for the target VM SKU (NDv4 or NDv5) is available.
+- Required resource providers are registered:
+    - Microsoft.Network
+
+Use the Azure CLI to validate and request quota increases if needed.
+
+## 2. Role Assignments and Access Control
+
+Assign the following roles to the appropriate identities in your Azure subscription:
+
+- **Contributor** or **Owner**: to deploy infrastructure and manage resources.
+- **Impact Reporter**: for GHR operations (if enabled).
+- **Reader**: for monitoring and telemetry dashboards.
+
+Ensure your automation identities (e.g., Terraform, Bicep, AzHPC) have adequate permissions.
+
+## 3. Register for Guest Health Reporting (Optional)
+
+Guest Health Reporting (GHR) enables qualified customers to notify Azure about faulty hardware nodes.
+
+To register:
+
+1. Go to your Azure subscription.
+2. In **Resource Providers**, register `Microsoft.Impact`.
+3. In **Preview Features**, register `Allow Impact Reporting`.
+4. Under **Access Control (IAM)**, assign the **Impact Reporter** role to the app or user that will report issues.
+5. Fill out the [Onboarding Questionnaire](https://forms.office.com/Pages/DesignPageV2.aspx?origin=NeoPortalPage&subpage=design&id=v4j5cvGGr0GRqy180BHbR5TDsw2DhHZCkjVm4E5h1NNUNTZQMkRRWUw4S1ZOTUM1UlJIQkhXQ0czSi4u&analysis=false&topview=Preview).
+
+See the [Guest Health Reporting section](ghr.md) for usage details.
+
+## 4. Next Steps
+
+Once your subscription is ready and roles assigned, proceed to the [Deployment Guide](deployment.md) to launch your supercomputing cluster.
@@ -0,0 +1,71 @@
+# Guest Health Reporting (GHR)
+
+Guest Health Reporting (GHR) is a mechanism that allows customers to notify Azure about suspected hardware issues with specific nodes. It is available to approved customers operating supported VM SKUs like NDv4 and NDv5.
+
+## 1. What is GHR?
+
+GHR enables external users to flag potentially faulty virtual machines to Microsoft. These reports contribute to Azure's hardware telemetry and support processes, accelerating detection and remediation of underlying issues.
+
+## 2. Who Can Use GHR?
+
+GHR is currently in preview and is only available to approved customers. To request access:
+
+- Register the `Microsoft.Impact` resource provider
+- Enable the preview feature `Allow Impact Reporting`
+- Assign the `Impact Reporter` role to your reporting identity
+- Complete the onboarding form (link in [Getting Started](getting-started.md))
+
+## 3. How GHR Works
+
+Once enabled:
+
+1. You detect a node with a suspected fault (via validation, logs, repeated failures, etc.)
+2. Your system (or you) sends a signed POST request to the GHR API with impact details
+3. Azure logs and triages the report; correlated reports trigger deeper diagnostics or node removal
+
+Reports are not immediate triggers—they are signals in a broader telemetry system.
+
+## 4. Reporting an Impact
+
+To report an impact, POST to the following endpoint:
+
+```
+https://impact.api.azure.com/impact/v1/report
+```
+
+With a body like:
+
+```json
+{
+  "subscriptionId": "<your-subscription-id>",
+  "resourceUri": "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm-name>",
+  "impactedComponents": [
+    {
+      "impactCategory": "GPU",
+      "impactType": "DegradedPerformance",
+      "timestamp": "2024-04-10T22:30:00Z"
+    }
+  ]
+}
+```
+
+Make sure your identity has the `Impact Reporter` role and your app is registered in Azure AD.
+
+## 5. Supported Impact Categories
+
+| Category | Type                | Example                          |
+|----------|---------------------|----------------------------------|
+| GPU      | DegradedPerformance | ECC errors, frequent resets      |
+| IB       | Unreachable         | Node fails NCCL or link tests    |
+| CPU      | UnexpectedReboot    | Node crashes during workload     |
+| PCIe     | BandwidthThrottle   | PCIe/NVLink bottleneck observed  |
+
+## 6. Best Practices
+
+- Only report when confident the issue is hardware-related
+- Include timestamps and context if possible
+- Integrate into automated diagnostic pipelines for scale
+
+---
+
+Next: [InfiniBand Topology](topology.md)
@@ -0,0 +1,24 @@
+# Azure Supercomputer User Guide
+
+### For AI/ML infrastructure teams operating large-scale GPU clusters
+
+This guide helps infrastructure engineers, ops teams, and ML researchers deploy and operate GPU supercomputing environments on Azure. It focuses on NDv4 and NDv5 VM SKUs and provides practical guidance for setup, validation, benchmarking, and performance optimization.
+
+Whether you're bringing up your first NDv5 cluster or tuning NCCL for a 1,024-GPU run, this guide aims to make Azure's high-performance infrastructure accessible and operationally reliable.
+
+Topics covered include:
+
+- Subscription and quota prep
+- Deployment architectures and automation
+- VM SKU selection and hardware topologies
+- Node validation and health checks
+- Performance benchmarking with NCCL
+- Guest Health Reporting (GHR)
+- InfiniBand topology tuning
+- Telemetry and observability
+- Diagnostic tools and troubleshooting workflows
+
+> Contributions welcome. Reach out to your Microsoft team or open a PR if this site is hosted on GitHub.
+
+---
+
@@ -0,0 +1,58 @@
+# Operations
+
+This section covers best practices for day-to-day operations of Azure GPU supercomputing clusters, including workload monitoring, failure remediation, and using Guest Health Reporting (GHR).
+
+## 1. Monitoring Jobs and Node Health
+
+For job-level and cluster-level visibility:
+
+- Use Prometheus and Grafana for GPU/CPU/memory metrics
+- Monitor GPU utilization, thermal state, ECC errors, and memory usage via `nvidia-smi` or DCGM
+- InfiniBand traffic and errors can be tracked using `perfquery`, `ibdiagnet`, or `infiniband-exporter`
+- Use AzHPC telemetry or Moneo if supported in your cluster
+
+## 2. Common Failure Modes
+
+Watch for:
+
+- Node hangs or unresponsiveness
+- Repeated job failures on specific nodes
+- ECC or PCIe errors
+- GPUs missing from `nvidia-smi`
+- InfiniBand degradation or disconnections
+
+Many of these are detected during pre-job NHC or post-failure diagnostics.
+
+## 3. Failure Remediation
+
+Steps:
+
+1. **Drain the node** from your scheduler (e.g., `scontrol update nodename=XXX state=drain reason="validation fail"`).
+2. Run AzHPC NHC or custom diagnostics scripts.
+3. Compare results with historical telemetry.
+4. If issue persists and GHR is enabled, report it.
+
+Document steps and timestamps to correlate with Azure support logs if escalation is required.
+
+## 4. Node Reallocation
+
+If you observe flaky behavior (intermittent failures), consider:
+
+- Manually deallocating and reallocating the node
+- Reimaging the node with your base image
+- Cross-validating in different jobs or under stress testing
+
+Avoid building long-term automation around reallocation—it’s a workaround, not a fix.
+
+## 5. Guest Health Reporting (GHR)
+
+For supported customers, GHR enables impact reporting and tracking hardware incidents.
+
+- Register using the onboarding steps in [Getting Started](getting-started.md)
+- For full usage, see [GHR](ghr.md)
+
+GHR can be integrated with job failure detection systems to auto-report problematic nodes.
+
+---
+
+Next: [Guest Health Reporting (GHR)](ghr.md)