Skip to content

Commit fe10936

Browse files
committed
Creating a new docs landing page
1 parent 9272b99 commit fe10936

File tree

13 files changed

+678
-0
lines changed

13 files changed

+678
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
venv-docs

docs/appendices.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Appendices
2+
3+
This section contains reference material, scripts, and diagnostic guidance to support users operating GPU supercomputing clusters on Azure.
4+
5+
## A. Diagnostic Scripts
6+
7+
### Node Health Check (AzHPC)
8+
9+
The AzHPC validation toolkit includes a modular node health check script:
10+
11+
```bash
12+
git clone https://github.com/Azure/azhpc-validation
13+
cd azhpc-validation
14+
bash scripts/run-validation.sh
15+
```
16+
17+
Includes checks for:
18+
19+
- GPU enumeration and driver status
20+
- ECC errors
21+
- PCIe/NVLink/IB connectivity
22+
- NCCL functionality
23+
- Clock/thermal status
24+
25+
### NCCL Benchmark Scripts
26+
27+
Preconfigured NCCL benchmark wrappers can be found in the same repo or customized:
28+
29+
```bash
30+
mpirun -np 8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
31+
./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
32+
```
33+
34+
## B. Common Issues and Signatures
35+
36+
| Symptom | Possible Cause | Tool |
37+
|----------------------------------|----------------------------------------|-----------------------------|
38+
| Missing GPU | GPU failure, driver issue | `nvidia-smi`, NHC |
39+
| Low NCCL bandwidth | SHARP off, job not packed | `all_reduce_perf`, ToRset |
40+
| InfiniBand link down | Cable/NIC/switch issue | `ibstat`, `perfquery` |
41+
| ECC error spike | Faulty GPU | `nvidia-smi -q`, DCGM |
42+
| PCIe bus errors | NUMA misalignment, system misconfig | `lspci`, `dmesg` |
43+
44+
## C. Reference Links
45+
46+
- [AzHPC GitHub](https://github.com/Azure/azhpc)
47+
- [Moneo GitHub](https://github.com/Azure/moneo)
48+
- [GHR API Docs](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/guest-health)
49+
- [NVIDIA NCCL](https://developer.nvidia.com/nccl)
50+
51+
## D. Feedback & Contributions
52+
53+
This guide is open to customer feedback. If you notice outdated info or would like to contribute improvements, reach out to your Microsoft account team or submit a pull request if hosted on GitHub.
54+
55+
---
56+
57+
End of Guide.

docs/benchmarking.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
2+
# Benchmarking
3+
4+
This section describes how to benchmark your Azure supercomputing cluster to verify expected performance and identify potential bottlenecks. Benchmarks also serve as a pre-check for production readiness and support engagement.
5+
6+
## 1. Why Benchmark?
7+
8+
- Validate cluster configuration (e.g., topology, SHARP enablement)
9+
- Establish performance baselines for future regressions
10+
- Identify underperforming nodes or links
11+
- Support escalation by demonstrating hardware-level anomalies
12+
13+
## 2. NCCL Benchmarks
14+
15+
NCCL is the standard collective communication library for multi-GPU workloads using NVLink and InfiniBand.
16+
17+
Clone and build the tests:
18+
19+
```bash
20+
git clone https://github.com/NVIDIA/nccl-tests.git
21+
cd nccl-tests
22+
make MPI=1
23+
```
24+
25+
Then run:
26+
27+
```bash
28+
mpirun -np 8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
29+
./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
30+
```
31+
32+
Use the number of GPUs you have available, and ensure each rank maps to a separate GPU.
33+
34+
## 3. SHARP vs Non-SHARP Output
35+
36+
| Test Pattern | SHARP-enabled (NDv4) | Non-SHARP |
37+
|-------------------|----------------------|-------------------|
38+
| AllReduce 1GB | ~180 GB/s | ~120 GB/s |
39+
| AllReduce 256MB | ~90–120 GB/s | ~60–80 GB/s |
40+
41+
Performance depends on node locality and job packing. Use ToRset information to diagnose.
42+
43+
## 4. Interpreting Results
44+
45+
- **Flat or low throughput** across sizes suggests topology misalignment or SHARP not engaged
46+
- **One GPU consistently slower** can indicate a bad PCIe lane or thermal throttling
47+
- **High variability** between runs = likely job placement issue
48+
49+
Plot and compare runs to a known-good benchmark from your team or Microsoft.
50+
51+
## 5. Additional Tests
52+
53+
- `ib_read_bw` / `ib_write_bw` – raw IB throughput per link
54+
- `dcgmi dmon -e 1000` – GPU perf counters
55+
- `nvidia-smi nvlink --status` – validate NVLink health
56+
57+
---
58+
59+
Next: [Telemetry & Observability](telemetry.md)

docs/deployment.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Deployment Guide
2+
3+
This section describes how to deploy Azure GPU Supercomputing infrastructure, with options for CLI-based provisioning, infrastructure-as-code tools, and key networking considerations.
4+
5+
## 1. Choose a Deployment Method
6+
7+
Azure Supercomputing clusters can be deployed using the following options:
8+
9+
- **Azure CLI** – for lightweight manual provisioning and testing
10+
- **Bicep or ARM templates** – for reproducible and auditable deployments
11+
- **Terraform** – popular among infrastructure teams for cloud-agnostic deployment
12+
- **AzHPC** – Microsoft-supported toolkit for deploying tightly coupled HPC clusters with InfiniBand
13+
14+
> **Recommendation:** Use AzHPC for complex topologies or when IB tuning is required.
15+
16+
## 2. Define Your Topology
17+
18+
Define:
19+
20+
- Desired VM SKU (NDv4 or NDv5)
21+
- Number of nodes
22+
- InfiniBand network topology (e.g., flat, SHARP-enabled, non-SHARP)
23+
- Placement policies (e.g., proximity placement groups)
24+
25+
Use the appropriate parameters or variable files depending on your tooling.
26+
27+
## 3. Configure Networking
28+
29+
Ensure the following:
30+
31+
- VNet and subnet are provisioned with sufficient IPs
32+
- Accelerated networking is enabled
33+
- NSGs allow SSH, telemetry, and any required workload ports
34+
- If using IB, ensure the correct partitioning and ToR topology alignment
35+
36+
## 4. Provision Resources
37+
38+
Example CLI steps:
39+
40+
```bash
41+
az group create -n myResourceGroup -l eastus
42+
43+
az vm create \
44+
--resource-group myResourceGroup \
45+
--name myVM \
46+
--image OpenLogic:CentOS-HPC:7_9:latest \
47+
--size Standard_ND96asr_v4 \
48+
--vnet-name myVNet \
49+
--subnet mySubnet \
50+
--admin-username azureuser \
51+
--ssh-key-values ~/.ssh/id_rsa.pub
52+
```
53+
Replace with your VM SKU, region, and networking details.
54+
55+
## 5. Post-Deployment Validation
56+
57+
After deployment, verify:
58+
59+
- Node health (see the [Validation](validation.md) section)
60+
- IB topology and functionality (see the [Topology](topology.md) section)
61+
- Telemetry pipeline is functional (see the [Telemetry](telemetry.md) section)
62+
63+
## 6. Automation and Scaling
64+
65+
We recommend integrating deployment pipelines into your CI/CD system for reproducibility. For scale-out, consider:
66+
67+
- VM Scale Sets (VMSS) with custom image
68+
- Azure CycleCloud
69+
- AzHPC scripts with looped host creation
70+
71+
---
72+
73+
Next: [VM SKU Reference](vm-skus.md)

docs/getting-started.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Getting Started
2+
3+
This section provides the essential setup steps to prepare for using Azure's AI Supercomputing infrastructure. It covers subscription readiness, quota configuration, access roles, and optional onboarding for advanced features like Guest Health Reporting (GHR).
4+
5+
## 1. Subscription Preparation
6+
7+
Before deploying GPU clusters, ensure the following:
8+
9+
- You have access to an Azure subscription in the correct region.
10+
- Sufficient quota for the target VM SKU (NDv4 or NDv5) is available.
11+
- Required resource providers are registered:
12+
- Microsoft.Network
13+
14+
Use the Azure CLI to validate and request quota increases if needed.
15+
16+
## 2. Role Assignments and Access Control
17+
18+
Assign the following roles to the appropriate identities in your Azure subscription:
19+
20+
- **Contributor** or **Owner**: to deploy infrastructure and manage resources.
21+
- **Impact Reporter**: for GHR operations (if enabled).
22+
- **Reader**: for monitoring and telemetry dashboards.
23+
24+
Ensure your automation identities (e.g., Terraform, Bicep, AzHPC) have adequate permissions.
25+
26+
## 3. Register for Guest Health Reporting (Optional)
27+
28+
Guest Health Reporting (GHR) enables qualified customers to notify Azure about faulty hardware nodes.
29+
30+
To register:
31+
32+
1. Go to your Azure subscription.
33+
2. In **Resource Providers**, register `Microsoft.Impact`.
34+
3. In **Preview Features**, register `Allow Impact Reporting`.
35+
4. Under **Access Control (IAM)**, assign the **Impact Reporter** role to the app or user that will report issues.
36+
5. Fill out the [Onboarding Questionnaire](https://forms.office.com/Pages/DesignPageV2.aspx?origin=NeoPortalPage&subpage=design&id=v4j5cvGGr0GRqy180BHbR5TDsw2DhHZCkjVm4E5h1NNUNTZQMkRRWUw4S1ZOTUM1UlJIQkhXQ0czSi4u&analysis=false&topview=Preview).
37+
38+
See the [Guest Health Reporting section](ghr.md) for usage details.
39+
40+
## 4. Next Steps
41+
42+
Once your subscription is ready and roles assigned, proceed to the [Deployment Guide](deployment.md) to launch your supercomputing cluster.

docs/ghr.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Guest Health Reporting (GHR)
2+
3+
Guest Health Reporting (GHR) is a mechanism that allows customers to notify Azure about suspected hardware issues with specific nodes. It is available to approved customers operating supported VM SKUs like NDv4 and NDv5.
4+
5+
## 1. What is GHR?
6+
7+
GHR enables external users to flag potentially faulty virtual machines to Microsoft. These reports contribute to Azure's hardware telemetry and support processes, accelerating detection and remediation of underlying issues.
8+
9+
## 2. Who Can Use GHR?
10+
11+
GHR is currently in preview and is only available to approved customers. To request access:
12+
13+
- Register the `Microsoft.Impact` resource provider
14+
- Enable the preview feature `Allow Impact Reporting`
15+
- Assign the `Impact Reporter` role to your reporting identity
16+
- Complete the onboarding form (link in [Getting Started](getting-started.md))
17+
18+
## 3. How GHR Works
19+
20+
Once enabled:
21+
22+
1. You detect a node with a suspected fault (via validation, logs, repeated failures, etc.)
23+
2. Your system (or you) sends a signed POST request to the GHR API with impact details
24+
3. Azure logs and triages the report; correlated reports trigger deeper diagnostics or node removal
25+
26+
Reports are not immediate triggers—they are signals in a broader telemetry system.
27+
28+
## 4. Reporting an Impact
29+
30+
To report an impact, POST to the following endpoint:
31+
32+
```
33+
https://impact.api.azure.com/impact/v1/report
34+
```
35+
36+
With a body like:
37+
38+
```json
39+
{
40+
"subscriptionId": "<your-subscription-id>",
41+
"resourceUri": "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm-name>",
42+
"impactedComponents": [
43+
{
44+
"impactCategory": "GPU",
45+
"impactType": "DegradedPerformance",
46+
"timestamp": "2024-04-10T22:30:00Z"
47+
}
48+
]
49+
}
50+
```
51+
52+
Make sure your identity has the `Impact Reporter` role and your app is registered in Azure AD.
53+
54+
## 5. Supported Impact Categories
55+
56+
| Category | Type | Example |
57+
|----------|---------------------|----------------------------------|
58+
| GPU | DegradedPerformance | ECC errors, frequent resets |
59+
| IB | Unreachable | Node fails NCCL or link tests |
60+
| CPU | UnexpectedReboot | Node crashes during workload |
61+
| PCIe | BandwidthThrottle | PCIe/NVLink bottleneck observed |
62+
63+
## 6. Best Practices
64+
65+
- Only report when confident the issue is hardware-related
66+
- Include timestamps and context if possible
67+
- Integrate into automated diagnostic pipelines for scale
68+
69+
---
70+
71+
Next: [InfiniBand Topology](topology.md)

docs/index.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Azure Supercomputer User Guide
2+
3+
### For AI/ML infrastructure teams operating large-scale GPU clusters
4+
5+
This guide helps infrastructure engineers, ops teams, and ML researchers deploy and operate GPU supercomputing environments on Azure. It focuses on NDv4 and NDv5 VM SKUs and provides practical guidance for setup, validation, benchmarking, and performance optimization.
6+
7+
Whether you're bringing up your first NDv5 cluster or tuning NCCL for a 1,024-GPU run, this guide aims to make Azure's high-performance infrastructure accessible and operationally reliable.
8+
9+
Topics covered include:
10+
11+
- Subscription and quota prep
12+
- Deployment architectures and automation
13+
- VM SKU selection and hardware topologies
14+
- Node validation and health checks
15+
- Performance benchmarking with NCCL
16+
- Guest Health Reporting (GHR)
17+
- InfiniBand topology tuning
18+
- Telemetry and observability
19+
- Diagnostic tools and troubleshooting workflows
20+
21+
> Contributions welcome. Reach out to your Microsoft team or open a PR if this site is hosted on GitHub.
22+
23+
---
24+

docs/operations.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Operations
2+
3+
This section covers best practices for day-to-day operations of Azure GPU supercomputing clusters, including workload monitoring, failure remediation, and using Guest Health Reporting (GHR).
4+
5+
## 1. Monitoring Jobs and Node Health
6+
7+
For job-level and cluster-level visibility:
8+
9+
- Use Prometheus and Grafana for GPU/CPU/memory metrics
10+
- Monitor GPU utilization, thermal state, ECC errors, and memory usage via `nvidia-smi` or DCGM
11+
- InfiniBand traffic and errors can be tracked using `perfquery`, `ibdiagnet`, or `infiniband-exporter`
12+
- Use AzHPC telemetry or Moneo if supported in your cluster
13+
14+
## 2. Common Failure Modes
15+
16+
Watch for:
17+
18+
- Node hangs or unresponsiveness
19+
- Repeated job failures on specific nodes
20+
- ECC or PCIe errors
21+
- GPUs missing from `nvidia-smi`
22+
- InfiniBand degradation or disconnections
23+
24+
Many of these are detected during pre-job NHC or post-failure diagnostics.
25+
26+
## 3. Failure Remediation
27+
28+
Steps:
29+
30+
1. **Drain the node** from your scheduler (e.g., `scontrol update nodename=XXX state=drain reason="validation fail"`).
31+
2. Run AzHPC NHC or custom diagnostics scripts.
32+
3. Compare results with historical telemetry.
33+
4. If issue persists and GHR is enabled, report it.
34+
35+
Document steps and timestamps to correlate with Azure support logs if escalation is required.
36+
37+
## 4. Node Reallocation
38+
39+
If you observe flaky behavior (intermittent failures), consider:
40+
41+
- Manually deallocating and reallocating the node
42+
- Reimaging the node with your base image
43+
- Cross-validating in different jobs or under stress testing
44+
45+
Avoid building long-term automation around reallocation—it’s a workaround, not a fix.
46+
47+
## 5. Guest Health Reporting (GHR)
48+
49+
For supported customers, GHR enables impact reporting and tracking hardware incidents.
50+
51+
- Register using the onboarding steps in [Getting Started](getting-started.md)
52+
- For full usage, see [GHR](ghr.md)
53+
54+
GHR can be integrated with job failure detection systems to auto-report problematic nodes.
55+
56+
---
57+
58+
Next: [Guest Health Reporting (GHR)](ghr.md)

0 commit comments

Comments
 (0)