Skip to content

Latest commit

 

History

History

gpu-health-check

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Health Check

This repository offers a robust, pre-check recipe for thorough GPU health validation prior to deploying production or research workloads. Designed to operate seamlessly across both single-node and multi-node environments, this diagnostic toolset enables you to verify that your GPU infrastructure is primed for high-demand experiments. By systematically assessing key performance metrics—such as thermal behavior, power stability, and overall hardware reliability—you can proactively detect and address issues like thermal throttling, power irregularities, and GPU instability. This early-warning system minimizes the risk of unexpected downtime and performance degradation, ensuring that your system consistently operates at peak efficiency and reliability during critical computational tasks.

Pre-Filled Samples

Title Description
2 A10 GPUs with dtype 16 Deploys 2 A10 GPUs with dtype 16 on VM.GPU.A10.2 with 2 GPU(s).
2 A10 GPUs with dtype 32 Deploys 2 A10 GPUs with dtype 32 on VM.GPU.A10.2 with 2 GPU(s).
8 H100 GPUs with dtype 16 Deploys 8 H100 GPUs with dtype 16 on BM.GPU.H100.8 with 8 GPU(s).