|
| 1 | +--- |
| 2 | +title: Benchmarking via onnxruntime_perf_test |
| 3 | +weight: 7 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +Now that you’ve set up and run the ONNX model (e.g., SqueezeNet), you can use it to benchmark inference performance using Python-based timing or tools like **onnxruntime_perf_test**. This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances. |
| 10 | + |
| 11 | +You can also compare the inference time between Cobalt 100 (Arm64) and similar D-series x86_64-based virtual machine on Azure. |
| 12 | +As noted before, the steps to benchmark remain the same, whether it's a Docker container or a custom virtual machine. |
| 13 | + |
| 14 | +## Run the performance tests using onnxruntime_perf_test |
| 15 | +The **onnxruntime_perf_test** is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models under various runtime conditions (like CPU, GPU, or other execution providers). |
| 16 | + |
| 17 | +### Install Required Build Tools |
| 18 | + |
| 19 | +```console |
| 20 | +tdnf install -y cmake make gcc-c++ git |
| 21 | +``` |
| 22 | +#### Install Protobuf |
| 23 | + |
| 24 | +```console |
| 25 | +tdnf install -y protobuf protobuf-devel |
| 26 | +``` |
| 27 | +Then verify: |
| 28 | +```console |
| 29 | +protoc --version |
| 30 | +``` |
| 31 | +You should see an output similar to: |
| 32 | + |
| 33 | +```output |
| 34 | +libprotoc 3.x.x |
| 35 | +``` |
| 36 | +If installation via the package manager fails, or the version is too old for ONNX Runtime; then proceed with installing Protobuf using the AArch64 pre-built zip artifact, as discussed below. |
| 37 | + |
| 38 | +#### Install Protobuf with Prebuilt AArch64 ZIP Artifact |
| 39 | + |
| 40 | +```console |
| 41 | +wget https://github.com/protocolbuffers/protobuf/releases/download/v31.1/protoc-31.1-linux-aarch_64.zip -O protoc-31.1.zip |
| 42 | +mkdir -p $HOME/tools/protoc-31.1 |
| 43 | +unzip protoc-31.1.zip -d $HOME/tools/protoc-31.1 |
| 44 | +echo 'export PATH="$HOME/tools/protoc-31.1/bin:$PATH"' >> ~/.bashrc |
| 45 | +source ~/.bashrc |
| 46 | +``` |
| 47 | + |
| 48 | +Then verify: |
| 49 | +```console |
| 50 | +protoc --version |
| 51 | +``` |
| 52 | +You should see an output similar to: |
| 53 | +```output |
| 54 | +libprotoc x.x.x |
| 55 | +``` |
| 56 | + |
| 57 | +### Clone and Build ONNX Runtime from Source: |
| 58 | + |
| 59 | +The benchmarking tool, **onnxruntime_perf_test**, isn’t available as a pre-built binary artifact for any platform. So, you have to build it from the source, which is expected to take around 40-50 minutes. |
| 60 | + |
| 61 | +Install the required tools and clone onnxruntime: |
| 62 | +```console |
| 63 | +tdnf install -y protobuf-compiler libprotobuf-dev libprotoc-dev |
| 64 | +git clone --recursive https://github.com/microsoft/onnxruntime |
| 65 | +cd onnxruntime |
| 66 | +``` |
| 67 | +Now, build the benchmark as below: |
| 68 | + |
| 69 | +```console |
| 70 | +./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests |
| 71 | +``` |
| 72 | +This will build the benchmark tool inside ./build/Linux/Release/onnxruntime_perf_test. |
| 73 | + |
| 74 | +### Run the benchmark |
| 75 | +Now that the benchmarking tool has been built, you can benchmark the **squeezenet-int8.onnx** model, as below: |
| 76 | + |
| 77 | +```console |
| 78 | +./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I <path-to-squeezenet-int8.onnx> |
| 79 | +``` |
| 80 | + |
| 81 | +- **e cpu**: Use the CPU execution provider (not GPU or any other backend). |
| 82 | +- **r 100**: Run 100 inferences. |
| 83 | +- **m times**: Use "repeat N times" mode. |
| 84 | +- **s**: Show detailed statistics. |
| 85 | +- **Z**: Disable intra-op thread spinning (reduces CPU usage when idle between runs). |
| 86 | +- **I**: Input the ONNX model path without using input/output test data. |
| 87 | + |
| 88 | +### Benchmark summary on x86_64: |
| 89 | + |
| 90 | +The following benchmark results are collected on two different x86_64 environments: a **Docker container running Azure Linux 3.0 hosted on a D4s_v6 Ubuntu-based Azure virtual machine**, and a **D4s_v4 Azure virtual machine created from the Azure Linux 3.0 image published by Ntegral Inc**. |
| 91 | + |
| 92 | +| **Metric** | **Value on Docker Container** | **Value on Virtual Machine** | |
| 93 | +|--------------------------|----------------------------------------|-----------------------------------------| |
| 94 | +| **Average Inference Time** | 1.4713 ms | 1.8961 ms | |
| 95 | +| **Throughput** | 679.48 inferences/sec | 527.25 inferences/sec | |
| 96 | +| **CPU Utilization** | 100% | 95% | |
| 97 | +| **Peak Memory Usage** | 39.8 MB | 36.1 MB | |
| 98 | +| **P50 Latency** | 1.4622 ms | 1.8709 ms | |
| 99 | +| **Max Latency** | 2.3384 ms | 2.7826 ms | |
| 100 | +| **Latency Consistency** | Consistent | Consistent | |
| 101 | + |
| 102 | + |
| 103 | +### Benchmark summary on Arm64: |
| 104 | + |
| 105 | +The following benchmark results are collected on two different Arm64 environments: a **Docker container running Azure Linux 3.0 hosted on a D4ps_v6 Ubuntu-based Azure virtual machine**, and a **D4ps_v6 Azure virtual machine created from the Azure Linux 3.0 custom image using the AArch64 ISO**. |
| 106 | + |
| 107 | +| **Metric** | **Value on Docker Container** | **Value on Virtual Machine** | |
| 108 | +|---------------------------|---------------------------------------|---------------------------------------------| |
| 109 | +| **Average Inference Time**| 1.9183 ms | 1.9169 ms | |
| 110 | +| **Throughput** | 521.09 inferences/sec | 521.41 inferences/sec | |
| 111 | +| **CPU Utilization** | 98% | 100% | |
| 112 | +| **Peak Memory Usage** | 35.36 MB | 33.57 MB | |
| 113 | +| **P50 Latency** | 1.9165 ms | 1.9168 ms | |
| 114 | +| **Max Latency** | 2.0142 ms | 1.9979 ms | |
| 115 | +| **Latency Consistency** | Consistent | Consistent | |
| 116 | + |
| 117 | + |
| 118 | +### Highlights from Azure Linux Arm64 Benchmarking (ONNX Runtime with SqueezeNet) |
| 119 | +- **Low-Latency Inference:** Achieved consistent average inference times of ~1.92 ms across both Docker and virtual machine environments on Arm64. |
| 120 | +- **Strong and Stable Throughput:** Sustained throughput of over 521 inferences/sec using the squeezenet-int8.onnx model on D4ps_v6 instances. |
| 121 | +- **Lightweight Resource Footprint:** Peak memory usage stayed below 36 MB, with CPU utilization reaching ~98–100%, ideal for efficient edge or cloud inference. |
| 122 | +- **Consistent Performance:** P50 and Max latency remained tightly bound across both setups, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure. |
0 commit comments