You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: benchmarks/cpp/README.md
+60-70
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
-
# Benchmark for C++ Runtime
1
+
# Benchmark C++ Runtime
2
2
3
3
This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
4
-
multiple GPUs or multiple nodes with multiple GPUs.
4
+
multiple GPUs or multiple nodes with multiple GPUs using the C++ runtime.
5
5
6
6
## Usage
7
7
@@ -16,58 +16,11 @@ Windows users: Follow the
16
16
instead, and be sure to set DLL paths as specified in
17
17
[Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).
18
18
19
-
### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
20
-
21
-
#### Prepare TensorRT-LLM engine(s)
22
-
23
-
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
24
-
25
-
Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
26
-
27
-
#### Launch benchmarking
28
-
29
-
For detailed usage, you can do the following
30
-
```
31
-
cd cpp/build
32
-
33
-
# You can directly execute the binary for help information
If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
61
-
62
-
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
63
-
64
-
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
### 2. Launch C++ benchmarking (Inflight/V1 batching)
67
20
68
21
#### Prepare dataset
69
22
70
-
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*
23
+
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
71
24
72
25
This tool can be used in 2 different modes of traffic generation.
73
26
@@ -127,7 +80,8 @@ For `tokenizer`, specifying the path to the local tokenizer that have already be
127
80
128
81
129
82
#### Prepare TensorRT-LLM engines
130
-
Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.
83
+
84
+
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using `trtllm-build` command. For more details on building engine(s), please refer to the [Quick Start Guide](../../docs/source/quick-start-guide.md).
131
85
132
86
#### Launch benchmarking
133
87
@@ -139,21 +93,10 @@ cd cpp/build
139
93
./benchmarks/gptManagerBenchmark --help
140
94
```
141
95
142
-
Take GPT-350M as an example for single GPU V1 batching
To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
109
+
To emulate the deprecated `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
110
+
167
111
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
168
112
169
-
`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
113
+
Datasets with fixed input/output lengths for benchmarking can be generated with the preprocessing script, e.g.
170
114
```
171
115
python prepare_dataset.py \
172
116
--output tokens-fixed-lengths.json \
@@ -181,7 +125,6 @@ Take GPT-350M as an example for single GPU with static batching
### 3. [DEPRECATED] Launch C++ static batching benchmarking (Fixed BatchSize/InputLen/OutputLen)
256
+
257
+
#### Prepare TensorRT-LLM engine(s)
258
+
259
+
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
260
+
261
+
Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
262
+
263
+
#### Launch benchmarking
264
+
265
+
For detailed usage, you can do the following
266
+
```
267
+
cd cpp/build
268
+
269
+
# You can directly execute the binary for help information
If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
297
+
298
+
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
299
+
300
+
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
0 commit comments