Skip to content

Commit bca9a33

Browse files
kaiyuxttimMahmoudAshraf97saeyoonohhattizai
authored
Update TensorRT-LLM (NVIDIA#2008)
* Update TensorRT-LLM --------- Co-authored-by: Timur Abishev <[email protected]> Co-authored-by: MahmoudAshraf97 <[email protected]> Co-authored-by: Saeyoon Oh <[email protected]> Co-authored-by: hattizai <[email protected]>
1 parent 5ddb6bf commit bca9a33

File tree

480 files changed

+328259
-6552
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

480 files changed

+328259
-6552
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -48,3 +48,6 @@ results_trt/
4848

4949
# Generated files
5050
cpp/include/tensorrt_llm/executor/version.h
51+
52+
# User config files
53+
CMakeUserPresets.json

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ TensorRT-LLM
77
[![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-LLM/)
88
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
99
[![cuda](https://img.shields.io/badge/cuda-12.4.1-green)](https://developer.nvidia.com/cuda-downloads)
10-
[![trt](https://img.shields.io/badge/TRT-10.1.0-green)](https://developer.nvidia.com/tensorrt)
10+
[![trt](https://img.shields.io/badge/TRT-10.2.0-green)](https://developer.nvidia.com/tensorrt)
1111
[![version](https://img.shields.io/badge/release-0.12.0.dev-green)](./tensorrt_llm/version.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

benchmarks/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# TensorRT-LLM Benchmarks
2+
3+
## Overview
4+
5+
There are currently three workflows to benchmark TensorRT-LLM:
6+
* [C++ benchmarks](./cpp)
7+
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
8+
* [Python benchmarks](./python)
9+
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
10+
* [The Python benchmarking suite](./suite)
11+
- This benchmarking suite is a current work in progress and is prone to large changes.

benchmarks/cpp/README.md

+60-70
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
# Benchmark for C++ Runtime
1+
# Benchmark C++ Runtime
22

33
This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
4-
multiple GPUs or multiple nodes with multiple GPUs.
4+
multiple GPUs or multiple nodes with multiple GPUs using the C++ runtime.
55

66
## Usage
77

@@ -16,58 +16,11 @@ Windows users: Follow the
1616
instead, and be sure to set DLL paths as specified in
1717
[Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).
1818

19-
### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
20-
21-
#### Prepare TensorRT-LLM engine(s)
22-
23-
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
24-
25-
Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
26-
27-
#### Launch benchmarking
28-
29-
For detailed usage, you can do the following
30-
```
31-
cd cpp/build
32-
33-
# You can directly execute the binary for help information
34-
./benchmarks/gptSessionBenchmark --help
35-
./benchmarks/bertBenchmark --help
36-
```
37-
38-
Take GPT-350M as an example for single GPU
39-
40-
```
41-
./benchmarks/gptSessionBenchmark \
42-
--engine_dir "../../benchmarks/gpt_350m/" \
43-
--batch_size "1" \
44-
--input_output_len "60,20"
45-
46-
# Expected output:
47-
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
48-
```
49-
Take GPT-175B as an example for multiple GPUs
50-
```
51-
mpirun -n 8 ./benchmarks/gptSessionBenchmark \
52-
--engine_dir "../../benchmarks/gpt_175b/" \
53-
--batch_size "1" \
54-
--input_output_len "60,20"
55-
56-
# Expected output:
57-
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
58-
```
59-
60-
If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
61-
62-
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
63-
64-
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
65-
66-
### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)
19+
### 2. Launch C++ benchmarking (Inflight/V1 batching)
6720

6821
#### Prepare dataset
6922

70-
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*
23+
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
7124

7225
This tool can be used in 2 different modes of traffic generation.
7326

@@ -127,7 +80,8 @@ For `tokenizer`, specifying the path to the local tokenizer that have already be
12780

12881

12982
#### Prepare TensorRT-LLM engines
130-
Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.
83+
84+
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using `trtllm-build` command. For more details on building engine(s), please refer to the [Quick Start Guide](../../docs/source/quick-start-guide.md).
13185

13286
#### Launch benchmarking
13387

@@ -139,21 +93,10 @@ cd cpp/build
13993
./benchmarks/gptManagerBenchmark --help
14094
```
14195

142-
Take GPT-350M as an example for single GPU V1 batching
143-
```
144-
./benchmarks/gptManagerBenchmark \
145-
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
146-
--type V1 \
147-
--request_rate 10 \
148-
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
149-
--max_num_samples 500
150-
```
151-
15296
Take GPT-350M as an example for 2-GPU inflight batching
15397
```
15498
mpirun -n 2 ./benchmarks/gptManagerBenchmark \
15599
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
156-
--type IFB \
157100
--request_rate 10 \
158101
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
159102
--max_num_samples 500
@@ -163,10 +106,11 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
163106

164107
#### Emulated static batching
165108

166-
To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
109+
To emulate the deprecated `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
110+
167111
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
168112

169-
`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
113+
Datasets with fixed input/output lengths for benchmarking can be generated with the preprocessing script, e.g.
170114
```
171115
python prepare_dataset.py \
172116
--output tokens-fixed-lengths.json \
@@ -181,7 +125,6 @@ Take GPT-350M as an example for single GPU with static batching
181125
```
182126
./benchmarks/gptManagerBenchmark \
183127
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
184-
--type IFB \
185128
--request-rate -1 \
186129
--static_emulated_batch_size 32 \
187130
--static_emulated_timeout 100 \
@@ -239,7 +182,7 @@ ${HOME}/.local/bin/trtllm-build \
239182
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h mlp_gate \
240183
--max_lora_rank ${MAX_LORA_RANK}
241184
242-
NUM_LORAS=(8 16 24 32 64 128 256)
185+
NUM_LORAS=(8 16)
243186
NUM_REQUESTS=1024
244187
245188
# Convert LoRA to cpp format
@@ -271,7 +214,7 @@ for nloras in ${NUM_LORAS[@]}; do
271214
done
272215
273216
# Generate random lora weights for 256 adapters
274-
python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 256
217+
python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 16
275218
276219
# perform benchmarking
277220
@@ -284,7 +227,7 @@ mpirun -n ${TP} --output-filename ${EG_DIR}/log-base-lora \
284227
--dataset "${EG_DIR}/data/token-norm-dist.json" \
285228
--lora_host_cache_bytes 8589934592 \
286229
--lora_num_device_mod_layers $(( 32 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
287-
--kv_cache_free_gpu_mem_fraction 0.80 \
230+
--kv_cache_free_gpu_mem_fraction 0.70 \
288231
--log_level info \
289232
--eos_id ${EOS_ID}
290233
@@ -302,9 +245,56 @@ for nloras in ${NUM_LORAS[@]}; do
302245
--dataset "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
303246
--lora_host_cache_bytes 8589934592 \
304247
--lora_num_device_mod_layers $(( 16 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
305-
--kv_cache_free_gpu_mem_fraction 0.80 \
248+
--kv_cache_free_gpu_mem_fraction 0.70 \
306249
--log_level info \
307250
--eos_id ${EOS_ID} \
308251
--lora_dir ${EG_DIR}/loras
309252
done
310253
```
254+
255+
### 3. [DEPRECATED] Launch C++ static batching benchmarking (Fixed BatchSize/InputLen/OutputLen)
256+
257+
#### Prepare TensorRT-LLM engine(s)
258+
259+
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
260+
261+
Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
262+
263+
#### Launch benchmarking
264+
265+
For detailed usage, you can do the following
266+
```
267+
cd cpp/build
268+
269+
# You can directly execute the binary for help information
270+
./benchmarks/gptSessionBenchmark --help
271+
./benchmarks/bertBenchmark --help
272+
```
273+
274+
Take GPT-350M as an example for single GPU
275+
276+
```
277+
./benchmarks/gptSessionBenchmark \
278+
--engine_dir "../../benchmarks/gpt_350m/" \
279+
--batch_size "1" \
280+
--input_output_len "60,20"
281+
282+
# Expected output:
283+
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
284+
```
285+
Take GPT-175B as an example for multiple GPUs
286+
```
287+
mpirun -n 8 ./benchmarks/gptSessionBenchmark \
288+
--engine_dir "../../benchmarks/gpt_175b/" \
289+
--batch_size "1" \
290+
--input_output_len "60,20"
291+
292+
# Expected output:
293+
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
294+
```
295+
296+
If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
297+
298+
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
299+
300+
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*

benchmarks/cpp/gptManagerBenchmark.cpp

+14
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@ struct BenchmarkParams
155155
std::optional<SizeType32> maxNumTokens{std::nullopt};
156156
int randomSeed = 430;
157157
std::optional<int> maxAttentionWindow{std::nullopt};
158+
bool multiBlockMode{false};
158159

159160
// lora / peft params
160161
std::optional<std::string> loraDir{std::nullopt};
@@ -820,6 +821,7 @@ class ExecutorServer
820821
executorConfig.setDecodingConfig(texec::DecodingConfig(
821822
benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
822823
std::nullopt, benchmarkParams.medusaChoices));
824+
executorConfig.setMultiBlockMode(benchmarkParams.multiBlockMode);
823825

824826
mExecutor = std::make_unique<texec::Executor>(trtEnginePath, texec::ModelType::kDECODER_ONLY, executorConfig);
825827

@@ -1399,6 +1401,7 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
13991401
optionalParams.decodingConfig = texec::DecodingConfig(
14001402
benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
14011403
std::nullopt, benchmarkParams.medusaChoices);
1404+
optionalParams.multiBlockMode = benchmarkParams.multiBlockMode;
14021405

14031406
auto const jsonConfig = GptJsonConfig::parse(engineDir / "config.json");
14041407
auto const worldConfig = WorldConfig::mpi(jsonConfig.getGpusPerNode(), jsonConfig.getTensorParallelism(),
@@ -1439,6 +1442,7 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
14391442
auto startLoraLoad = std::chrono::steady_clock::now();
14401443
LoraLib loras(benchmarkParams.loraDir.value());
14411444
SizeType32 reqId = 0;
1445+
gptServer->resetBatchDeadline();
14421446
for (auto const& [taskId, p] : loras.getLoras())
14431447
{
14441448
reqId++;
@@ -1550,6 +1554,9 @@ void benchmarkExecutor(std::filesystem::path const& engineDir, TrtGptModelType m
15501554
std::vector<texec::Request> requests;
15511555
for (auto& [taskId, p] : loras.getLoras())
15521556
{
1557+
// squeeze lora configs and weights since LoraConfig requires them to be 2D tensors
1558+
p.first->squeeze(0);
1559+
p.second->squeeze(0);
15531560
texec::LoraConfig loraConfig(
15541561
taskId, texec::detail::ofITensor(p.first), texec::detail::ofITensor(p.second));
15551562
Sample s{std::vector<int32_t>{1, 2, 3, 4, 5}, 1, static_cast<int32_t>(taskId)};
@@ -1771,6 +1778,10 @@ int main(int argc, char* argv[])
17711778
options.add_options()(
17721779
"medusa_choices", "Medusa choices in the format of [[0], [0, 1], [0, 0, 1]]", cxxopts::value<std::string>());
17731780

1781+
options.add_options()("multi_block_mode",
1782+
"Distribute the work across multiple CUDA thread-blocks on the GPU for masked MHA kernel",
1783+
cxxopts::value<bool>()->default_value("false"));
1784+
17741785
auto result = options.parse(argc, argv);
17751786

17761787
if (result.count("help"))
@@ -1922,6 +1933,9 @@ int main(int argc, char* argv[])
19221933
benchmarkParams.medusaChoices = parseVectorOfVectors(result["medusa_choices"].as<std::string>());
19231934
}
19241935

1936+
// Argument: multi_block_mode
1937+
benchmarkParams.multiBlockMode = result["multi_block_mode"].as<bool>();
1938+
19251939
std::optional<TokenIdType> padId;
19261940
// Argument: Padding token id
19271941
if (result.count("pad_id"))

benchmarks/python/README.md

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
# Benchmark for Python Runtime
1+
# Benchmark Python Runtime
2+
3+
> [!WARNING] Python benchmark is not recommended to be used for benchmarking, please use C++ benchmark instead
4+
> The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
25
36
This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
4-
multiple GPUs or multiple nodes with multiple GPUs.
7+
multiple GPUs or multiple nodes with multiple GPUs using the Python runtime.
58

69
## Overview
710

benchmarks/python/all_reduce.py

-1
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@ def allreduce_benchmark(dtype: str,
6868
]:
6969
builder = tllm.Builder()
7070
net = builder.create_network()
71-
net.plugin_config.set_nccl_plugin(dtype, use_custom_all_reduce=True)
7271
_buffers, workspace = current_all_reduce_helper(
7372
).allocate_workspace(mapping, size * dtype_size)
7473

0 commit comments

Comments
 (0)