MustaphaU
diff --git a/‎.gitignore
+3 b/‎.gitignore
+3
diff --git a/‎README.md
+1-1 b/‎README.md
+1-1
diff --git a/‎benchmarks/README.md
+11 b/‎benchmarks/README.md
+11
diff --git a/‎benchmarks/cpp/README.md
+60-70 b/‎benchmarks/cpp/README.md
+60-70
diff --git a/‎benchmarks/cpp/gptManagerBenchmark.cpp
+14 b/‎benchmarks/cpp/gptManagerBenchmark.cpp
+14
diff --git a/‎benchmarks/python/README.md
+5-2 b/‎benchmarks/python/README.md
+5-2
diff --git a/‎benchmarks/python/all_reduce.py
-1 b/‎benchmarks/python/all_reduce.py
-1
@@ -48,3 +48,6 @@ results_trt/
 
 # Generated files
 cpp/include/tensorrt_llm/executor/version.h
+
+# User config files
+CMakeUserPresets.json
@@ -7,7 +7,7 @@ TensorRT-LLM
 [![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-LLM/)
 [![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
 [![cuda](https://img.shields.io/badge/cuda-12.4.1-green)](https://developer.nvidia.com/cuda-downloads)
-[![trt](https://img.shields.io/badge/TRT-10.1.0-green)](https://developer.nvidia.com/tensorrt)
+[![trt](https://img.shields.io/badge/TRT-10.2.0-green)](https://developer.nvidia.com/tensorrt)
 [![version](https://img.shields.io/badge/release-0.12.0.dev-green)](./tensorrt_llm/version.py)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
 
 
@@ -0,0 +1,11 @@
+# TensorRT-LLM Benchmarks
+
+## Overview
+
+There are currently three workflows to benchmark TensorRT-LLM:
+* [C++ benchmarks](./cpp)
+  - The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
+* [Python benchmarks](./python)
+  - The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
+* [The Python benchmarking suite](./suite)
+  - This benchmarking suite is a current work in progress and is prone to large changes.
@@ -1,7 +1,7 @@
-# Benchmark for C++ Runtime
+# Benchmark C++ Runtime
 
 This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
-multiple GPUs or multiple nodes with multiple GPUs.
+multiple GPUs or multiple nodes with multiple GPUs using the C++ runtime.
 
 ## Usage
 
@@ -16,58 +16,11 @@ Windows users: Follow the
 instead, and be sure to set DLL paths as specified in
 [Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).
 
-### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
-
-#### Prepare TensorRT-LLM engine(s)
-
-Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
-
-Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
-
-####  Launch benchmarking
-
-For detailed usage, you can do the following
-```
-cd cpp/build
-
-# You can directly execute the binary for help information
-./benchmarks/gptSessionBenchmark --help
-./benchmarks/bertBenchmark --help
-```
-
-Take GPT-350M as an example for single GPU
-
-```
-./benchmarks/gptSessionBenchmark \
-    --engine_dir "../../benchmarks/gpt_350m/" \
-    --batch_size "1" \
-    --input_output_len "60,20"
-
-# Expected output:
-# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
-```
-Take GPT-175B as an example for multiple GPUs
-```
-mpirun -n 8 ./benchmarks/gptSessionBenchmark \
-    --engine_dir "../../benchmarks/gpt_175b/" \
-    --batch_size "1" \
-    --input_output_len "60,20"
-
-# Expected output:
-# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
-```
-
-If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
-
-If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
-
-*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
-
-### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)
+### 2. Launch C++ benchmarking (Inflight/V1 batching)
 
 #### Prepare dataset
 
-Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*
+Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
 
 This tool can be used in 2 different modes of traffic generation.
 
@@ -127,7 +80,8 @@ For `tokenizer`, specifying the path to the local tokenizer that have already be
 
 
 #### Prepare TensorRT-LLM engines
-Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.
+
+Before you launch C++ benchmarking, please make sure that you have already built engine(s) using `trtllm-build` command. For more details on building engine(s), please refer to the [Quick Start Guide](../../docs/source/quick-start-guide.md).
 
 #### Launch benchmarking
 
@@ -139,21 +93,10 @@ cd cpp/build
 ./benchmarks/gptManagerBenchmark --help
 ```
 
-Take GPT-350M as an example for single GPU V1 batching
-```
-./benchmarks/gptManagerBenchmark \
-    --engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
-    --type V1 \
-    --request_rate 10 \
-    --dataset ../../benchmarks/cpp/preprocessed_dataset.json
-    --max_num_samples 500
-```
-
 Take GPT-350M as an example for 2-GPU inflight batching
 ```
 mpirun -n 2 ./benchmarks/gptManagerBenchmark \
     --engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
-    --type IFB \
     --request_rate 10 \
     --dataset ../../benchmarks/cpp/preprocessed_dataset.json
     --max_num_samples 500
@@ -163,10 +106,11 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
 
 #### Emulated static batching
 
-To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
+To emulate the deprecated `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
+
 Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
 
-`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
+Datasets with fixed input/output lengths for benchmarking can be generated with the preprocessing script, e.g.
 ```
  python prepare_dataset.py \
   --output tokens-fixed-lengths.json \
@@ -181,7 +125,6 @@ Take GPT-350M as an example for single GPU with static batching
 ```
 ./benchmarks/gptManagerBenchmark \
     --engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
-    --type IFB \
     --request-rate -1 \
     --static_emulated_batch_size 32 \
     --static_emulated_timeout 100 \
@@ -239,7 +182,7 @@ ${HOME}/.local/bin/trtllm-build \
     --lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h mlp_gate \
     --max_lora_rank ${MAX_LORA_RANK}
 
-NUM_LORAS=(8 16 24 32 64 128 256)
+NUM_LORAS=(8 16)
 NUM_REQUESTS=1024
 
 # Convert LoRA to cpp format
@@ -271,7 +214,7 @@ for nloras in ${NUM_LORAS[@]}; do
 done
 
 # Generate random lora weights for 256 adapters
-python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 256
+python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 16
 
 # perform benchmarking
 
@@ -284,7 +227,7 @@ mpirun -n ${TP} --output-filename ${EG_DIR}/log-base-lora \
     --dataset "${EG_DIR}/data/token-norm-dist.json" \
     --lora_host_cache_bytes 8589934592 \
     --lora_num_device_mod_layers $(( 32 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
-    --kv_cache_free_gpu_mem_fraction 0.80 \
+    --kv_cache_free_gpu_mem_fraction 0.70 \
     --log_level info \
     --eos_id ${EOS_ID}
 
@@ -302,9 +245,56 @@ for nloras in ${NUM_LORAS[@]}; do
         --dataset "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
         --lora_host_cache_bytes 8589934592 \
         --lora_num_device_mod_layers $(( 16 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
-        --kv_cache_free_gpu_mem_fraction 0.80 \
+        --kv_cache_free_gpu_mem_fraction 0.70 \
         --log_level info \
         --eos_id ${EOS_ID} \
         --lora_dir ${EG_DIR}/loras
 done
 ```
+
+### 3. [DEPRECATED] Launch C++ static batching benchmarking (Fixed BatchSize/InputLen/OutputLen)
+
+#### Prepare TensorRT-LLM engine(s)
+
+Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
+
+Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
+
+####  Launch benchmarking
+
+For detailed usage, you can do the following
+```
+cd cpp/build
+
+# You can directly execute the binary for help information
+./benchmarks/gptSessionBenchmark --help
+./benchmarks/bertBenchmark --help
+```
+
+Take GPT-350M as an example for single GPU
+
+```
+./benchmarks/gptSessionBenchmark \
+    --engine_dir "../../benchmarks/gpt_350m/" \
+    --batch_size "1" \
+    --input_output_len "60,20"
+
+# Expected output:
+# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
+```
+Take GPT-175B as an example for multiple GPUs
+```
+mpirun -n 8 ./benchmarks/gptSessionBenchmark \
+    --engine_dir "../../benchmarks/gpt_175b/" \
+    --batch_size "1" \
+    --input_output_len "60,20"
+
+# Expected output:
+# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
+```
+
+If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
+
+If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
+
+*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
@@ -155,6 +155,7 @@ struct BenchmarkParams
     std::optional<SizeType32> maxNumTokens{std::nullopt};
     int randomSeed = 430;
     std::optional<int> maxAttentionWindow{std::nullopt};
+    bool multiBlockMode{false};
 
     // lora / peft params
     std::optional<std::string> loraDir{std::nullopt};
@@ -820,6 +821,7 @@ class ExecutorServer
         executorConfig.setDecodingConfig(texec::DecodingConfig(
             benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
             std::nullopt, benchmarkParams.medusaChoices));
+        executorConfig.setMultiBlockMode(benchmarkParams.multiBlockMode);
 
         mExecutor = std::make_unique<texec::Executor>(trtEnginePath, texec::ModelType::kDECODER_ONLY, executorConfig);
 
@@ -1399,6 +1401,7 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
     optionalParams.decodingConfig = texec::DecodingConfig(
         benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
         std::nullopt, benchmarkParams.medusaChoices);
+    optionalParams.multiBlockMode = benchmarkParams.multiBlockMode;
 
     auto const jsonConfig = GptJsonConfig::parse(engineDir / "config.json");
     auto const worldConfig = WorldConfig::mpi(jsonConfig.getGpusPerNode(), jsonConfig.getTensorParallelism(),
@@ -1439,6 +1442,7 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
             auto startLoraLoad = std::chrono::steady_clock::now();
             LoraLib loras(benchmarkParams.loraDir.value());
             SizeType32 reqId = 0;
+            gptServer->resetBatchDeadline();
             for (auto const& [taskId, p] : loras.getLoras())
             {
                 reqId++;
@@ -1550,6 +1554,9 @@ void benchmarkExecutor(std::filesystem::path const& engineDir, TrtGptModelType m
             std::vector<texec::Request> requests;
             for (auto& [taskId, p] : loras.getLoras())
             {
+                // squeeze lora configs and weights since LoraConfig requires them to be 2D tensors
+                p.first->squeeze(0);
+                p.second->squeeze(0);
                 texec::LoraConfig loraConfig(
                     taskId, texec::detail::ofITensor(p.first), texec::detail::ofITensor(p.second));
                 Sample s{std::vector<int32_t>{1, 2, 3, 4, 5}, 1, static_cast<int32_t>(taskId)};
@@ -1771,6 +1778,10 @@ int main(int argc, char* argv[])
     options.add_options()(
         "medusa_choices", "Medusa choices in the format of [[0], [0, 1], [0, 0, 1]]", cxxopts::value<std::string>());
 
+    options.add_options()("multi_block_mode",
+        "Distribute the work across multiple CUDA thread-blocks on the GPU for masked MHA kernel",
+        cxxopts::value<bool>()->default_value("false"));
+
     auto result = options.parse(argc, argv);
 
     if (result.count("help"))
@@ -1922,6 +1933,9 @@ int main(int argc, char* argv[])
         benchmarkParams.medusaChoices = parseVectorOfVectors(result["medusa_choices"].as<std::string>());
     }
 
+    // Argument: multi_block_mode
+    benchmarkParams.multiBlockMode = result["multi_block_mode"].as<bool>();
+
     std::optional<TokenIdType> padId;
     // Argument: Padding token id
     if (result.count("pad_id"))
 
@@ -1,7 +1,10 @@
-# Benchmark for Python Runtime
+# Benchmark Python Runtime
+
+> [!WARNING] Python benchmark is not recommended to be used for benchmarking, please use C++ benchmark instead
+> The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
 
 This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
-multiple GPUs or multiple nodes with multiple GPUs.
+multiple GPUs or multiple nodes with multiple GPUs using the Python runtime.
 
 ## Overview
 
 
@@ -68,7 +68,6 @@ def allreduce_benchmark(dtype: str,
         ]:
             builder = tllm.Builder()
             net = builder.create_network()
-            net.plugin_config.set_nccl_plugin(dtype, use_custom_all_reduce=True)
             _buffers, workspace = current_all_reduce_helper(
             ).allocate_workspace(mapping, size * dtype_size)