TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Architecture | Results | Examples | Documentation

Latest News

[Weekly] Check out @NVIDIAAIDev & NVIDIA AI LinkedIn for the latest updates!
[2024/02/06] 🚀 Speed up inference with SOTA quantization techniques in TRT-LLM
[2024/01/30] New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
[2023/12/04] Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
[2023/11/27] SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version
[2023/11/13] H200 achieves nearly 12,000 tok/sec on Llama2-13B
[2023/10/22] 🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙
[2023/10/19] Getting Started Guide - Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
[2023/10/17] Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows

TensorRT-LLM Overview

TensorRT-LLM is an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

The TensorRT-LLM Python API architecture looks similar to the PyTorch API. It provides a functional module containing functions like einsum, softmax, matmul or view. The layers module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the models module.

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. Refer to the Support Matrix for a list of supported models.

To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (refer to support matrix). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Name	Name	Last commit message	Last commit date
Latest commit kaiyux and vonjackustc Update TensorRT-LLM (NVIDIA#1639 ) May 21, 2024 5d8ca2f · May 21, 2024 History 76 Commits
.github	.github	Update TensorRT-LLM (NVIDIA#1598 )	May 14, 2024
3rdparty	3rdparty	Update TensorRT-LLM (NVIDIA#1492 )	Apr 24, 2024
benchmarks	benchmarks	Update TensorRT-LLM (NVIDIA#1639 )	May 21, 2024
cpp	cpp	Update TensorRT-LLM (NVIDIA#1639 )	May 21, 2024
docker	docker	Update TensorRT-LLM (NVIDIA#1554 )	May 7, 2024
docs	docs	Update TensorRT-LLM (NVIDIA#1639 )	May 21, 2024
examples	examples	Update TensorRT-LLM (NVIDIA#1639 )	May 21, 2024
scripts	scripts	Update TensorRT-LLM (NVIDIA#1598 )	May 14, 2024
tensorrt_llm	tensorrt_llm	Update TensorRT-LLM (NVIDIA#1639 )	May 21, 2024
tests	tests	Update TensorRT-LLM (NVIDIA#1639 )	May 21, 2024
windows	windows	Update TensorRT-LLM (NVIDIA#1598 )	May 14, 2024
.clang-format	.clang-format	Update TensorRT-LLM (NVIDIA#1274 )	Mar 12, 2024
.dockerignore	.dockerignore	Update TensorRT-LLM (NVIDIA#941 )	Jan 23, 2024
.gitattributes	.gitattributes	Update TensorRT-LLM (NVIDIA#1554 )	May 7, 2024
.gitignore	.gitignore	Update TensorRT-LLM (NVIDIA#1598 )	May 14, 2024
.gitmodules	.gitmodules	Update TensorRT-LLM (NVIDIA#524 )	Dec 1, 2023
.pre-commit-config.yaml	.pre-commit-config.yaml	Update TensorRT-LLM (20240116) (NVIDIA#891 )	Jan 16, 2024
CHANGELOG.md	CHANGELOG.md	Update TensorRT-LLM (NVIDIA#1455 )	Apr 16, 2024
LICENSE	LICENSE	Initial commit	Sep 20, 2023
README.md	README.md	Update TensorRT-LLM (NVIDIA#1554 )	May 7, 2024
requirements-dev-windows.txt	requirements-dev-windows.txt	Update TensorRT-LLM (NVIDIA#1427 )	Apr 9, 2024
requirements-dev.txt	requirements-dev.txt	Update TensorRT-LLM (NVIDIA#1427 )	Apr 9, 2024
requirements-windows.txt	requirements-windows.txt	Update TensorRT-LLM (NVIDIA#1598 )	May 14, 2024
requirements.txt	requirements.txt	Update TensorRT-LLM (NVIDIA#1639 )	May 21, 2024
setup.cfg	setup.cfg	Initial commit	Sep 20, 2023
setup.py	setup.py	Update TensorRT-LLM (NVIDIA#1598 )	May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Latest News

TensorRT-LLM Overview

Getting Started

About

Releases

Packages

Languages

License

KangCaijun/TensorRT-LLM

Folders and files

Latest commit

History

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Latest News

TensorRT-LLM Overview

Getting Started

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages