Skip to content

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

License

Notifications You must be signed in to change notification settings

KangCaijun/TensorRT-LLM

This branch is 64 commits behind NVIDIA/TensorRT-LLM:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

5d8ca2f Β· May 21, 2024

History

76 Commits
May 14, 2024
Apr 24, 2024
May 21, 2024
May 21, 2024
May 7, 2024
May 21, 2024
May 21, 2024
May 14, 2024
May 21, 2024
May 21, 2024
May 14, 2024
Mar 12, 2024
Jan 23, 2024
May 7, 2024
May 14, 2024
Dec 1, 2023
Jan 16, 2024
Apr 16, 2024
Sep 20, 2023
May 7, 2024
Apr 9, 2024
Apr 9, 2024
May 14, 2024
May 21, 2024
Sep 20, 2023
May 14, 2024

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Results   |   Examples   |   Documentation


Latest News

TensorRT-LLM Overview

TensorRT-LLM is an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

The TensorRT-LLM Python API architecture looks similar to the PyTorch API. It provides a functional module containing functions like einsum, softmax, matmul or view. The layers module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the models module.

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. Refer to the Support Matrix for a list of supported models.

To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (refer to support matrix). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

About

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 99.3%
  • Python 0.6%
  • Cuda 0.1%
  • CMake 0.0%
  • Shell 0.0%
  • Smarty 0.0%