@@ -4,8 +4,13 @@ The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs
4
4
that uses NVIDIA's TensorRT library for inference acceleration.
5
5
It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.
6
6
7
- To use the TRTLLM backend you need to compile ` engines ` for the models you want to use.
8
- Each ` engine ` must be compiled on the same GPU architecture that you will use for inference.
7
+ To use the TRTLLM backend ** you need to compile** ` engines ` for the models you want to use.
8
+ Each ` engine ` must be compiled for a given set of:
9
+ - GPU architecture that you will use for inference (e.g. A100, L40, etc.)
10
+ - Maximum batch size
11
+ - Maximum input length
12
+ - Maximum output length
13
+ - Maximum beams width
9
14
10
15
## Supported models
11
16
@@ -19,63 +24,159 @@ want to use.
19
24
20
25
``` bash
21
26
MODEL_NAME=" meta-llama/Llama-3.1-8B-Instruct"
22
-
23
- # Install huggingface_cli
24
- python -m pip install huggingface-cli[hf_transfer]
25
-
26
- # Login to the Hugging Face Hub
27
- huggingface-cli login
28
-
29
- # Create a directory to store the model
30
- mkdir -p /tmp/models/$MODEL_NAME
31
-
32
- # Create a directory to store the compiled engine
33
- mkdir -p /tmp/engines/$MODEL_NAME
34
-
35
- # Download the model
36
- HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME
37
-
27
+ DESTINATION=" /tmp/engines/$MODEL_NAME "
28
+ HF_TOKEN=" hf_xxx"
38
29
# Compile the engine using Optimum-NVIDIA
30
+ # This will create a compiled engine in the /tmp/engines/meta-llama/Llama-3.1-8B-Instruct
31
+ # directory for 1 GPU
39
32
docker run \
40
33
--rm \
41
34
-it \
42
35
--gpus=1 \
43
- -v /tmp/models/$MODEL_NAME :/model \
44
- -v /tmp/engines/$MODEL_NAME :/engine \
45
- huggingface/optimum-nvidia \
46
- optimum-cli export trtllm \
36
+ --shm-size=1g \
37
+ -v " $DESTINATION " :/engine \
38
+ -e HF_TOKEN=$HF_TOKEN \
39
+ -e HF_HUB_ENABLE_HF_TRANSFER=1 \
40
+ huggingface/optimum-nvidia:v0.1.0b9-py310 \
41
+ bash -c " optimum-cli export trtllm \
47
42
--tp=1 \
48
43
--pp=1 \
49
- --max-batch-size=128 \
44
+ --max-batch-size=64 \
50
45
--max-input-length 4096 \
51
46
--max-output-length 8192 \
52
47
--max-beams-width=1 \
53
- --destination /engine \
54
- $MODEL_NAME
48
+ --destination /tmp/ engine \
49
+ $MODEL_NAME && cp -rL /tmp/engine/* /engine/ "
55
50
```
56
51
57
- Your compiled engine will be saved in the ` /tmp/engines/$MODEL_NAME ` directory.
52
+ Your compiled engine will be saved in the ` /tmp/engines/$MODEL_NAME ` directory, in a subfolder named after the GPU used to compile the model .
58
53
59
54
## Using the TRTLLM backend
60
55
61
56
Run TGI-TRTLLM Docker image with the compiled engine:
62
57
63
58
``` bash
59
+ MODEL_NAME=" meta-llama/Llama-3.1-8B-Instruct"
60
+ DESTINATION=" /tmp/engines/$MODEL_NAME "
61
+ HF_TOKEN=" hf_xxx"
64
62
docker run \
65
63
--gpus 1 \
64
+ --shm-size=1g \
66
65
-it \
67
66
--rm \
68
67
-p 3000:3000 \
69
68
-e MODEL=$MODEL_NAME \
70
69
-e PORT=3000 \
71
- -e HF_TOKEN=' hf_XXX ' \
72
- -v /tmp /engines/ $MODEL_NAME :/data \
70
+ -e HF_TOKEN=$HF_TOKEN \
71
+ -v " $DESTINATION " / < YOUR_GPU_ARCHITECTURE > /engines:/data \
73
72
ghcr.io/huggingface/text-generation-inference:latest-trtllm \
74
- --executor-worker executorWorker \
75
- --model-id /data/ $MODEL_NAME
73
+ --model-id /data/ \
74
+ --tokenizer-name $MODEL_NAME
76
75
```
77
76
78
77
## Development
79
78
80
- To develop TRTLLM backend, you can use [ dev containers] ( https://containers.dev/ ) located in
81
- ` .devcontainer ` directory.
79
+ To develop TRTLLM backend, you can use [ dev containers] ( https://containers.dev/ ) with the following ` .devcontainer.json ` file:
80
+ ``` json
81
+ {
82
+ "name" : " CUDA" ,
83
+ "build" : {
84
+ "dockerfile" : " Dockerfile_trtllm" ,
85
+ "context" : " .."
86
+ },
87
+ "remoteEnv" : {
88
+ "PATH" : " ${containerEnv:PATH}:/usr/local/cuda/bin" ,
89
+ "LD_LIBRARY_PATH" : " $LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64" ,
90
+ "XLA_FLAGS" : " --xla_gpu_cuda_data_dir=/usr/local/cuda"
91
+ },
92
+ "customizations" : {
93
+ "jetbrains" : {
94
+ "backend" : " CLion"
95
+ }
96
+ }
97
+ }
98
+ ```
99
+
100
+ and ` Dockerfile_trtllm ` :
101
+
102
+ ``` Dockerfile
103
+ ARG cuda_arch_list="75-real;80-real;86-real;89-real;90-real"
104
+ ARG build_type=release
105
+ ARG ompi_version=4.1.7
106
+
107
+ # CUDA dependent dependencies resolver stage
108
+ FROM nvidia/cuda:12.6.3-cudnn-devel-ubuntu24.04 AS cuda-builder
109
+
110
+ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
111
+ build-essential \
112
+ cmake \
113
+ curl \
114
+ gcc-14 \
115
+ g++-14 \
116
+ git \
117
+ git-lfs \
118
+ lld \
119
+ libssl-dev \
120
+ libucx-dev \
121
+ libasan8 \
122
+ libubsan1 \
123
+ ninja-build \
124
+ pkg-config \
125
+ pipx \
126
+ python3 \
127
+ python3-dev \
128
+ python3-setuptools \
129
+ tar \
130
+ wget --no-install-recommends && \
131
+ pipx ensurepath
132
+
133
+ ENV TGI_INSTALL_PREFIX=/usr/local/tgi
134
+ ENV TENSORRT_INSTALL_PREFIX=/usr/local/tensorrt
135
+
136
+ # Install OpenMPI
137
+ FROM cuda-builder AS mpi-builder
138
+ WORKDIR /opt/src/mpi
139
+
140
+ ARG ompi_version
141
+ ENV OMPI_VERSION=${ompi_version}
142
+ ENV OMPI_TARBALL_FILENAME=openmpi-${OMPI_VERSION}.tar.bz2
143
+ ADD --checksum=sha256:54a33cb7ad81ff0976f15a6cc8003c3922f0f3d8ceed14e1813ef3603f22cd34 \
144
+ https://download.open-mpi.org/release/open-mpi/v4.1/${OMPI_TARBALL_FILENAME} .
145
+
146
+ RUN tar --strip-components=1 -xf ${OMPI_TARBALL_FILENAME} &&\
147
+ ./configure --prefix=/usr/local/mpi --with-cuda=/usr/local/cuda --with-slurm && \
148
+ make -j all && \
149
+ make install && \
150
+ rm -rf ${OMPI_TARBALL_FILENAME}/..
151
+
152
+ # Install TensorRT
153
+ FROM cuda-builder AS trt-builder
154
+ COPY backends/trtllm/scripts/install_tensorrt.sh /opt/install_tensorrt.sh
155
+ RUN chmod +x /opt/install_tensorrt.sh && \
156
+ /opt/install_tensorrt.sh
157
+
158
+ # Build Backend
159
+ FROM cuda-builder AS tgi-builder
160
+ WORKDIR /usr/src/text-generation-inference
161
+
162
+ # Scoped global args reuse
163
+ ARG cuda_arch_list
164
+ ARG build_type
165
+ ARG sccache_gha_enabled
166
+ ARG actions_cache_url
167
+ ARG actions_runtime_token
168
+
169
+ # Install Rust
170
+ ENV PATH="/root/.cargo/bin:$PATH"
171
+ RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | bash -s -- -y && \
172
+ chmod -R a+w /root/.rustup && \
173
+ chmod -R a+w /root/.cargo && \
174
+ cargo install sccache --locked
175
+
176
+ ENV LD_LIBRARY_PATH="/usr/local/mpi/lib:$LD_LIBRARY_PATH"
177
+ ENV PKG_CONFIG_PATH="/usr/local/mpi/lib/pkgconfig"
178
+ ENV CMAKE_PREFIX_PATH="/usr/local/mpi:/usr/local/tensorrt"
179
+
180
+ ENV USE_LLD_LINKER=ON
181
+ ENV CUDA_ARCH_LIST=${cuda_arch_list}
182
+ ```
0 commit comments