Skip to content

Commit 81c9ad7

Browse files
committed
Merge branch 'main' into feature/usage-stats
2 parents af661fd + dbb23fb commit 81c9ad7

File tree

109 files changed

+4275
-3708
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+4275
-3708
lines changed

.github/workflows/autodocs.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,16 @@ jobs:
3030
id: install-router
3131
run: cargo install --path router/
3232

33+
- uses: actions/setup-node@v4
34+
with:
35+
node-version: 22
36+
3337
- name: Set up Python
3438
uses: actions/setup-python@v2
3539
with:
3640
python-version: '3.x'
3741

3842
- name: Check that documentation is up-to-date
3943
run: |
44+
npm install -g swagger-cli
4045
python update_doc.py --check

.github/workflows/build.yaml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@ on:
1111
# - rocm
1212
# - intel
1313
required: true
14+
release-tests:
15+
description: "Run release integration tests"
16+
required: true
17+
default: false
18+
type: boolean
1419

1520
jobs:
1621
build-and-push:
@@ -23,7 +28,7 @@ jobs:
2328
group: ${{ github.workflow }}-build-and-push-image-${{ inputs.hardware }}-${{ github.head_ref || github.run_id }}
2429
cancel-in-progress: true
2530
# TODO see with @Glegendre to get CPU runner here instead
26-
runs-on: [self-hosted, nvidia-gpu , multi-gpu, 4-a10, ci]
31+
runs-on: [self-hosted, intel-cpu, 32-cpu, 256-ram, ci]
2732
permissions:
2833
contents: write
2934
packages: write
@@ -131,8 +136,8 @@ jobs:
131136
DOCKER_LABEL=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
132137
tags: ${{ steps.meta.outputs.tags || steps.meta-pr.outputs.tags }}
133138
labels: ${{ steps.meta.outputs.labels || steps.meta-pr.outputs.labels }}
134-
cache-from: type=registry,ref=registry-push.github-runners.huggingface.tech/api-inference/community/text-generation-inference:cache${{ env.LABEL }},mode=min
135-
cache-to: type=registry,ref=registry-push.github-runners.huggingface.tech/api-inference/community/text-generation-inference:cache${{ env.LABEL }},mode=min
139+
cache-from: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL }},mode=min,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
140+
cache-to: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL }},mode=min,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
136141
- name: Final
137142
id: final
138143
run: |
@@ -148,7 +153,7 @@ jobs:
148153
runs-on: ["self-hosted", "${{ needs.build-and-push.outputs.runs_on }}", "multi-gpu"]
149154
if: needs.build-and-push.outputs.runs_on != 'ubuntu-latest'
150155
env:
151-
PYTEST_FLAGS: ${{ (startsWith(github.ref, 'refs/tags/') || github.ref == 'refs/heads/main') && '--release' || '' }}
156+
PYTEST_FLAGS: ${{ (startsWith(github.ref, 'refs/tags/') || github.ref == 'refs/heads/main' || inputs.release-tests == true) && '--release' || '' }}
152157
steps:
153158
- name: Checkout repository
154159
uses: actions/checkout@v4

.github/workflows/ci_build.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,14 @@ on:
2020
- "Dockerfile_amd"
2121
- "Dockerfile_intel"
2222
branches:
23-
- 'main'
23+
- "main"
24+
workflow_dispatch:
25+
inputs:
26+
release-tests:
27+
description: "Run release integration tests"
28+
required: true
29+
default: false
30+
type: boolean
2431

2532
jobs:
2633
build:
@@ -33,4 +40,6 @@ jobs:
3340
uses: ./.github/workflows/build.yaml # calls the one above ^
3441
with:
3542
hardware: ${{ matrix.hardware }}
43+
# https://github.com/actions/runner/issues/2206
44+
release-tests: ${{ inputs.release-tests == true }}
3645
secrets: inherit

Cargo.lock

Lines changed: 3 additions & 25 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Dockerfile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,9 @@ RUN cargo build --profile release-opt
4040
# Adapted from: https://github.com/pytorch/pytorch/blob/master/Dockerfile
4141
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS pytorch-install
4242

43+
# NOTE: When updating PyTorch version, beware to remove `pip install nvidia-nccl-cu12==2.22.3` below in the Dockerfile. Context: https://github.com/huggingface/text-generation-inference/pull/2099
4344
ARG PYTORCH_VERSION=2.3.0
45+
4446
ARG PYTHON_VERSION=3.10
4547
# Keep in sync with `server/pyproject.toml
4648
ARG CUDA_VERSION=12.1
@@ -241,7 +243,10 @@ COPY server/Makefile server/Makefile
241243
RUN cd server && \
242244
make gen-server && \
243245
pip install -r requirements_cuda.txt && \
244-
pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
246+
pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir && \
247+
pip install nvidia-nccl-cu12==2.22.3
248+
249+
ENV LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2
245250

246251
# Deps before the binaries
247252
# The binaries change on every build given we burn the SHA into them

README.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -20,19 +20,20 @@ to power Hugging Chat, the Inference API and Inference Endpoint.
2020

2121
## Table of contents
2222

23-
- [Get Started](#get-started)
24-
- [API Documentation](#api-documentation)
25-
- [Using a private or gated model](#using-a-private-or-gated-model)
26-
- [A note on Shared Memory](#a-note-on-shared-memory-shm)
27-
- [Distributed Tracing](#distributed-tracing)
28-
- [Local Install](#local-install)
29-
- [CUDA Kernels](#cuda-kernels)
30-
- [Optimized architectures](#optimized-architectures)
31-
- [Run Mistral](#run-a-model)
32-
- [Run](#run)
33-
- [Quantization](#quantization)
34-
- [Develop](#develop)
35-
- [Testing](#testing)
23+
- [Get Started](#get-started)
24+
- [Docker](#docker)
25+
- [API documentation](#api-documentation)
26+
- [Using a private or gated model](#using-a-private-or-gated-model)
27+
- [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)
28+
- [Distributed Tracing](#distributed-tracing)
29+
- [Architecture](#architecture)
30+
- [Local install](#local-install)
31+
- [Optimized architectures](#optimized-architectures)
32+
- [Run locally](#run-locally)
33+
- [Run](#run)
34+
- [Quantization](#quantization)
35+
- [Develop](#develop)
36+
- [Testing](#testing)
3637

3738
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:
3839

clients/python/text_generation/types.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ class ChoiceDeltaToolCall(BaseModel):
6161
class ChoiceDelta(BaseModel):
6262
role: str
6363
content: Optional[str] = None
64-
tool_calls: Optional[ChoiceDeltaToolCall]
64+
tool_calls: Optional[ChoiceDeltaToolCall] = None
6565

6666

6767
class Choice(BaseModel):

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@
1111
title: Using TGI with Intel Gaudi
1212
- local: installation_inferentia
1313
title: Using TGI with AWS Inferentia
14+
- local: installation_intel
15+
title: Using TGI with Intel GPUs
1416
- local: installation
1517
title: Installation from source
1618
- local: supported_models

docs/source/architecture.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ Several variants of the model server exist that are actively supported by Huggin
103103

104104
- By default, the model server will attempt building [a server optimized for Nvidia GPUs with CUDA](https://huggingface.co/docs/text-generation-inference/installation_nvidia). The code for this version is hosted in the [main TGI repository](https://github.com/huggingface/text-generation-inference).
105105
- A [version optimized for AMD with ROCm](https://huggingface.co/docs/text-generation-inference/installation_amd) is hosted in the main TGI repository. Some model features differ.
106+
- A [version optimized for Intel GPUs](https://huggingface.co/docs/text-generation-inference/installation_intel) is hosted in the main TGI repository. Some model features differ.
106107
- The [version for Intel Gaudi](https://huggingface.co/docs/text-generation-inference/installation_gaudi) is maintained on a forked repository, often resynchronized with the main [TGI repository](https://github.com/huggingface/tgi-gaudi).
107108
- A [version for Neuron (AWS Inferentia2)](https://huggingface.co/docs/text-generation-inference/installation_inferentia) is maintained as part of [Optimum Neuron](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference).
108109
- A version for Google TPUs is maintained as part of [Optimum TPU](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference).

docs/source/installation_intel.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Using TGI with Intel GPUs
2+
3+
TGI optimized models are supported on Intel Data Center GPU [Max1100](https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html), [Max1550](https://www.intel.com/content/www/us/en/products/sku/232873/intel-data-center-gpu-max-1550/specifications.html), the recommended usage is through Docker.
4+
5+
6+
On a server powered by Intel GPUs, TGI can be launched with the following command:
7+
8+
```bash
9+
model=teknium/OpenHermes-2.5-Mistral-7B
10+
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
11+
12+
docker run --rm --privileged --cap-add=sys_nice \
13+
--device=/dev/dri \
14+
--ipc=host --shm-size 1g --net host -v $volume:/data \
15+
ghcr.io/huggingface/text-generation-inference:latest-intel \
16+
--model-id $model --cuda-graphs 0
17+
```
18+
19+
The launched TGI server can then be queried from clients, make sure to check out the [Consuming TGI](./basic_tutorials/consuming_tgi) guide.

0 commit comments

Comments
 (0)