Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade tgi to 2.3.1 #225

Merged
merged 345 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
345 commits
Select commit Hold shift + click to select a range
71b0189
fix FlashDecoding change's regression in intel platform (#2161)
sywangyi Jul 2, 2024
e913f3a
fix: use the base layers weight in mistral rocm (#2155)
drbh Jul 2, 2024
bc5a792
Fixing rocm. (#2164)
Narsil Jul 2, 2024
d580215
Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167)
Narsil Jul 2, 2024
233e464
feat: improve update_docs for openapi schema (#2169)
drbh Jul 3, 2024
b6c8984
Fixing missing `object` field for regular completions.
Narsil Jul 3, 2024
878491c
Revert "Fixing missing `object` field for regular completions."
Narsil Jul 3, 2024
64989f9
Fixing the dockerfile warnings. (#2173)
Narsil Jul 3, 2024
e93c830
Fixing missing `object` field for regular completions. (#2175)
Narsil Jul 3, 2024
74ddd12
Version 2.1.1
Narsil Jul 4, 2024
2e09ebe
Preparing patch release. (#2186)
Narsil Jul 4, 2024
835ad0a
Adding "longrope" for Phi-3 (#2172) (#2179)
amihalik Jul 5, 2024
1b434e8
Refactor dead code - Removing all `flash_xxx.py` files. (#2166)
Narsil Jul 5, 2024
e481a9b
Hotfixing after refactor.
Narsil Jul 5, 2024
1e7ce69
Fix Starcoder2 after refactor (#2189)
danieldk Jul 5, 2024
54c194d
GPTQ CI improvements (#2151)
danieldk Jul 5, 2024
508e308
Consistently take `prefix` in model constructors (#2191)
danieldk Jul 5, 2024
8e3d1e6
fix dbrx & opt model prefix bug (#2201)
icyxp Jul 8, 2024
f11fd69
hotfix: Fix number of KV heads (#2202)
danieldk Jul 8, 2024
1759491
Fix incorrect cache allocation with multi-query (#2203)
danieldk Jul 8, 2024
540e710
Falcon/DBRX: get correct number of key-value heads (#2205)
danieldk Jul 8, 2024
8dd9b2b
add doc for intel gpus (#2181)
sywangyi Jul 8, 2024
4a54e41
fix: python deserialization (#2178)
jaluma Jul 8, 2024
74edda9
update to metrics 0.23.0 or could work with metrics-exporter-promethe…
sywangyi Jul 8, 2024
48f1196
feat: use model name as adapter id in chat endpoints (#2128)
drbh Jul 8, 2024
eaaea91
Fix nccl regression on PyTorch 2.3 upgrade (#2099)
fxmarty Jul 8, 2024
591f9f7
Adding sanity check to openapi docs.
Narsil Jul 9, 2024
cc4fceb
Updating the self check (#2209)
Narsil Jul 9, 2024
2a6c3ca
Move quantized weight handling out of the `Weights` class (#2194)
danieldk Jul 9, 2024
85c3c5d
Add support for FP8 on compute capability >=8.0, <8.9 (#2213)
danieldk Jul 11, 2024
5029e72
fix: append DONE message to chat stream (#2221)
drbh Jul 11, 2024
dedeb3c
Modifying base in yarn embedding (#2212)
SeongBeomLEE Jul 12, 2024
ee56266
Use symmetric quantization in the `quantize` subcommand (#2120)
danieldk Jul 12, 2024
619eede
feat: simple mistral lora integration tests (#2180)
drbh Jul 15, 2024
271ebb7
fix custom cache dir (#2226)
ErikKaum Jul 15, 2024
8a223eb
fix: Remove bitsandbytes installation when running cpu-only install (…
Hugoch Jul 15, 2024
e955f7b
Add support for AWQ-quantized Idefics2 (#2233)
danieldk Jul 16, 2024
7177da0
`server quantize`: expose groupsize option (#2225)
danieldk Jul 16, 2024
e0710cc
Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237)
danieldk Jul 16, 2024
118ee57
fix(server): fix cohere (#2249)
OlivierDehaene Jul 18, 2024
2dd680b
Improve the handling of quantized weights (#2250)
danieldk Jul 19, 2024
394f8c7
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255)
danieldk Jul 19, 2024
ba0dfb6
Hotfix: various GPT-based model fixes (#2256)
danieldk Jul 19, 2024
990ea79
Hotfix: fix MPT after recent refactor (#2257)
danieldk Jul 19, 2024
e658d95
Hotfix: pass through model revision in `VlmCausalLM` (#2258)
danieldk Jul 19, 2024
66f3de5
usage stats and crash reports (#2220)
ErikKaum Jul 19, 2024
8afc173
add usage stats to toctree (#2260)
ErikKaum Jul 19, 2024
898a892
fix: adjust default tool choice (#2244)
drbh Jul 19, 2024
c1638a5
Add support for Deepseek V2 (#2224)
danieldk Jul 19, 2024
50149c3
Add FP8 release test (#2261)
danieldk Jul 20, 2024
85f10ec
feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248)
OlivierDehaene Jul 20, 2024
d13215d
fix(server): fix deepseekv2 loading (#2266)
OlivierDehaene Jul 21, 2024
a5aee82
Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269)
icyxp Jul 22, 2024
758a8b8
legacy warning on text_generation client (#2271)
ErikKaum Jul 22, 2024
a7515b8
fix(server): fix fp8 weight loading (#2268)
OlivierDehaene Jul 22, 2024
568cc9f
Softcapping for gemma2. (#2273)
Narsil Jul 22, 2024
31eb03d
Fixing mistral nemo. (#2276)
Narsil Jul 23, 2024
919da25
fix(l4): fix fp8 logic on l4 (#2277)
OlivierDehaene Jul 23, 2024
26460f0
Add support for repacking AWQ weights for GPTQ-Marlin (#2278)
danieldk Jul 23, 2024
69b67b7
Add support for Mistral-Nemo by supporting head_dim through config (#…
shaltielshmid Jul 23, 2024
5390973
Preparing for release. (#2285)
Narsil Jul 23, 2024
43f4914
Add support for Llama 3 rotary embeddings (#2286)
danieldk Jul 23, 2024
b1077b0
hotfix: pin numpy (#2289)
danieldk Jul 23, 2024
34c472b
chore: update to torch 2.4 (#2259)
OlivierDehaene Jul 23, 2024
a994f6a
hotfix: update nccl
OlivierDehaene Jul 23, 2024
2041421
fix crash in multi-modal (#2245)
sywangyi Jul 24, 2024
d939315
fix of use of unquantized weights in cohere GQA loading, also enable …
sywangyi Jul 24, 2024
457791f
Split up `layers.marlin` into several files (#2292)
danieldk Jul 24, 2024
7ebee37
fix: refactor adapter weight loading and mapping (#2193)
drbh Jul 24, 2024
69db13e
Using g6 instead of g5. (#2281)
Narsil Jul 25, 2024
64ffd64
Some small fixes for the Torch 2.4.0 update (#2304)
danieldk Jul 25, 2024
d5e0543
Fixing idefics on g6 tests. (#2306)
Narsil Jul 25, 2024
1674f44
Fix registry name (#2307)
XciD Jul 25, 2024
fc6d80f
Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313)
danieldk Jul 26, 2024
a87791d
feat: add ruff and resolve issue (#2262)
drbh Jul 26, 2024
2c1d280
Run ci api key (#2315)
ErikKaum Jul 29, 2024
23a3927
Install Marlin from standalone package (#2320)
danieldk Jul 29, 2024
a574381
fix: reject grammars without properties (#2309)
drbh Jul 29, 2024
b1d1d26
patch-error-on-invalid-grammar (#2282)
ErikKaum Jul 29, 2024
bafab73
fix: adjust test snapshots and small refactors (#2323)
drbh Jul 29, 2024
247a29f
server quantize: store quantizer config in standard format (#2299)
danieldk Jul 30, 2024
120d577
Rebase TRT-llm (#2331)
Narsil Jul 31, 2024
468e5c6
Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300)
danieldk Jul 31, 2024
c73d1d6
Pr 2290 ci run (#2329)
drbh Jul 31, 2024
3c4f816
refactor usage stats (#2339)
ErikKaum Jul 31, 2024
d70da59
enable HuggingFaceM4/idefics-9b in intel gpu (#2338)
sywangyi Aug 1, 2024
ccddb30
Fix cache block size for flash decoding (#2351)
danieldk Aug 1, 2024
48fec7b
Unify attention output handling (#2343)
danieldk Aug 1, 2024
688321b
fix: attempt forward on flash attn2 to check hardware support (#2335)
drbh Aug 5, 2024
8b0f5fe
feat: include local lora adapter loading docs (#2359)
drbh Aug 5, 2024
83d1f23
fix: return the out tensor rather then the functions return value (#2…
drbh Aug 6, 2024
88e07f1
feat: implement a templated endpoint for visibility into chat request…
drbh Aug 6, 2024
b4562e1
feat: prefer stop over eos_token to align with openai finish_reason (…
drbh Aug 6, 2024
5400c71
feat: return the generated text when parsing fails (#2353)
drbh Aug 6, 2024
db873be
fix: default num_ln_in_parallel_attn to one if not supplied (#2364)
drbh Aug 6, 2024
3ccde43
fix: prefer original layernorm names for 180B (#2365)
drbh Aug 6, 2024
11fab8a
fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350)
almersawi Aug 7, 2024
3ea8e8a
add gptj modeling in TGI #2366 (CI RUN) (#2372)
drbh Aug 8, 2024
9b1b545
Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371)
drbh Aug 8, 2024
06b638f
Pr 2374 ci branch (#2378)
drbh Aug 8, 2024
3893d00
fix EleutherAI/gpt-neox-20b does not work in tgi (#2346)
sywangyi Aug 8, 2024
1057f28
Pr 2337 ci branch (#2379)
drbh Aug 8, 2024
853fb96
fix: prefer hidden_activation over hidden_act in gemma2 (#2381)
drbh Aug 8, 2024
b1bc0ec
Update Quantization docs and minor doc fix. (#2368)
Vaibhavs10 Aug 8, 2024
6f2a468
Pr 2352 ci branch (#2382)
drbh Aug 9, 2024
4a16da5
Add FlashInfer support (#2354)
danieldk Aug 9, 2024
dc0fa60
Add experimental flake (#2384)
danieldk Aug 9, 2024
afa14b7
Using HF_HOME instead of CACHE to get token read in addition to model…
Narsil Aug 9, 2024
e9ba044
flake: add fmt and clippy (#2389)
danieldk Aug 9, 2024
1d4a35a
Update documentation for Supported models (#2386)
Vaibhavs10 Aug 9, 2024
df719fd
flake: use rust-overlay (#2390)
danieldk Aug 9, 2024
849bd93
Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
Narsil Aug 9, 2024
959add5
feat: add guideline to chat request and template (#2391)
drbh Aug 9, 2024
bb83338
Update flake for 9.0a capability in Torch (#2394)
danieldk Aug 9, 2024
197dd3a
nix: add router to the devshell (#2396)
danieldk Aug 12, 2024
8750dc8
Upgrade fbgemm (#2398)
Narsil Aug 12, 2024
fbe59c6
Adding launcher to build. (#2397)
Narsil Aug 12, 2024
1daaddd
Fixing import exl2 (#2399)
Narsil Aug 12, 2024
b8efd6d
Cpu dockerimage (#2367)
sywangyi Aug 12, 2024
f586cc7
Add support for prefix caching to the v3 router (#2392)
danieldk Aug 12, 2024
6393cde
Keeping the benchmark somewhere (#2401)
Narsil Aug 12, 2024
8e6bfa2
feat: validate template variables before apply and improve sliding wi…
drbh Aug 12, 2024
3079865
fix: allocate tmp based on sgmv kernel if available (#2345)
drbh Aug 12, 2024
96e8fa3
fix: improve completions to send a final chunk with usage details (#2…
drbh Aug 12, 2024
18d6be6
Updating the flake. (#2404)
Narsil Aug 12, 2024
1f8c0f8
Pr 2395 ci run (#2406)
drbh Aug 12, 2024
10b2be6
fix: include create_exllama_buffers and set_device for exllama (#2407)
drbh Aug 12, 2024
eb561bb
nix: incremental build of the launcher (#2410)
danieldk Aug 13, 2024
c5e4c18
Adding more kernels to flake. (#2411)
Narsil Aug 13, 2024
7a4d831
add numa to improve cpu inference perf (#2330)
sywangyi Aug 13, 2024
ffc8fb0
fix: adds causal to attention params (#2408)
drbh Aug 13, 2024
bae161a
nix: partial incremental build of the router (#2416)
danieldk Aug 14, 2024
4baa6ff
Upgrading exl2. (#2415)
Narsil Aug 14, 2024
c3401e0
More fixes trtllm (#2342)
mfuntowicz Aug 14, 2024
e5c39a5
nix: build router incrementally (#2422)
danieldk Aug 15, 2024
df6ea89
Fixing exl2 and other quanize tests again. (#2419)
Narsil Aug 15, 2024
f0181ed
Upgrading the tests to match the current workings. (#2423)
Narsil Aug 15, 2024
20ed7b5
nix: try to reduce the number of Rust rebuilds (#2424)
danieldk Aug 16, 2024
df0e650
Improve the Consuming TGI + Streaming docs. (#2412)
Vaibhavs10 Aug 16, 2024
85df9fc
Further fixes. (#2426)
Narsil Aug 16, 2024
11d25a4
FIxing the CI.
Narsil Aug 16, 2024
53fdbe6
doc: Add metrics documentation and add a 'Reference' section (#2230)
Hugoch Aug 16, 2024
cd208c5
All integration tests back everywhere (too many failed CI). (#2428)
Narsil Aug 16, 2024
ddba272
nix: update to CUDA 12.4 (#2429)
danieldk Aug 19, 2024
635dde8
Prefix caching (#2402)
Narsil Aug 20, 2024
516392d
nix: add pure server to flake, add both pure and impure devshells (#2…
danieldk Aug 20, 2024
a5af557
nix: add `text-generation-benchmark` to pure devshell (#2431)
danieldk Aug 21, 2024
6654c2d
Adding eetq to flake. (#2438)
Narsil Aug 21, 2024
b7d1adc
nix: add awq-inference-engine as server dependency (#2442)
danieldk Aug 21, 2024
92ac02e
nix: add default package (#2453)
danieldk Aug 23, 2024
7aebb95
Fix: don't apply post layernorm in SiglipVisionTransformer (#2459)
drbh Aug 26, 2024
73ebbd0
Pr 2451 ci branch (#2454)
drbh Aug 27, 2024
6793b72
Fixing CI. (#2462)
Narsil Aug 27, 2024
e80b2c2
fix: bump minijinja version and add test for llama 3.1 tools (#2463)
drbh Aug 27, 2024
08834e0
fix: improve regex expression (#2468)
drbh Aug 28, 2024
622c9c3
nix: build Torch against MKL and various other improvements (#2469)
danieldk Aug 29, 2024
4e1ca8d
Lots of improvements (Still 2 allocators) (#2449)
Narsil Aug 29, 2024
990478b
feat: add /v1/models endpoint (#2433)
drbh Aug 29, 2024
61b2f49
update doc with intel cpu part (#2420)
sywangyi Aug 29, 2024
a313355
Tied embeddings in MLP speculator. (#2473)
Narsil Aug 29, 2024
07c70e7
nix: improve impure devshell (#2478)
danieldk Sep 2, 2024
3e17cb7
nix: add punica-kernels (#2477)
danieldk Sep 2, 2024
be5cb0c
fix: enable chat requests in vertex endpoint (#2481)
drbh Sep 2, 2024
34a6399
feat: support lora revisions and qkv_proj weights (#2482)
drbh Sep 2, 2024
c7b495f
hotfix: avoid non-prefilled block use when using prefix caching (#2489)
danieldk Sep 5, 2024
556a870
Adding links to Adyen blogpost. (#2492)
Narsil Sep 5, 2024
d8610a6
Add two handy gitignores for Nix environments (#2484)
danieldk Sep 5, 2024
938a7f3
hotfix: fix regression of attention api change in intel platform (#2439)
sywangyi Sep 5, 2024
1e14a94
nix: add pyright/ruff for proper LSP in the impure devshell (#2496)
danieldk Sep 6, 2024
8ba790a
Fix incompatibility with latest `syrupy` and update in Poetry (#2497)
danieldk Sep 6, 2024
67f44cc
radix trie: add assertions (#2491)
danieldk Sep 6, 2024
0198db1
hotfix: add syrupy to the right subproject (#2499)
danieldk Sep 6, 2024
7c2ed55
Add links to Adyen blogpost (#2500)
martinigoyanes Sep 6, 2024
eb54d95
Fixing more correctly the invalid drop of the batch. (#2498)
Narsil Sep 6, 2024
b67a0cd
Add Directory Check to Prevent Redundant Cloning in Build Process (#2…
vamsivallepu Sep 7, 2024
510d1c7
Prefix test - Different kind of load test to trigger prefix test bugs…
Narsil Sep 11, 2024
c6b568b
Fix tokenization yi (#2507)
Narsil Sep 11, 2024
f32fa56
Fix truffle (#2514)
Narsil Sep 11, 2024
7be7ab7
nix: support Python tokenizer conversion in the router (#2515)
danieldk Sep 12, 2024
7d89718
Add nix test. (#2513)
Narsil Sep 12, 2024
5fc0e0c
fix: pass missing revision arg for lora adapter when loading multiple…
drbh Sep 12, 2024
cbfe9e5
hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
sywangyi Sep 12, 2024
afe5cae
Use `ratatui` not (deprecated) `tui` (#2521)
strickvl Sep 13, 2024
e8c3293
Add tests for Mixtral (#2520)
danieldk Sep 16, 2024
0110b83
Adding a test for FD. (#2516)
Narsil Sep 16, 2024
0ecbd61
nix: pure Rust check/fmt/clippy/test (#2525)
danieldk Sep 17, 2024
88b72c8
fix: metrics unbounded memory (#2528)
OlivierDehaene Sep 17, 2024
29a93b7
Move to moe-kernels package and switch to common MoE layer (#2511)
danieldk Sep 17, 2024
2d470c8
Stream options. (#2533)
Narsil Sep 19, 2024
c1a99e2
Update to moe-kenels 0.3.1 (#2535)
danieldk Sep 19, 2024
b6ef2bf
doc: clarify that `--quantize` is not needed for pre-quantized models…
danieldk Sep 19, 2024
3519398
hotfix: ipex fails since cuda moe kernel is not supported (#2532)
sywangyi Sep 20, 2024
bd9675c
fix: wrap python basic logs in debug assertion in launcher (#2539)
OlivierDehaene Sep 20, 2024
514a5a7
Preparing for release. (#2540)
Narsil Sep 20, 2024
14fdc4a
Add some missing modification of 2.3.0 because of conflict
yuanwu2017 Sep 25, 2024
bab529c
Make Gaudi adapt to the tgi 2.3.0
yuanwu2017 Sep 26, 2024
67ee45a
Pass the max_batch_total_tokens to causal_lm
yuanwu2017 Oct 10, 2024
8686a0f
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Oct 23, 2024
8ebe77b
Simplify the warmup
yuanwu2017 Oct 24, 2024
b590310
Add missing import package
yuanwu2017 Oct 25, 2024
9aed9d5
nix: remove unused `_server.nix` file (#2538)
danieldk Sep 23, 2024
73e6090
chore: Add old V2 backend (#2551)
OlivierDehaene Sep 24, 2024
79ac2b7
Micro cleanup. (#2555)
Narsil Sep 24, 2024
68cfc94
Hotfixing main (#2556)
Narsil Sep 24, 2024
32d50c2
Add support for scalar FP8 weight scales (#2550)
danieldk Sep 24, 2024
d4f995e
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537)
danieldk Sep 24, 2024
8c6d3e0
Update the link to the Ratatui organization (#2546)
orhun Sep 24, 2024
5247f89
Simplify crossterm imports (#2545)
orhun Sep 24, 2024
782130d
Adding note for private models in quick-tour document (#2548)
ariG23498 Sep 24, 2024
25e0edf
Hotfixing main. (#2562)
Narsil Sep 24, 2024
97d4bdd
Cleanup Vertex + Chat (#2553)
Narsil Sep 24, 2024
a684a81
More tensor cores. (#2558)
Narsil Sep 24, 2024
0817643
remove LORA_ADAPTERS_PATH (#2563)
nbroad1881 Sep 24, 2024
6976cf8
Add LoRA adapters support for Gemma2 (#2567)
alvarobartt Sep 26, 2024
bc28f86
Fix build with `--features google` (#2566)
alvarobartt Sep 26, 2024
653193a
Improve support for GPUs with capability < 8 (#2575)
danieldk Sep 27, 2024
f82a3f5
flashinfer: pass window size and dtype (#2574)
danieldk Sep 28, 2024
55fd281
Remove compute capability lazy cell (#2580)
danieldk Sep 30, 2024
6808b2d
Update architecture.md (#2577)
ulhaqi12 Sep 30, 2024
ff905ae
Update ROCM libs and improvements (#2579)
mht-sharma Sep 30, 2024
288bcb0
Add support for GPTQ-quantized MoE models using MoE Marlin (#2557)
danieldk Sep 30, 2024
bdc4739
feat: support phi3.5 moe (#2479)
drbh Sep 30, 2024
692f8dd
Move flake back to tgi-nix `main` (#2586)
danieldk Sep 30, 2024
775e5f4
MoE Marlin: support `desc_act` for `groupsize != -1` (#2590)
danieldk Sep 30, 2024
fa964f8
nix: experimental support for building a Docker container (#2470)
danieldk Oct 1, 2024
51506aa
Mllama flash version (#2585)
Narsil Oct 2, 2024
967e671
Max token capacity metric (#2595)
Narsil Oct 2, 2024
7664d2e
CI (2592): Allow LoRA adapter revision in server launcher (#2602)
drbh Oct 2, 2024
902f526
Unroll notify error into generate response (#2597)
drbh Oct 2, 2024
34e98b1
New release 2.3.1 (#2604)
Narsil Oct 3, 2024
7e282b4
V2.3.1
Narsil Oct 3, 2024
372e071
Fix the issues of tgi-gaudi for v.2.3.1
yuanwu2017 Oct 27, 2024
c23584f
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Oct 27, 2024
4c9856f
Add missing package
yuanwu2017 Oct 28, 2024
fcf2e3a
Fix the prefill warmup issue
yuanwu2017 Nov 1, 2024
c345c73
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Nov 1, 2024
636cdb4
Fix startcode issue
yuanwu2017 Nov 26, 2024
b83419a
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Nov 28, 2024
4586325
Fix the starCode warmup issue
yuanwu2017 Nov 26, 2024
0228bd0
Doesn't run the prefill warmup when limit_hpu_graph=true
yuanwu2017 Dec 1, 2024
253a992
Remove the CI workflows we don't currently support
yuanwu2017 Dec 2, 2024
9f356ce
Refine the warmup process
yuanwu2017 Dec 7, 2024
73e6e3b
Remove the error log
yuanwu2017 Dec 8, 2024
1b65978
Add the no-deps in pip install
yuanwu2017 Dec 8, 2024
c6f023a
Use optimum-habana v1.15-release branch
yuanwu2017 Dec 8, 2024
c922ef9
Fix the warmup issue of llama2-7B.
yuanwu2017 Dec 9, 2024
c3b8899
Revert "Use optimum-habana v1.15-release branch"
yuanwu2017 Dec 11, 2024
15de6c9
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Dec 17, 2024
eaeef6e
Remove the useless modifications
yuanwu2017 Dec 17, 2024
8e2e5d8
Fix benchmark build error
yuanwu2017 Dec 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Empty file added .devcontainer/Dockerfile.trtllm
Empty file.
Empty file added .devcontainer/devcontainer.json
Empty file.
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ aml
target
server/transformers
server/flash-attention
cmake-build-debug/
cmake-build-release/
Dockerfile*
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,12 @@ target
router/tokenizer.json
*__pycache__*

backends/v2/src/client/pb
backends/v3/src/client/pb

# ROCm auto-generated files
*.hip
server/exllamav2_kernels/exllamav2_kernels/hip/
server/exllamav2
server/exllama_kernels/exllama_kernels/hip/
server/exllama_kernels/exllama_kernels/hip_func/
*_hip.cuh
Expand All @@ -14,3 +17,7 @@ server/exllama_kernels/exllama_kernels/exllama_ext_hip.cpp

data/
load_tests/*.json
server/fbgemmm

.direnv/
.venv/
9 changes: 7 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,19 @@ repos:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
exclude: docs/source/basic_tutorials/launcher.md
exclude: docs/source/reference/launcher.md
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
- repo: https://github.com/doublify/pre-commit-rust
rev: v1.0
hooks:
- id: fmt
- id: cargo-check
- id: fmt
- id: clippy
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.3.0
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
82 changes: 82 additions & 0 deletions .redocly.lint-ignore.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# This file instructs Redocly's linter to ignore the rules contained for specific parts of your API.
# See https://redoc.ly/docs/cli/ for more information.
docs/openapi.json:
no-empty-servers:
- '#/openapi'
spec:
- >-
#/components/schemas/GenerateParameters/properties/best_of/exclusiveMinimum
- >-
#/components/schemas/GenerateParameters/properties/frequency_penalty/exclusiveMinimum
- '#/components/schemas/GenerateParameters/properties/grammar/nullable'
- >-
#/components/schemas/GenerateParameters/properties/repetition_penalty/exclusiveMinimum
- '#/components/schemas/GenerateParameters/properties/seed/exclusiveMinimum'
- >-
#/components/schemas/GenerateParameters/properties/temperature/exclusiveMinimum
- '#/components/schemas/GenerateParameters/properties/top_k/exclusiveMinimum'
- >-
#/components/schemas/GenerateParameters/properties/top_n_tokens/exclusiveMinimum
- '#/components/schemas/GenerateParameters/properties/top_p/exclusiveMinimum'
- >-
#/components/schemas/GenerateParameters/properties/typical_p/exclusiveMinimum
- '#/components/schemas/GenerateResponse/properties/details/nullable'
- '#/components/schemas/StreamResponse/properties/details/nullable'
- '#/components/schemas/ChatRequest/properties/response_format/nullable'
- '#/components/schemas/ChatRequest/properties/stream_options/nullable'
- '#/components/schemas/ChatRequest/properties/tool_choice/nullable'
- '#/components/schemas/ToolChoice/nullable'
- '#/components/schemas/ChatCompletionComplete/properties/logprobs/nullable'
- '#/components/schemas/ChatCompletionChunk/properties/usage/nullable'
- '#/components/schemas/ChatCompletionChoice/properties/logprobs/nullable'
no-invalid-media-type-examples:
- '#/paths/~1/post/responses/422/content/application~1json/example'
- '#/paths/~1/post/responses/424/content/application~1json/example'
- '#/paths/~1/post/responses/429/content/application~1json/example'
- '#/paths/~1/post/responses/500/content/application~1json/example'
- '#/paths/~1generate/post/responses/422/content/application~1json/example'
- '#/paths/~1generate/post/responses/424/content/application~1json/example'
- '#/paths/~1generate/post/responses/429/content/application~1json/example'
- '#/paths/~1generate/post/responses/500/content/application~1json/example'
- >-
#/paths/~1generate_stream/post/responses/422/content/text~1event-stream/example
- >-
#/paths/~1generate_stream/post/responses/424/content/text~1event-stream/example
- >-
#/paths/~1generate_stream/post/responses/429/content/text~1event-stream/example
- >-
#/paths/~1generate_stream/post/responses/500/content/text~1event-stream/example
- '#/paths/~1tokenize/post/responses/404/content/application~1json/example'
- >-
#/paths/~1v1~1chat~1completions/post/responses/422/content/application~1json/example
- >-
#/paths/~1v1~1chat~1completions/post/responses/424/content/application~1json/example
- >-
#/paths/~1v1~1chat~1completions/post/responses/429/content/application~1json/example
- >-
#/paths/~1v1~1chat~1completions/post/responses/500/content/application~1json/example
- >-
#/paths/~1v1~1completions/post/responses/422/content/application~1json/example
- >-
#/paths/~1v1~1completions/post/responses/424/content/application~1json/example
- >-
#/paths/~1v1~1completions/post/responses/429/content/application~1json/example
- >-
#/paths/~1v1~1completions/post/responses/500/content/application~1json/example
operation-4xx-response:
- '#/paths/~1health/get/responses'
- '#/paths/~1info/get/responses'
- '#/paths/~1metrics/get/responses'
no-unused-components:
- '#/components/schemas/Completion'
security-defined:
- '#/paths/~1/post'
- '#/paths/~1generate/post'
- '#/paths/~1generate_stream/post'
- '#/paths/~1health/get'
- '#/paths/~1info/get'
- '#/paths/~1metrics/get'
- '#/paths/~1tokenize/post'
- '#/paths/~1v1~1chat~1completions/post'
- '#/paths/~1v1~1completions/post'
- '#/paths/~1v1~1models/get'
133 changes: 133 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@

# Contributor Covenant Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or advances of
any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.

Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
[email protected].
All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series of
actions.

**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or permanent
ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within the
community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.1, available at
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].

Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].

For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
[https://www.contributor-covenant.org/translations][translations].

[homepage]: https://www.contributor-covenant.org
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations
120 changes: 120 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
<!---
Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Contribute to text-generation-inference

Everyone is welcome to contribute, and we value everybody's contribution. Code
contributions are not the only way to help the community. Answering questions, helping
others, and improving the documentation are also immensely valuable.

It also helps us if you spread the word! Reference the library in blog posts
about the awesome projects it made possible, shout out on Twitter every time it has
helped you, or simply ⭐️ the repository to say thank you.

However you choose to contribute, please be mindful and respect our
[code of conduct](https://github.com/huggingface/text-generation-inference/blob/main/CODE_OF_CONDUCT.md).

**This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**

## Ways to contribute

There are several ways you can contribute to text-generation-inference.

* Fix outstanding issues with the existing code.
* Submit issues related to bugs or desired new features.
* Contribute to the examples or to the documentation.

> All contributions are equally valuable to the community. 🥰
## Fixing outstanding issues

If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) and open
a Pull Request!

## Submitting a bug-related issue or feature request

Do your best to follow these guidelines when submitting a bug-related issue or a feature
request. It will make it easier for us to come back to you quickly and with good
feedback.

### Did you find a bug?

The text-generation-inference library is robust and reliable thanks to users who report the problems they encounter.

Before you report an issue, we would really appreciate it if you could **make sure the bug was not
already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the
library itself, and not your code.

Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so
we can quickly resolve it:

* Your **OS type and version**, as well as your environment versions (versions of rust, python, and dependencies).
* A short, self-contained, code snippet that allows us to reproduce the bug.
* The *full* traceback if an exception is raised.
* Attach any other additional information, like screenshots, you think may help.

To get the OS and software versions automatically, you can re-run the launcher with the `--env` flag:

```bash
text-generation-launcher --env
```

This will precede the launch of the model with the information relative to your environment. We recommend pasting
that in your issue report.

### Do you want a new feature?

If there is a new feature you'd like to see in text-generation-inference, please open an issue and describe:

1. What is the *motivation* behind this feature? Is it related to a problem or frustration with the library? Is it
a feature related to something you need for a project? Is it something you worked on and think it could benefit
the community?

Whatever it is, we'd love to hear about it!

2. Describe your requested feature in as much detail as possible. The more you can tell us about it, the better
we'll be able to help you.
3. Provide a *code snippet* that demonstrates the feature's usage.
4. If the feature is related to a paper, please include a link.

If your issue is well written we're already 80% of the way there by the time you create it.

We have added [templates](https://github.com/huggingface/text-generation-inference/tree/main/.github/ISSUE_TEMPLATE)
to help you get started with your issue.

## Do you want to implement a new model?

New models are constantly released and if you want to implement a new model, please provide the following information:

* A short description of the model and a link to the paper.
* Link to the implementation if it is open-sourced.
* Link to the model weights if they are available.

If you are willing to contribute the model yourself, let us know so we can help you add it to text-generation-inference!

## Do you want to add documentation?

We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know
how the documentation can be improved such as typos and any content that is missing, unclear or inaccurate. We'll be
happy to make the changes or help you make a contribution if you're interested!

## I want to become a maintainer of the project. How do I get there?

TGI is a project led and managed by Hugging Face as it powers our internal services. However, we are happy to have
motivated individuals from other organizations join us as maintainers with the goal of making TGI the best inference
service.

If you are such an individual (or organization), please reach out to us and let's collaborate.
Loading