You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deploy image and model microsoft/Phi-3-small-128k-instruct on an A100-80GB GPU.
Expected behavior
Given how TGI handles the warmup phase, we expect to have as-close-to maxed out VRAM usage (~75GB) when loaded. However, once accepting any load, the VRAM usage drops to around 28GB and stays there (see utilization graph below)
This causes our model to perform badly on stress tests.
In contrast, the same deployment scheme but using microsoft/Phi-3-mini-128k-instruct and Phi-3-medium-128k-instruct keeps the GPU memory saturated constantly, keeping throughput and latency in expected bounds. The two questions are
Why is the GPU unable to keep its VRAM saturated using the Phi-3-small? In general, we know that the model bypasses the automatic setting of batch prefill token limits during warmup, but I don't think that should cause the GPU memory utilization to plummet during a request load.
Why does the Phi-3-small model crash at such a low MAX_BATCH_PREFILL_TOKENS value during warmup? Right now, it is set at 60000 and any higher causes the model to crash at warmup. In contract, the Phi-3-mini and medium models we deployed has that value set >140000 for the mini and ~100000 for the medium, so we expected the small (7B params and similar architecture) to be in between those limits.
The text was updated successfully, but these errors were encountered:
The reason behind this difference in behaviour is because phi3 small is not natively supported whereas mini and medium are. Phi3 small therefore uses the AutoModel implementation that requires a lot more memory and is not ISO in features to native models (no flash/paged attention, padding...).
Phi3 small is really different from the other flavours, from what I could find so far:
Uses block sparse attention with dense attention every n layers.
Does not use custom softmax scaling (not sqrt(head_size)).
Does not use regular gating in the MLP layers, uses the gegelu activation instead, which does gating internally using linear gating. However a big difference is that the gate and non-gate parameters are interleaved (a_gelu, a_linear = input[..., ::2], input[..., 1::2]).
Residual + layernorm (not RMS) is arranged differently than in Llama.
Column-packed QKV with GQA
We are considering adding native support for it but have not done it yet.
On a sidenote, phi3 mini is affected by #2055, solved in #2060.
System Info
text-generation-inference version: 2.0.4
using standard docker image
ghcr.io/huggingface/text-generation-inference
with envvarsInformation
Tasks
Reproduction
microsoft/Phi-3-small-128k-instruct
on an A100-80GB GPU.Expected behavior
Given how TGI handles the warmup phase, we expect to have as-close-to maxed out VRAM usage (~75GB) when loaded. However, once accepting any load, the VRAM usage drops to around 28GB and stays there (see utilization graph below)

This causes our model to perform badly on stress tests.
In contrast, the same deployment scheme but using
microsoft/Phi-3-mini-128k-instruct
andPhi-3-medium-128k-instruct
keeps the GPU memory saturated constantly, keeping throughput and latency in expected bounds. The two questions areWhy is the GPU unable to keep its VRAM saturated using the Phi-3-small? In general, we know that the model bypasses the automatic setting of batch prefill token limits during warmup, but I don't think that should cause the GPU memory utilization to plummet during a request load.
Why does the Phi-3-small model crash at such a low
MAX_BATCH_PREFILL_TOKENS
value during warmup? Right now, it is set at 60000 and any higher causes the model to crash at warmup. In contract, the Phi-3-mini and medium models we deployed has that value set >140000 for the mini and ~100000 for the medium, so we expected the small (7B params and similar architecture) to be in between those limits.The text was updated successfully, but these errors were encountered: