Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory not saturated using microsoft/Phi-3-small-128k-instruct #2040

Closed
1 of 4 tasks
calwoo opened this issue Jun 7, 2024 · 2 comments
Closed
1 of 4 tasks

GPU memory not saturated using microsoft/Phi-3-small-128k-instruct #2040

calwoo opened this issue Jun 7, 2024 · 2 comments
Labels

Comments

@calwoo
Copy link

calwoo commented Jun 7, 2024

System Info

text-generation-inference version: 2.0.4
using standard docker image ghcr.io/huggingface/text-generation-inference with envvars

VALIDATION_WORKERS: '2'
SHARDED: 'false'
TRUST_REMOTE_CODE: 'true'
MAX_CONCURRENT_REQUESTS: '300'
MAX_BEST_OF: '1'
MAX_STOP_SEQUENCES: '4'
MAX_INPUT_LENGTH: '60000'
MAX_TOTAL_TOKENS: '60001'
DISABLE_CUSTOM_KERNELS: 'false'
WAITING_SERVED_RATIO: '1.2'
MAX_BATCH_PREFILL_TOKENS: '60000'
MAX_WAITING_TOKENS: '20'

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Deploy image and model microsoft/Phi-3-small-128k-instruct on an A100-80GB GPU.

Expected behavior

Given how TGI handles the warmup phase, we expect to have as-close-to maxed out VRAM usage (~75GB) when loaded. However, once accepting any load, the VRAM usage drops to around 28GB and stays there (see utilization graph below)
Screenshot 2024-06-07 at 6 27 27 PM
This causes our model to perform badly on stress tests.

In contrast, the same deployment scheme but using microsoft/Phi-3-mini-128k-instruct and Phi-3-medium-128k-instruct keeps the GPU memory saturated constantly, keeping throughput and latency in expected bounds. The two questions are

  1. Why is the GPU unable to keep its VRAM saturated using the Phi-3-small? In general, we know that the model bypasses the automatic setting of batch prefill token limits during warmup, but I don't think that should cause the GPU memory utilization to plummet during a request load.

  2. Why does the Phi-3-small model crash at such a low MAX_BATCH_PREFILL_TOKENS value during warmup? Right now, it is set at 60000 and any higher causes the model to crash at warmup. In contract, the Phi-3-mini and medium models we deployed has that value set >140000 for the mini and ~100000 for the medium, so we expected the small (7B params and similar architecture) to be in between those limits.

@OlivierDehaene
Copy link
Contributor

The reason behind this difference in behaviour is because phi3 small is not natively supported whereas mini and medium are. Phi3 small therefore uses the AutoModel implementation that requires a lot more memory and is not ISO in features to native models (no flash/paged attention, padding...).

Phi3 small is really different from the other flavours, from what I could find so far:

  • Uses block sparse attention with dense attention every n layers.
  • Does not use custom softmax scaling (not sqrt(head_size)).
  • Does not use regular gating in the MLP layers, uses the gegelu activation instead, which does gating internally using linear gating. However a big difference is that the gate and non-gate parameters are interleaved (a_gelu, a_linear = input[..., ::2], input[..., 1::2]).
  • Residual + layernorm (not RMS) is arranged differently than in Llama.
  • Column-packed QKV with GQA

We are considering adding native support for it but have not done it yet.

On a sidenote, phi3 mini is affected by #2055, solved in #2060.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 13, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants