We provide a set of examples to help you serve large language models, by default, we use vLLM as the backend.
- Deploy models from Huggingface
- Deploy models from ModelScope
- Deploy models from ObjectStore
- Deploy models via SGLang
- Deploy models via llama.cpp
- Deploy models via text-generation-inference
- Deploy models via ollama
- Speculative Decoding with vLLM
- Deploy multi-host inference
- Deploy host models
Deploy models hosted in Huggingface, see example here.
Note: if your model needs Huggingface token for weight downloads, please run
kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token>
ahead.
In theory, we support any size of model. However, the bandwidth is limited. For example, we want to load the llama2-7B
model, which takes about 15GB memory size, if we have a 200Mbps bandwidth, it will take about 10mins to download the model, so the bandwidth plays a vital role here.
Deploy models hosted in ModelScope, see example here, similar to other backends.
Deploy models stored in object stores, we support various providers, see the full list below.
In theory, if we want to load the Qwen2-7B
model, which occupies about 14.2 GB memory size, and the intranet bandwidth is about 800Mbps, it will take about 2 ~ 3 minutes to download the model. However, the intranet bandwidth can be improved.
-
Alibaba Cloud OSS, see example here
Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running
kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>
By default, we use vLLM as the inference backend, however, if you want to use other backends like SGLang, see example here.
llama.cpp can serve models on a wide variety of hardwares, such as CPU, see example here.
text-generation-inference is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see example here.
ollama based on llama.cpp, aims for local deploy. see example here.
Speculative Decoding can improve inference performance efficiently, see example here.
Model size is growing bigger and bigger, Llama 3.1 405B FP16 LLM requires more than 750 GB GPU for weights only, leaving kv cache unconsidered, even with 8 x H100 Nvidia GPUs, 80 GB size of HBM each, can not fit in a single host, requires a multi-host deployment, see example here.
Models could be loaded in prior to the hosts, especially those extremely big models, see example to serve local models.