Slipstream tool that stands up an OpenAI-compatible inference endpoint for any HuggingFace model on sglang or vllm, on a user-specified GPU shape. TP/DP/EP are computed automatically from model size + GPU spec.
Looking for the why? See docs/architecture.md for the request flow and docs/adr/ for the decisions behind every non-obvious design choice.
| Name | Description |
|---|---|
TARGET_MODEL |
HuggingFace model ID, e.g. Qwen/Qwen2.5-1.5B-Instruct, meta-llama/Llama-3.1-70B-Instruct. |
ACCELERATORS |
SkyPilot GPU spec, e.g. L4:1, A100:8, H100:4. Cost and availability are user-controlled — we don't auto-pick. |
| Name | Default | Description |
|---|---|---|
ENGINE |
sglang |
sglang or vllm. |
SERVED_MODEL_NAME |
basename of TARGET_MODEL |
Alias clients use. |
PORT |
8000 |
Same default for both engines. |
MAX_MODEL_LEN |
(engine default) | sglang --context-length; vllm --max-model-len (auto to fit KV cache). |
DTYPE |
(engine default) | auto, bfloat16, float16, float32. |
QUANTIZATION |
none | sglang: fp8 or torchao methods (int4wo-128, int8wo, int8dq). vllm: fp8, bitsandbytes, torchao. |
MEM_FRACTION |
0.85 |
Fraction of GPU memory the server may use. |
HF_TOKEN |
empty | Required for gated models. |
SPEC_ENABLED |
false |
Speculative decoding on/off. |
SPEC_DRAFT_MODEL |
empty | HuggingFace ID of the speculator, e.g. RedHatAI/Qwen3-14B-speculator.eagle3. |
SPEC_METHOD |
eagle3 |
sglang: EAGLE/EAGLE3/STANDALONE. vllm: eagle/eagle3. |
SPEC_NUM_TOKENS |
3 |
Tokens drafted per step. |
This tool runs through slipstream in production. To test it directly via baby-shark on GCP, follow these steps.
# GCS bucket (existing or new) — used by baby-shark for pipeline state
gcloud storage buckets create gs://YOUR-BUCKET --location=us-west1
# ADC auth for SkyPilot
gcloud auth application-default login
# baby-shark with ${VAR} interpolation (PR tails-mpt/baby-shark#110)
git clone https://github.com/tails-mpt/baby-shark.git
cd baby-shark
git checkout feat/shark-toml-env-interpolation # remove once PR lands
uv sync
# Sanity check
uv run shark sky checkcd inference-endpoint
# Required
export BUCKET="gs://YOUR-BUCKET"
export TARGET_MODEL="Qwen/Qwen2.5-1.5B-Instruct"
export ACCELERATORS="L4:1"
# Optional — defaults match schema.toml
export ENGINE="sglang"
export HF_TOKEN="hf_..." # only for gated models
uv run --project /path/to/baby-shark \
shark pipeline run --pipeline-file .slipstream/pipeline/shark_pipeline.tomlThe pipeline's [stage.env] block uses ${VAR} interpolation to pull these
exports into the VM's environment, so the same shark.toml works under
slipstream (which uses global_env) and via direct shark pipeline run.
- SkyPilot provisions an L4 spot VM in GCP (~3 min).
run.shon the VM logs:=== Computing parallelism ===→TP=1, DP=1, EP=1=== Installing engine: sglang ===(~2 min)=== Launching server ===Endpoint live at http://<external-ip>:8000
- From your laptop:
curl http://<external-ip>:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"Qwen2.5-1.5B-Instruct", "messages":[{"role":"user","content":"Hi"}], "max_tokens":20}'
The server runs forever by design (teardown_cluster = false). Tear it down
explicitly when you're done:
uv run --project /path/to/baby-shark shark instance list
uv run --project /path/to/baby-shark shark instance destroy --name <cluster-name>| GPU spec | Approx. spot $/hr | Cold start to live endpoint |
|---|---|---|
L4:1 |
$0.20 | ~6 min |
A100:1 |
$0.50 | ~7 min |
H100:8 |
$20 | ~10 min |
Spot can be preempted. For demos, set use_spot = false in shark.toml.
Each entry exercises a different code path. Swap the exports below into the launch block above.
Smallest "real" test. No HF token needed.
export TARGET_MODEL="Qwen/Qwen2.5-7B-Instruct"
export ACCELERATORS="A100:1"Expected: TP=1, DP=1, EP=1. ~7 min cold start, ~$0.50/hr.
Forces TP > 1. Validates the parallelism math and HF auth.
# https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct — accept the license first
export HF_TOKEN="hf_..."
export TARGET_MODEL="meta-llama/Llama-3.1-70B-Instruct"
export ACCELERATORS="A100:4"Expected: TP=4, DP=1, EP=1. ~12-15 min cold start, ~$5/hr.
Forces EP > 1. Validates the MoE branch.
export TARGET_MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"
export ACCELERATORS="A100-80GB:2"Expected: TP=2, DP=1, EP=2. ~15 min cold start.
Validates the speculative path on a public draft.
export TARGET_MODEL="Qwen/Qwen3-14B-FP8"
export ACCELERATORS="H100:1"
export SPEC_ENABLED="true"
export SPEC_DRAFT_MODEL="RedHatAI/Qwen3-14B-speculator.eagle3"
export SPEC_METHOD="eagle3" # or 'EAGLE3' for sglangA100/H100 spot quota is per-region in GCP and often zero by default. If
sky launch retries forever:
gcloud compute regions describe us-west1 --format="value(quotas)" \
| tr ',' '\n' | grep -i nvidiaIf the quota you want is zero, request it in the GCP console (Compute Engine → Quotas) or fall back to a smaller GPU spec.
src/parallelism.py follows ai-factory's heuristic:
- Pull weight count from HF (safetensors metadata → file sizes → param count).
- Add 20% overhead for CUDA context, activations, KV cache bootstrap.
- Apply quantization reduction if
QUANTIZATIONis set (fp8 → 0.5×, int4wo → 0.25×). - TP = smallest power of 2 such that
weight_mem / TP ≤ vram_per_gpu, then validated to dividenum_attention_heads. - DP =
gpu_count // TP. - EP =
TPif the model is MoE (config hasnum_local_experts), else 1.
If TP exceeds the requested GPU count, the job fails fast with an actionable message.
src/gpu_specs.py maps SkyPilot accelerator names to per-GPU VRAM:
| GPU | VRAM | GPU | VRAM | |
|---|---|---|---|---|
T4 |
16 GB | A100 |
40 GB | |
L4 |
24 GB | A100-80GB |
80 GB | |
A10 / A10G |
24 GB | H100 |
80 GB | |
V100 |
16 GB | H200 |
141 GB | |
V100-32GB |
32 GB | B200 |
192 GB |
Add to VRAM_GB in gpu_specs.py if SkyPilot supports a GPU not listed here.
inference-endpoint/
├── README.md
├── pyproject.toml
├── src/
│ ├── gpu_specs.py # ACCELERATORS string → GPUInventory
│ ├── parallelism.py # vendored from ai-factory; HF intro + TP/DP/EP math
│ └── launch.py # engine-aware argv builder, execvp's the server
└── .slipstream/
├── schema.toml # user-facing inputs
└── pipeline/
├── shark_pipeline.toml
└── jobs/inference-endpoint/
├── shark.toml # accelerators = "${ACCELERATORS}"
├── environment.yml # base env (engine installed at runtime)
└── run.sh # parallelism check → engine install → launch + health
shark.toml uses accelerators = "${ACCELERATORS}" — driven by the slipstream
[[input]]. This requires baby-shark with ${VAR} interpolation in TOML
loading (PR tails-mpt/baby-shark#110). Without it, the launch fails with a
literal-string accelerator value.
- TP/PP weight sharding is approximated — actual layouts vary between engines.
- Activation memory is folded into a flat 20% overhead.
- Quantized checkpoints (
gguf,awq) are detected by file naming only; explicitQUANTIZATIONis more reliable. - KV cache pre-allocation differs between sglang (
mem-fraction-static) and vllm (gpu-memory-utilization); we passMEM_FRACTIONto both.
Things this v0.1 doesn't do yet. Listed roughly by priority — the security & cost items are the riskiest gaps if anyone uses this beyond personal testing.
- API-key auth. Currently anyone with the external IP can hit the
endpoint. sglang accepts
--api-key; vllm accepts--api-key. AddAPI_KEYschema input and thread it into both launchers. - TLS / HTTPS. Endpoint is raw HTTP — credentials and prompts go cleartext. Either terminate TLS at a reverse proxy (caddy / nginx) on the same VM, or front with GCP HTTPS Load Balancer.
- Firewall scoping. SkyPilot's
portsopens 8000 to0.0.0.0/0. Restrict to known CIDRs viaports+ custom GCP firewall rules. - HF token leakage.
HF_TOKENis currently passed via[stage.env]in plaintext into the SkyPilot YAML. Switch to a secret-manager pull (GCP Secret Manager / GitHub Actions secret) at run.sh time.
- Auto-shutdown on idle. Server runs forever unless explicitly
destroyed. Add idle-timeout: poll request rate, shut down after N
minutes of zero traffic. Optional schema input
IDLE_TIMEOUT_MIN. - Budget cap. No hard ceiling. Add
MAX_RUN_HOURSthat schedules agcloud compute instances delete --max-run-duration=Nhon launch (already used by ai-factory's find_gpu.sh). - Spot preemption recovery. Today, preemption = endpoint dies. Move
to SkyPilot managed jobs (
sky.jobs.launch()) for auto-recovery. - Persistent HF cache. Every fresh cluster re-downloads weights
(~10 min for 70B). Mount a persistent disk at
~/.cache/huggingfaceacross launches in the same project.
- Per-request token counting. Both engines emit Prometheus metrics
at
/metrics(sglang:--enable-metricsalready on; vllm: built-in). Scrape into a dashboard, attribute by API key once auth lands. - Request log to GCS. No durable record of who asked what. Stream
access logs from sglang/vllm to a GCS bucket via
--log-requests+ a fluentbit sidecar. - Cost-per-1k-tokens reporting. Compute from GPU $/hr × utilization × throughput. Useful as an output artifact.
- Latency percentiles in MLflow. Mirror number-generator's MLflow pattern for benchmark runs.
- vllm engine. Argv builder exists but never run end-to-end. Smoke
test on
Qwen/Qwen2.5-1.5B-Instruct/L4:1. - Speculative decoding. Both engines wired up; not validated.
Smoke test: sglang EAGLE3 with
RedHatAI/Qwen3-14B-speculator.eagle3. - MoE / EP path. Math validated locally (Mixtral on H100:4 → EP=2); live launch never attempted.
- Multi-cloud. Only GCP tested. SkyPilot supports AWS, Azure,
Lambda, RunPod — should work via just changing
cloudin shark.toml, not verified.
- Pipeline parallelism (PP). We compute TP and DP, never PP. Models too large for TP-on-one-node need PP across nodes.
- Auto-pick fp8 vs int8wo by architecture. Today the user has to know fp8 is Hopper+. Could auto-substitute int8wo on Ampere when fp8 is requested, with a warning.
- Activation memory modeling at long context. Flat 20% overhead is
wrong for
MAX_MODEL_LEN=128k. Account for sequence length × hidden dim × layers in the estimate. - Honor
num_key_value_heads(GQA) in TP validation. Currently we validate againstnum_attention_heads; for GQA models,num_kv_headsis the tighter constraint.
- Tests. Zero tests in this repo. At minimum: unit tests for
parse_accelerators,_next_power_of_two,calculate_parallelism(we have local smoke tests but not committed). - CI. No GitHub Actions. Add lint + unit-test workflow.
- De-vendor parallelism math.
src/parallelism.pyis copied from ai-factory. Either depend on ai-factory as a package (after it addspyproject.toml) or extract a sharedtails-mpt/ai-inference-utilslibrary. See discussion in commit history. - More GPUs in
gpu_specs.py. Add MI300X, RTX 6000 Ada, etc. as SkyPilot adds them. - GGUF / AWQ / GPTQ paths. Currently only fp8 + torchao. Llama.cpp / AWQ could run much smaller models on cheaper GPUs.
- Local-only mode. No way to run sglang/vllm against a local model directory. Useful for testing draft-model pairs developed in-house.
- Baby-shark
${VAR}interpolation: still on PR tails-mpt/baby-shark#110; without merge, this tool depends on the feature branch. - fp8 quantization requires Hopper+ (guarded in
parallelism.py).