-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Description
The GKE nightly benchmark (ci-nighly-benchmark-gke.yaml) consistently fails at the "Standup target cloud (standalone)" step. The vLLM standalone pod for meta-llama/Llama-3.2-1B gets created, reaches Running state, then crashes within ~30 seconds.
Reproduction
Every run of ci-nighly-benchmark-gke.yaml fails at this step:
- Run 21994195293 (Feb 13, 2026)
- Run 21992642881 (Feb 13, 2026)
- Run 21969640089 (Feb 13, 2026)
Log Evidence
16:28:56 - Deploying model "meta-llama/Llama-3.2-1B" (from files at /tmp/cicd/)
16:28:57 - Model deployed
16:28:57 - Waiting for pods to be Running (timeout=900s)
16:52:48 - "vllm-standalone-llama-3-1b-845945945-dljsw" pod created
16:52:53 - "vllm-standalone-llama-3-1b-845945945-dljsw" pod running
16:52:53 - Waiting for pods to be Ready (timeout=900s)
16:53:24 - ❌ Crashed container in pod: vllm-standalone-llama-3-1b-845945945-dljsw
Pod runs for ~30 seconds before crashing — suggests vLLM startup/model loading failure.
Configuration
Scenario: gke_H100_fb
LLMDBENCH_VLLM_COMMON_AFFINITY=cloud.google.com/gke-accelerator:nvidia-h100-80gbLLMDBENCH_VLLM_COMMON_ACCELERATOR_NR=1LLMDBENCH_LLMD_IMAGE_TAG=auto(resolved via skopeo)
Also observed early in logs:
Error: flags cannot be placed before plugin name: --kubeconfig
This appears twice, likely a non-fatal kubectl plugin invocation issue.
Possible Causes
- GPU driver mismatch — The vLLM CUDA image may be incompatible with the H100 GPU driver on GKE
- Image tag resolution —
autotag might resolve to an incompatible version - vLLM model loading failure — Despite being a small 1B model, something in the startup fails
Needed for Diagnosis
Pod-level logs from vllm-standalone-llama-3-1b-* before it crashes would pinpoint the exact error. Consider adding a kubectl logs <pod> --previous step to the workflow on failure.
Suggestion
Add a failure diagnostics step to capture pod logs and describe output when standup fails:
- name: Collect failure diagnostics
if: failure()
run: |
kubectl get pods -n llmdbenchcicd -o wide || true
kubectl describe pods -n llmdbenchcicd -l app=vllm-standalone || true
kubectl logs -n llmdbenchcicd -l app=vllm-standalone --previous --tail=100 || true