Skip to content

🐛 GKE nightly: vLLM standalone pod crashes during Llama-3.2-1B deployment #668

@clubanderson

Description

@clubanderson

Description

The GKE nightly benchmark (ci-nighly-benchmark-gke.yaml) consistently fails at the "Standup target cloud (standalone)" step. The vLLM standalone pod for meta-llama/Llama-3.2-1B gets created, reaches Running state, then crashes within ~30 seconds.

Reproduction

Every run of ci-nighly-benchmark-gke.yaml fails at this step:

  • Run 21994195293 (Feb 13, 2026)
  • Run 21992642881 (Feb 13, 2026)
  • Run 21969640089 (Feb 13, 2026)

Log Evidence

16:28:56 - Deploying model "meta-llama/Llama-3.2-1B" (from files at /tmp/cicd/)
16:28:57 - Model deployed
16:28:57 - Waiting for pods to be Running (timeout=900s)
16:52:48 - "vllm-standalone-llama-3-1b-845945945-dljsw" pod created
16:52:53 - "vllm-standalone-llama-3-1b-845945945-dljsw" pod running
16:52:53 - Waiting for pods to be Ready (timeout=900s)
16:53:24 - ❌ Crashed container in pod: vllm-standalone-llama-3-1b-845945945-dljsw

Pod runs for ~30 seconds before crashing — suggests vLLM startup/model loading failure.

Configuration

Scenario: gke_H100_fb

  • LLMDBENCH_VLLM_COMMON_AFFINITY=cloud.google.com/gke-accelerator:nvidia-h100-80gb
  • LLMDBENCH_VLLM_COMMON_ACCELERATOR_NR=1
  • LLMDBENCH_LLMD_IMAGE_TAG=auto (resolved via skopeo)

Also observed early in logs:

Error: flags cannot be placed before plugin name: --kubeconfig

This appears twice, likely a non-fatal kubectl plugin invocation issue.

Possible Causes

  1. GPU driver mismatch — The vLLM CUDA image may be incompatible with the H100 GPU driver on GKE
  2. Image tag resolutionauto tag might resolve to an incompatible version
  3. vLLM model loading failure — Despite being a small 1B model, something in the startup fails

Needed for Diagnosis

Pod-level logs from vllm-standalone-llama-3-1b-* before it crashes would pinpoint the exact error. Consider adding a kubectl logs <pod> --previous step to the workflow on failure.

Suggestion

Add a failure diagnostics step to capture pod logs and describe output when standup fails:

- name: Collect failure diagnostics
  if: failure()
  run: |
    kubectl get pods -n llmdbenchcicd -o wide || true
    kubectl describe pods -n llmdbenchcicd -l app=vllm-standalone || true
    kubectl logs -n llmdbenchcicd -l app=vllm-standalone --previous --tail=100 || true

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions