🐛 GKE nightly: vLLM standalone pod crashes during Llama-3.2-1B deployment

## Description

The GKE nightly benchmark (`ci-nighly-benchmark-gke.yaml`) consistently fails at the "Standup target cloud (standalone)" step. The vLLM standalone pod for `meta-llama/Llama-3.2-1B` gets created, reaches Running state, then crashes within ~30 seconds.

## Reproduction

Every run of `ci-nighly-benchmark-gke.yaml` fails at this step:
- Run 21994195293 (Feb 13, 2026)
- Run 21992642881 (Feb 13, 2026)
- Run 21969640089 (Feb 13, 2026)

## Log Evidence

```
16:28:56 - Deploying model "meta-llama/Llama-3.2-1B" (from files at /tmp/cicd/)
16:28:57 - Model deployed
16:28:57 - Waiting for pods to be Running (timeout=900s)
16:52:48 - "vllm-standalone-llama-3-1b-845945945-dljsw" pod created
16:52:53 - "vllm-standalone-llama-3-1b-845945945-dljsw" pod running
16:52:53 - Waiting for pods to be Ready (timeout=900s)
16:53:24 - ❌ Crashed container in pod: vllm-standalone-llama-3-1b-845945945-dljsw
```

Pod runs for ~30 seconds before crashing — suggests vLLM startup/model loading failure.

## Configuration

Scenario: `gke_H100_fb`
- `LLMDBENCH_VLLM_COMMON_AFFINITY=cloud.google.com/gke-accelerator:nvidia-h100-80gb`
- `LLMDBENCH_VLLM_COMMON_ACCELERATOR_NR=1`
- `LLMDBENCH_LLMD_IMAGE_TAG=auto` (resolved via skopeo)

Also observed early in logs:
```
Error: flags cannot be placed before plugin name: --kubeconfig
```
This appears twice, likely a non-fatal kubectl plugin invocation issue.

## Possible Causes

1. **GPU driver mismatch** — The vLLM CUDA image may be incompatible with the H100 GPU driver on GKE
2. **Image tag resolution** — `auto` tag might resolve to an incompatible version
3. **vLLM model loading failure** — Despite being a small 1B model, something in the startup fails

## Needed for Diagnosis

Pod-level logs from `vllm-standalone-llama-3-1b-*` before it crashes would pinpoint the exact error. Consider adding a `kubectl logs <pod> --previous` step to the workflow on failure.

## Suggestion

Add a failure diagnostics step to capture pod logs and describe output when standup fails:
```yaml
- name: Collect failure diagnostics
  if: failure()
  run: |
    kubectl get pods -n llmdbenchcicd -o wide || true
    kubectl describe pods -n llmdbenchcicd -l app=vllm-standalone || true
    kubectl logs -n llmdbenchcicd -l app=vllm-standalone --previous --tail=100 || true
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 GKE nightly: vLLM standalone pod crashes during Llama-3.2-1B deployment #668

Description

Reproduction

Log Evidence

Configuration

Possible Causes

Needed for Diagnosis

Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🐛 GKE nightly: vLLM standalone pod crashes during Llama-3.2-1B deployment #668

Description

Description

Reproduction

Log Evidence

Configuration

Possible Causes

Needed for Diagnosis

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions