Skip to content

🐛 OCP nightly: harness launcher pod crashes in modelservice E2E (simulator mode) #669

@clubanderson

Description

@clubanderson

Description

The OCP nightly benchmark (ci-nighly-benchmark-ocp.yaml) consistently fails at the "E2E target cloud (modelservice, inference-perf)" step. The standalone benchmarks all pass, and the modelservice deployment + smoketest succeed, but the harness launcher pod crashes within ~2 minutes.

Reproduction

Every run of ci-nighly-benchmark-ocp.yaml fails at the modelservice E2E step:

  • Run 21994517553 (Feb 13, 2026)
  • Run 21993819179 (Feb 13, 2026)
  • Run 21993408435 (Feb 13, 2026)

Log Evidence

Simulator Mode Activated

The workflow detects < 20 available GPUs and activates simulator mode:

if [[ $(echo "$(kubectl-view-allocations...) - 20.00" | bc ...) -lt 0 ]]; then
  echo "LLM-D SIMULATOR"; sed -i 's^####^^g' scenarios/cicd/ocp_fb.sh
fi

This uncomments the simulator overrides in ocp_fb.sh:

  • LLMDBENCH_VLLM_STANDALONE_IMAGE_NAME=llm-d-inference-sim
  • LLMDBENCH_VLLM_COMMON_ACCELERATOR_NR=0
  • LLMDBENCH_VLLM_MODELSERVICE_PREFILL_TENSOR_PARALLELISM=0
  • LLMDBENCH_VLLM_MODELSERVICE_DECODE_TENSOR_PARALLELISM=0

Capacity Planner Warnings

⚠️ TP=0 is invalid. Please select from these options ([1, 2, 3, 4, 6, 12]) for facebook/opt-125m
⚠️ Max model length = 16384 exceeds the acceptable for facebook/opt-125m (max 2048)
Each model replica requires 0 GPUs, total available GPU memory = 0.0 GB

Deployment Succeeds but Harness Crashes

16:50:15 - Stack model detected is "facebook/opt-125m", matches requested
16:50:17 - Starting 1 harness-launcher pod(s) for model "facebook/opt-125m"
16:50:17 - 1 harness-launcher pod(s) started
16:50:19 - Waiting for pods to complete (timeout=3600s)
16:52:14 - ❌ Found some pods are in an error state

The harness pod crashes in ~2 minutes.

What Works

  • Standalone deployment with inference-sim works fine (all 3 standalone benchmarks pass)
  • Modelservice deployment succeeds (pods healthy, smoketest passes)
  • The inference endpoint responds correctly

Possible Causes

  1. Simulator mode incompatibility — The inference-sim image may not fully support the modelservice path
  2. Harness pod configuration — The harness launcher pod itself may have an error (not the inference pods)
  3. TP=0 validation — The TP=0 is invalid warning suggests misconfigurations that could affect harness behavior

Suggestion

Add pod-level failure diagnostics:

- name: Collect failure diagnostics
  if: failure()
  run: |
    kubectl get pods -n llmdbenchcicd -l app=llmdbench-harness-launcher -o wide || true
    kubectl describe pods -n llmdbenchcicd -l app=llmdbench-harness-launcher || true
    kubectl logs -n llmdbenchcicd -l app=llmdbench-harness-launcher --tail=200 || true

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions