-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Description
The OCP nightly benchmark (ci-nighly-benchmark-ocp.yaml) consistently fails at the "E2E target cloud (modelservice, inference-perf)" step. The standalone benchmarks all pass, and the modelservice deployment + smoketest succeed, but the harness launcher pod crashes within ~2 minutes.
Reproduction
Every run of ci-nighly-benchmark-ocp.yaml fails at the modelservice E2E step:
- Run 21994517553 (Feb 13, 2026)
- Run 21993819179 (Feb 13, 2026)
- Run 21993408435 (Feb 13, 2026)
Log Evidence
Simulator Mode Activated
The workflow detects < 20 available GPUs and activates simulator mode:
if [[ $(echo "$(kubectl-view-allocations...) - 20.00" | bc ...) -lt 0 ]]; then
echo "LLM-D SIMULATOR"; sed -i 's^####^^g' scenarios/cicd/ocp_fb.sh
fiThis uncomments the simulator overrides in ocp_fb.sh:
LLMDBENCH_VLLM_STANDALONE_IMAGE_NAME=llm-d-inference-simLLMDBENCH_VLLM_COMMON_ACCELERATOR_NR=0LLMDBENCH_VLLM_MODELSERVICE_PREFILL_TENSOR_PARALLELISM=0LLMDBENCH_VLLM_MODELSERVICE_DECODE_TENSOR_PARALLELISM=0
Capacity Planner Warnings
⚠️ TP=0 is invalid. Please select from these options ([1, 2, 3, 4, 6, 12]) for facebook/opt-125m
⚠️ Max model length = 16384 exceeds the acceptable for facebook/opt-125m (max 2048)
Each model replica requires 0 GPUs, total available GPU memory = 0.0 GB
Deployment Succeeds but Harness Crashes
16:50:15 - Stack model detected is "facebook/opt-125m", matches requested
16:50:17 - Starting 1 harness-launcher pod(s) for model "facebook/opt-125m"
16:50:17 - 1 harness-launcher pod(s) started
16:50:19 - Waiting for pods to complete (timeout=3600s)
16:52:14 - ❌ Found some pods are in an error state
The harness pod crashes in ~2 minutes.
What Works
- Standalone deployment with inference-sim works fine (all 3 standalone benchmarks pass)
- Modelservice deployment succeeds (pods healthy, smoketest passes)
- The inference endpoint responds correctly
Possible Causes
- Simulator mode incompatibility — The inference-sim image may not fully support the modelservice path
- Harness pod configuration — The harness launcher pod itself may have an error (not the inference pods)
- TP=0 validation — The
TP=0 is invalidwarning suggests misconfigurations that could affect harness behavior
Suggestion
Add pod-level failure diagnostics:
- name: Collect failure diagnostics
if: failure()
run: |
kubectl get pods -n llmdbenchcicd -l app=llmdbench-harness-launcher -o wide || true
kubectl describe pods -n llmdbenchcicd -l app=llmdbench-harness-launcher || true
kubectl logs -n llmdbenchcicd -l app=llmdbench-harness-launcher --tail=200 || trueReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels