🐛 OCP nightly: harness launcher pod crashes in modelservice E2E (simulator mode)

## Description

The OCP nightly benchmark (`ci-nighly-benchmark-ocp.yaml`) consistently fails at the "E2E target cloud (modelservice, inference-perf)" step. The standalone benchmarks all pass, and the modelservice deployment + smoketest succeed, but the harness launcher pod crashes within ~2 minutes.

## Reproduction

Every run of `ci-nighly-benchmark-ocp.yaml` fails at the modelservice E2E step:
- Run 21994517553 (Feb 13, 2026)
- Run 21993819179 (Feb 13, 2026)
- Run 21993408435 (Feb 13, 2026)

## Log Evidence

### Simulator Mode Activated
The workflow detects < 20 available GPUs and activates simulator mode:
```bash
if [[ $(echo "$(kubectl-view-allocations...) - 20.00" | bc ...) -lt 0 ]]; then
  echo "LLM-D SIMULATOR"; sed -i 's^####^^g' scenarios/cicd/ocp_fb.sh
fi
```

This uncomments the simulator overrides in `ocp_fb.sh`:
- `LLMDBENCH_VLLM_STANDALONE_IMAGE_NAME=llm-d-inference-sim`
- `LLMDBENCH_VLLM_COMMON_ACCELERATOR_NR=0`
- `LLMDBENCH_VLLM_MODELSERVICE_PREFILL_TENSOR_PARALLELISM=0`
- `LLMDBENCH_VLLM_MODELSERVICE_DECODE_TENSOR_PARALLELISM=0`

### Capacity Planner Warnings
```
⚠️ TP=0 is invalid. Please select from these options ([1, 2, 3, 4, 6, 12]) for facebook/opt-125m
⚠️ Max model length = 16384 exceeds the acceptable for facebook/opt-125m (max 2048)
Each model replica requires 0 GPUs, total available GPU memory = 0.0 GB
```

### Deployment Succeeds but Harness Crashes
```
16:50:15 - Stack model detected is "facebook/opt-125m", matches requested
16:50:17 - Starting 1 harness-launcher pod(s) for model "facebook/opt-125m"
16:50:17 - 1 harness-launcher pod(s) started
16:50:19 - Waiting for pods to complete (timeout=3600s)
16:52:14 - ❌ Found some pods are in an error state
```

The harness pod crashes in ~2 minutes.

## What Works

- Standalone deployment with inference-sim works fine (all 3 standalone benchmarks pass)
- Modelservice deployment succeeds (pods healthy, smoketest passes)
- The inference endpoint responds correctly

## Possible Causes

1. **Simulator mode incompatibility** — The inference-sim image may not fully support the modelservice path
2. **Harness pod configuration** — The harness launcher pod itself may have an error (not the inference pods)
3. **TP=0 validation** — The `TP=0 is invalid` warning suggests misconfigurations that could affect harness behavior

## Suggestion

Add pod-level failure diagnostics:
```yaml
- name: Collect failure diagnostics
  if: failure()
  run: |
    kubectl get pods -n llmdbenchcicd -l app=llmdbench-harness-launcher -o wide || true
    kubectl describe pods -n llmdbenchcicd -l app=llmdbench-harness-launcher || true
    kubectl logs -n llmdbenchcicd -l app=llmdbench-harness-launcher --tail=200 || true
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 OCP nightly: harness launcher pod crashes in modelservice E2E (simulator mode) #669

Description

Reproduction

Log Evidence

Simulator Mode Activated

Capacity Planner Warnings

Deployment Succeeds but Harness Crashes

What Works

Possible Causes

Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🐛 OCP nightly: harness launcher pod crashes in modelservice E2E (simulator mode) #669

Description

Description

Reproduction

Log Evidence

Simulator Mode Activated

Capacity Planner Warnings

Deployment Succeeds but Harness Crashes

What Works

Possible Causes

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions