Orphaned 'running'/'ongoing' state after controller dies: no liveness / reconciliation in evals scheduler

## Summary

When the controller process hosting `experiment.run_evals(...)` dies mid-run (kernel crash, OOM, `nbconvert` per-cell timeout killing the kernel, power loss, etc.), `rapidfire_evals.db` is left with `experiments.status = 'running'` and `pipelines.status = 'ongoing'` **indefinitely**. Nothing in the dispatcher or scheduler detects that the Ray actors / controller are gone, so from the dispatcher API's perspective the experiment is still active forever — which poisons everything downstream (RapidFire dashboard, Converge Assistant's autopilot loop, any tool polling `/dispatcher/get-all-runs`).

## Reproduction (deterministic, ~35 min)

```bash
source ~/.venv/bin/activate
cd ~/tutorial_notebooks/rag-contexteng
# Force the kernel to die mid-run_evals:
jupyter nbconvert --to notebook --execute rf-tutorial-rag-fiqa.ipynb \
  --output rf-tutorial-rag-fiqa.executed.ipynb \
  --ExecutePreprocessor.timeout=1800          # too short on purpose
```

At ~34 min the `run_evals` cell hits the 30-min per-cell timeout, `nbconvert` kills the kernel, Ray actors go with it. Then:

```bash
sqlite3 ~/rapidfireai/db/rapidfire_evals.db \
  "SELECT experiment_id, status FROM experiments;
   SELECT pipeline_id, status, shards_completed, current_shard_id FROM pipelines;"
```

Observed on my run (9 hours after the kernel died, nothing running):

```
1|exp1-fiqa-rag|running

1|completed|4|4
2|completed|4|4
3|completed|4|4
4|completed|4|4
5|ongoing|3|3     ← actor died mid-vLLM-reinit for shard 3
6|ongoing|3|3     ← actor never got to shard 3
7|ongoing|3|3
8|ongoing|3|3
```

There is no live RapidFire process for this experiment (`pgrep -fa 'ray::|ipykernel|rapidfireai.evals'` returns empty) and no one ever flips any of this to `failed`.

## Why this matters

- **Dashboard / UI** treats the experiment as active and keeps polling.
- **Converge Assistant's autopilot** (via `rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.py` → `GET http://127.0.0.1:8851/dispatcher/get-all-runs`) keeps being told runs are ongoing and hallucinates progress messages like *"Trials 5–8 are currently ongoing at approximately 75% completion"* that won't ever be true.
- Next `rapidfireai start` / next experiment has no way to know which `rapidfire_evals.db` rows are stale and which are live. There's no "owner PID" / session marker on `experiments` or `pipelines`, so we can't even heuristically detect staleness.

## Relevant code

State transitions today (OSS `rapidfireai 0.15.3rc5+`):

- `rapidfireai/evals/scheduling/controller.py:1195` — `db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED)` — only runs when `shards_completed >= num_shards`. Never executes if the controller dies.
- `rapidfireai/evals/scheduling/controller.py:1320` — `db.set_pipeline_status(pipeline_id, PipelineStatus.FAILED)` — only on in-process exception.
- `rapidfireai/evals/utils/experiment_utils.py:147` — `set_experiment_status(..., ExperimentStatus.COMPLETED)` — normal happy path.
- `rapidfireai/evals/utils/experiment_utils.py:183` — `set_experiment_status(..., ExperimentStatus.CANCELLED)` — intentional cancellation path.

None of these execute on process death. There is no watchdog on the dispatcher side either (the dispatcher at port 8851 just serves whatever the DB says).

## Fix hints

### Cheapest (reactive, detects on next start)

At `rapidfireai start` / on `Experiment(...)` construction, scan `rapidfire_evals.db`:
- For every `experiments.status = 'running'` with no live PID in `rapidfire_pids.txt` + no listening Ray GCS, flip to `ExperimentStatus.FAILED` with `error = 'controller died without cleanup (detected at next startup)'`.
- Cascade: flip all `pipelines.status = 'ongoing'` under that experiment to `PipelineStatus.FAILED`.
- Same for MLflow: for each pipeline's `metric_run_id` still in `RUNNING`, call `MlflowClient.set_terminated(run_id, status='KILLED', end_time=<last actor_tasks.completed_at or now>)`.

Keeps the code local to OSS startup; doesn't require long-running daemons. Probably a 50-line patch.

### Medium (active, reconciles every N seconds)

Add a reconciler thread inside the dispatcher (port 8851) that runs every 30s:
- For each active experiment row, check whether the owning controller session is still up:
  - Ping `ray.is_initialized()` on the recorded session, or
  - Track a heartbeat column `experiments.last_heartbeat` that the controller's main loop updates every few seconds. If `now() - last_heartbeat > 60s`, mark FAILED.
- Requires a new column + one UPDATE per cycle in the controller — low overhead.

### Robust (authoritative)

Record the controller's PID + host + ray session ID in `experiments` when a run starts. The dispatcher checks `kill -0 <pid>` periodically. Cascades to MLflow as above. Also lets a new controller reclaim or fence off an orphaned experiment safely.

### API-side UX

While stale rows exist, `GET /dispatcher/get-all-runs` could annotate pipelines with a `stale: true` flag based on `last_heartbeat` staleness, so Converge and the dashboard can at least show them differently from genuinely in-flight pipelines even before reconciliation finishes.

## Related

- RapidFireAI/rapidfireai-pro#37 — converge's separate `converge_pids.txt` shares the same "no lifecycle reconciliation" family of issues.
- The Converge Assistant visibly hallucinates on stale data (*"Trials 5–8 are currently ongoing at approximately 75% completion"* — 9 hours after the kernel died). That's a Pro-side symptom of this OSS bug; will file separately.

## Environment

- Ubuntu 24.04.3 LTS on GCP (kernel 6.14.0-1021-gcp)
- Python 3.12.3, venv at `/home/kamran/.venv`
- `rapidfireai` 0.15.3rc5 + `rapidfireai-pro` 0.15.3rc7
- Mode: `--evals`, experiment `exp1-fiqa-rag`, 8 grid configs × 4 shards
- GPU: NVIDIA L4 (driver 580.95.05)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orphaned 'running'/'ongoing' state after controller dies: no liveness / reconciliation in evals scheduler #222

Summary

Reproduction (deterministic, ~35 min)

Why this matters

Relevant code

Fix hints

Cheapest (reactive, detects on next start)

Medium (active, reconciles every N seconds)

Robust (authoritative)

API-side UX

Related

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Orphaned 'running'/'ongoing' state after controller dies: no liveness / reconciliation in evals scheduler #222

Description

Summary

Reproduction (deterministic, ~35 min)

Why this matters

Relevant code

Fix hints

Cheapest (reactive, detects on next start)

Medium (active, reconciles every N seconds)

Robust (authoritative)

API-side UX

Related

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions