Skip to content

Orphaned 'running'/'ongoing' state after controller dies: no liveness / reconciliation in evals scheduler #222

@kamran-rapidfireAI

Description

@kamran-rapidfireAI

Summary

When the controller process hosting experiment.run_evals(...) dies mid-run (kernel crash, OOM, nbconvert per-cell timeout killing the kernel, power loss, etc.), rapidfire_evals.db is left with experiments.status = 'running' and pipelines.status = 'ongoing' indefinitely. Nothing in the dispatcher or scheduler detects that the Ray actors / controller are gone, so from the dispatcher API's perspective the experiment is still active forever — which poisons everything downstream (RapidFire dashboard, Converge Assistant's autopilot loop, any tool polling /dispatcher/get-all-runs).

Reproduction (deterministic, ~35 min)

source ~/.venv/bin/activate
cd ~/tutorial_notebooks/rag-contexteng
# Force the kernel to die mid-run_evals:
jupyter nbconvert --to notebook --execute rf-tutorial-rag-fiqa.ipynb \
  --output rf-tutorial-rag-fiqa.executed.ipynb \
  --ExecutePreprocessor.timeout=1800          # too short on purpose

At ~34 min the run_evals cell hits the 30-min per-cell timeout, nbconvert kills the kernel, Ray actors go with it. Then:

sqlite3 ~/rapidfireai/db/rapidfire_evals.db \
  "SELECT experiment_id, status FROM experiments;
   SELECT pipeline_id, status, shards_completed, current_shard_id FROM pipelines;"

Observed on my run (9 hours after the kernel died, nothing running):

1|exp1-fiqa-rag|running

1|completed|4|4
2|completed|4|4
3|completed|4|4
4|completed|4|4
5|ongoing|3|3     ← actor died mid-vLLM-reinit for shard 3
6|ongoing|3|3     ← actor never got to shard 3
7|ongoing|3|3
8|ongoing|3|3

There is no live RapidFire process for this experiment (pgrep -fa 'ray::|ipykernel|rapidfireai.evals' returns empty) and no one ever flips any of this to failed.

Why this matters

  • Dashboard / UI treats the experiment as active and keeps polling.
  • Converge Assistant's autopilot (via rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.pyGET http://127.0.0.1:8851/dispatcher/get-all-runs) keeps being told runs are ongoing and hallucinates progress messages like "Trials 5–8 are currently ongoing at approximately 75% completion" that won't ever be true.
  • Next rapidfireai start / next experiment has no way to know which rapidfire_evals.db rows are stale and which are live. There's no "owner PID" / session marker on experiments or pipelines, so we can't even heuristically detect staleness.

Relevant code

State transitions today (OSS rapidfireai 0.15.3rc5+):

  • rapidfireai/evals/scheduling/controller.py:1195db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED) — only runs when shards_completed >= num_shards. Never executes if the controller dies.
  • rapidfireai/evals/scheduling/controller.py:1320db.set_pipeline_status(pipeline_id, PipelineStatus.FAILED) — only on in-process exception.
  • rapidfireai/evals/utils/experiment_utils.py:147set_experiment_status(..., ExperimentStatus.COMPLETED) — normal happy path.
  • rapidfireai/evals/utils/experiment_utils.py:183set_experiment_status(..., ExperimentStatus.CANCELLED) — intentional cancellation path.

None of these execute on process death. There is no watchdog on the dispatcher side either (the dispatcher at port 8851 just serves whatever the DB says).

Fix hints

Cheapest (reactive, detects on next start)

At rapidfireai start / on Experiment(...) construction, scan rapidfire_evals.db:

  • For every experiments.status = 'running' with no live PID in rapidfire_pids.txt + no listening Ray GCS, flip to ExperimentStatus.FAILED with error = 'controller died without cleanup (detected at next startup)'.
  • Cascade: flip all pipelines.status = 'ongoing' under that experiment to PipelineStatus.FAILED.
  • Same for MLflow: for each pipeline's metric_run_id still in RUNNING, call MlflowClient.set_terminated(run_id, status='KILLED', end_time=<last actor_tasks.completed_at or now>).

Keeps the code local to OSS startup; doesn't require long-running daemons. Probably a 50-line patch.

Medium (active, reconciles every N seconds)

Add a reconciler thread inside the dispatcher (port 8851) that runs every 30s:

  • For each active experiment row, check whether the owning controller session is still up:
    • Ping ray.is_initialized() on the recorded session, or
    • Track a heartbeat column experiments.last_heartbeat that the controller's main loop updates every few seconds. If now() - last_heartbeat > 60s, mark FAILED.
  • Requires a new column + one UPDATE per cycle in the controller — low overhead.

Robust (authoritative)

Record the controller's PID + host + ray session ID in experiments when a run starts. The dispatcher checks kill -0 <pid> periodically. Cascades to MLflow as above. Also lets a new controller reclaim or fence off an orphaned experiment safely.

API-side UX

While stale rows exist, GET /dispatcher/get-all-runs could annotate pipelines with a stale: true flag based on last_heartbeat staleness, so Converge and the dashboard can at least show them differently from genuinely in-flight pipelines even before reconciliation finishes.

Related

  • RapidFireAI/rapidfireai-pro#37 — converge's separate converge_pids.txt shares the same "no lifecycle reconciliation" family of issues.
  • The Converge Assistant visibly hallucinates on stale data ("Trials 5–8 are currently ongoing at approximately 75% completion" — 9 hours after the kernel died). That's a Pro-side symptom of this OSS bug; will file separately.

Environment

  • Ubuntu 24.04.3 LTS on GCP (kernel 6.14.0-1021-gcp)
  • Python 3.12.3, venv at /home/kamran/.venv
  • rapidfireai 0.15.3rc5 + rapidfireai-pro 0.15.3rc7
  • Mode: --evals, experiment exp1-fiqa-rag, 8 grid configs × 4 shards
  • GPU: NVIDIA L4 (driver 580.95.05)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions