Summary
When the controller process hosting experiment.run_evals(...) dies mid-run (kernel crash, OOM, nbconvert per-cell timeout killing the kernel, power loss, etc.), rapidfire_evals.db is left with experiments.status = 'running' and pipelines.status = 'ongoing' indefinitely. Nothing in the dispatcher or scheduler detects that the Ray actors / controller are gone, so from the dispatcher API's perspective the experiment is still active forever — which poisons everything downstream (RapidFire dashboard, Converge Assistant's autopilot loop, any tool polling /dispatcher/get-all-runs).
Reproduction (deterministic, ~35 min)
source ~/.venv/bin/activate
cd ~/tutorial_notebooks/rag-contexteng
# Force the kernel to die mid-run_evals:
jupyter nbconvert --to notebook --execute rf-tutorial-rag-fiqa.ipynb \
--output rf-tutorial-rag-fiqa.executed.ipynb \
--ExecutePreprocessor.timeout=1800 # too short on purpose
At ~34 min the run_evals cell hits the 30-min per-cell timeout, nbconvert kills the kernel, Ray actors go with it. Then:
sqlite3 ~/rapidfireai/db/rapidfire_evals.db \
"SELECT experiment_id, status FROM experiments;
SELECT pipeline_id, status, shards_completed, current_shard_id FROM pipelines;"
Observed on my run (9 hours after the kernel died, nothing running):
1|exp1-fiqa-rag|running
1|completed|4|4
2|completed|4|4
3|completed|4|4
4|completed|4|4
5|ongoing|3|3 ← actor died mid-vLLM-reinit for shard 3
6|ongoing|3|3 ← actor never got to shard 3
7|ongoing|3|3
8|ongoing|3|3
There is no live RapidFire process for this experiment (pgrep -fa 'ray::|ipykernel|rapidfireai.evals' returns empty) and no one ever flips any of this to failed.
Why this matters
- Dashboard / UI treats the experiment as active and keeps polling.
- Converge Assistant's autopilot (via
rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.py → GET http://127.0.0.1:8851/dispatcher/get-all-runs) keeps being told runs are ongoing and hallucinates progress messages like "Trials 5–8 are currently ongoing at approximately 75% completion" that won't ever be true.
- Next
rapidfireai start / next experiment has no way to know which rapidfire_evals.db rows are stale and which are live. There's no "owner PID" / session marker on experiments or pipelines, so we can't even heuristically detect staleness.
Relevant code
State transitions today (OSS rapidfireai 0.15.3rc5+):
rapidfireai/evals/scheduling/controller.py:1195 — db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED) — only runs when shards_completed >= num_shards. Never executes if the controller dies.
rapidfireai/evals/scheduling/controller.py:1320 — db.set_pipeline_status(pipeline_id, PipelineStatus.FAILED) — only on in-process exception.
rapidfireai/evals/utils/experiment_utils.py:147 — set_experiment_status(..., ExperimentStatus.COMPLETED) — normal happy path.
rapidfireai/evals/utils/experiment_utils.py:183 — set_experiment_status(..., ExperimentStatus.CANCELLED) — intentional cancellation path.
None of these execute on process death. There is no watchdog on the dispatcher side either (the dispatcher at port 8851 just serves whatever the DB says).
Fix hints
Cheapest (reactive, detects on next start)
At rapidfireai start / on Experiment(...) construction, scan rapidfire_evals.db:
- For every
experiments.status = 'running' with no live PID in rapidfire_pids.txt + no listening Ray GCS, flip to ExperimentStatus.FAILED with error = 'controller died without cleanup (detected at next startup)'.
- Cascade: flip all
pipelines.status = 'ongoing' under that experiment to PipelineStatus.FAILED.
- Same for MLflow: for each pipeline's
metric_run_id still in RUNNING, call MlflowClient.set_terminated(run_id, status='KILLED', end_time=<last actor_tasks.completed_at or now>).
Keeps the code local to OSS startup; doesn't require long-running daemons. Probably a 50-line patch.
Medium (active, reconciles every N seconds)
Add a reconciler thread inside the dispatcher (port 8851) that runs every 30s:
- For each active experiment row, check whether the owning controller session is still up:
- Ping
ray.is_initialized() on the recorded session, or
- Track a heartbeat column
experiments.last_heartbeat that the controller's main loop updates every few seconds. If now() - last_heartbeat > 60s, mark FAILED.
- Requires a new column + one UPDATE per cycle in the controller — low overhead.
Robust (authoritative)
Record the controller's PID + host + ray session ID in experiments when a run starts. The dispatcher checks kill -0 <pid> periodically. Cascades to MLflow as above. Also lets a new controller reclaim or fence off an orphaned experiment safely.
API-side UX
While stale rows exist, GET /dispatcher/get-all-runs could annotate pipelines with a stale: true flag based on last_heartbeat staleness, so Converge and the dashboard can at least show them differently from genuinely in-flight pipelines even before reconciliation finishes.
Related
- RapidFireAI/rapidfireai-pro#37 — converge's separate
converge_pids.txt shares the same "no lifecycle reconciliation" family of issues.
- The Converge Assistant visibly hallucinates on stale data ("Trials 5–8 are currently ongoing at approximately 75% completion" — 9 hours after the kernel died). That's a Pro-side symptom of this OSS bug; will file separately.
Environment
- Ubuntu 24.04.3 LTS on GCP (kernel 6.14.0-1021-gcp)
- Python 3.12.3, venv at
/home/kamran/.venv
rapidfireai 0.15.3rc5 + rapidfireai-pro 0.15.3rc7
- Mode:
--evals, experiment exp1-fiqa-rag, 8 grid configs × 4 shards
- GPU: NVIDIA L4 (driver 580.95.05)
Summary
When the controller process hosting
experiment.run_evals(...)dies mid-run (kernel crash, OOM,nbconvertper-cell timeout killing the kernel, power loss, etc.),rapidfire_evals.dbis left withexperiments.status = 'running'andpipelines.status = 'ongoing'indefinitely. Nothing in the dispatcher or scheduler detects that the Ray actors / controller are gone, so from the dispatcher API's perspective the experiment is still active forever — which poisons everything downstream (RapidFire dashboard, Converge Assistant's autopilot loop, any tool polling/dispatcher/get-all-runs).Reproduction (deterministic, ~35 min)
At ~34 min the
run_evalscell hits the 30-min per-cell timeout,nbconvertkills the kernel, Ray actors go with it. Then:Observed on my run (9 hours after the kernel died, nothing running):
There is no live RapidFire process for this experiment (
pgrep -fa 'ray::|ipykernel|rapidfireai.evals'returns empty) and no one ever flips any of this tofailed.Why this matters
rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.py→GET http://127.0.0.1:8851/dispatcher/get-all-runs) keeps being told runs are ongoing and hallucinates progress messages like "Trials 5–8 are currently ongoing at approximately 75% completion" that won't ever be true.rapidfireai start/ next experiment has no way to know whichrapidfire_evals.dbrows are stale and which are live. There's no "owner PID" / session marker onexperimentsorpipelines, so we can't even heuristically detect staleness.Relevant code
State transitions today (OSS
rapidfireai 0.15.3rc5+):rapidfireai/evals/scheduling/controller.py:1195—db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED)— only runs whenshards_completed >= num_shards. Never executes if the controller dies.rapidfireai/evals/scheduling/controller.py:1320—db.set_pipeline_status(pipeline_id, PipelineStatus.FAILED)— only on in-process exception.rapidfireai/evals/utils/experiment_utils.py:147—set_experiment_status(..., ExperimentStatus.COMPLETED)— normal happy path.rapidfireai/evals/utils/experiment_utils.py:183—set_experiment_status(..., ExperimentStatus.CANCELLED)— intentional cancellation path.None of these execute on process death. There is no watchdog on the dispatcher side either (the dispatcher at port 8851 just serves whatever the DB says).
Fix hints
Cheapest (reactive, detects on next start)
At
rapidfireai start/ onExperiment(...)construction, scanrapidfire_evals.db:experiments.status = 'running'with no live PID inrapidfire_pids.txt+ no listening Ray GCS, flip toExperimentStatus.FAILEDwitherror = 'controller died without cleanup (detected at next startup)'.pipelines.status = 'ongoing'under that experiment toPipelineStatus.FAILED.metric_run_idstill inRUNNING, callMlflowClient.set_terminated(run_id, status='KILLED', end_time=<last actor_tasks.completed_at or now>).Keeps the code local to OSS startup; doesn't require long-running daemons. Probably a 50-line patch.
Medium (active, reconciles every N seconds)
Add a reconciler thread inside the dispatcher (port 8851) that runs every 30s:
ray.is_initialized()on the recorded session, orexperiments.last_heartbeatthat the controller's main loop updates every few seconds. Ifnow() - last_heartbeat > 60s, mark FAILED.Robust (authoritative)
Record the controller's PID + host + ray session ID in
experimentswhen a run starts. The dispatcher checkskill -0 <pid>periodically. Cascades to MLflow as above. Also lets a new controller reclaim or fence off an orphaned experiment safely.API-side UX
While stale rows exist,
GET /dispatcher/get-all-runscould annotate pipelines with astale: trueflag based onlast_heartbeatstaleness, so Converge and the dashboard can at least show them differently from genuinely in-flight pipelines even before reconciliation finishes.Related
converge_pids.txtshares the same "no lifecycle reconciliation" family of issues.Environment
/home/kamran/.venvrapidfireai0.15.3rc5 +rapidfireai-pro0.15.3rc7--evals, experimentexp1-fiqa-rag, 8 grid configs × 4 shards