Bug Description
controller.run_multi_pipeline_inference hangs forever in the busy-actor branch. ray reports 0% gpu/cpu, the scheduler keeps returning {pipeline_id: -1}, no progress is made on any pipeline.
To Reproduce
scheduler-only repro (no ray needed, runs in <1s):
from rapidfireai.evals.scheduling.pipeline_scheduler import PipelineScheduler
s = PipelineScheduler(pipeline_ids=[1, 2], num_actors=1, num_shards=1)
s.schedule() # actor 0 marked busy with pipeline 1
# simulate a controller dispatch failure that forgets to call remove_pipeline
for _ in range(20):
print(s.schedule()) # always {pipeline_id: -1, actor_id: -1, shard_id: -1}
end-to-end repro on a real workload:
- start a 10 config evals grid via
Experiment.run_evals(...)
- each cfg uses
RFGridSearch([cfg_dict]) with num_shards=1 and batch_size=8
- configs are openai-api shaped via
RFOpenAIAPIModelConfig
- first dispatch of every cfg wedges at
questions_done=0
regression test for this exact wedge is at tests/test_pipeline_scheduler.py::TestPipelineSchedulerBookkeeping::test_actor_leaks_busy_when_neither_completion_nor_removal_called on PR #238
Expected Behavior
on dispatch failure or actor death the scheduler frees the actor and continues with surviving pipelines. when all actors are legitimately busy the loop polls for completions instead of sleeping blindly.
Screenshots
n/a, backend hang.
Environment
- OS: Ubuntu 24.04 (kernel 6.17.0, aarch64)
- Python version: 3.13.12
- RapidFire AI version: 0.12.8 (same wedge line confirmed on 0.15.2 main at controller.py:1403)
- Ray version: 2.55.1
- Hardware: 20 cpu cores, 121 GiB ram, nvidia gpu present at /dev/nvidia0 (nvidia-smi not in PATH during the run)
Additional Context
two independent triggers in run_multi_pipeline_inference:
- exception during the batch dispatch / bookkeeping block.
actor.initialize_for_pipeline.remote(...) is wrapped by an existing try/except. the subsequent actor.process_batch.remote(...) loop, the active_tasks[actor_id] = {...} assignment, and the db.set_actor_task_* writes are not. any synchronous exception there leaks the actor busy state from pipeline_scheduler.schedule().
- actor death after dispatch. completion reaping is
ray.wait(futures, timeout=0), a non-blocking poll. busy-loop is time.sleep(0.5) with no health check. when an actor dies (oom, segfault) its futures never surface, the loop spins forever.
deterministic on our box, 100% repro across 5+ attempts. left it 90+ minutes once before killing. confirmed wedge vs slow progress via three signals:
ray status: 0.0/1.0 gpu, 0.0/20.0 cpu
- zero outbound api connections from actor processes
grid_status.json mtime fresh (controller heartbeat alive) but last_question_id stuck at null
PR with fix: #238
Error Logs
py-spy stack from the wedged process (PID 1733772, captured via sudo py-spy since ptrace_scope=1 blocks user-level py-spy):
Process 1733772: .venv/bin/python experiments/run_full_grid.py --skip-indexes
Python v3.13.12
Thread 1733772 (idle): "MainThread"
run_multi_pipeline_inference (rapidfireai/evals/scheduling/controller.py:1081)
run_evals (rapidfireai/experiment.py:377)
logs/grid_status.json snapshot during the wedge:
{
"current_config": "cfg1_default",
"questions_done": 0,
"questions_total": 45,
"last_question_id": null,
"last_dt_seconds": null,
"cfg_done": false
}
ray status during the wedge: 0.0/1.0 gpu, 0.0/20.0 cpu
Bug Description
controller.run_multi_pipeline_inferencehangs forever in the busy-actor branch. ray reports 0% gpu/cpu, the scheduler keeps returning{pipeline_id: -1}, no progress is made on any pipeline.To Reproduce
scheduler-only repro (no ray needed, runs in <1s):
end-to-end repro on a real workload:
Experiment.run_evals(...)RFGridSearch([cfg_dict])withnum_shards=1andbatch_size=8RFOpenAIAPIModelConfigquestions_done=0regression test for this exact wedge is at
tests/test_pipeline_scheduler.py::TestPipelineSchedulerBookkeeping::test_actor_leaks_busy_when_neither_completion_nor_removal_calledon PR #238Expected Behavior
on dispatch failure or actor death the scheduler frees the actor and continues with surviving pipelines. when all actors are legitimately busy the loop polls for completions instead of sleeping blindly.
Screenshots
n/a, backend hang.
Environment
Additional Context
two independent triggers in
run_multi_pipeline_inference:actor.initialize_for_pipeline.remote(...)is wrapped by an existing try/except. the subsequentactor.process_batch.remote(...)loop, theactive_tasks[actor_id] = {...}assignment, and thedb.set_actor_task_*writes are not. any synchronous exception there leaks the actor busy state frompipeline_scheduler.schedule().ray.wait(futures, timeout=0), a non-blocking poll. busy-loop istime.sleep(0.5)with no health check. when an actor dies (oom, segfault) its futures never surface, the loop spins forever.deterministic on our box, 100% repro across 5+ attempts. left it 90+ minutes once before killing. confirmed wedge vs slow progress via three signals:
ray status: 0.0/1.0 gpu, 0.0/20.0 cpugrid_status.jsonmtime fresh (controller heartbeat alive) butlast_question_idstuck at nullPR with fix: #238
Error Logs
py-spy stack from the wedged process (PID 1733772, captured via sudo py-spy since ptrace_scope=1 blocks user-level py-spy):
logs/grid_status.jsonsnapshot during the wedge:ray status during the wedge: 0.0/1.0 gpu, 0.0/20.0 cpu