[BUG] evals controller hangs in busy-loop when actor leaks busy state

## Bug Description

`controller.run_multi_pipeline_inference` hangs forever in the busy-actor branch. ray reports 0% gpu/cpu, the scheduler keeps returning `{pipeline_id: -1}`, no progress is made on any pipeline.

## To Reproduce

scheduler-only repro (no ray needed, runs in <1s):

```python
from rapidfireai.evals.scheduling.pipeline_scheduler import PipelineScheduler

s = PipelineScheduler(pipeline_ids=[1, 2], num_actors=1, num_shards=1)
s.schedule()  # actor 0 marked busy with pipeline 1
# simulate a controller dispatch failure that forgets to call remove_pipeline
for _ in range(20):
    print(s.schedule())  # always {pipeline_id: -1, actor_id: -1, shard_id: -1}
```

end-to-end repro on a real workload:

1. start a 10 config evals grid via `Experiment.run_evals(...)`
2. each cfg uses `RFGridSearch([cfg_dict])` with `num_shards=1` and `batch_size=8`
3. configs are openai-api shaped via `RFOpenAIAPIModelConfig` 
4. first dispatch of every cfg wedges at `questions_done=0`

regression test for this exact wedge is at `tests/test_pipeline_scheduler.py::TestPipelineSchedulerBookkeeping::test_actor_leaks_busy_when_neither_completion_nor_removal_called` on PR #238 

## Expected Behavior

on dispatch failure or actor death the scheduler frees the actor and continues with surviving pipelines. when all actors are legitimately busy the loop polls for completions instead of sleeping blindly.

## Screenshots

n/a, backend hang.

## Environment

- OS: Ubuntu 24.04 (kernel 6.17.0, aarch64)
- Python version: 3.13.12
- RapidFire AI version: 0.12.8 (same wedge line confirmed on 0.15.2 main at controller.py:1403)
- Ray version: 2.55.1
- Hardware: 20 cpu cores, 121 GiB ram, nvidia gpu present at /dev/nvidia0 (nvidia-smi not in PATH during the run)

## Additional Context

two independent triggers in `run_multi_pipeline_inference`:

1. exception during the batch dispatch / bookkeeping block. `actor.initialize_for_pipeline.remote(...)` is wrapped by an existing try/except. the subsequent `actor.process_batch.remote(...)` loop, the `active_tasks[actor_id] = {...}` assignment, and the `db.set_actor_task_*` writes are not. any synchronous exception there leaks the actor busy state from `pipeline_scheduler.schedule()`.
2. actor death after dispatch. completion reaping is `ray.wait(futures, timeout=0)`, a non-blocking poll. busy-loop is `time.sleep(0.5)` with no health check. when an actor dies (oom, segfault) its futures never surface, the loop spins forever.

deterministic on our box, 100% repro across 5+ attempts. left it 90+ minutes once before killing. confirmed wedge vs slow progress via three signals:

- `ray status`: 0.0/1.0 gpu, 0.0/20.0 cpu
- zero outbound api connections from actor processes
- `grid_status.json` mtime fresh (controller heartbeat alive) but `last_question_id` stuck at null

PR with fix: #238

## Error Logs

py-spy stack from the wedged process (PID 1733772, captured via sudo py-spy since ptrace_scope=1 blocks user-level py-spy):

```
Process 1733772: .venv/bin/python experiments/run_full_grid.py --skip-indexes
Python v3.13.12

Thread 1733772 (idle): "MainThread"
    run_multi_pipeline_inference (rapidfireai/evals/scheduling/controller.py:1081)
    run_evals (rapidfireai/experiment.py:377)
```

`logs/grid_status.json` snapshot during the wedge:

```
{
  "current_config": "cfg1_default",
  "questions_done": 0,
  "questions_total": 45,
  "last_question_id": null,
  "last_dt_seconds": null,
  "cfg_done": false
}
```

ray status during the wedge: 0.0/1.0 gpu, 0.0/20.0 cpu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] evals controller hangs in busy-loop when actor leaks busy state #239

Bug Description

To Reproduce

Expected Behavior

Screenshots

Environment

Additional Context

Error Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] evals controller hangs in busy-loop when actor leaks busy state #239

Description

Bug Description

To Reproduce

Expected Behavior

Screenshots

Environment

Additional Context

Error Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions