Fix/eval scheduler actor wedge#238
Conversation
Two changes to evals/scheduling/controller.run_multi_pipeline_inference
that close a deterministic hang we hit on a 10-config grid run.
Symptom: MainThread parks in `time.sleep(0.5)` at the "Check if all
actors busy" branch while Ray reports 0% GPU/CPU usage. py-spy stack
shows the loop spinning; the bookkeeping says actors are busy but Ray
says they are idle. The hang is permanent.
Root cause has two triggers:
(1) Exception during the batch-submission block. The init step
(`actor.initialize_for_pipeline.remote(...)`) is already wrapped
by an existing try/except that calls `scheduler.remove_pipeline`,
but the subsequent `actor.process_batch.remote(...)` loop, the
`active_tasks[actor_id] = {...}` assignment, and the three
`db.set_actor_task_*` writes are not. If any of these raise
synchronously (serialization error, transient DB hiccup, etc.)
the actor was already marked busy by `scheduler.schedule()` and
leaks busy state; the controller never calls `set_completed_task`
or `remove_pipeline` for it.
(2) Actor death after dispatch. Once batches are submitted and tracked
in `active_tasks[actor_id]["futures"]`, the only completion-reaping
path is `ray.wait(futures, timeout=0)` -- a non-blocking poll. If
the actor process dies (OOM-kill, segfault from a broken native
dep, etc.) the futures become orphaned: Ray will eventually surface
them as failed, but `ray.wait(timeout=0)` only catches what is
already ready in this Python tick. The controller falls into the
"all actors busy" branch, sleeps 0.5s with no health check, and
repeats forever.
Fixes:
- Wrap the batch-submission + bookkeeping block in try/except that
mirrors the existing init-failure handler: mark the pipeline FAILED,
drop any partial active_tasks entry, call `scheduler.remove_pipeline`
so the actor is freed, and continue.
- Replace the bare `time.sleep(0.5)` in the busy-loop with
`ray.wait(all_pending, num_returns=1, timeout=0.5)`. ray.wait
surfaces both successful and failed futures (including
RayActorError from dead actors), so the next iteration's reap path
at the "Check for completed tasks" block frees the actor via the
existing exception handler. Falls back to time.sleep when no futures
are dispatched yet (initial loop entry).
The two changes are independent and complementary: either alone fixes
a subset of triggers; together they close the wedge.
Adds tests/test_pipeline_scheduler.py with 6 unit tests for the busy/free
transitions in PipelineScheduler. They document the invariant the
controller's scheduling loop relies on: an actor is marked busy by
schedule() and MUST be freed by either set_completed_task() (success)
or remove_pipeline() (failure). If the controller dispatch path fails
without calling one of those, the actor leaks busy state and the loop
wedges.
Also adds tools/repro_scheduler_wedge.py, a Ray-free standalone
reproduction of the bug. Scenario 1 simulates the buggy controller
path (dispatch fails, no remove_pipeline call) and detects the wedge
by counting consecutive "all actors busy" returns from schedule().
Scenario 2 demonstrates recovery via remove_pipeline. Useful for
maintainers to confirm the bookkeeping contract without spinning up
the full evals stack.
Existing tests on this branch:
pytest tests/test_pipeline_scheduler.py -> 6 passed
python tools/repro_scheduler_wedge.py -> exits 0, prints
"BUG REPRODUCED + FIX VERIFIED"
Tighten the new comments in controller.py, drop narrative docstrings from test_pipeline_scheduler.py, and simplify repro_scheduler_wedge.py output. No behavior change.
Drop the standalone tools/repro_scheduler_wedge.py and add the wedge scenario as a regression test in test_pipeline_scheduler.py. The unit test encodes the same invariant deterministically and runs in CI, so the standalone script no longer pulls its weight.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6e7e220. Configure here.
…ompletion The previous busy-branch passed every in-flight batch future to ray.wait(num_returns=1, timeout=0.5). ray.wait is a snapshot of readiness, not an event consumer -- once any single batch resolved (which happens early in a multi-batch task), that already-ready ref permanently satisfied num_returns=1 and the 0.5s timeout never engaged. Combined with the reap path only draining when all of an actor's futures are ready, the all-actors-busy branch hot-spun the IC poll and scheduler.schedule() at thousands of iterations per second instead of the original two per second. Partition first via a non-blocking ray.wait and only block on the not-yet-ready set. Dead-actor detection is preserved: a RayActorError-bearing future is "ready" to ray.wait, lands in the ready partition, and gets drained by the next iteration's reap path via the existing handler at controller.py:1312.
|
@ekirimlioglu Thanks for the detailed diagnosis and patch! I verified that both failure modes are real and your fix follows the existing init-failure handler pattern, which is the right shape. Before the RapidFire AI team can proceed with reviewing and merging this, please open a GitHub Issue first describing the bugs, following the repo's Issue conventions. Specifically:
Then please reference that Issue in this PR as the one being resolved. This will help the QA team reproduce the bugs independently and validate the fix. It will also keep the project's bug-tracking conventions consistent. Overall, thanks for the great contribution. First-time PRs of this depth are rare! |
|
Filed #239 with the repro, environment, and end-to-end validation notes. Small suggestion: CONTRIBUTING.md doesn't currently mention the file-an-issue-first step. Adding a short note under "Submitting a pull request" could help future first-time contributors. |

Changes
a. An exception during the batch-submission/bookkeeping block left the actor marked busy by PipelineScheduler.schedule() with no path to free it. The block is now wrapped in a try/except that mirrors the existing init-failure handler — marks the pipeline FAILED, drops the partial
active_tasks entry, and calls scheduler.remove_pipeline so the actor is released.
b. Actor death after dispatch (OOM, segfault) left orphaned futures and the controller's busy-branch was a blind time.sleep(0.5). Replaced with ray.wait(all_pending, num_returns=1, timeout=0.5) so failed futures surface and the existing reap path frees the actor on the next iteration.
Changelog Content
Additions
Changes
Fixes
Testing
Screenshots (if applicable)
N/A — backend-only change.
Checklist
Testing
Screenshots (if applicable)
Add screenshots to help explain your changes.
Checklist
Performance Impact
If this PR affects performance, describe the impact and any optimizations made.
Related Issues
Fixes #(issue number)
Closes #(issue number)
Related to #(issue number)
Note
Medium Risk
Touches the core multi-pipeline scheduling loop and Ray waiting behavior; mistakes could reintroduce hangs or change throughput under load, though changes are targeted to failure paths.
Overview
Prevents
Controller.run_multi_pipeline_inferencefrom wedging when an actor dies after dispatch or batch dispatch/bookkeeping throws synchronously.The busy-actor branch now blocks on
ray.waitover pending (not-yet-ready) futures instead oftime.sleep, so failed futures surface promptly. Batch submission + task bookkeeping is wrapped in atry/exceptthat marks the pipeline/task asFAILEDand callsscheduler.remove_pipeline(...)to ensure the actor is freed.Adds unit tests asserting
PipelineScheduler’s busy/free invariants and a regression case showing how missingremove_pipelinecan leave actors stuck busy.Reviewed by Cursor Bugbot for commit 5728903. Bugbot is set up for automated code reviews on this repo. Configure here.