Fix/eval scheduler actor wedge by ekirimlioglu · Pull Request #238 · RapidFireAI/rapidfireai

ekirimlioglu · 2026-05-01T05:33:32Z

Changes

Fixes a deterministic wedge in Controller.run_multi_pipeline_inference that caused multi-pipeline inference to hang at 0% GPU/CPU. Two independent triggers were closed:
a. An exception during the batch-submission/bookkeeping block left the actor marked busy by PipelineScheduler.schedule() with no path to free it. The block is now wrapped in a try/except that mirrors the existing init-failure handler — marks the pipeline FAILED, drops the partial
active_tasks entry, and calls scheduler.remove_pipeline so the actor is released.
b. Actor death after dispatch (OOM, segfault) left orphaned futures and the controller's busy-branch was a blind time.sleep(0.5). Replaced with ray.wait(all_pending, num_returns=1, timeout=0.5) so failed futures surface and the existing reap path frees the actor on the next iteration.
No API or behavior changes for healthy runs. No breaking changes.

Changelog Content

Additions

Unit tests covering PipelineScheduler busy/free bookkeeping invariants, including a regression test for the actor-leak wedge (tests/test_pipeline_scheduler.py).

Changes

Busy-loop in evals controller now blocks on ray.wait instead of time.sleep, so dead-actor futures are reaped promptly.

Fixes

Prevent indefinite hang in evals multi-pipeline inference when batch dispatch raises synchronously or an actor dies after dispatch.

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this change manually
I have tested this change in the following environments:
- Local development
- Docker environment
- Other: 10-config grid run that previously wedged on a multi-GPU host

Screenshots (if applicable)

N/A — backend-only change.

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (no user-facing docs affected)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this change manually
I have tested this change in the following environments:
- Local development
- Docker environment
- Other: _______________

Screenshots (if applicable)

Add screenshots to help explain your changes.

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Performance Impact

If this PR affects performance, describe the impact and any optimizations made.

Related Issues

Fixes #(issue number)
Closes #(issue number)
Related to #(issue number)

Note

Medium Risk
Touches the core multi-pipeline scheduling loop and Ray waiting behavior; mistakes could reintroduce hangs or change throughput under load, though changes are targeted to failure paths.

Overview
Prevents Controller.run_multi_pipeline_inference from wedging when an actor dies after dispatch or batch dispatch/bookkeeping throws synchronously.

The busy-actor branch now blocks on ray.wait over pending (not-yet-ready) futures instead of time.sleep, so failed futures surface promptly. Batch submission + task bookkeeping is wrapped in a try/except that marks the pipeline/task as FAILED and calls scheduler.remove_pipeline(...) to ensure the actor is freed.

Adds unit tests asserting PipelineScheduler’s busy/free invariants and a regression case showing how missing remove_pipeline can leave actors stuck busy.

^{Reviewed by Cursor Bugbot for commit 5728903. Bugbot is set up for automated code reviews on this repo. Configure here.}

Two changes to evals/scheduling/controller.run_multi_pipeline_inference that close a deterministic hang we hit on a 10-config grid run. Symptom: MainThread parks in `time.sleep(0.5)` at the "Check if all actors busy" branch while Ray reports 0% GPU/CPU usage. py-spy stack shows the loop spinning; the bookkeeping says actors are busy but Ray says they are idle. The hang is permanent. Root cause has two triggers: (1) Exception during the batch-submission block. The init step (`actor.initialize_for_pipeline.remote(...)`) is already wrapped by an existing try/except that calls `scheduler.remove_pipeline`, but the subsequent `actor.process_batch.remote(...)` loop, the `active_tasks[actor_id] = {...}` assignment, and the three `db.set_actor_task_*` writes are not. If any of these raise synchronously (serialization error, transient DB hiccup, etc.) the actor was already marked busy by `scheduler.schedule()` and leaks busy state; the controller never calls `set_completed_task` or `remove_pipeline` for it. (2) Actor death after dispatch. Once batches are submitted and tracked in `active_tasks[actor_id]["futures"]`, the only completion-reaping path is `ray.wait(futures, timeout=0)` -- a non-blocking poll. If the actor process dies (OOM-kill, segfault from a broken native dep, etc.) the futures become orphaned: Ray will eventually surface them as failed, but `ray.wait(timeout=0)` only catches what is already ready in this Python tick. The controller falls into the "all actors busy" branch, sleeps 0.5s with no health check, and repeats forever. Fixes: - Wrap the batch-submission + bookkeeping block in try/except that mirrors the existing init-failure handler: mark the pipeline FAILED, drop any partial active_tasks entry, call `scheduler.remove_pipeline` so the actor is freed, and continue. - Replace the bare `time.sleep(0.5)` in the busy-loop with `ray.wait(all_pending, num_returns=1, timeout=0.5)`. ray.wait surfaces both successful and failed futures (including RayActorError from dead actors), so the next iteration's reap path at the "Check for completed tasks" block frees the actor via the existing exception handler. Falls back to time.sleep when no futures are dispatched yet (initial loop entry). The two changes are independent and complementary: either alone fixes a subset of triggers; together they close the wedge.

Adds tests/test_pipeline_scheduler.py with 6 unit tests for the busy/free transitions in PipelineScheduler. They document the invariant the controller's scheduling loop relies on: an actor is marked busy by schedule() and MUST be freed by either set_completed_task() (success) or remove_pipeline() (failure). If the controller dispatch path fails without calling one of those, the actor leaks busy state and the loop wedges. Also adds tools/repro_scheduler_wedge.py, a Ray-free standalone reproduction of the bug. Scenario 1 simulates the buggy controller path (dispatch fails, no remove_pipeline call) and detects the wedge by counting consecutive "all actors busy" returns from schedule(). Scenario 2 demonstrates recovery via remove_pipeline. Useful for maintainers to confirm the bookkeeping contract without spinning up the full evals stack. Existing tests on this branch: pytest tests/test_pipeline_scheduler.py -> 6 passed python tools/repro_scheduler_wedge.py -> exits 0, prints "BUG REPRODUCED + FIX VERIFIED"

Tighten the new comments in controller.py, drop narrative docstrings from test_pipeline_scheduler.py, and simplify repro_scheduler_wedge.py output. No behavior change.

Drop the standalone tools/repro_scheduler_wedge.py and add the wedge scenario as a regression test in test_pipeline_scheduler.py. The unit test encodes the same invariant deterministically and runs in CI, so the standalone script no longer pulls its weight.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6e7e220. Configure here.}

…ompletion The previous busy-branch passed every in-flight batch future to ray.wait(num_returns=1, timeout=0.5). ray.wait is a snapshot of readiness, not an event consumer -- once any single batch resolved (which happens early in a multi-batch task), that already-ready ref permanently satisfied num_returns=1 and the 0.5s timeout never engaged. Combined with the reap path only draining when all of an actor's futures are ready, the all-actors-busy branch hot-spun the IC poll and scheduler.schedule() at thousands of iterations per second instead of the original two per second. Partition first via a non-blocking ray.wait and only block on the not-yet-ready set. Dead-actor detection is preserved: a RayActorError-bearing future is "ready" to ray.wait, lands in the ready partition, and gets drained by the next iteration's reap path via the existing handler at controller.py:1312.

arun-rfai · 2026-05-02T04:17:11Z

@ekirimlioglu Thanks for the detailed diagnosis and patch! I verified that both failure modes are real and your fix follows the existing init-failure handler pattern, which is the right shape.

Before the RapidFire AI team can proceed with reviewing and merging this, please open a GitHub Issue first describing the bugs, following the repo's Issue conventions. Specifically:

Repro steps (minimal sequence to trigger the wedge on main)
Minimal set of configs that reliably reproduces it
Machine environment (OS, Python version, Ray version, hardware — CPU/GPU, cores, memory)
Workload characteristics (which model API calls, which embedding model, batch sizes, num actors, etc.)
Observed symptoms (logs, stdout, how you confirmed the wedge vs. slow progress)
Whether the wedge is deterministic or probabilistic on your setup
Whether your patched branch resolved the wedge end-to-end on the original workload, or only addresses related failure modes

Then please reference that Issue in this PR as the one being resolved. This will help the QA team reproduce the bugs independently and validate the fix. It will also keep the project's bug-tracking conventions consistent.

Overall, thanks for the great contribution. First-time PRs of this depth are rare!

ekirimlioglu · 2026-05-04T03:11:48Z

Filed #239 with the repro, environment, and end-to-end validation notes.

Small suggestion: CONTRIBUTING.md doesn't currently mention the file-an-issue-first step. Adding a short note under "Submitting a pull request" could help future first-time contributors.

ekirimlioglu added 4 commits April 30, 2026 22:12

chore(evals): trim verbose comments and prints in scheduler wedge work

474b9ef

Tighten the new comments in controller.py, drop narrative docstrings from test_pipeline_scheduler.py, and simplify repro_scheduler_wedge.py output. No behavior change.

ekirimlioglu marked this pull request as ready for review May 1, 2026 05:40

cursor Bot reviewed May 1, 2026

View reviewed changes

Comment thread rapidfireai/evals/scheduling/controller.py Outdated

ekirimlioglu mentioned this pull request May 4, 2026

[BUG] evals controller hangs in busy-loop when actor leaks busy state #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/eval scheduler actor wedge#238

Fix/eval scheduler actor wedge#238
ekirimlioglu wants to merge 5 commits intoRapidFireAI:mainfrom
ekirimlioglu:fix/eval-scheduler-actor-wedge

ekirimlioglu commented May 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

arun-rfai commented May 2, 2026

Uh oh!

ekirimlioglu commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ekirimlioglu commented May 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Screenshots (if applicable)

Checklist

Performance Impact

Related Issues

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arun-rfai commented May 2, 2026

Uh oh!

ekirimlioglu commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ekirimlioglu commented May 1, 2026 •

edited by cursor Bot

Loading