Skip to content

Fix/eval scheduler actor wedge#238

Open
ekirimlioglu wants to merge 5 commits intoRapidFireAI:mainfrom
ekirimlioglu:fix/eval-scheduler-actor-wedge
Open

Fix/eval scheduler actor wedge#238
ekirimlioglu wants to merge 5 commits intoRapidFireAI:mainfrom
ekirimlioglu:fix/eval-scheduler-actor-wedge

Conversation

@ekirimlioglu
Copy link
Copy Markdown

@ekirimlioglu ekirimlioglu commented May 1, 2026

Changes

  • Fixes a deterministic wedge in Controller.run_multi_pipeline_inference that caused multi-pipeline inference to hang at 0% GPU/CPU. Two independent triggers were closed:
    a. An exception during the batch-submission/bookkeeping block left the actor marked busy by PipelineScheduler.schedule() with no path to free it. The block is now wrapped in a try/except that mirrors the existing init-failure handler — marks the pipeline FAILED, drops the partial
    active_tasks entry, and calls scheduler.remove_pipeline so the actor is released.
    b. Actor death after dispatch (OOM, segfault) left orphaned futures and the controller's busy-branch was a blind time.sleep(0.5). Replaced with ray.wait(all_pending, num_returns=1, timeout=0.5) so failed futures surface and the existing reap path frees the actor on the next iteration.
  • No API or behavior changes for healthy runs. No breaking changes.

Changelog Content

Additions

  • Unit tests covering PipelineScheduler busy/free bookkeeping invariants, including a regression test for the actor-leak wedge (tests/test_pipeline_scheduler.py).

Changes

  • Busy-loop in evals controller now blocks on ray.wait instead of time.sleep, so dead-actor futures are reaped promptly.

Fixes

  • Prevent indefinite hang in evals multi-pipeline inference when batch dispatch raises synchronously or an actor dies after dispatch.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this change manually
  • I have tested this change in the following environments:
    • Local development
    • Docker environment
    • Other: 10-config grid run that previously wedged on a multi-GPU host

Screenshots (if applicable)

N/A — backend-only change.

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (no user-facing docs affected)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this change manually
  • I have tested this change in the following environments:
    • Local development
    • Docker environment
    • Other: _______________

Screenshots (if applicable)

Add screenshots to help explain your changes.

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

Performance Impact

If this PR affects performance, describe the impact and any optimizations made.

Related Issues

Fixes #(issue number)
Closes #(issue number)
Related to #(issue number)


Note

Medium Risk
Touches the core multi-pipeline scheduling loop and Ray waiting behavior; mistakes could reintroduce hangs or change throughput under load, though changes are targeted to failure paths.

Overview
Prevents Controller.run_multi_pipeline_inference from wedging when an actor dies after dispatch or batch dispatch/bookkeeping throws synchronously.

The busy-actor branch now blocks on ray.wait over pending (not-yet-ready) futures instead of time.sleep, so failed futures surface promptly. Batch submission + task bookkeeping is wrapped in a try/except that marks the pipeline/task as FAILED and calls scheduler.remove_pipeline(...) to ensure the actor is freed.

Adds unit tests asserting PipelineScheduler’s busy/free invariants and a regression case showing how missing remove_pipeline can leave actors stuck busy.

Reviewed by Cursor Bugbot for commit 5728903. Bugbot is set up for automated code reviews on this repo. Configure here.

Two changes to evals/scheduling/controller.run_multi_pipeline_inference
that close a deterministic hang we hit on a 10-config grid run.

Symptom: MainThread parks in `time.sleep(0.5)` at the "Check if all
actors busy" branch while Ray reports 0% GPU/CPU usage. py-spy stack
shows the loop spinning; the bookkeeping says actors are busy but Ray
says they are idle. The hang is permanent.

Root cause has two triggers:

(1) Exception during the batch-submission block. The init step
    (`actor.initialize_for_pipeline.remote(...)`) is already wrapped
    by an existing try/except that calls `scheduler.remove_pipeline`,
    but the subsequent `actor.process_batch.remote(...)` loop, the
    `active_tasks[actor_id] = {...}` assignment, and the three
    `db.set_actor_task_*` writes are not. If any of these raise
    synchronously (serialization error, transient DB hiccup, etc.)
    the actor was already marked busy by `scheduler.schedule()` and
    leaks busy state; the controller never calls `set_completed_task`
    or `remove_pipeline` for it.

(2) Actor death after dispatch. Once batches are submitted and tracked
    in `active_tasks[actor_id]["futures"]`, the only completion-reaping
    path is `ray.wait(futures, timeout=0)` -- a non-blocking poll. If
    the actor process dies (OOM-kill, segfault from a broken native
    dep, etc.) the futures become orphaned: Ray will eventually surface
    them as failed, but `ray.wait(timeout=0)` only catches what is
    already ready in this Python tick. The controller falls into the
    "all actors busy" branch, sleeps 0.5s with no health check, and
    repeats forever.

Fixes:

- Wrap the batch-submission + bookkeeping block in try/except that
  mirrors the existing init-failure handler: mark the pipeline FAILED,
  drop any partial active_tasks entry, call `scheduler.remove_pipeline`
  so the actor is freed, and continue.

- Replace the bare `time.sleep(0.5)` in the busy-loop with
  `ray.wait(all_pending, num_returns=1, timeout=0.5)`. ray.wait
  surfaces both successful and failed futures (including
  RayActorError from dead actors), so the next iteration's reap path
  at the "Check for completed tasks" block frees the actor via the
  existing exception handler. Falls back to time.sleep when no futures
  are dispatched yet (initial loop entry).

The two changes are independent and complementary: either alone fixes
a subset of triggers; together they close the wedge.
Adds tests/test_pipeline_scheduler.py with 6 unit tests for the busy/free
transitions in PipelineScheduler. They document the invariant the
controller's scheduling loop relies on: an actor is marked busy by
schedule() and MUST be freed by either set_completed_task() (success)
or remove_pipeline() (failure). If the controller dispatch path fails
without calling one of those, the actor leaks busy state and the loop
wedges.

Also adds tools/repro_scheduler_wedge.py, a Ray-free standalone
reproduction of the bug. Scenario 1 simulates the buggy controller
path (dispatch fails, no remove_pipeline call) and detects the wedge
by counting consecutive "all actors busy" returns from schedule().
Scenario 2 demonstrates recovery via remove_pipeline. Useful for
maintainers to confirm the bookkeeping contract without spinning up
the full evals stack.

Existing tests on this branch:
  pytest tests/test_pipeline_scheduler.py  ->  6 passed
  python tools/repro_scheduler_wedge.py    ->  exits 0, prints
    "BUG REPRODUCED + FIX VERIFIED"
Tighten the new comments in controller.py, drop narrative docstrings
from test_pipeline_scheduler.py, and simplify repro_scheduler_wedge.py
output. No behavior change.
Drop the standalone tools/repro_scheduler_wedge.py and add the wedge
scenario as a regression test in test_pipeline_scheduler.py. The unit
test encodes the same invariant deterministically and runs in CI, so
the standalone script no longer pulls its weight.
@ekirimlioglu ekirimlioglu marked this pull request as ready for review May 1, 2026 05:40
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6e7e220. Configure here.

Comment thread rapidfireai/evals/scheduling/controller.py Outdated
…ompletion

The previous busy-branch passed every in-flight batch future to
ray.wait(num_returns=1, timeout=0.5). ray.wait is a snapshot of
readiness, not an event consumer -- once any single batch resolved
(which happens early in a multi-batch task), that already-ready ref
permanently satisfied num_returns=1 and the 0.5s timeout never
engaged. Combined with the reap path only draining when all of an
actor's futures are ready, the all-actors-busy branch hot-spun the
IC poll and scheduler.schedule() at thousands of iterations per second
instead of the original two per second.

Partition first via a non-blocking ray.wait and only block on the
not-yet-ready set. Dead-actor detection is preserved: a
RayActorError-bearing future is "ready" to ray.wait, lands in the
ready partition, and gets drained by the next iteration's reap path
via the existing handler at controller.py:1312.
@arun-rfai
Copy link
Copy Markdown
Collaborator

@ekirimlioglu Thanks for the detailed diagnosis and patch! I verified that both failure modes are real and your fix follows the existing init-failure handler pattern, which is the right shape.

Before the RapidFire AI team can proceed with reviewing and merging this, please open a GitHub Issue first describing the bugs, following the repo's Issue conventions. Specifically:

  • Repro steps (minimal sequence to trigger the wedge on main)
  • Minimal set of configs that reliably reproduces it
  • Machine environment (OS, Python version, Ray version, hardware — CPU/GPU, cores, memory)
  • Workload characteristics (which model API calls, which embedding model, batch sizes, num actors, etc.)
  • Observed symptoms (logs, stdout, how you confirmed the wedge vs. slow progress)
  • Whether the wedge is deterministic or probabilistic on your setup
  • Whether your patched branch resolved the wedge end-to-end on the original workload, or only addresses related failure modes

Then please reference that Issue in this PR as the one being resolved. This will help the QA team reproduce the bugs independently and validate the fix. It will also keep the project's bug-tracking conventions consistent.

Overall, thanks for the great contribution. First-time PRs of this depth are rare!

@ekirimlioglu
Copy link
Copy Markdown
Author

Filed #239 with the repro, environment, and end-to-end validation notes.

Small suggestion: CONTRIBUTING.md doesn't currently mention the file-an-issue-first step. Adding a short note under "Submitting a pull request" could help future first-time contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants