Make rollout engine state machine by fzyzcjy · Pull Request #940 · radixark/miles

fzyzcjy · 2026-04-07T09:47:49Z

to support starting in background and waiting in main loop, have to differentiate "engine exists but not yet finished initialization" vs "engine is ready to update-weight"; this pr firstly makes it a naive state machine

This reverts commit 48bdb00. # Conflicts: # miles/ray/rollout/rollout_server.py

gemini-code-assist

Code Review

This pull request introduces a ServerEngine class to manage the state of rollout engines, replacing the previous use of raw Ray actor handles and None values with a structured state machine. While this improves type safety and formalizes state transitions, the transition from Optional[ActorHandle] to ServerEngine objects has introduced potential runtime crashes. Specifically, accessing the actor_handle property on an unallocated engine triggers an AssertionError. Feedback highlights that the current implementation in rollout_manager.py and health_monitor.py fails to account for this, as previous null checks are now ineffective against the always-present ServerEngine instances.

gemini-code-assist · 2026-04-07T09:50:09Z

miles/ray/rollout/rollout_manager.py

        """Return engines eligible for weight updates."""
        srv = self._get_updatable_server()
-        engines = srv.engines if srv else []
+        engines = [e.actor_handle for e in srv.engines] if srv else []


This list comprehension will raise an AssertionError if any engine in srv.engines is not in the Allocated state (e.g., if it has been stopped due to a health check failure). The previous implementation allowed None values in this list, and other properties like engine_gpu_counts and engine_gpu_offsets in RolloutServer still return values for all engines in the group. To maintain parallelism between these lists and avoid runtime crashes, you should return None for unallocated engines.

Suggested change

engines = [e.actor_handle for e in srv.engines] if srv else []

engines = [e.actor_handle if e.is_allocated else None for e in srv.engines] if srv else []

gemini-code-assist · 2026-04-07T09:50:09Z

miles/utils/health_monitor.py


        try:
-            ray.get(engine.health_generate.remote(timeout=self._check_timeout))
+            ray.get(engine.actor_handle.health_generate.remote(timeout=self._check_timeout))


Accessing engine.actor_handle here will cause an AssertionError if the engine is not allocated. Since self._server_group.engines now returns a list of ServerEngine instances instead of Optional[ActorHandle], the existing check if engine is None at line 147 (not shown in this hunk) is no longer effective as the engine object itself is never None. You should update the health check logic to verify engine.is_allocated instead of checking for None.

fzyzcjy added 4 commits April 7, 2026 17:39

Revert "temp revert"

604bf1e

This reverts commit 48bdb00. # Conflicts: # miles/ray/rollout/rollout_server.py

more

14cf104

fmt

c9fe216

simp

1535dd3

fzyzcjy requested review from guapisolo, maocheng23 and yueming-yuan as code owners April 7, 2026 09:47

more

bae9929

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make rollout engine state machine#940

Make rollout engine state machine#940
fzyzcjy wants to merge 5 commits intorollout_ft/22from
rollout_ft/23

fzyzcjy commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	engines = [e.actor_handle for e in srv.engines] if srv else []
	engines = [e.actor_handle if e.is_allocated else None for e in srv.engines] if srv else []

Conversation

fzyzcjy commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant