Skip to content

Make rollout engine state machine#940

Open
fzyzcjy wants to merge 5 commits intorollout_ft/22from
rollout_ft/23
Open

Make rollout engine state machine#940
fzyzcjy wants to merge 5 commits intorollout_ft/22from
rollout_ft/23

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Apr 7, 2026

to support starting in background and waiting in main loop, have to differentiate "engine exists but not yet finished initialization" vs "engine is ready to update-weight"; this pr firstly makes it a naive state machine

fzyzcjy added 4 commits April 7, 2026 17:39
This reverts commit 48bdb00.

# Conflicts:
#	miles/ray/rollout/rollout_server.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ServerEngine class to manage the state of rollout engines, replacing the previous use of raw Ray actor handles and None values with a structured state machine. While this improves type safety and formalizes state transitions, the transition from Optional[ActorHandle] to ServerEngine objects has introduced potential runtime crashes. Specifically, accessing the actor_handle property on an unallocated engine triggers an AssertionError. Feedback highlights that the current implementation in rollout_manager.py and health_monitor.py fails to account for this, as previous null checks are now ineffective against the always-present ServerEngine instances.

"""Return engines eligible for weight updates."""
srv = self._get_updatable_server()
engines = srv.engines if srv else []
engines = [e.actor_handle for e in srv.engines] if srv else []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This list comprehension will raise an AssertionError if any engine in srv.engines is not in the Allocated state (e.g., if it has been stopped due to a health check failure). The previous implementation allowed None values in this list, and other properties like engine_gpu_counts and engine_gpu_offsets in RolloutServer still return values for all engines in the group. To maintain parallelism between these lists and avoid runtime crashes, you should return None for unallocated engines.

Suggested change
engines = [e.actor_handle for e in srv.engines] if srv else []
engines = [e.actor_handle if e.is_allocated else None for e in srv.engines] if srv else []


try:
ray.get(engine.health_generate.remote(timeout=self._check_timeout))
ray.get(engine.actor_handle.health_generate.remote(timeout=self._check_timeout))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing engine.actor_handle here will cause an AssertionError if the engine is not allocated. Since self._server_group.engines now returns a list of ServerEngine instances instead of Optional[ActorHandle], the existing check if engine is None at line 147 (not shown in this hunk) is no longer effective as the engine object itself is never None. You should update the health check logic to verify engine.is_allocated instead of checking for None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant