Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a ServerEngine class to manage the state of rollout engines, replacing the previous use of raw Ray actor handles and None values with a structured state machine. While this improves type safety and formalizes state transitions, the transition from Optional[ActorHandle] to ServerEngine objects has introduced potential runtime crashes. Specifically, accessing the actor_handle property on an unallocated engine triggers an AssertionError. Feedback highlights that the current implementation in rollout_manager.py and health_monitor.py fails to account for this, as previous null checks are now ineffective against the always-present ServerEngine instances.
| """Return engines eligible for weight updates.""" | ||
| srv = self._get_updatable_server() | ||
| engines = srv.engines if srv else [] | ||
| engines = [e.actor_handle for e in srv.engines] if srv else [] |
There was a problem hiding this comment.
This list comprehension will raise an AssertionError if any engine in srv.engines is not in the Allocated state (e.g., if it has been stopped due to a health check failure). The previous implementation allowed None values in this list, and other properties like engine_gpu_counts and engine_gpu_offsets in RolloutServer still return values for all engines in the group. To maintain parallelism between these lists and avoid runtime crashes, you should return None for unallocated engines.
| engines = [e.actor_handle for e in srv.engines] if srv else [] | |
| engines = [e.actor_handle if e.is_allocated else None for e in srv.engines] if srv else [] |
|
|
||
| try: | ||
| ray.get(engine.health_generate.remote(timeout=self._check_timeout)) | ||
| ray.get(engine.actor_handle.health_generate.remote(timeout=self._check_timeout)) |
There was a problem hiding this comment.
Accessing engine.actor_handle here will cause an AssertionError if the engine is not allocated. Since self._server_group.engines now returns a list of ServerEngine instances instead of Optional[ActorHandle], the existing check if engine is None at line 147 (not shown in this hunk) is no longer effective as the engine object itself is never None. You should update the health check logic to verify engine.is_allocated instead of checking for None.
to support starting in background and waiting in main loop, have to differentiate "engine exists but not yet finished initialization" vs "engine is ready to update-weight"; this pr firstly makes it a naive state machine