[fix] Skip flush_cache in in_place mode and add fully async example#974
[fix] Skip flush_cache in in_place mode and add fully async example#974maocheng23 wants to merge 2 commits intomainfrom
Conversation
…mple In fully async (in_place) mode, flush_cache is unnecessary and can hang because the engine never becomes fully idle while paused. Skip it when pause_generation_mode is "in_place". Also adds an example script for Qwen3-30B-A3B fully async training with configurable pause-generation and weight-transfer modes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new script for running Qwen3-30B-A3B in a fully asynchronous mode and modifies the weight update logic to skip cache flushing when using in-place generation pausing. Feedback focuses on improving code quality by using idiomatic string comparisons, moving imports to the top level per PEP 8, and correctly implementing dynamic default values in dataclasses using field(default_factory=...) to prevent shared state across instances.
| mode = self.args.pause_generation_mode | ||
| ray.get([engine.pause_generation.remote(mode=mode) for engine in self.rollout_engines]) | ||
| ray.get([engine.flush_cache.remote() for engine in self.rollout_engines]) | ||
| if mode not in ("in_place"): |
There was a problem hiding this comment.
The expression mode not in ("in_place") is evaluated as a substring check because ("in_place") is a string literal, not a tuple. While this works for the current literal values, it is non-idiomatic and potentially confusing. Using a direct inequality check is clearer and more robust.
| if mode not in ("in_place"): | |
| if mode != "in_place": |
| from dataclasses import dataclass | ||
| from typing import Literal |
There was a problem hiding this comment.
Move the os import to the top level and include field from dataclasses to support dynamic default values for dataclass fields.
| from dataclasses import dataclass | |
| from typing import Literal | |
| from dataclasses import dataclass, field | |
| import os | |
| from typing import Literal |
References
- PEP 8: Imports should be at the top of the file, before any other code except module docstrings. (link)
| @dataclass | ||
| class ScriptArgs(U.ExecuteTrainConfig): | ||
| mode: Literal["normal", "debug_minimal"] = "normal" | ||
| run_id: str = U.create_run_id() |
There was a problem hiding this comment.
In Python dataclasses, dynamic default values should be defined using field(default_factory=...). Using a function call directly in the class definition assigns the result of that call at module load time, meaning all instances of ScriptArgs will share the same run_id generated when the script is first imported. Using a factory ensures a fresh ID is generated upon instantiation.
| run_id: str = U.create_run_id() | |
| run_id: str = field(default_factory=U.create_run_id) |
| import os | ||
|
|
||
| fully_async_dir = os.path.join(os.path.dirname(os.path.abspath(__file__))) |
There was a problem hiding this comment.
Summary
flush_cachecall inin_placepause_generation mode — in fully async mode, flush is unnecessary and hangs because the engine never becomes fully idle while paused (the waiting queue still holds requests)run_qwen3_30b_a3b_fully_async.pyfor Qwen3-30B-A3B fully async training with configurable pause-generation and weight-transfer modesTest plan
--pause-generation-mode in_placeand verify no hang during weight update--pause-generation-mode retractto confirm flush_cache still executes🤖 Generated with Claude Code