Skip to content

[fix] Skip flush_cache in in_place mode and add fully async example#974

Open
maocheng23 wants to merge 2 commits intomainfrom
fix/fully_async_no_hang
Open

[fix] Skip flush_cache in in_place mode and add fully async example#974
maocheng23 wants to merge 2 commits intomainfrom
fix/fully_async_no_hang

Conversation

@maocheng23
Copy link
Copy Markdown
Contributor

Summary

  • Skip flush_cache call in in_place pause_generation mode — in fully async mode, flush is unnecessary and hangs because the engine never becomes fully idle while paused (the waiting queue still holds requests)
  • Add example script run_qwen3_30b_a3b_fully_async.py for Qwen3-30B-A3B fully async training with configurable pause-generation and weight-transfer modes

Test plan

  • Run fully async training with --pause-generation-mode in_place and verify no hang during weight update
  • Run with --pause-generation-mode retract to confirm flush_cache still executes

🤖 Generated with Claude Code

…mple

In fully async (in_place) mode, flush_cache is unnecessary and can hang
because the engine never becomes fully idle while paused. Skip it when
pause_generation_mode is "in_place".

Also adds an example script for Qwen3-30B-A3B fully async training with
configurable pause-generation and weight-transfer modes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script for running Qwen3-30B-A3B in a fully asynchronous mode and modifies the weight update logic to skip cache flushing when using in-place generation pausing. Feedback focuses on improving code quality by using idiomatic string comparisons, moving imports to the top level per PEP 8, and correctly implementing dynamic default values in dataclasses using field(default_factory=...) to prevent shared state across instances.

mode = self.args.pause_generation_mode
ray.get([engine.pause_generation.remote(mode=mode) for engine in self.rollout_engines])
ray.get([engine.flush_cache.remote() for engine in self.rollout_engines])
if mode not in ("in_place"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expression mode not in ("in_place") is evaluated as a substring check because ("in_place") is a string literal, not a tuple. While this works for the current literal values, it is non-idiomatic and potentially confusing. Using a direct inequality check is clearer and more robust.

Suggested change
if mode not in ("in_place"):
if mode != "in_place":

Comment on lines +1 to +2
from dataclasses import dataclass
from typing import Literal
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Move the os import to the top level and include field from dataclasses to support dynamic default values for dataclass fields.

Suggested change
from dataclasses import dataclass
from typing import Literal
from dataclasses import dataclass, field
import os
from typing import Literal
References
  1. PEP 8: Imports should be at the top of the file, before any other code except module docstrings. (link)

@dataclass
class ScriptArgs(U.ExecuteTrainConfig):
mode: Literal["normal", "debug_minimal"] = "normal"
run_id: str = U.create_run_id()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Python dataclasses, dynamic default values should be defined using field(default_factory=...). Using a function call directly in the class definition assigns the result of that call at module load time, meaning all instances of ScriptArgs will share the same run_id generated when the script is first imported. Using a factory ensures a fresh ID is generated upon instantiation.

Suggested change
run_id: str = U.create_run_id()
run_id: str = field(default_factory=U.create_run_id)

Comment on lines +152 to +154
import os

fully_async_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local import of os is unnecessary if moved to the top level, and os.path.join is redundant when called with a single argument.

Suggested change
import os
fully_async_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)))
fully_async_dir = os.path.dirname(os.path.abspath(__file__))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant