[WIP] [feature] single controller: add train controller #414

daihaowz · 2025-10-02T06:28:44Z

No description provided.

gemini-code-assist · 2025-10-02T06:29:02Z

Summary of Changes

Hello @daihaowz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for a distributed training system by introducing a new DistributedTrainController. This controller manages the lifecycle of distributed training jobs, from allocating resources to initializing and communicating with individual training engines. To support this, core API definitions for scheduling and engine configuration have been updated to be more flexible and expressive. Furthermore, new utility functions have been implemented to enhance the system's reliability in handling distributed operations, particularly concerning connection stability and concurrent task execution.

Highlights

New Distributed Training Controller: Introduced the DistributedTrainController class, which is responsible for orchestrating distributed training workflows, including worker allocation, engine initialization, and RPC communication among workers.
Scheduler API Refactoring: The scheduler_api.py has undergone significant changes, including renaming SchedulingConfig to Job and updating it to accept a list of Scheduling tasks. The Worker dataclass now explicitly defines serve_port and extra_ports, and ScheduleStrategy uses Literal for stricter type hinting of strategy types.
Engine API Enhancements: The Scheduling dataclass in engine_api.py now includes port_count and an optional cmd field. Additionally, the get_scheduling_config method has been updated to return a list of Scheduling configurations, allowing for more complex resource definitions.
Robust Utility Functions: New utility functions create_engine_with_retry and rpc_call have been added. create_engine_with_retry ensures resilient engine initialization by retrying connections, while rpc_call facilitates concurrent remote procedure calls to workers, leveraging a new wait_future_ordered function for ordered result processing and robust exception handling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a DistributedTrainController and related utilities, which is a significant step towards a single controller architecture. The changes also include refactoring the scheduler and engine APIs for better clarity and consistency. My review focuses on improving the robustness and maintainability of the new components. I've identified a critical bug related to uninitialized state, some high-severity design issues in error handling that could lead to unexpected process termination, and several medium-severity suggestions to improve code consistency and clarity through type hints and documentation.

areal/controller/train_controller.py

gemini-code-assist · 2025-10-02T06:30:46Z

areal/controller/utils.py

+            for worker in workers
+        ]
+        try:
+            results = wait_future_ordered(futures, exit_on_exception=True)


The call to wait_future_ordered uses exit_on_exception=True, which will cause the entire process to terminate via os.kill if any of the RPC calls fail. This is an aggressive error handling strategy that prevents any cleanup or higher-level exception management. Furthermore, it makes the surrounding try...except block misleading, as the except block will never be reached upon an exception within a future. It's better to let exceptions propagate by setting exit_on_exception=False (which is the default) and allow the caller of rpc_call to handle failures gracefully.

Suggested change

results = wait_future_ordered(futures, exit_on_exception=True)

results = wait_future_ordered(futures)

gemini-code-assist · 2025-10-02T06:30:46Z

areal/utils/http.py

+            logger.warning(traceback.format_exc())
+            if exit_on_exception:
+                logger.info("Exiting due to exception in future.")
+                os.kill(os.getpid(), signal.SIGTERM)


Using os.kill(os.getpid(), signal.SIGTERM) for exception handling is a very aggressive approach. It abruptly terminates the entire process, which can lead to resource leaks, prevent proper cleanup, and make the function difficult to reuse safely in different contexts. It's generally better to propagate exceptions to the caller, allowing them to decide on the appropriate error handling strategy. The exit_on_exception flag provides an option, but terminating the process should be avoided in a library/utility function.

gemini-code-assist · 2025-10-02T06:30:46Z

areal/api/scheduler_api.py


 class Scheduler(abc.ABC):
-    def create_workers(self, worker_key, scheduler_config, *args, **kwargs) -> str:
+    def create_workers(self, job: Job, *args, **kwargs):


The create_workers method is missing a return type annotation. Based on its usage in areal/controller/train_controller.py (where its return value is assigned to self.uid), it is expected to return a string representing the job ID. Adding the -> str type hint improves code clarity and enables better static analysis.

Suggested change

def create_workers(self, job: Job, *args, **kwargs):

def create_workers(self, job: Job, *args, **kwargs) -> str:

gemini-code-assist · 2025-10-02T06:30:46Z

areal/controller/train_controller.py

+        self.uid: str
+        self.workers: List[Worker]
+
+        # todo: delete this method


This todo comment is vague. It's unclear if the method create_process_group should be deleted or if there's a task related to it. Please clarify the intent of this todo by adding more details or remove it if it's obsolete to improve code clarity.

gemini-code-assist · 2025-10-02T06:30:46Z

areal/controller/utils.py

+def create_engine_with_retry(
+    create_engine_func, max_retries=60, retry_delay=10, *args, **kwargs
+):


This function is missing type hints for its parameters. Adding type hints for create_engine_func, max_retries, and retry_delay would improve readability and allow for static analysis. You will need to add from typing import Callable.

Suggested change

def create_engine_with_retry(

create_engine_func, max_retries=60, retry_delay=10, *args, **kwargs

):

def create_engine_with_retry(

create_engine_func: Callable[..., Any], max_retries: int = 60, retry_delay: int = 10, *args, **kwargs

) -> Any:

gemini-code-assist · 2025-10-02T06:30:46Z

areal/controller/utils.py

+    """
+    工具方法：并发RPC调用
+
+    :param scheduler: 调度器对象, 必须有 call_engine(worker_id, method, *args, **kwargs)
+    :param workers: 可遍历的worker列表，每个worker应有 worker.id 属性
+    :param method: 方法名字符串
+    :param args: 传递给call_engine的*args
+    :param kwargs: 传递给call_engine的**kwargs
+    :return: results
+    """


The docstring for this function is in Chinese, while the rest of the codebase is in English. For consistency and to make it accessible to all contributors, please translate the docstring to English.

Suggested change

"""

工具方法：并发RPC调用

:param scheduler: 调度器对象, 必须有 call_engine(worker_id, method, *args, **kwargs)

:param workers: 可遍历的worker列表，每个worker应有 worker.id 属性

:param method: 方法名字符串

:param args: 传递给call_engine的*args

:param kwargs: 传递给call_engine的**kwargs

:return: results

"""

"""

Utility method for concurrent RPC calls.

:param scheduler: Scheduler object, must have a `call_engine(worker_id, method, *args, **kwargs)` method.

:param workers: An iterable list of workers, where each worker should have a `worker.id` attribute.

:param method: The name of the method to call as a string.

:param args: Positional arguments to pass to `call_engine`.

:param kwargs: Keyword arguments to pass to `call_engine`.

:return: A list of results from the RPC calls.

"""

github-actions · 2025-10-17T01:13:40Z

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

Saingsophea · 2025-10-17T01:27:29Z

[email protected]> This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!
How to start this comment

Saingsophea · 2025-10-17T01:28:05Z

0x7e43d3a147f66a953979e4272f0368dac3a5c826>

daihaowz had a problem deploying to AReaL-unittests October 2, 2025 06:28 — with GitHub Actions Error

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

single controller: add train controller

008bd7a

daihaowz force-pushed the main branch from 28af941 to 008bd7a Compare October 2, 2025 10:00

daihaowz had a problem deploying to AReaL-unittests October 2, 2025 10:00 — with GitHub Actions Error

github-actions bot added the stale label Oct 17, 2025

Saingsophea mentioned this pull request Oct 17, 2025

## Summary of Changes ton-blockchain/wallet-contract#431

Closed

github-actions bot removed the stale label Oct 18, 2025

daihaowz closed this Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [feature] single controller: add train controller #414

[WIP] [feature] single controller: add train controller #414

daihaowz commented Oct 2, 2025

Uh oh!

gemini-code-assist bot commented Oct 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

Saingsophea commented Oct 17, 2025 •

edited

Loading

Uh oh!

Saingsophea commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	results = wait_future_ordered(futures, exit_on_exception=True)
	results = wait_future_ordered(futures)

	def create_workers(self, job: Job, args, *kwargs):
	def create_workers(self, job: Job, args, *kwargs) -> str:

[WIP] [feature] single controller: add train controller #414

[WIP] [feature] single controller: add train controller #414

Conversation

daihaowz commented Oct 2, 2025

Uh oh!

gemini-code-assist bot commented Oct 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

Saingsophea commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Saingsophea commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Saingsophea commented Oct 17, 2025 •

edited

Loading