Skip to content

Conversation

kilinchange
Copy link
Collaborator

@kilinchange kilinchange commented Oct 14, 2025

实现进程启动器 infini_run 与通信组抽象 ProcessGroup,为多机分布式和多维并行训练提供基础支持。

  • 进程启动器 infini_run:类似于 PyTorch 的 torchrun,主要负责多进程训练任务的启动与环境变量的配置,支持通过命令行参数指定节点数量、每节点进程数、节点序号及 rendezvous 地址等信息。例如:
    /infini_run --nnodes=1 --nproc_per_node=1 --node_rank=0 --rdzv_endpoint=127.0.0.1:29500 -- ./llama3 --arg1 .. --arg2 ...

  • 通信后端抽象 ProcessGroup:新增分布式通信基础组件,用于统一封装 NCCL 通信后端的初始化与管理。

    • 添加了 ProcessGroup 类及其辅助函数;
    • 实现了 ProcessGroupFactory 工厂类,提供线程安全的通信组创建与获取机制,以及默认通信组机制;
    • 新增通用 Rank 类,用于抽象进程与线程的层次化 rank 信息;
    • 将原本位于 Device 类中的通信管理(comm)逻辑迁移至 ProcessGroup 中,Device 仅维护自身的 rank 信息;
    • 通信算子接口从原先的 dispatcher 调用改为走 ProcessGroup 成员函数,允许显式或隐式(默认组)指定通信上下文;
    • 新增 global 类统一管理并获取环境变量,为通信初始化提供全局上下文支持;
    • 修改 main.cc 以适配基于 ProcessGroup 的分布式训练流程。

@kilinchange kilinchange force-pushed the feature/distributed-launch branch 5 times, most recently from 18b2aec to 1b5edc1 Compare October 16, 2025 15:44
…rocess group, and update communication operators to be dispatched via ProcessGroup member functions.
- Add helper functions for ProcessGroup management
- Update main.cc to support ProcessGroup-based training workflow
- Move NCCL communicator from Device to ProcessGroup; maintain rank information in Device
- Update communication operators to call ProcessGroup member functions; pg parameter is optional (uses default group if null)
- Introduce a generic Rank class for rank abstraction
- Add global class to manage and access environment variables
@kilinchange
Copy link
Collaborator Author

llama3 ddp:
image
gpt2 ddp:
image

@kilinchange
Copy link
Collaborator Author

使用 infini_run 启动 llama3 ddp:
image
使用 infini_run 启动 gpt2 ddp:
image

@kilinchange kilinchange force-pushed the feature/distributed-launch branch from 1b5edc1 to f157739 Compare October 16, 2025 16:47
@kilinchange kilinchange changed the title [WIP] introduce infini_run and ProcessGroup introduce infini_run and ProcessGroup Oct 16, 2025
@kilinchange kilinchange changed the title introduce infini_run and ProcessGroup feat: introduce infini_run and ProcessGroup Oct 16, 2025
@kilinchange kilinchange force-pushed the feature/distributed-launch branch from a2d90ba to 13865dc Compare October 16, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant