feat: introduce infini_run and ProcessGroup #80
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
实现进程启动器 infini_run 与通信组抽象 ProcessGroup,为多机分布式和多维并行训练提供基础支持。
进程启动器 infini_run:类似于 PyTorch 的 torchrun,主要负责多进程训练任务的启动与环境变量的配置,支持通过命令行参数指定节点数量、每节点进程数、节点序号及 rendezvous 地址等信息。例如:
/infini_run --nnodes=1 --nproc_per_node=1 --node_rank=0 --rdzv_endpoint=127.0.0.1:29500 -- ./llama3 --arg1 .. --arg2 ...
通信后端抽象 ProcessGroup:新增分布式通信基础组件,用于统一封装 NCCL 通信后端的初始化与管理。