[Miles] Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity#964
[Miles] Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity#964wduan-hai wants to merge 1 commit intoradixark:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances the P2P weight transfer mechanism by introducing GPU-affine InfiniBand NIC binding, increasing the transfer timeout, and improving CPU-side model loading to prevent OOM issues. It also adds necessary weight post-processing steps and robust parameter mapping. Regarding the review feedback, I recommend verifying the idempotency of post_process_weights to ensure that calling it in both _pause_and_prepare_engines and _finalize_and_resume_engines is safe, and considering whether silent skipping of unmapped parameters in _get_transfer_ready_params should be replaced with a warning or error to prevent silent model state inconsistencies.
| if dist.get_rank() == 0: | ||
| post_process_weights( | ||
| rollout_engines=self.rollout_engines, | ||
| restore_weights_before_load=True, | ||
| ) |
There was a problem hiding this comment.
The post_process_weights call is performed inside _pause_and_prepare_engines and again in _finalize_and_resume_engines. If post_process_weights is idempotent, this is fine, but if it performs stateful operations, this could lead to unexpected behavior. Please ensure that the logic is safe to call multiple times or consider if the call in _pause_and_prepare_engines is necessary given the one in _finalize_and_resume_engines.
| if mapped_result is None: | ||
| continue |
There was a problem hiding this comment.
The check if mapped_result is None: continue is added to handle cases where mapping fails. Ensure that this silent skipping of parameters does not lead to incomplete weight updates or inconsistencies in the model state, as it might be safer to log a warning or raise an error if a parameter is expected to be mapped.
|
you are making a lot of changes at once.
All my previous experiments were run on H100, so I do expect issues to come up for GB. |
Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity.
Tested on 4 node with Qwen3-30B-A3B, 16 node with Qwen3-235B-A22B over RoCE.
This depends on sgl-project/sglang#22468