[Miles] Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity by wduan-hai · Pull Request #964 · radixark/miles

wduan-hai · 2026-04-09T20:47:24Z

Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity.

Tested on 4 node with Qwen3-30B-A3B, 16 node with Qwen3-235B-A22B over RoCE.

This depends on sgl-project/sglang#22468

… NIC affinity

gemini-code-assist

Code Review

This pull request enhances the P2P weight transfer mechanism by introducing GPU-affine InfiniBand NIC binding, increasing the transfer timeout, and improving CPU-side model loading to prevent OOM issues. It also adds necessary weight post-processing steps and robust parameter mapping. Regarding the review feedback, I recommend verifying the idempotency of post_process_weights to ensure that calling it in both _pause_and_prepare_engines and _finalize_and_resume_engines is safe, and considering whether silent skipping of unmapped parameters in _get_transfer_ready_params should be replaced with a warning or error to prevent silent model state inconsistencies.

gemini-code-assist · 2026-04-09T20:51:53Z

miles/backends/megatron_utils/update_weight/update_weight_from_distributed/p2p.py

+        if dist.get_rank() == 0:
+            post_process_weights(
+                rollout_engines=self.rollout_engines,
+                restore_weights_before_load=True,
+            )


The post_process_weights call is performed inside _pause_and_prepare_engines and again in _finalize_and_resume_engines. If post_process_weights is idempotent, this is fine, but if it performs stateful operations, this could lead to unexpected behavior. Please ensure that the logic is safe to call multiple times or consider if the call in _pause_and_prepare_engines is necessary given the one in _finalize_and_resume_engines.

gemini-code-assist · 2026-04-09T20:51:53Z

miles/backends/megatron_utils/update_weight/update_weight_from_distributed/p2p.py

+            if mapped_result is None:
+                continue


The check if mapped_result is None: continue is added to handle cases where mapping fails. Ensure that this silent skipping of parameters does not lead to incomplete weight updates or inconsistencies in the model state, as it might be safer to log a warning or raise an error if a parameter is expected to be mapped.

JD-ETH · 2026-04-10T03:30:13Z

you are making a lot of changes at once.

flashinfer double process; can you verify this is still needed? we added post processing call now
_get_transfer_ready_params skip is unsafe
it's unclear to me why you are only calling some of the processing on rank 0
GPU affinity looks reasonable though.

All my previous experiments were run on H100, so I do expect issues to come up for GB.
can you separate the issues and maybe start by getting a config that can work for GB?

[Miles] Update P2P RDMA weight sync with flashinfer compatibility and…

6cabe81

… NIC affinity

wduan-hai requested review from fzyzcjy, guapisolo, maocheng23, yueming-yuan and yushengsu-thu as code owners April 9, 2026 20:47

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

wduan-hai mentioned this pull request Apr 9, 2026

[Sglang] Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity sgl-project/sglang#22468

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Miles] Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity#964

[Miles] Update P2P RDMA weight sync with flashinfer compatibility and NIC affinity#964
wduan-hai wants to merge 1 commit intoradixark:mainfrom
wduan-hai:wduan/p2p-flashinfer-restore

wduan-hai commented Apr 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

JD-ETH commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wduan-hai commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

JD-ETH commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wduan-hai commented Apr 9, 2026 •

edited

Loading