You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-12-03-rfork.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ To address this challenge, we have developed **a novel weight-loading framework
39
39
40
40
The core concept of Tensor R-Fork is to **leverage GPU-Direct RDMA for constructing a peer-to-peer (P2P) weight storage architecture.**
41
41
42
-
The performance of data transfer using tranditional method is low, because there is always bottleneck in the entire path, whose bandwidth is much smaller than InfiniBand.
42
+
The performance of data transfer using traditional method is low, because there is always bottleneck in the entire path, whose bandwidth is much smaller than InfiniBand.
43
43
From the data flow analysis, we observe that weight tensors are stored on each GPU and can be transmitted directly between nodes via GPU-direct RDMA.
44
44
45
45
To maximize the utilization of InfiniBand NIC's bandwidth, we design a per GPU-pair data transfer strategy: a local GPU directly transfers data to/from its paired remote GPU. This design effectively bypasses the PCIe bottleneck between GPU and CPU, enabling high-throughput communication without relying on CPU or host memory.
@@ -85,8 +85,8 @@ When initializing the destination instance:
| Deployment Complexity| ✅ No additional dependency. |❌ Additional library `mooncake` is needed. |
88
-
|Overhead of Transfer Setup | ✅ Building communication groups takes hundreds of miliseconds| ➖ Registering memory regions to RDMA channel may take several seconds, but can be overlapped with other initialization phases.|
89
-
|Non-disturbing to GPU workload | ❌ Tensor transfer will launch CUDA kernels. | ✅ No CUDA kernels launched for transfering weights. |
88
+
|Overhead of Transfer Setup | ✅ Building communication groups takes hundreds of milliseconds| ➖ Registering memory regions to RDMA channel may take several seconds, but can be overlapped with other initialization phases.|
89
+
|Non-disturbing to GPU workload | ❌ Tensor transfer will launch CUDA kernels. | ✅ No CUDA kernels launched for transferring weights. |
90
90
91
91
## How to Use
92
92
@@ -164,3 +164,4 @@ Key advantages provided by TransferEngine[5]:
164
164
Known limitation in the current TransferEngine implementation:
165
165
***Memory registration (register_mr) is slow**: <u>This is due to the RDMA driver</u>. If you have any insights or solutions to this issue, we would be truly grateful to hear from you. We value diverse perspectives and are keen to explore innovative approaches together.
0 commit comments