Skip to content

Commit c767f43

Browse files
authored
Revise blog on Tensor R-Fork typos
1 parent cb05445 commit c767f43

File tree

1 file changed

+4
-3
lines changed

1 file changed

+4
-3
lines changed

blog/2025-12-03-rfork.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ To address this challenge, we have developed **a novel weight-loading framework
3939

4040
The core concept of Tensor R-Fork is to **leverage GPU-Direct RDMA for constructing a peer-to-peer (P2P) weight storage architecture.**
4141

42-
The performance of data transfer using tranditional method is low, because there is always bottleneck in the entire path, whose bandwidth is much smaller than InfiniBand.
42+
The performance of data transfer using traditional method is low, because there is always bottleneck in the entire path, whose bandwidth is much smaller than InfiniBand.
4343
From the data flow analysis, we observe that weight tensors are stored on each GPU and can be transmitted directly between nodes via GPU-direct RDMA.
4444

4545
To maximize the utilization of InfiniBand NIC's bandwidth, we design a per GPU-pair data transfer strategy: a local GPU directly transfers data to/from its paired remote GPU. This design effectively bypasses the PCIe bottleneck between GPU and CPU, enabling high-throughput communication without relying on CPU or host memory.
@@ -85,8 +85,8 @@ When initializing the destination instance:
8585
| | NCCL | TransferEngine |
8686
|----------------------|--------------------------------|----------------|
8787
| Deployment Complexity| ✅ No additional dependency. |❌ Additional library `mooncake` is needed. |
88-
|Overhead of Transfer Setup | ✅ Building communication groups takes hundreds of miliseconds | ➖ Registering memory regions to RDMA channel may take several seconds, but can be overlapped with other initialization phases.|
89-
|Non-disturbing to GPU workload | ❌ Tensor transfer will launch CUDA kernels. | ✅ No CUDA kernels launched for transfering weights. |
88+
|Overhead of Transfer Setup | ✅ Building communication groups takes hundreds of milliseconds | ➖ Registering memory regions to RDMA channel may take several seconds, but can be overlapped with other initialization phases.|
89+
|Non-disturbing to GPU workload | ❌ Tensor transfer will launch CUDA kernels. | ✅ No CUDA kernels launched for transferring weights. |
9090

9191
## How to Use
9292

@@ -164,3 +164,4 @@ Key advantages provided by TransferEngine[5]:
164164
Known limitation in the current TransferEngine implementation:
165165
* **Memory registration (register_mr) is slow**: <u>This is due to the RDMA driver</u>. If you have any insights or solutions to this issue, we would be truly grateful to hear from you. We value diverse perspectives and are keen to explore innovative approaches together.
166166

167+

0 commit comments

Comments
 (0)