Revise blog on Tensor R-Fork typos

amysaq2023 · web-flow · commit c767f4322c50 · 2025-12-09T15:40:21.000+08:00
diff --git a/blog/2025-12-03-rfork.md b/blog/2025-12-03-rfork.md
@@ -39,7 +39,7 @@ To address this challenge, we have developed **a novel weight-loading framework
 
 The core concept of Tensor R-Fork is to **leverage GPU-Direct RDMA for constructing a peer-to-peer (P2P) weight storage architecture.**
 
-The performance of data transfer using tranditional method is low, because there is always bottleneck in the entire path, whose bandwidth is much smaller than InfiniBand. 
+The performance of data transfer using traditional method is low, because there is always bottleneck in the entire path, whose bandwidth is much smaller than InfiniBand. 
 From the data flow analysis, we observe that weight tensors are stored on each GPU and can be transmitted directly between nodes via GPU-direct RDMA.
 
 To maximize the utilization of InfiniBand NIC's bandwidth, we design a per GPU-pair data transfer strategy: a local GPU directly transfers data to/from its paired remote GPU. This design effectively bypasses the PCIe bottleneck between GPU and CPU, enabling high-throughput communication without relying on CPU or host memory.
@@ -85,8 +85,8 @@ When initializing the destination instance:
 |                      | NCCL                           | TransferEngine |
 |----------------------|--------------------------------|----------------|
 | Deployment Complexity| ✅ No additional dependency.   |❌ Additional library `mooncake` is needed. |
-|Overhead of Transfer Setup | ✅ Building communication groups takes hundreds of miliseconds | ➖ Registering memory regions to RDMA channel may take several seconds, but can be overlapped with other initialization phases.|
-|Non-disturbing to GPU workload | ❌ Tensor transfer will launch CUDA kernels. | ✅ No CUDA kernels launched for transfering weights. | 
+|Overhead of Transfer Setup | ✅ Building communication groups takes hundreds of milliseconds | ➖ Registering memory regions to RDMA channel may take several seconds, but can be overlapped with other initialization phases.|
+|Non-disturbing to GPU workload | ❌ Tensor transfer will launch CUDA kernels. | ✅ No CUDA kernels launched for transferring weights. | 
 
 ## How to Use
 
@@ -164,3 +164,4 @@ Key advantages provided by TransferEngine[5]:
 Known limitation in the current TransferEngine implementation:
 * **Memory registration (register_mr) is slow**: <u>This is due to the RDMA driver</u>. If you have any insights or solutions to this issue, we would be truly grateful to hear from you. We value diverse perspectives and are keen to explore innovative approaches together.
 
+