What is the communication protocol for two GPUs within the same node? #90

Rookie-Kai · 2025-01-22T06:32:36Z

In your provided example, it's mentioned that communication can be handled using TCP or RDMA. However, in my case, the two GPUs reside within the same node. For GPUs on the same node, RDMA is typically not necessary for communication. In this scenario, to achieve optimal performance, should I still use TCP for communication? If not, what communication method should I configure to achieve the best results?

Additionally, I have another issue. My communication protocol is TCP. After deploying using Mooncake for separate prefill and decode stages with vLLM, I conducted a stress test. Initially, it functioned correctly, but after a very short period, the entire service hung. The decode side displayed the following metrics:
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

However, when using the same configuration with only vLLM (without Mooncake's separation), this problem does not occur. Could this issue be caused by my use of TCP for communication?

my environments：
vllm: 0.6.6.post1
gcc: 11.4.0
cmake: 3.29.0
GPU: 4090D *2 (same node)
model_size: 7B

Thank you very much for your open-source contribution. I understand you are very busy, but I would be extremely grateful if you could provide some insights into this issue. Thank you!

The text was updated successfully, but these errors were encountered:

alogfans · 2025-01-23T01:49:03Z

Currently Mooncake supports copying data using cudaMalloc/malloc if the destination is in the same process. We are investigating how to optimize the transfer between two processes in the same machine.
It seems the problem is caused by error of Tcp transport. Did you find some error messages from Transfer Engine?

Rookie-Kai · 2025-01-23T03:48:21Z

目前 Mooncake 支持使用 cudaMalloc/malloc 复制数据（如果目标位于同一进程中）。我们正在研究如何优化同一台机器中两个进程之间的传输。

问题似乎是由 Tcp 传输错误引起的。您是否发现 Transfer Engine 的一些错误消息？

Thank you for your response. Unfortunately, I did not find any error messages in the Transfer Engine. I ran the sample proxy.py file, and normally, when a request is successful, the proxy_server returns a 200 status. However, when the service hangs, each request shows the following --

Additionally, I would like to confirm something. I have two 4090D GPUs on the same node, connected via PCIe, and internal communication can be done through the intranet. Currently, when using Mooncake + vLLM, is TCP the only option? For intra-node communication, do we need to wait for the next update?

stmatengss · 2025-01-23T08:52:38Z

目前 Mooncake 支持使用 cudaMalloc/malloc 复制数据（如果目标位于同一进程中）。我们正在研究如何优化同一台机器中两个进程之间的传输。

问题似乎是由 Tcp 传输错误引起的。您是否发现 Transfer Engine 的一些错误消息？

Thank you for your response. Unfortunately, I did not find any error messages in the Transfer Engine. I ran the sample proxy.py file, and normally, when a request is successful, the proxy_server returns a 200 status. However, when the service hangs, each request shows the following --

Additionally, I would like to confirm something. I have two 4090D GPUs on the same node, connected via PCIe, and internal communication can be done through the intranet. Currently, when using Mooncake + vLLM, is TCP the only option? For intra-node communication, do we need to wait for the next update?

#77 is a PR trying to support intra-node communication with shared memory. We will merge it into the main after finishing enough tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the communication protocol for two GPUs within the same node? #90

What is the communication protocol for two GPUs within the same node? #90

Rookie-Kai commented Jan 22, 2025

alogfans commented Jan 23, 2025

Rookie-Kai commented Jan 23, 2025 •

edited

Loading

stmatengss commented Jan 23, 2025

What is the communication protocol for two GPUs within the same node? #90

What is the communication protocol for two GPUs within the same node? #90

Comments

Rookie-Kai commented Jan 22, 2025

alogfans commented Jan 23, 2025

Rookie-Kai commented Jan 23, 2025 • edited Loading

stmatengss commented Jan 23, 2025

Rookie-Kai commented Jan 23, 2025 •

edited

Loading