Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the communication protocol for two GPUs within the same node? #90

Open
Rookie-Kai opened this issue Jan 22, 2025 · 3 comments
Open

Comments

@Rookie-Kai
Copy link

In your provided example, it's mentioned that communication can be handled using TCP or RDMA. However, in my case, the two GPUs reside within the same node. For GPUs on the same node, RDMA is typically not necessary for communication. In this scenario, to achieve optimal performance, should I still use TCP for communication? If not, what communication method should I configure to achieve the best results?

Additionally, I have another issue. My communication protocol is TCP. After deploying using Mooncake for separate prefill and decode stages with vLLM, I conducted a stress test. Initially, it functioned correctly, but after a very short period, the entire service hung. The decode side displayed the following metrics:
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

However, when using the same configuration with only vLLM (without Mooncake's separation), this problem does not occur. Could this issue be caused by my use of TCP for communication?

my environments:
vllm: 0.6.6.post1
gcc: 11.4.0
cmake: 3.29.0
GPU: 4090D *2 (same node)
model_size: 7B

Thank you very much for your open-source contribution. I understand you are very busy, but I would be extremely grateful if you could provide some insights into this issue. Thank you!

@alogfans
Copy link
Collaborator

  1. Currently Mooncake supports copying data using cudaMalloc/malloc if the destination is in the same process. We are investigating how to optimize the transfer between two processes in the same machine.

  2. It seems the problem is caused by error of Tcp transport. Did you find some error messages from Transfer Engine?

@Rookie-Kai
Copy link
Author

Rookie-Kai commented Jan 23, 2025

  1. 目前 Mooncake 支持使用 cudaMalloc/malloc 复制数据(如果目标位于同一进程中)。我们正在研究如何优化同一台机器中两个进程之间的传输。
  2. 问题似乎是由 Tcp 传输错误引起的。您是否发现 Transfer Engine 的一些错误消息?

Thank you for your response. Unfortunately, I did not find any error messages in the Transfer Engine. I ran the sample proxy.py file, and normally, when a request is successful, the proxy_server returns a 200 status. However, when the service hangs, each request shows the following --

Additionally, I would like to confirm something. I have two 4090D GPUs on the same node, connected via PCIe, and internal communication can be done through the intranet. Currently, when using Mooncake + vLLM, is TCP the only option? For intra-node communication, do we need to wait for the next update?

@stmatengss
Copy link
Collaborator

  1. 目前 Mooncake 支持使用 cudaMalloc/malloc 复制数据(如果目标位于同一进程中)。我们正在研究如何优化同一台机器中两个进程之间的传输。
  2. 问题似乎是由 Tcp 传输错误引起的。您是否发现 Transfer Engine 的一些错误消息?

Thank you for your response. Unfortunately, I did not find any error messages in the Transfer Engine. I ran the sample proxy.py file, and normally, when a request is successful, the proxy_server returns a 200 status. However, when the service hangs, each request shows the following --

Additionally, I would like to confirm something. I have two 4090D GPUs on the same node, connected via PCIe, and internal communication can be done through the intranet. Currently, when using Mooncake + vLLM, is TCP the only option? For intra-node communication, do we need to wait for the next update?

#77 is a PR trying to support intra-node communication with shared memory. We will merge it into the main after finishing enough tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants