-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the communication protocol for two GPUs within the same node? #90
Comments
|
Thank you for your response. Unfortunately, I did not find any error messages in the Transfer Engine. I ran the sample proxy.py file, and normally, when a request is successful, the proxy_server returns a 200 status. However, when the service hangs, each request shows the following Additionally, I would like to confirm something. I have two 4090D GPUs on the same node, connected via PCIe, and internal communication can be done through the intranet. Currently, when using Mooncake + vLLM, is TCP the only option? For intra-node communication, do we need to wait for the next update? |
#77 is a PR trying to support intra-node communication with shared memory. We will merge it into the main after finishing enough tests. |
In your provided example, it's mentioned that communication can be handled using TCP or RDMA. However, in my case, the two GPUs reside within the same node. For GPUs on the same node, RDMA is typically not necessary for communication. In this scenario, to achieve optimal performance, should I still use TCP for communication? If not, what communication method should I configure to achieve the best results?
Additionally, I have another issue. My communication protocol is TCP. After deploying using Mooncake for separate prefill and decode stages with vLLM, I conducted a stress test. Initially, it functioned correctly, but after a very short period, the entire service hung. The decode side displayed the following metrics:
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
However, when using the same configuration with only vLLM (without Mooncake's separation), this problem does not occur. Could this issue be caused by my use of TCP for communication?
my environments:
vllm: 0.6.6.post1
gcc: 11.4.0
cmake: 3.29.0
GPU: 4090D *2 (same node)
model_size: 7B
Thank you very much for your open-source contribution. I understand you are very busy, but I would be extremely grateful if you could provide some insights into this issue. Thank you!
The text was updated successfully, but these errors were encountered: