-
Notifications
You must be signed in to change notification settings - Fork 11.5k
rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients. |
Quick test looks like 2.5% to 5% boost, definately noticeable and consistent on 1Gb/s local LAN: Llama 4 Scout 108B Q2_K_M NGL 40/49 3x 4070 (2 RPC) cuda backend Llama 3.2 1b spec QwQ 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec DS R1 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec PP = prompt processing |
Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it. As this would be yet another breaking change for the RPC protocol, I am going to add |
NIce. Any speedup appreciated! |
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.
I did some performance testing with RPC v1.0.0 over LAN
RPC v1.0.0 over WiFi
RPC v2.0.0 over LAN
RPC v2.0.0 over WiFi
There is 1.04 TG speedup for low-latency connections (which is consistent with @steampunque results) and 1.24 TG speedup for more latent connections such as WiFi |
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response.
The performance impact of this change depends on the network latency.