- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients. | 
| 
 Quick test looks like 2.5% to 5% boost, definately noticeable and consistent on 1Gb/s local LAN: Llama 4 Scout 108B Q2_K_M NGL 40/49 3x 4070 (2 RPC) cuda backend Llama 3.2 1b spec QwQ 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec DS R1 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec PP = prompt processing | 
| Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it. As this would be yet another breaking change for the RPC protocol, I am going to add  | 
| 
 NIce. Any speedup appreciated! | 
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.
| I did some performance testing with  RPC v1.0.0 over LAN
 RPC v1.0.0 over WiFi
 RPC v2.0.0 over LAN
 RPC v2.0.0 over WiFi
 There is 1.04 TG speedup for low-latency connections (which is consistent with @steampunque results) and 1.24 TG speedup for more latent connections such as WiFi | 
…org#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response.
The performance impact of this change depends on the network latency.