Skip to content

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 25, 2025

Conversation

rgerganov
Copy link
Collaborator

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response.

The performance impact of this change depends on the network latency.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 14, 2025
@rgerganov
Copy link
Collaborator Author

@steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients.

@steampunque
Copy link

@steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients.

Quick test looks like 2.5% to 5% boost, definately noticeable and consistent on 1Gb/s local LAN:

Llama 4 Scout 108B Q2_K_M NGL 40/49 3x 4070 (2 RPC) cuda backend Llama 3.2 1b spec
PN=416 PP=73.79487586830689 TG=11.329969567048092 DN=531 DA=239
PN=416 PP=76.66702870541333 TG=11.901360834784338 DN=531 DA=239 with patch
PP X 1.04 TG X 1.05

QwQ 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec
PN=1384 PP=248.3264016115168 TG=31.242259273143873 DN=1780 DA=939
PN=1384 PP=254.4311950437515 TG=32.02353229834696 DN=1780 DA=939 with patch
PP X 1.025 TG X 1.026

DS R1 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec
PN=632 PP=304.26701414841614 TG=40.78197895972528 DN=680 DA=462
PN=632 PP=311.0518059036731 TG=41.87847766962705 DN=680 DA=462 with patch
PP X 1.022 TG X 1.026

PP = prompt processing
TG = token gen
PN = Predicted tokens
DN = Drafted tokens
DA = Accepted draft tokens

@rgerganov
Copy link
Collaborator Author

Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it.

As this would be yet another breaking change for the RPC protocol, I am going to add RPC_CMD_HELLO first and introduce some protocol versioning.

@steampunque
Copy link

Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it.

As this would be yet another breaking change for the RPC protocol, I am going to add RPC_CMD_HELLO first and introduce some protocol versioning.

NIce. Any speedup appreciated!

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
@rgerganov
Copy link
Collaborator Author

I did some performance testing with rpc-server running on Steam Deck and using both LAN and WiFi:

RPC v1.0.0 over LAN

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 147.15 ± 0.17
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 11.14 ± 0.04

RPC v1.0.0 over WiFi

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 141.74 ± 0.37
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 7.00 ± 0.06

RPC v2.0.0 over LAN

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 147.20 ± 0.38
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 11.68 ± 0.01

RPC v2.0.0 over WiFi

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 141.55 ± 1.23
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 8.72 ± 0.04

There is 1.04 TG speedup for low-latency connections (which is consistent with @steampunque results) and 1.24 TG speedup for more latent connections such as WiFi

@rgerganov rgerganov marked this pull request as ready for review April 24, 2025 08:05
@rgerganov rgerganov merged commit 553a5c3 into ggml-org:master Apr 25, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants