rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943

rgerganov · 2025-04-14T13:12:32Z

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response.

The performance impact of this change depends on the network latency.

rgerganov · 2025-04-14T13:16:03Z

@steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients.

steampunque · 2025-04-14T15:25:44Z

@steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients.

Quick test looks like 2.5% to 5% boost, definately noticeable and consistent on 1Gb/s local LAN:

Llama 4 Scout 108B Q2_K_M NGL 40/49 3x 4070 (2 RPC) cuda backend Llama 3.2 1b spec
PN=416 PP=73.79487586830689 TG=11.329969567048092 DN=531 DA=239
PN=416 PP=76.66702870541333 TG=11.901360834784338 DN=531 DA=239 with patch
PP X 1.04 TG X 1.05

QwQ 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec
PN=1384 PP=248.3264016115168 TG=31.242259273143873 DN=1780 DA=939
PN=1384 PP=254.4311950437515 TG=32.02353229834696 DN=1780 DA=939 with patch
PP X 1.025 TG X 1.026

DS R1 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec
PN=632 PP=304.26701414841614 TG=40.78197895972528 DN=680 DA=462
PN=632 PP=311.0518059036731 TG=41.87847766962705 DN=680 DA=462 with patch
PP X 1.022 TG X 1.026

PP = prompt processing
TG = token gen
PN = Predicted tokens
DN = Drafted tokens
DA = Accepted draft tokens

rgerganov · 2025-04-15T07:45:08Z

Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it.

As this would be yet another breaking change for the RPC protocol, I am going to add RPC_CMD_HELLO first and introduce some protocol versioning.

steampunque · 2025-04-15T12:48:21Z

Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it.

As this would be yet another breaking change for the RPC protocol, I am going to add RPC_CMD_HELLO first and introduce some protocol versioning.

NIce. Any speedup appreciated!

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.

rgerganov · 2025-04-24T08:05:21Z

I did some performance testing with rpc-server running on Steam Deck and using both LAN and WiFi:

RPC v1.0.0 over LAN

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	147.15 ± 0.17
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	11.14 ± 0.04

RPC v1.0.0 over WiFi

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	141.74 ± 0.37
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	7.00 ± 0.06

RPC v2.0.0 over LAN

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	147.20 ± 0.38
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	11.68 ± 0.01

RPC v2.0.0 over WiFi

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	141.55 ± 1.23
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	8.72 ± 0.04

There is 1.04 TG speedup for low-latency connections (which is consistent with @steampunque results) and 1.24 TG speedup for more latent connections such as WiFi

…org#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 14, 2025

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR

e7290ca

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.

rgerganov force-pushed the rpc-noresp branch from d0ab197 to e7290ca Compare April 23, 2025 13:44

rgerganov marked this pull request as ready for review April 24, 2025 08:05

ggerganov approved these changes Apr 24, 2025

View reviewed changes

rgerganov merged commit 553a5c3 into ggml-org:master Apr 25, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943

Uh oh!

rgerganov commented Apr 14, 2025

Uh oh!

rgerganov commented Apr 14, 2025

Uh oh!

steampunque commented Apr 14, 2025

Uh oh!

rgerganov commented Apr 15, 2025

Uh oh!

steampunque commented Apr 15, 2025

Uh oh!

rgerganov commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR #12943

Uh oh!

Conversation

rgerganov commented Apr 14, 2025

Uh oh!

rgerganov commented Apr 14, 2025

Uh oh!

steampunque commented Apr 14, 2025

Uh oh!

rgerganov commented Apr 15, 2025

Uh oh!

steampunque commented Apr 15, 2025

Uh oh!

rgerganov commented Apr 24, 2025

RPC v1.0.0 over LAN

RPC v1.0.0 over WiFi

RPC v2.0.0 over LAN

RPC v2.0.0 over WiFi

Uh oh!

Uh oh!

Uh oh!