Skip to content

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR#12943

Merged
rgerganov merged 1 commit intoggml-org:masterfrom
rgerganov:rpc-noresp
Apr 25, 2025
Merged

rpc : do not wait for response when sending RPC_CMD_SET_TENSOR#12943
rgerganov merged 1 commit intoggml-org:masterfrom
rgerganov:rpc-noresp

Conversation

@rgerganov
Copy link
Copy Markdown
Member

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response.

The performance impact of this change depends on the network latency.

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 14, 2025
@rgerganov
Copy link
Copy Markdown
Member Author

@steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients.

@steampunque
Copy link
Copy Markdown

@steampunque Could you please test this change on your setup and see if there is any noticeable improvement? You need to rebuild both server and clients.

Quick test looks like 2.5% to 5% boost, definately noticeable and consistent on 1Gb/s local LAN:

Llama 4 Scout 108B Q2_K_M NGL 40/49 3x 4070 (2 RPC) cuda backend Llama 3.2 1b spec
PN=416 PP=73.79487586830689 TG=11.329969567048092 DN=531 DA=239
PN=416 PP=76.66702870541333 TG=11.901360834784338 DN=531 DA=239 with patch
PP X 1.04 TG X 1.05

QwQ 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec
PN=1384 PP=248.3264016115168 TG=31.242259273143873 DN=1780 DA=939
PN=1384 PP=254.4311950437515 TG=32.02353229834696 DN=1780 DA=939 with patch
PP X 1.025 TG X 1.026

DS R1 32B IQ4_XS NGL 65/65 2x 4070 (1 RPC) cuda backend DS R1 1.5B spec
PN=632 PP=304.26701414841614 TG=40.78197895972528 DN=680 DA=462
PN=632 PP=311.0518059036731 TG=41.87847766962705 DN=680 DA=462 with patch
PP X 1.022 TG X 1.026

PP = prompt processing
TG = token gen
PN = Predicted tokens
DN = Drafted tokens
DA = Accepted draft tokens

@rgerganov
Copy link
Copy Markdown
Member Author

Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it.

As this would be yet another breaking change for the RPC protocol, I am going to add RPC_CMD_HELLO first and introduce some protocol versioning.

@steampunque
Copy link
Copy Markdown

Thank you for these measurements. It's a small improvement but code changes are also small, so I think it's worth it.

As this would be yet another breaking change for the RPC protocol, I am going to add RPC_CMD_HELLO first and introduce some protocol versioning.

NIce. Any speedup appreciated!

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
@rgerganov
Copy link
Copy Markdown
Member Author

I did some performance testing with rpc-server running on Steam Deck and using both LAN and WiFi:

RPC v1.0.0 over LAN

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 147.15 ± 0.17
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 11.14 ± 0.04

RPC v1.0.0 over WiFi

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 141.74 ± 0.37
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 7.00 ± 0.06

RPC v2.0.0 over LAN

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 147.20 ± 0.38
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 11.68 ± 0.01

RPC v2.0.0 over WiFi

model size params backend ngl test t/s
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 pp512 141.55 ± 1.23
gemma3 4B Q4_0 2.93 GiB 3.88 B RPC 99 tg128 8.72 ± 0.04

There is 1.04 TG speedup for low-latency connections (which is consistent with @steampunque results) and 1.24 TG speedup for more latent connections such as WiFi

@rgerganov rgerganov marked this pull request as ready for review April 24, 2025 08:05
@rgerganov rgerganov merged commit 553a5c3 into ggml-org:master Apr 25, 2025
48 checks passed
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
…org#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
timwu pushed a commit to timwu/llama.cpp that referenced this pull request Dec 20, 2025
…org#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (ggml-org#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (ggml-org#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (ggml-org#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (ggml-org#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <[email protected]>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <[email protected]>

---------

Signed-off-by: Ville Vesilehto <[email protected]>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (ggml-org#13188)

Signed-off-by: xiaofei <[email protected]>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (ggml-org#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

---------

Signed-off-by: Ville Vesilehto <[email protected]>
Signed-off-by: xiaofei <[email protected]>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <[email protected]>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <[email protected]>
Co-authored-by: xiaofei <[email protected]>
Co-authored-by: Justin Santa Barbara <[email protected]>
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (ggml-org#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (ggml-org#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (ggml-org#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (ggml-org#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <[email protected]>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <[email protected]>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <[email protected]>

---------

Signed-off-by: Ville Vesilehto <[email protected]>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (ggml-org#13188)

Signed-off-by: xiaofei <[email protected]>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (ggml-org#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

* fix rpc function not found error

---------

Signed-off-by: Ville Vesilehto <[email protected]>
Signed-off-by: xiaofei <[email protected]>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <[email protected]>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <[email protected]>
Co-authored-by: xiaofei <[email protected]>
Co-authored-by: Justin Santa Barbara <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants