server : support unified cache across slots #16736

ggerganov · 2025-10-23T09:31:48Z

Current logic in this PR (subject to change):

When using unified KV cache with -kvu, share the entire context -c N among all parallel slots of the server -np N
When we run out of space, try to free some by purging old sequences from idle slots
If we still run out of space, terminate all active slots at once
The -np N argument is still utilized to control the max number of parallel jobs, but it is no longer used to change the per-slot context

Example:

llama-server -m model.gguf -c 8192 --jinja -kvu -np 4

TODO:

When we run out of space, terminate the active slots one-by-one and keep trying
Think about instead of purging, to move the slot into host-memory cache. Not sure that this is really needed thanks to the existing logic from server : host-memory prompt caching #16391
Update logic for starting a new task to check that it has some extra room for generation
Add tests

slaren · 2025-10-23T13:46:10Z

src/llama-context.cpp


 uint32_t llama_context::n_ctx_per_seq() const {
-    return cparams.n_ctx / cparams.n_seq_max;
+    return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;


Should this value be capped when using unified cache to avoid exceeding the model context length? I think it could be set to min(n_ctx_train, n_ctx), or add a parameter to allow the user to change it.

I guess we can cap it to n_ctx_train. The only use case for n_ctx > n_ctx_train that comes to mind is self-extend, but lately this technique seems less relevant.

We can also cap it for the non-unified case?

Suggested change

return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;

return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

We can also cap it for the non-unified case?

What would happen to the leftover slots? I may be misunderstanding the way split cache works, but my assumption would be that these slots would never be used, and it would be wasted memory. So if that's capped, it should be done at context creation.

Right, we should do the capping at context creation in the llama_context constructor. Currently we have some additional logic for this in llama-model:

llama.cpp/src/llama-model.cpp

Lines 19708 to 19724 in 7863fcc

const auto padding = llama_kv_cache::get_padding(cparams);

uint32_t n_ctx_per_stream = cparams.n_ctx;

if (!cparams.kv_unified) {

n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;

n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;

} else {

n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

cparams.n_ctx = n_ctx_per_stream;

}

LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

Since we no longer need the padding logic (as of #16148 and related) we should simplify this.

I'll push a separate PR for this and then will come back to polishing this one.

server : support unified context across slots

1fa44f4

github-actions bot added examples server labels Oct 23, 2025

ggerganov added 4 commits October 23, 2025 14:33

cont : fix speculative decoding initialization

02d1011

context : fix n_ctx_per_seq computation

4d197ed

server : purge slots one by one

e7cec95

tests : add unified cache server tests

7863fcc

slaren reviewed Oct 23, 2025

View reviewed changes

github-actions bot added the python python script changes label Oct 23, 2025

wip [no ci]

7a25d4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : support unified cache across slots #16736

server : support unified cache across slots #16736

ggerganov commented Oct 23, 2025 •

edited

Loading

Uh oh!

slaren Oct 23, 2025

Uh oh!

ggerganov Oct 23, 2025

Uh oh!

slaren Oct 23, 2025

Uh oh!

ggerganov Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
	return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

	const auto padding = llama_kv_cache::get_padding(cparams);

	uint32_t n_ctx_per_stream = cparams.n_ctx;

	if (!cparams.kv_unified) {
	n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;
	n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

	cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;
	} else {
	n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

	cparams.n_ctx = n_ctx_per_stream;
	}

	LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

server : support unified cache across slots #16736

Are you sure you want to change the base?

server : support unified cache across slots #16736

Conversation

ggerganov commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Oct 23, 2025 •

edited

Loading