Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 23, 2025

ref #4130 (reply in thread)

Current logic in this PR (subject to change):

  • When using unified KV cache with -kvu, share the entire context -c N among all parallel slots of the server -np N
  • When we run out of space, try to free some by purging old sequences from idle slots
  • If we still run out of space, terminate all active slots at once
  • The -np N argument is still utilized to control the max number of parallel jobs, but it is no longer used to change the per-slot context

Example:

llama-server -m model.gguf -c 8192 --jinja -kvu -np 4

TODO:

  • When we run out of space, terminate the active slots one-by-one and keep trying
  • Think about instead of purging, to move the slot into host-memory cache. Not sure that this is really needed thanks to the existing logic from server : host-memory prompt caching #16391
  • Update logic for starting a new task to check that it has some extra room for generation
  • Add tests


uint32_t llama_context::n_ctx_per_seq() const {
return cparams.n_ctx / cparams.n_seq_max;
return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this value be capped when using unified cache to avoid exceeding the model context length? I think it could be set to min(n_ctx_train, n_ctx), or add a parameter to allow the user to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can cap it to n_ctx_train. The only use case for n_ctx > n_ctx_train that comes to mind is self-extend, but lately this technique seems less relevant.

We can also cap it for the non-unified case?

Suggested change
return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also cap it for the non-unified case?

What would happen to the leftover slots? I may be misunderstanding the way split cache works, but my assumption would be that these slots would never be used, and it would be wasted memory. So if that's capped, it should be done at context creation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should do the capping at context creation in the llama_context constructor. Currently we have some additional logic for this in llama-model:

llama.cpp/src/llama-model.cpp

Lines 19708 to 19724 in 7863fcc

const auto padding = llama_kv_cache::get_padding(cparams);
uint32_t n_ctx_per_stream = cparams.n_ctx;
if (!cparams.kv_unified) {
n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;
n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);
cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;
} else {
n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);
cparams.n_ctx = n_ctx_per_stream;
}
LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

Since we no longer need the padding logic (as of #16148 and related) we should simplify this.

I'll push a separate PR for this and then will come back to polishing this one.

@github-actions github-actions bot added the python python script changes label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants