Fused QKV multiplication #16786

am17an · 2025-10-26T16:35:17Z

am17an
Oct 26, 2025
Collaborator

In llama-graph.cpp, typically QKV follows

                ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur);
                cb(Qcur, "Qcur", il);
                if (model.layers[il].bq) {
                    Qcur = ggml_add(ctx0, Qcur, model.layers[il].bq);
                    cb(Qcur, "Qcur", il);
                }

                ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur);
                cb(Kcur, "Kcur", il);
                if (model.layers[il].bk) {
                    Kcur = ggml_add(ctx0, Kcur, model.layers[il].bk);
                    cb(Kcur, "Kcur", il);
                }

                ggml_tensor * Vcur = build_lora_mm(model.layers[il].wv, cur);
                cb(Vcur, "Vcur", il);
                if (model.layers[il].bv) {
                    Vcur = ggml_add(ctx0, Vcur, model.layers[il].bv);
                    cb(Vcur, "Vcur", il);
                }

                Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head,    n_tokens);
                Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
                Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

If there are no LoRA adapaters and no GQA (i.e. Q K and V all have the same dims), this can be a GEMM between activations A and Q,K,V. I think the only requirement would be that Q, K, V weights are allocated contiguously. From llama-model::load_tensors, is it possible to change this

       layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q,   "weight", i), {n_embd, n_embd_head_k * n_head}, 0);
       layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K,   "weight", i), {n_embd, n_embd_k_gqa}, 0);
       layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V,   "weight", i), {n_embd, n_embd_v_gqa}, 0);

to

       ggml_tensor * qkv_tensor = create_tensor(LLM_TENSOR_ATTN_QKV, "weight_comb", {n_embd,  n_embd_head_k * n_head + n_embd_kgqa + n_embd_v_gqa)
       layer.wq         = ggml_view_2d etc..

We can get a rough benefit on performance on cases where there is no GQA with this simple change

CISC · 2025-10-26T16:39:45Z

CISC
Oct 26, 2025
Collaborator

#15602 is relevant

8 replies

am17an Oct 27, 2025
Collaborator Author

Could you help me understand the cases where it would not be possible to do this?

ggerganov Oct 27, 2025
Maintainer

I'm not saying there are cases where it is impossible to do it. It's just that we need to implement some new logic in llama-model.cpp and llama-model-loader.cpp to allow to merge 3 separate Q, K and V tensors into a single QKV tensor. I'm just not sure atm what's the best way to implement this logic.

Maybe we can add an overload to create_tensor that accepts a list of tensors and loads them sequentially in the buffer.

am17an Oct 27, 2025
Collaborator Author

Maybe we can add an overload to create_tensor that accepts a list of tensors and loads them sequentially in the buffer.

That makes sense, and when we build the QKV mul mat we can check if the three tensors are continuously allocated and have the same stride? If so we can take the fused path.

ggerganov Oct 27, 2025
Maintainer

Rather, we just check if wqkv != nullptr and if so we take the fused path. Otherwise we go through the unfused path.

We already have this implemented btw:

llama.cpp/src/llama-model.cpp

Lines 14629 to 14656 in 5edfe78

    
           if (model.layers[il].wqkv == nullptr) { 
        
               Qcur = build_lora_mm(model.layers[il].wq, cur); 
        
               if (model.layers[il].bq) { 
        
                   Qcur = ggml_add(ctx0, Qcur, model.layers[il].bq); 
        
               } 
        
               Kcur = build_lora_mm(model.layers[il].wk, cur); 
        
               if (model.layers[il].bk) { 
        
                   Kcur = ggml_add(ctx0, Kcur, model.layers[il].bk); 
        
               } 
        
               Vcur = build_lora_mm(model.layers[il].wv, cur); 
        
               if (model.layers[il].bv) { 
        
                   Vcur = ggml_add(ctx0, Vcur, model.layers[il].bv); 
        
               } 
        
               Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head,    n_tokens); 
        
               Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens); 
        
               Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); 
        
           } else { 
        
               cur = build_lora_mm(model.layers[il].wqkv, cur); 
        
               cb(cur, "wqkv", il); 
        
               if (model.layers[il].bqkv) { 
        
                   cur = ggml_add(ctx0, cur, model.layers[il].bqkv); 
        
                   cb(cur, "bqkv", il); 
        
               } 
        
               Qcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head,    n_tokens, n_embd_head*sizeof(float), cur->nb[1], 0*sizeof(float)*(n_embd)); 
        
               Kcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head_kv, n_tokens, n_embd_head*sizeof(float), cur->nb[1], 1*sizeof(float)*(n_embd)); 
        
               Vcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head_kv, n_tokens, n_embd_head*sizeof(float), cur->nb[1], 1*sizeof(float)*(n_embd + n_embd_gqa)); 
        
           }

We can apply it to all architectures. The missing part is the construction of the wqkv tensor at load time.

am17an Oct 27, 2025
Collaborator Author

Ok yes makes sense. Let me put up a draft PR with this change

JohannesGaessler · 2025-10-27T09:20:03Z

JohannesGaessler
Oct 27, 2025
Collaborator

Consider also that Q, K, and V need to have the same type for batched GEMM.

1 reply

am17an Oct 27, 2025
Collaborator Author

it should be possible to figure this out at llama-model-load time

JohannesGaessler · 2025-10-27T09:22:46Z

JohannesGaessler
Oct 27, 2025
Collaborator

An additional opportunity for fusion would be to make the K and V matrices write their results directly to the KV cache but then you would need to somehow define output pointers and strides per channel.

2 replies

am17an Oct 27, 2025
Collaborator Author

I think #16769 is trying to do this for the K vector since it undergoes a RoPE after the mul-mat

JohannesGaessler Oct 27, 2025
Collaborator

By the way, both of those fusions are comparatively simple to do in isolation but more complicated to do together. I think batching the matrix multiplications will have the bigger impact due to tail effects, especially for the combination of big (in terms of SM count) GPUs and small matrices.

slaren · 2025-10-27T10:22:16Z

slaren
Oct 27, 2025
Maintainer

While loading a model you don't know if there are going to be LoRAs, and it would break mmap. It may be better to do this fusion in the backends.

2 replies

am17an Oct 27, 2025
Collaborator Author

It can still be fine if we return the wkqv matrix, but take the normal path when there are LoRAs involved, ie operate on views individually rather than a batched GEMM

JohannesGaessler Oct 27, 2025
Collaborator

For the PR where I'm automating weight distribution across devices I'm adding a no_alloc flag to plan the actual allocation. We could potentially re-use that code. Though from a ggml perspective I would agree that optimizing the compute graph per backend would be preferable.

Fused QKV multiplication #16786

Uh oh!

am17an Oct 26, 2025 Collaborator

Replies: 4 comments · 13 replies

Uh oh!

CISC Oct 26, 2025 Collaborator

Uh oh!

am17an Oct 27, 2025 Collaborator Author

Uh oh!

ggerganov Oct 27, 2025 Maintainer

Uh oh!

am17an Oct 27, 2025 Collaborator Author

Uh oh!

ggerganov Oct 27, 2025 Maintainer

Uh oh!

am17an Oct 27, 2025 Collaborator Author

Uh oh!

JohannesGaessler Oct 27, 2025 Collaborator

Uh oh!

am17an Oct 27, 2025 Collaborator Author

Uh oh!

JohannesGaessler Oct 27, 2025 Collaborator

Uh oh!

am17an Oct 27, 2025 Collaborator Author

Uh oh!

JohannesGaessler Oct 27, 2025 Collaborator

Uh oh!

slaren Oct 27, 2025 Maintainer

Uh oh!

am17an Oct 27, 2025 Collaborator Author

Uh oh!

JohannesGaessler Oct 27, 2025 Collaborator

am17an
Oct 26, 2025
Collaborator

Replies: 4 comments 13 replies

CISC
Oct 26, 2025
Collaborator

am17an Oct 27, 2025
Collaborator Author

ggerganov Oct 27, 2025
Maintainer

am17an Oct 27, 2025
Collaborator Author

ggerganov Oct 27, 2025
Maintainer

am17an Oct 27, 2025
Collaborator Author

JohannesGaessler
Oct 27, 2025
Collaborator

am17an Oct 27, 2025
Collaborator Author

JohannesGaessler
Oct 27, 2025
Collaborator

am17an Oct 27, 2025
Collaborator Author

JohannesGaessler Oct 27, 2025
Collaborator

slaren
Oct 27, 2025
Maintainer

am17an Oct 27, 2025
Collaborator Author

JohannesGaessler Oct 27, 2025
Collaborator