Replies: 4 comments 13 replies
-
| 
         #15602 is relevant  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         Consider also that Q, K, and V need to have the same type for batched GEMM.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         An additional opportunity for fusion would be to make the K and V matrices write their results directly to the KV cache but then you would need to somehow define output pointers and strides per channel.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         While loading a model you don't know if there are going to be LoRAs, and it would break mmap. It may be better to do this fusion in the backends.  | 
  
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In llama-graph.cpp, typically QKV follows
If there are no LoRA adapaters and no GQA (i.e. Q K and V all have the same dims), this can be a GEMM between activations A and Q,K,V. I think the only requirement would be that Q, K, V weights are allocated contiguously. From
llama-model::load_tensors, is it possible to change thisto
We can get a rough benefit on performance on cases where there is no GQA with this simple change
Beta Was this translation helpful? Give feedback.
All reactions