Hi,
Thank you for the great repo! I learn a lot from the implementation. Have you tried combining SeerAttention and SeerAttention-R to allow both sparse prefill and decoding? I guess a simple way is to use different linear layers and token budgets for prefill and decoding? I'm just curious about the results or your thoughts on this.
Thank you!
Hi,
Thank you for the great repo! I learn a lot from the implementation. Have you tried combining SeerAttention and SeerAttention-R to allow both sparse prefill and decoding? I guess a simple way is to use different linear layers and token budgets for prefill and decoding? I'm just curious about the results or your thoughts on this.
Thank you!