Add attention sink to flash attention

Hi team,

Could we add attention sink mechanism to flash attention, to enable gpt-oss model support?

I think we should add something like bellow after this qk_product `attn_weights = jnp.einsum)...` [here](https://github.com/NVIDIA/TransformerEngine/blob/05d3b7b511ff142a2a0e6d7b46d56b81d1323461/transformer_engine/jax/flax/transformer.py#L176).

Reference implementation [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_oss/modular_gpt_oss.py#L214-L220) , and main logic:

```
sinks = module.sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
combined_logits = torch.cat([attn_weights, sinks], dim=-1)
combined_logits = combined_logits - combined_logits.max(dim=-1, keepdim=True).values
probs = F.softmax(combined_logits, dim=-1, dtype=combined_logits.dtype)
scores = probs[..., :-1]  # we drop the sink here
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add attention sink to flash attention #2070

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add attention sink to flash attention #2070

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions