Attention CUDA BFloat16 Support #25974

nenad1002 · 2025-09-07T21:45:42Z

Description

Attention BFloat16 Support for CUDA - extends kernel implementations to accept BF16 input/output tensors.

Motivation and Context

We already have BFloat16 support for GQA (Group Query Attention), but not for regular Attention which many models require for inference (e.g. visual encoder of Gemma 3) due to FP32-like stability at lower memory/compute cost.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…t/onnxruntime into nebanfic/attention-bf16

onnxruntime/contrib_ops/cuda/bert/add_bias_transpose.cu

onnxruntime/contrib_ops/cuda/bert/attention_kv_cache.cu

onnxruntime/contrib_ops/cuda/bert/attention_qk.cu

onnxruntime/contrib_ops/cuda/bert/attention_transpose.cu

onnxruntime/contrib_ops/cuda/bert/utils.cuh

tianleiwu

Could you try build the code for older cuda architecture like GTX 1080 to see whether the compiler and code can properly run (Need to fail nicely like show error message that the GPU does not support BF16)?

nenad1002 and others added 25 commits August 28, 2025 14:44

Start attention bf16

6bfb30e

More changes

b6f0764

Version we can run

b70caa9

More bf16 changes

02d5038

Remove hardcoded if constexpr expressions

1480c57

Update onnxruntime/contrib_ops/cuda/bert/utils.cuh

cc7a728

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/contrib_ops/cuda/bert/utils.cuh

7b65612

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fmt files

63f9329

Update onnxruntime/contrib_ops/cuda/bert/packed_attention_impl.cu

e0089e4

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/contrib_ops/cuda/bert/packed_attention_impl.cu

849c5c7

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Remove unused param

bf8c820

Merge branch 'nebanfic/attention-bf16' of https://github.com/microsof…

8b13ada

…t/onnxruntime into nebanfic/attention-bf16

Fix mmha_launch_kernel err

2f5b9e8

Fix masked_multihead_attention_kernel

eb1fb85

add bfloat16 at start in docs

0b76cda

Update docs

18fc903

Remove head_size%4 cases

54f4583

Remove if consexper for attention impl

718d5f5

Add back nv_bfloat16

3a03d65

Change if statements

51194dc

use native cuda type on cudamemdevicetodevice

959a015

introduce nv_bfloat164

bde3277

refactoring

19a465a

Disable memory efficient attention when using bf16

e210258

Solve warnings

58b93e0

nenad1002 requested a review from tianleiwu September 10, 2025 15:53