Skip to content

Conversation

nenad1002
Copy link
Contributor

Description

Attention BFloat16 Support for CUDA - extends kernel implementations to accept BF16 input/output tensors.

Motivation and Context

We already have BFloat16 support for GQA (Group Query Attention), but not for regular Attention which many models require for inference (e.g. visual encoder of Gemma 3) due to FP32-like stability at lower memory/compute cost.

nenad1002 and others added 25 commits August 28, 2025 14:44
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@nenad1002 nenad1002 requested a review from tianleiwu September 10, 2025 15:53
Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try build the code for older cuda architecture like GTX 1080 to see whether the compiler and code can properly run (Need to fail nicely like show error message that the GPU does not support BF16)?

@nenad1002 nenad1002 requested a review from tianleiwu September 12, 2025 18:10
@nenad1002 nenad1002 merged commit 8301eea into main Sep 15, 2025
92 of 98 checks passed
@nenad1002 nenad1002 deleted the nebanfic/attention-bf16-2 branch September 15, 2025 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants