Skip to content

Commit 99f2b80

Browse files
RyanUnderhillRyan Hill
and
Ryan Hill
authored
Fix cuda memory access violation in GQA FlashAttention (microsoft#24447)
### Description zeros_ memory buffer was uninitialized, but it must be initialized to zero. ### Motivation and Context A memory allocator change in GenAI started crashing in FlashAttention and this was eventually tracked down to be the cause. The allocator change was innocent. I'm not sure how this didn't fail previously, or if it was we weren't getting the reports about it. Co-authored-by: Ryan Hill <{ID}+{username}@users.noreply.github.com>
1 parent f267b7e commit 99f2b80

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc

+1
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ GroupQueryAttention<T>::GroupQueryAttention(const OpKernelInfo& info)
6363

6464
if (!disable_flash_attention_) {
6565
zeros_ = this->GetScratchBuffer<int>(kZerosCount, nullptr);
66+
CUDA_CALL_THROW(cudaMemset(zeros_.get(), 0, kZerosCount * sizeof(int)));
6667
}
6768
}
6869

0 commit comments

Comments
 (0)