Fix cuda memory access violation in GQA FlashAttention (microsoft#24447)

RyanUnderhill · Ryan Hill · web-flow · commit 99f2b80602bc · 2025-04-16T17:36:50.000-07:00
### Description
zeros_ memory buffer was uninitialized, but it must be initialized to
zero.


### Motivation and Context
A memory allocator change in GenAI started crashing in FlashAttention
and this was eventually tracked down to be the cause. The allocator
change was innocent. I'm not sure how this didn't fail previously, or if
it was we weren't getting the reports about it.

Co-authored-by: Ryan Hill &lt;{ID}+{username}@users.noreply.github.com&gt;
diff --git a/onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc b/onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc
@@ -63,6 +63,7 @@ GroupQueryAttention<T>::GroupQueryAttention(const OpKernelInfo& info)
 
   if (!disable_flash_attention_) {
     zeros_ = this->GetScratchBuffer<int>(kZerosCount, nullptr);
+    CUDA_CALL_THROW(cudaMemset(zeros_.get(), 0, kZerosCount * sizeof(int)));
   }
 }
 

Original file line number	Diff line number	Diff line change
`@@ -63,6 +63,7 @@ GroupQueryAttention<T>::GroupQueryAttention(const OpKernelInfo& info)`
`63`	`63`
`64`	`64`	`if (!disable_flash_attention_) {`
`65`	`65`	`zeros_ = this->GetScratchBuffer<int>(kZerosCount, nullptr);`
	`66`	`+ CUDA_CALL_THROW(cudaMemset(zeros_.get(), 0, kZerosCount * sizeof(int)));`
`66`	`67`	`}`
`67`	`68`	`}`
`68`	`69`