Skip to content

Conversation

SilvesterHsu
Copy link

In the scatter plugin, cudaMemcpy with implicit synchronization is used to complete data copying, ensuring that device_transform_coeff is properly assigned before the kernel execution. However, this method fails when using cudaStreamNonBlocking stream for inference in TensorRT, resulting in incorrect outcomes. This issue can be resolved by switching to cudaMemcpyAsync and using the same stream as the kernel, yielding correct results.

@kevinch-nv kevinch-nv requested a review from a team as a code owner July 9, 2025 17:14
@kevinch-nv kevinch-nv requested review from LeoZDong and kevinch-nv and removed request for a team July 9, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants