Skip to content

Commit 40d0115

Browse files
committed
[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size
Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>
1 parent cab198d commit 40d0115

1 file changed

Lines changed: 9 additions & 0 deletions

File tree

cpp/tensorrt_llm/kernels/quantization.cuh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -897,6 +897,15 @@ quantize_with_block_size(
897897
}
898898
}
899899
}
900+
// Fix for nvbugs/5970614 (https://nvbugspro.nvidia.com/bug/5970614).
901+
// PDL completion is reported when every CTA has either exited or called
902+
// this function at least once (per CUDA Programming Guide). Without a
903+
// CTA-wide barrier, an early-finishing warp can trigger completion while
904+
// other warps in the same CTA are still writing sf_out / out, allowing the
905+
// downstream NVF4 GEMM consumer to read partial data once
906+
// wait_on_dependent_grids returns. Drain the CTA's stores before trigger.
907+
__syncthreads();
908+
__threadfence();
900909
cudaTriggerProgrammaticLaunchCompletion();
901910
#endif
902911
}

0 commit comments

Comments
 (0)