[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size#14668
[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size#14668tianyuxbear wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughThe PR addresses a CUDA graph execution issue in NVIDIA's TensorRT-LLM by adding memory-ordering synchronization to the quantization kernel and adjusting the DeepSeek-R1 NVFP4 test configuration, then removing the corresponding test waiver. ChangesNVFP4 DeepSeek-R1 Throughput Stabilization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/quantization.cuh (1)
2-2:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate the copyright header year on this modified source file.
This file was modified, but the NVIDIA header still ends at 2023. Please update it to include the latest modification year.
Suggested fix
- * Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2019-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines:
**/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/quantization.cuh` at line 2, Update the NVIDIA copyright header at the top of the modified source file by changing the year range from "2019-2023" to include the current modification year (e.g. "2019-2026"); edit the top-of-file header comment in quantization.cuh (the initial copyright comment block) so the year range reflects the latest modification year.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/quantization.cuh`:
- Line 2: Update the NVIDIA copyright header at the top of the modified source
file by changing the year range from "2019-2023" to include the current
modification year (e.g. "2019-2026"); edit the top-of-file header comment in
quantization.cuh (the initial copyright comment block) so the year range
reflects the latest modification year.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 04887fc1-1f9e-4ec2-8b75-a6af13d6921c
📒 Files selected for processing (3)
cpp/tensorrt_llm/kernels/quantization.cuhtests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
|
/bot run --disable-fail-fast |
|
PR_Github #53276 [ run ] triggered by Bot. Commit: |
|
PR_Github #53276 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53316 [ run ] triggered by Bot. Commit: |
|
PR_Github #53316 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53440 [ run ] triggered by Bot. Commit: |
|
PR_Github #53440 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53533 [ run ] triggered by Bot. Commit: |
|
PR_Github #53533 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53601 [ run ] triggered by Bot. Commit: |
|
PR_Github #53601 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53784 [ run ] triggered by Bot. Commit: |
|
PR_Github #53784 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53854 [ run ] triggered by Bot. Commit: |
|
PR_Github #53854 [ run ] completed with state
|
…_with_block_size Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>
…e_with_block_size Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #54115 [ run ] triggered by Bot. Commit: |
|
PR_Github #54115 [ run ] completed with state
|
Summary
Fixes a PDL (Programmatic Dependent Launch) race in
quantize_with_block_sizethat intermittently corrupts NVFP4 GEMM outputs and degrades GSM8K accuracy for DeepSeek-R1 NVFP4 on GB300 + PP=4 + MTP (nvbug 5970614).cudaTriggerProgrammaticLaunchCompletion()only signals that the CTA has reached the trigger point — it does not flush prior stores to global memory. Memory visibility for the secondary kernel must be provided either by the producer (a fence before the trigger) or by the consumer (wait_on_dependent_grids()before its first dependent load). In the current NVFP4 path neither side does so: the producer lacks a fence, and the sm103 blockscaled GEMM'smain_sf_loadwarp branch is missing the correspondingwait_on_dependent_grids()(tracked separately in NVIDIA/cutlass#3279).Compounding this, PDL completion is reported per-CTA at-least-once: a single warp reaching the trigger marks its whole CTA as "trigger reached", even if peer warps in the same CTA are still writing
sf_out/out. Once every CTA has been marked, the driver launches the secondary kernel, which TMA-loads partial data — NaNs then propagate through the DeepSeek-R1 forward and corrupt output tokens.Fix: insert
__syncthreads(); __threadfence();immediately before the trigger, so all warps in the CTA reach the same program point and all their stores are made globally visible before PDL completion is signaled.In-tree precedent
This mirrors an existing pattern in
cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm.cuh:865-867, which uses__syncthreads(); membar.gl;immediately before the samecudaTriggerProgrammaticLaunchCompletion()call.__threadfence()is the CUDA intrinsic formembar.gl, so the two forms are semantically equivalent.Evidence
The race was characterized on
release/1.2, where the timing window reproduces it deterministically enough to measure. A cache-bypassld.global.cv.u8probe on the SF buffer detects0x7fpoison fill (from thecudaMallocAsyncpool) as a direct race indicator:On
mainthe race is latent: the GSM8K test no longer fails even without this fix (10/10 PASS, mean 94.95 ± 0.25 with autotuner off; with the fix mean 95.14 ± 0.19). Code/scheduler changes sincerelease/1.2appear to have narrowed the producer-to-consumer window enough that the race no longer trips the GSM8K threshold — but the underlying defect is unchanged, and any future change that widens the window (cutlass bump, scheduler reordering, kernel fusion) can re-expose it.The full investigation branch — probes, race-rate measurement, per-run logs — lives on my fork at
tianyuxbear/TensorRT-LLM:fix/5970614-bak, which is based onrelease/1.2(where the race is deterministically reproducible).The probe code used to measure the race is not part of this PR; it was a one-off diagnostic inside the cutlass consumer.
Relationship to the test-case waive on main
On
mainthe affected test case (TestDeepSeekR1::test_nvfp4_multi_gpus[throughput_pp4_mtp]) is currently waived for an unrelated OOM issue (nvbug 6018046); that is why the deterministic accuracy gap above was measured onrelease/1.2instead.This PR is standalone with respect to that waive: the change is a CTA-internal
__syncthreads(); __threadfence();immediately beforecudaTriggerProgrammaticLaunchCompletion(), with no shared code surface with the OOM root cause. It can be merged without waiting for nvbug 6018046 to be resolved, and once that waive is eventually lifted, this fence will already be in place to keep the race from re-tripping the test should any future change widen the producer-to-consumer window again.Companion cutlass fix
The consumer-side fix at NVIDIA/cutlass#3279 addresses the missing
wait_on_dependent_grids()in the sm103 blockscaled GEMMmain_sf_loadwarp branch (the other half of the race described in Summary). Either fix alone closes the race; both are correct individually:quantize_with_block_size, regardless of which cutlass revision trtllm pulls in.Risk
quantize_with_block_size), covering NVFP4 / FP8 / MXFP8 paths.__syncthreads()+ one__threadfence()at the very end of the kernel (cold path). No effect on PDL launch overlap — the trigger still happens; it just happens after the CTA has drained.