Sync dsv4-fp4-b300-trt recipes with B300 agg frontier config#1703
Sync dsv4-fp4-b300-trt recipes with B300 agg frontier config#1703Oseltamivir wants to merge 6 commits into
Conversation
B300 analog of PR #1699 (B200). Apply the same TensorRT-LLM recipe sync to dsv4_fp4_b300_trt.sh (MTP0) and dsv4_fp4_b300_trt_mtp.sh (MTP), and bump the dsv4-fp4-b300-trt / -mtp images to feat-deepseek_v4-c185066. Recipe changes (both): - Worker envs (overridable): TRTLLM_SERVER_DISABLE_GC, TRTLLM_WORKER_DISABLE_GC, NCCL_GRAPH_MIXING_SUPPORT=0, MIMALLOC_PURGE_DELAY=0, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. - kv_cache free_gpu_memory_fraction: 0.9 (no DP-attn) / 0.7 non-MTP, 0.6 MTP (DP-attn), was 0.50. - attention_dp_config batching_wait_iters 0 -> 30, drop timeout_iters. - stream_interval 10 -> 100; moe_config.use_low_precision_moe_combine: true. - MOE_BACKEND overridable, switches to MEGAMOE_DEEPGEMM at high conc on 1k ISL. - max_num_tokens drops the OSL term. MTP additionally: max_draft_len (was num_nextn_predict_layers), default draft 3 stepping to 2 at high conc on 8k ISL, enable_lm_head_tp_in_adp on DP-attn. B300-specific bits preserved: MODEL_PATH download block, TRTLLM_MHC_ENABLE_FUSED_HC=1, trtllm-serve "$MODEL_PATH". B300 search space left as-is (already covers the high-concurrency frontier the recipe changes target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27297153576 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27297153576 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27297153576 |
Cap the 8k1k tp8/ep8 DP-attn sweep at conc 256 (was 256-1024) for dsv4-fp4-b300-trt. trt-mtp and the 1k1k sweep are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27324968855 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d8c3caa. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27325008715 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27325043819 |

What
B300 analog of #1699 (which did this for B200). Sync the DeepSeek-V4-Pro aggregated frontier configs into the single-node TensorRT-LLM B300 recipes and bump the feature image. The non-MTP recipe carries the MTP0 settings; the MTP recipe carries the MTP settings.
Changes
Image (
.github/configs/nvidia-master.yaml)dsv4-fp4-b300-trtanddsv4-fp4-b300-trt-mtpimage bumpedfeat-deepseek_v4-9aa3715→feat-deepseek_v4-c185066.benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt.sh(MTP0)TRTLLM_SERVER_DISABLE_GC=1,TRTLLM_WORKER_DISABLE_GC=1,NCCL_GRAPH_MIXING_SUPPORT=0,MIMALLOC_PURGE_DELAY=0,PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.kv_cache_config.free_gpu_memory_fraction: 0.9 (TP / no DP-attn) / 0.7 (DP-attn), was0.50.attention_dp_config:batching_wait_iters0 → 30, droptimeout_iters.stream_interval10 → 100;moe_config.use_low_precision_moe_combine: true.max_num_tokensdrops the OSL term:ISL + 256.MOE_BACKENDmade overridable (defaultTRTLLM;MEGAMOE_DEEPGEMMat high conc on 1k ISL).benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt_mtp.sh(MTP)Same as above, plus:
free_gpu_memory_fraction= 0.6.enable_lm_head_tp_in_adp: trueon the DP-attn path.speculative_configusesmax_draft_len; default level 2 → 3 (overridable viaTRTLLM_DSV4_MTP_NUM_NEXTN_LAYERS), stepping back to 2 at high conc on 8k ISL.max_num_tokens=ISL + (draft+1)*batch + 256(drops OSL; keeps the speculative-verification headroom).Deliberate non-changes
MODEL_PATHdownload block,TRTLLM_MHC_ENABLE_FUSED_HC=1, andtrtllm-serve "$MODEL_PATH".conc-end), the B300 fixed-seq-len sweeps already cover the high-concurrency regime the recipe changes target (1k up to 2048, 8k up to 1024), so noconc-endedit is needed.cuda_graph_config/max_batch_sizeleft CONC-derived.max_seq_lenkept floored at ≥ 8192.Validation
bash -npasses on both recipes.enable_lm_head_tp_in_adpandmax_draft_len) parses as valid YAML.🤖 Generated with Claude Code
Note
Low Risk
Benchmark/CI recipe and environment tuning only; no application auth or production serving paths changed.
Overview
B300 DeepSeek-V4-Pro TensorRT-LLM benchmark recipes are aligned with the aggregated frontier settings (B300 follow-on to B200 PR #1699): both
dsv4-fp4-b300-trtanddsv4-fp4-b300-trt-mtpuse imagefeat-deepseek_v4-c185066, and the non-MTP 8k/1ktp8/ep8DP-attn sweepconc-endis reduced from 1024 → 256.The
dsv4_fp4_b300_trt.shanddsv4_fp4_b300_trt_mtp.shscripts add default runtime env (GC off, NCCL graph mixing off, alloc tweaks), raise KV cache fractions by DP path, setstream_interval100,use_low_precision_moe_combine, andbatching_wait_iters30 (MTP dropstimeout_iters).max_num_tokensno longer includes the OSL term; MoE backend switches toMEGAMOE_DEEPGEMMat high concurrency on short ISL. MTP usesmax_draft_len, variable draft length defaults, andenable_lm_head_tp_in_adpon DP-attn.perf-changelog.yamldocuments the above for both config keys.Reviewed by Cursor Bugbot for commit a5b4fd4. Bugbot is set up for automated code reviews on this repo. Configure here.