Commit d137832
feat(speculative): add typical-acceptance verify mode for Eagle3 draft
Squeeze pipeline Track B B1: drop-in flag-gated alternative to strict
rejection-sampling verification. Adds two server-args flags to ServerArgs
in sglang/srt/server_args.py:
--speculative-verify-mode {rejection_sampling, typical_acceptance}
--speculative-typical-acceptance-alpha FLOAT (default 0.8)
When `speculative_verify_mode == "typical_acceptance"`, the Eagle3
verification path in `sglang/srt/speculative/eagle_info.py` overrides
both `threshold_single` and `threshold_acc` with the alpha value before
calling the existing `tree_speculative_sampling_target_only` kernel.
The kernel acceptance condition
if (coin <= prob_acc / threshold_acc || target_prob_single >= threshold_single) {
// accept token
}
(in `sgl-kernel/csrc/speculative/speculative_sampling.cuh:80`) is the
Medusa typical-acceptance formula when threshold_single == threshold_acc
== alpha and 0 < alpha <= 1. So the kernel math is already correct;
this commit just exposes the alpha knob.
Defaults preserve existing behavior: rejection_sampling is the default
mode and the existing `--speculative-accept-threshold-{single,acc}` flags
continue to work unchanged. alpha=1.0 in typical_acceptance mode also
reproduces strict rejection sampling.
Scope intentionally narrow per the squeeze B1 preflight at
`experiments/MiniMax-M2.5/squeeze/relaxed/B1-typical-acceptance/preflight.md`:
- Eagle3 path only (eagle_info.py). ngram_info.py and dflash_utils.py also
call tree_speculative_sampling_target_only but are not in the squeeze
experiment scope; they continue to use the strict thresholds.
- Global server-args flag, not per-request. Avoids mixed-mode KV-cache
state. Per-request override deferred to a future revision if needed.
To use:
python -m sglang.launch_server \
--model-path <target> \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/<Model>-Eagle3 \
--speculative-verify-mode typical_acceptance \
--speculative-typical-acceptance-alpha 0.8 \
...
Squeeze B1 alpha-sweep protocol: alpha in {0.7, 0.8, 0.9}, with alpha=1.0
as a control reproducing rejection-sampling baseline. Per-dataset quality
must stay within 3% of the lossless Exp F baseline at every concurrency
point per the squeeze plan §187 quality floor.
Branch sits on top of `fix/llama-eagle3-fp8-aux-dtype-cast` (commit
71e0bf0) so it can run end-to-end on FP8 targets like
MiniMaxAI/MiniMax-M2.5 immediately.1 parent ea6c448 commit d137832
2 files changed
Lines changed: 49 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
514 | 514 | | |
515 | 515 | | |
516 | 516 | | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
517 | 525 | | |
518 | 526 | | |
519 | 527 | | |
| |||
5132 | 5140 | | |
5133 | 5141 | | |
5134 | 5142 | | |
| 5143 | + | |
| 5144 | + | |
| 5145 | + | |
| 5146 | + | |
| 5147 | + | |
| 5148 | + | |
| 5149 | + | |
| 5150 | + | |
| 5151 | + | |
| 5152 | + | |
| 5153 | + | |
| 5154 | + | |
| 5155 | + | |
| 5156 | + | |
| 5157 | + | |
| 5158 | + | |
| 5159 | + | |
| 5160 | + | |
| 5161 | + | |
| 5162 | + | |
| 5163 | + | |
| 5164 | + | |
| 5165 | + | |
| 5166 | + | |
| 5167 | + | |
| 5168 | + | |
| 5169 | + | |
| 5170 | + | |
5135 | 5171 | | |
5136 | 5172 | | |
5137 | 5173 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
381 | 381 | | |
382 | 382 | | |
383 | 383 | | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
384 | 395 | | |
385 | 396 | | |
386 | 397 | | |
| |||
393 | 404 | | |
394 | 405 | | |
395 | 406 | | |
396 | | - | |
397 | | - | |
| 407 | + | |
| 408 | + | |
398 | 409 | | |
399 | 410 | | |
400 | 411 | | |
| |||
0 commit comments