fix(mem_cache): alloc_req_slots matches upstream ReqToTokenPool.alloc signature by bhaktatejas922 · Pull Request #1 · morphllm/sglang

bhaktatejas922 · 2026-04-18T11:36:36Z

Summary

Drops the stale `num_reqs` positional from `req_to_token_pool.alloc(num_reqs, reqs)` in `mem_cache/common.py`, matching upstream v0.5.10.post1's `alloc(self, reqs: list[Req])`.
Without this, every EAGLE3 request hits `TypeError: ReqToTokenPool.alloc() takes 2 positional arguments but 3 were given` at `common.py:316` → `schedule_batch.prepare_for_extend` → scheduler SIGQUIT.

Why this happened

During the morph-v0.1 rebase onto upstream v0.5.10.post1, the fork's paired patches for this API change landed asymmetrically: the caller-side change (commit 3dd41aeb) was kept, but the signature-side change (commit 60877f1d6 adding `reqs=None` to `ReqToTokenPool.alloc`) was correctly skipped because upstream had already moved to a reqs-mandatory signature. That left the caller out of sync.

Test plan

Engine cold-boots on `morphllm-sglang:morph-v0.1` + flashinfer nightly against MiniMax-M2.7 NVFP4 + MiniMax-M2.5-Eagle3 draft (`--speculative-algorithm EAGLE3 --speculative-draft-model-quantization unquant`).
First completion request served successfully (was SIGQUITing on first request pre-patch).
Profile: B=1 ~110 tok/s, B=8 ~705 tok/s agg, accept rate ~34%.

…#22108)

…ch and canonical dataset format (#21736)

…alistic perf and auto-discover ut (#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…2067) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…… (#21213) Co-authored-by: RoyWang <RoyWang@amd.com>

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

Co-authored-by: Mick <mickjagger19@icloud.com>

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Co-authored-by: Shangming Cai <csmthu@gmail.com>

…ree (#22062) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: linjianyu77@foxmail.com

…21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

…inel The -1 sentinel in full_to_swa_index_mapping caused illegal memory access during CUDA graph capture/replay (negative index → OOB GPU pointer). Fix: Track allocation state in a separate _allocated_mask boolean tensor. The mapping itself always contains valid indices (0 for unallocated), so CUDA graph capture sees valid memory references. Result: EAGLE3 + CUDA graphs now works! 68.2 tok/s (1.40x speedup).

model_config.py reads server_args.hybrid_kvcache_ratio but it was missing from the ServerArgs dataclass after the merge. Default None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- get_rope_config() was missing after merge — needed by all model files that use RoPE (qwen3_next, glm4, deepseek, etc.) - is_piecewise_cuda_graph_disabled_model was missing from ModelConfig, use getattr with default False to avoid AttributeError Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alloc(reqs) should be alloc(num_reqs, reqs) — the first arg is an int (number of slots), second is the list of Req objects for mamba state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When called from SpecForge's training pipeline, layer can be None. Fall back to kwargs['layer_id'] like the hybrid backend does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oe.py The _LayerModeComputationContext dataclass requires is_next_layer_sparse but Glm4MoeDecoderLayer wasn't providing it, causing TypeError during model initialization for EAGLE3 training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

backup_state()/restore_state() only saved the two sub-allocator free lists, not full_to_swa_index_mapping. After EAGLE3 speculation restored state, the mapping was stale — causing SWA pool drift of ~1 slot per request until CUDA OOM at ~25 requests. Fix: clone the mapping in backup_state(), copy_() it back in restore_state(). Uses copy_() to preserve the shared reference chain with attention backends via self._kvcache. Adds test_swa_backup_restore_eagle3: 20 backup/restore cycles with zero pool drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SpecForge's sglang_backend/patch.py passes pynccl_use_current_stream to init_model_parallel_group for PD-Multiplexing prefill group setup. Add the parameter as accepted (no-op for now; single-GPU training always passes duplicate_tp_group=False so this has no effect). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

SpecForge's eagle3_target_model.py imports SWATokenToKVPoolAllocator and uses it for isinstance checks to enable SWA-aware memory management. Our fork only had TokenToKVPoolAllocator. Adding SWATokenToKVPoolAllocator as a subclass — for non-SWA models (Qwen2.5, Llama) the check always returns False so no behavior changes; SWA models (Gemma) would use it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

set_eagle3_layers_to_capture converts layer_ids=[0,14,27] to layers_to_capture=[1,15,28] (+1 offset to capture each layer's output as the pre-input of the next layer). For a 28-layer model, index 28 is self.end_layer — it's never reached in range(start_layer, end_layer), so the final aux hidden state was never captured (only 2/3 collected). Fix: after the main loop, check if self.end_layer is in layers_to_capture and append the pre-norm hidden state if so. This gives 3 aux hidden states with correct hidden_size*3 concatenated dimension. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Same bug as qwen2.py: set_eagle3_layers_to_capture converts layer_ids [0, mid, N-1] to layers_to_capture [1, mid+1, N] (+1 offset). The loop runs range(start, end_layer), so layer N = end_layer is never reached. Applied post-loop capture to all affected models: - llama.py (Llama-3.1-8B) - glm4_moe.py (GLM-4.7-Flash) - minimax_m2.py (MiniMax-M2.5) - deepseek_v2.py - gpt_oss.py - apertus.py - bailing_moe.py qwen2.py (Qwen2.5-1.5B, Qwen3 via inheritance) already fixed in prior commit. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…only Upstream sglang refactored ReqToTokenPool.alloc to take a single mandatory `reqs: list[Req]` argument (not `(need_size, reqs=None)`). The fork's earlier pair of patches adjusted both the signature and the caller, but only the caller landed during the rebase onto v0.5.10.post1 — the signature change was dropped because upstream had already changed it. Result: every EAGLE3 request hits TypeError: ReqToTokenPool.alloc() takes 2 positional arguments but 3 were given at mem_cache/common.py:316 → schedule_batch.prepare_for_extend → scheduler. Fix: drop the `num_reqs` positional and match upstream's reqs-only API. Repros: any request with --speculative-algorithm EAGLE3 once the engine finishes cuda graph capture. Confirmed fixed on morphllm-sglang:morph-v0.1 against MiniMax-M2.7 NVFP4 + MiniMax-M2.5-Eagle3 draft.

hnyls2002 and others added 30 commits April 4, 2026 02:38

Relax spec decoding accuracy threshold to fix flaky test (#22100)

e9d92b0

Fix killall_sglang missing the main sglang serve process (#22103)

abc2975

Fix Python 3.11 f-string lint error in deepgemm Blackwell benchmark (…

f3f7711

…#22108)

Align diffusion nightly presets and broaden skill discovery (#22099)

da25b47

[Benchmark] Add auto benchmark tool with YAML-driven server flag sear…

0f0f004

…ch and canonical dataset format (#21736)

[diffusion] CI: improve diffusion comparison benchmark setting for re…

efee62e

…alistic perf and auto-discover ut (#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

fix: add missing f-string prefixes in warning and assert messages (#2…

70658bf

…2067) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CI auto-bisect workflow for automated regression analysis (#22119)

8cbeacd

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[AMD]: Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5…

dd49127

…… (#21213) Co-authored-by: RoyWang <RoyWang@amd.com>

Remove flaky TestToolChoiceLfm2Moe from test_tool_choice (#22137)

675100b

[CI]Temporary ban auto benchmark tool test (#22138)

723ed6c

Update ci_auto_bisect.py to use correct model (#22142)

edee9ae

Add failfast flag to rerun-test workflow (#22141)

bb9e058

Migrate reasoning_tokens tests to existing server fixtures (#22102)

904bb47

[model] support voxtral (speech-to-text) (#21635)

71544f0

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

[diffusion] fix: fix accuracy for flux series (#22059)

2b119ba

Co-authored-by: Mick <mickjagger19@icloud.com>

Consolidate reasoning tests into test/registered/reasoning/ (#22139)

bd6a585

[Fix] Fix nightly tests (#22140)

0882034

Enable IndexCache for DeepSeek V3.2 (#21405)

5a35316

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

[diffusion] Add is_float64_supported to Platform (#22112)

10b18b8

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

[Doc] Update GLM-5 instructions in sglang documentation (#21716)

106baed

Isolate spec V1 path in decode post-processing (#22146)

cd2d45e

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths (#22147)

aeff9fb

Unify think_end_id to model_config as single source of truth (#22148)

df9c831

DEBUG: reproduce flaky test_load_weights_from_remote_instance (#22150)

3a4f4cb

Co-authored-by: Shangming Cai <csmthu@gmail.com>

[PD] Fix staging warmup for GQA prefill decode different tp (#22153)

dccb118

[BugFix][RadixTree]: Fix backup invariant violation in Hi-MambaRadixT…

51b276d

…ree (#22062) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: linjianyu77@foxmail.com

[SpecV2]: Reopen kl accuracy test for qwen3 + SpecV2 (#22104)

f6c9072

[Hotfix] Fix router gemm on sm103 (#22134)

c5fa364

fix: TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention (#…

5dd2c24

…21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

lujangus and others added 12 commits April 18, 2026 03:23

fix: add missing hybrid_kvcache_ratio to ServerArgs

b5c742a

model_config.py reads server_args.hybrid_kvcache_ratio but it was missing from the ServerArgs dataclass after the merge. Default None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: alloc_req_slots passes reqs list as need_size

88a5206

alloc(reqs) should be alloc(num_reqs, reqs) — the first arg is an int (number of slots), second is the list of Req objects for mamba state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: handle layer=None in GDN backend forward_extend

cfd9e58

When called from SpecForge's training pipeline, layer can be None. Fall back to kwargs['layer_id'] like the hybrid backend does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added documentation Improvements or additions to documentation sgl-kernel jit-kernel dependencies Multi-modal diffusion labels Apr 18, 2026

bhaktatejas922 closed this Apr 18, 2026

github-actions Bot added lora quant speculative-decoding amd npu blackwell deepseek hicache piecewise-cuda-graph mthreads labels Apr 18, 2026

bhaktatejas922 deleted the morph/fix-alloc-reqs-upstream-signature branch April 18, 2026 11:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mem_cache): alloc_req_slots matches upstream ReqToTokenPool.alloc signature#1

fix(mem_cache): alloc_req_slots matches upstream ReqToTokenPool.alloc signature#1
bhaktatejas922 wants to merge 180 commits into
morph/mainfrom
morph/fix-alloc-reqs-upstream-signature

bhaktatejas922 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

bhaktatejas922 commented Apr 18, 2026

Summary

Why this happened

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants