fix(mem_cache): alloc_req_slots matches upstream ReqToTokenPool.alloc signature#1
Closed
bhaktatejas922 wants to merge 180 commits into
Closed
fix(mem_cache): alloc_req_slots matches upstream ReqToTokenPool.alloc signature#1bhaktatejas922 wants to merge 180 commits into
bhaktatejas922 wants to merge 180 commits into
Conversation
…ch and canonical dataset format (#21736)
…alistic perf and auto-discover ut (#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…2067) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…… (#21213) Co-authored-by: RoyWang <RoyWang@amd.com>
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
…ree (#22062) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: linjianyu77@foxmail.com
…21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…inel The -1 sentinel in full_to_swa_index_mapping caused illegal memory access during CUDA graph capture/replay (negative index → OOB GPU pointer). Fix: Track allocation state in a separate _allocated_mask boolean tensor. The mapping itself always contains valid indices (0 for unallocated), so CUDA graph capture sees valid memory references. Result: EAGLE3 + CUDA graphs now works! 68.2 tok/s (1.40x speedup).
model_config.py reads server_args.hybrid_kvcache_ratio but it was missing from the ServerArgs dataclass after the merge. Default None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- get_rope_config() was missing after merge — needed by all model files that use RoPE (qwen3_next, glm4, deepseek, etc.) - is_piecewise_cuda_graph_disabled_model was missing from ModelConfig, use getattr with default False to avoid AttributeError Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alloc(reqs) should be alloc(num_reqs, reqs) — the first arg is an int (number of slots), second is the list of Req objects for mamba state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When called from SpecForge's training pipeline, layer can be None. Fall back to kwargs['layer_id'] like the hybrid backend does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oe.py The _LayerModeComputationContext dataclass requires is_next_layer_sparse but Glm4MoeDecoderLayer wasn't providing it, causing TypeError during model initialization for EAGLE3 training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
backup_state()/restore_state() only saved the two sub-allocator free lists, not full_to_swa_index_mapping. After EAGLE3 speculation restored state, the mapping was stale — causing SWA pool drift of ~1 slot per request until CUDA OOM at ~25 requests. Fix: clone the mapping in backup_state(), copy_() it back in restore_state(). Uses copy_() to preserve the shared reference chain with attention backends via self._kvcache. Adds test_swa_backup_restore_eagle3: 20 backup/restore cycles with zero pool drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SpecForge's sglang_backend/patch.py passes pynccl_use_current_stream to init_model_parallel_group for PD-Multiplexing prefill group setup. Add the parameter as accepted (no-op for now; single-GPU training always passes duplicate_tp_group=False so this has no effect). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
SpecForge's eagle3_target_model.py imports SWATokenToKVPoolAllocator and uses it for isinstance checks to enable SWA-aware memory management. Our fork only had TokenToKVPoolAllocator. Adding SWATokenToKVPoolAllocator as a subclass — for non-SWA models (Qwen2.5, Llama) the check always returns False so no behavior changes; SWA models (Gemma) would use it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
set_eagle3_layers_to_capture converts layer_ids=[0,14,27] to layers_to_capture=[1,15,28] (+1 offset to capture each layer's output as the pre-input of the next layer). For a 28-layer model, index 28 is self.end_layer — it's never reached in range(start_layer, end_layer), so the final aux hidden state was never captured (only 2/3 collected). Fix: after the main loop, check if self.end_layer is in layers_to_capture and append the pre-norm hidden state if so. This gives 3 aux hidden states with correct hidden_size*3 concatenated dimension. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Same bug as qwen2.py: set_eagle3_layers_to_capture converts layer_ids [0, mid, N-1] to layers_to_capture [1, mid+1, N] (+1 offset). The loop runs range(start, end_layer), so layer N = end_layer is never reached. Applied post-loop capture to all affected models: - llama.py (Llama-3.1-8B) - glm4_moe.py (GLM-4.7-Flash) - minimax_m2.py (MiniMax-M2.5) - deepseek_v2.py - gpt_oss.py - apertus.py - bailing_moe.py qwen2.py (Qwen2.5-1.5B, Qwen3 via inheritance) already fixed in prior commit. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…only Upstream sglang refactored ReqToTokenPool.alloc to take a single mandatory `reqs: list[Req]` argument (not `(need_size, reqs=None)`). The fork's earlier pair of patches adjusted both the signature and the caller, but only the caller landed during the rebase onto v0.5.10.post1 — the signature change was dropped because upstream had already changed it. Result: every EAGLE3 request hits TypeError: ReqToTokenPool.alloc() takes 2 positional arguments but 3 were given at mem_cache/common.py:316 → schedule_batch.prepare_for_extend → scheduler. Fix: drop the `num_reqs` positional and match upstream's reqs-only API. Repros: any request with --speculative-algorithm EAGLE3 once the engine finishes cuda graph capture. Confirmed fixed on morphllm-sglang:morph-v0.1 against MiniMax-M2.7 NVFP4 + MiniMax-M2.5-Eagle3 draft.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why this happened
During the morph-v0.1 rebase onto upstream v0.5.10.post1, the fork's paired patches for this API change landed asymmetrically: the caller-side change (commit 3dd41aeb) was kept, but the signature-side change (commit 60877f1d6 adding `reqs=None` to `ReqToTokenPool.alloc`) was correctly skipped because upstream had already moved to a reqs-mandatory signature. That left the caller out of sync.
Test plan