Skip to content

fix(mem_cache): alloc_req_slots matches upstream ReqToTokenPool.alloc signature#1

Closed
bhaktatejas922 wants to merge 180 commits into
morph/mainfrom
morph/fix-alloc-reqs-upstream-signature
Closed

fix(mem_cache): alloc_req_slots matches upstream ReqToTokenPool.alloc signature#1
bhaktatejas922 wants to merge 180 commits into
morph/mainfrom
morph/fix-alloc-reqs-upstream-signature

Conversation

@bhaktatejas922
Copy link
Copy Markdown

Summary

  • Drops the stale `num_reqs` positional from `req_to_token_pool.alloc(num_reqs, reqs)` in `mem_cache/common.py`, matching upstream v0.5.10.post1's `alloc(self, reqs: list[Req])`.
  • Without this, every EAGLE3 request hits `TypeError: ReqToTokenPool.alloc() takes 2 positional arguments but 3 were given` at `common.py:316` → `schedule_batch.prepare_for_extend` → scheduler SIGQUIT.

Why this happened

During the morph-v0.1 rebase onto upstream v0.5.10.post1, the fork's paired patches for this API change landed asymmetrically: the caller-side change (commit 3dd41aeb) was kept, but the signature-side change (commit 60877f1d6 adding `reqs=None` to `ReqToTokenPool.alloc`) was correctly skipped because upstream had already moved to a reqs-mandatory signature. That left the caller out of sync.

Test plan

  • Engine cold-boots on `morphllm-sglang:morph-v0.1` + flashinfer nightly against MiniMax-M2.7 NVFP4 + MiniMax-M2.5-Eagle3 draft (`--speculative-algorithm EAGLE3 --speculative-draft-model-quantization unquant`).
  • First completion request served successfully (was SIGQUITing on first request pre-patch).
  • Profile: B=1 ~110 tok/s, B=8 ~705 tok/s agg, accept rate ~34%.

hnyls2002 and others added 30 commits April 4, 2026 02:38
…alistic perf and auto-discover ut (#22086)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…2067)

Co-authored-by: yuj <yuj@ztjzsoft.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…… (#21213)

Co-authored-by: RoyWang <RoyWang@amd.com>
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
…ree (#22062)

Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: linjianyu77@foxmail.com
…21649)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
lujangus and others added 12 commits April 18, 2026 03:23
…inel

The -1 sentinel in full_to_swa_index_mapping caused illegal memory access
during CUDA graph capture/replay (negative index → OOB GPU pointer).

Fix: Track allocation state in a separate _allocated_mask boolean tensor.
The mapping itself always contains valid indices (0 for unallocated),
so CUDA graph capture sees valid memory references.

Result: EAGLE3 + CUDA graphs now works! 68.2 tok/s (1.40x speedup).
model_config.py reads server_args.hybrid_kvcache_ratio but it was
missing from the ServerArgs dataclass after the merge. Default None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- get_rope_config() was missing after merge — needed by all model
  files that use RoPE (qwen3_next, glm4, deepseek, etc.)
- is_piecewise_cuda_graph_disabled_model was missing from ModelConfig,
  use getattr with default False to avoid AttributeError

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alloc(reqs) should be alloc(num_reqs, reqs) — the first arg is an int
(number of slots), second is the list of Req objects for mamba state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When called from SpecForge's training pipeline, layer can be None.
Fall back to kwargs['layer_id'] like the hybrid backend does.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oe.py

The _LayerModeComputationContext dataclass requires is_next_layer_sparse
but Glm4MoeDecoderLayer wasn't providing it, causing TypeError during
model initialization for EAGLE3 training.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
backup_state()/restore_state() only saved the two sub-allocator free
lists, not full_to_swa_index_mapping. After EAGLE3 speculation restored
state, the mapping was stale — causing SWA pool drift of ~1 slot per
request until CUDA OOM at ~25 requests.

Fix: clone the mapping in backup_state(), copy_() it back in
restore_state(). Uses copy_() to preserve the shared reference chain
with attention backends via self._kvcache.

Adds test_swa_backup_restore_eagle3: 20 backup/restore cycles with zero
pool drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SpecForge's sglang_backend/patch.py passes pynccl_use_current_stream
to init_model_parallel_group for PD-Multiplexing prefill group setup.
Add the parameter as accepted (no-op for now; single-GPU training
always passes duplicate_tp_group=False so this has no effect).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
SpecForge's eagle3_target_model.py imports SWATokenToKVPoolAllocator and
uses it for isinstance checks to enable SWA-aware memory management.
Our fork only had TokenToKVPoolAllocator. Adding SWATokenToKVPoolAllocator
as a subclass — for non-SWA models (Qwen2.5, Llama) the check always
returns False so no behavior changes; SWA models (Gemma) would use it.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
set_eagle3_layers_to_capture converts layer_ids=[0,14,27] to
layers_to_capture=[1,15,28] (+1 offset to capture each layer's output
as the pre-input of the next layer). For a 28-layer model, index 28 is
self.end_layer — it's never reached in range(start_layer, end_layer),
so the final aux hidden state was never captured (only 2/3 collected).

Fix: after the main loop, check if self.end_layer is in layers_to_capture
and append the pre-norm hidden state if so. This gives 3 aux hidden states
with correct hidden_size*3 concatenated dimension.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Same bug as qwen2.py: set_eagle3_layers_to_capture converts layer_ids
[0, mid, N-1] to layers_to_capture [1, mid+1, N] (+1 offset). The loop
runs range(start, end_layer), so layer N = end_layer is never reached.

Applied post-loop capture to all affected models:
- llama.py (Llama-3.1-8B)
- glm4_moe.py (GLM-4.7-Flash)
- minimax_m2.py (MiniMax-M2.5)
- deepseek_v2.py
- gpt_oss.py
- apertus.py
- bailing_moe.py

qwen2.py (Qwen2.5-1.5B, Qwen3 via inheritance) already fixed in prior commit.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…only

Upstream sglang refactored ReqToTokenPool.alloc to take a single
mandatory `reqs: list[Req]` argument (not `(need_size, reqs=None)`).
The fork's earlier pair of patches adjusted both the signature and the
caller, but only the caller landed during the rebase onto v0.5.10.post1
— the signature change was dropped because upstream had already changed
it.

Result: every EAGLE3 request hits
  TypeError: ReqToTokenPool.alloc() takes 2 positional arguments but 3 were given
at mem_cache/common.py:316 → schedule_batch.prepare_for_extend → scheduler.

Fix: drop the `num_reqs` positional and match upstream's reqs-only API.

Repros: any request with --speculative-algorithm EAGLE3 once the engine
finishes cuda graph capture. Confirmed fixed on morphllm-sglang:morph-v0.1
against MiniMax-M2.7 NVFP4 + MiniMax-M2.5-Eagle3 draft.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.