fix: Use persistent cache for incremental weight loading in DeepseekV2ForCausalLM #14486

moehanabi · 2025-12-05T07:18:02Z

Motivation

fix #14178

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…2ForCausalLM

gemini-code-assist · 2025-12-05T07:18:16Z

Summary of Changes

Hello @moehanabi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the DeepseekV2ForCausalLM model by implementing a persistent caching mechanism for incremental weight loading. The core change allows the load_weights function to handle partial weight sets across multiple invocations, specifically for the fusion of q_a_proj and kv_a_proj_with_mqa weights. This improves the robustness and flexibility of the model's weight loading process, particularly in scenarios where weights are loaded in stages.

Highlights

Persistent Cache Initialization: Introduced _cached_a_proj as a new instance variable in the DeepseekV2ForCausalLM constructor. This cache is specifically designed to be persistent across calls for q_a_proj and kv_a_proj_with_mqa fusion, enabling incremental weight loading.
Incremental Weight Loading Enhancement: Modified the load_weights method to utilize the newly initialized self._cached_a_proj instead of a temporary local cache. This ensures that partial weights can be loaded incrementally over multiple calls to the function, improving flexibility.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR enhances DeepseekV2ForCausalLM to support incremental weight loading by introducing a persistent cache for q_a_proj and kv_a_proj_with_mqa weight fusion across multiple load_weights calls.

Key Changes

Introduced self._cached_a_proj instance variable to persist weight cache across load_weights invocations
Replaced local cached_a_proj variable with the persistent instance cache to enable incremental weight loading

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-05T07:22:44Z

python/sglang/srt/models/deepseek_v2.py


+        # Persistent cache for incremental weight loading (q_a_proj + kv_a_proj_with_mqa fusion)
+        # This allows load_weights to be called multiple times with partial weights
+        self._cached_a_proj = {} if self.fuse_qkv_a_proj else None


The persistent cache self._cached_a_proj could cause a memory leak if load_weights is called multiple times with incomplete weight sets. When only one of q_a_proj or kv_a_proj_with_mqa is loaded in a call, that weight remains cached indefinitely. The cache only pops weights when both are present (lines 3830-3831).

Consider adding cache cleanup logic, either:

Clear the cache at the end of load_weights (if partial weights should not persist across calls)

Add a timeout or size limit to the cache

Document that load_weights must be called with complete weight pairs for proper memory management

gemini-code-assist

Code Review

This pull request correctly addresses an issue with incremental weight loading in DeepseekV2ForCausalLM by introducing a persistent cache for weights that need to be fused. The change from a local dictionary within load_weights to an instance attribute _cached_a_proj ensures that weights from multiple calls to load_weights can be correctly accumulated and fused. The implementation is clean, minimal, and effectively resolves the bug. The changes look good to me.

fix: Use persistent cache for incremental weight loading in DeepseekV…

48f76bc

…2ForCausalLM

Copilot AI review requested due to automatic review settings December 5, 2025 07:18

github-actions bot added the deepseek label Dec 5, 2025

Copilot started reviewing on behalf of moehanabi December 5, 2025 07:18 View session

Copilot finished reviewing on behalf of moehanabi December 5, 2025 07:20

Copilot AI reviewed Dec 5, 2025

View reviewed changes

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Use persistent cache for incremental weight loading in DeepseekV2ForCausalLM #14486

fix: Use persistent cache for incremental weight loading in DeepseekV2ForCausalLM #14486

moehanabi commented Dec 5, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: Use persistent cache for incremental weight loading in DeepseekV2ForCausalLM #14486

Are you sure you want to change the base?

fix: Use persistent cache for incremental weight loading in DeepseekV2ForCausalLM #14486

Conversation

moehanabi commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

moehanabi commented Dec 5, 2025 •

edited

Loading