-
Notifications
You must be signed in to change notification settings - Fork 3.6k
fix: Use persistent cache for incremental weight loading in DeepseekV2ForCausalLM #14486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @moehanabi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the DeepseekV2ForCausalLM model by implementing a persistent caching mechanism for incremental weight loading. The core change allows the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances DeepseekV2ForCausalLM to support incremental weight loading by introducing a persistent cache for q_a_proj and kv_a_proj_with_mqa weight fusion across multiple load_weights calls.
Key Changes
- Introduced
self._cached_a_projinstance variable to persist weight cache acrossload_weightsinvocations - Replaced local
cached_a_projvariable with the persistent instance cache to enable incremental weight loading
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Persistent cache for incremental weight loading (q_a_proj + kv_a_proj_with_mqa fusion) | ||
| # This allows load_weights to be called multiple times with partial weights | ||
| self._cached_a_proj = {} if self.fuse_qkv_a_proj else None |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The persistent cache self._cached_a_proj could cause a memory leak if load_weights is called multiple times with incomplete weight sets. When only one of q_a_proj or kv_a_proj_with_mqa is loaded in a call, that weight remains cached indefinitely. The cache only pops weights when both are present (lines 3830-3831).
Consider adding cache cleanup logic, either:
- Clear the cache at the end of
load_weights(if partial weights should not persist across calls) - Add a timeout or size limit to the cache
- Document that
load_weightsmust be called with complete weight pairs for proper memory management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses an issue with incremental weight loading in DeepseekV2ForCausalLM by introducing a persistent cache for weights that need to be fused. The change from a local dictionary within load_weights to an instance attribute _cached_a_proj ensures that weights from multiple calls to load_weights can be correctly accumulated and fused. The implementation is clean, minimal, and effectively resolves the bug. The changes look good to me.
Motivation
fix #14178
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist