Skip to content

fix: stream save_model to prevent OOM on large MoE models#18

Open
0xClandestine wants to merge 1 commit intoBlaizzy:pc/add-deepseekv4flash-modelfrom
0xClandestine:fix/ds4-quantize-oom
Open

fix: stream save_model to prevent OOM on large MoE models#18
0xClandestine wants to merge 1 commit intoBlaizzy:pc/add-deepseekv4flash-modelfrom
0xClandestine:fix/ds4-quantize-oom

Conversation

@0xClandestine
Copy link
Copy Markdown

Summary

  • When converting DeepSeek V4 Flash (256 experts × 43 layers) with -q, the process gets OOM-killed during save_model. The lazy computation graph from dequant → stack → quantize creates enormous BF16 intermediates that all materialize at once.
  • Refactors save_model to build and save shards incrementally: pop weights as each shard is constructed, explicitly mx.eval before writing, then free. Bounds peak memory to ~one shard (~5 GB) + one evaluation intermediate (~4 GB) instead of the entire model's lazy graph.

Test plan

  • All 6 test_utils.py tests pass
  • Run mlx_lm convert --hf-path deepseek-ai/DeepSeek-V4-Flash -q on a machine with sufficient disk space

When converting DeepSeek V4 Flash (256 experts × 43 layers) with -q,
the process gets OOM-killed during save. The lazy computation graph
from dequant → stack → quantize creates enormous BF16 intermediates
that all materialize at once when saving.

Build and save shards incrementally: pop weights from the dict as each
shard is constructed, explicitly mx.eval before writing, then free.
This bounds peak memory to ~one shard + one evaluation intermediate
instead of the entire model's lazy graph.
@0xClandestine
Copy link
Copy Markdown
Author

PR: ml-explore#1192
Issue: ml-explore#1192 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant