forked from ml-explore/mlx-lm
-
Notifications
You must be signed in to change notification settings - Fork 5
perf: optimize DeepSeek-V4 #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Blaizzy
merged 43 commits into
Blaizzy:pc/add-deepseekv4flash-model
from
0xClandestine:perf/optimize-ds4
Apr 26, 2026
+428
−174
Merged
Changes from 30 commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
bb6978a
Optimize HC sinkhorn Metal kernel with float4 SIMD and bounds guard
0xClandestine 4d33a92
Round grid up to 256-multiple for Metal dispatch safety
0xClandestine 1740f5c
Fix HC sinkhorn kernel: drop dead #if preprocessor guard
0xClandestine 0c76952
Fix sinkhorn kernel bugs and add split sparse attention for prefill
0xClandestine fbad2b7
Fix DS4 sanitize: stack grouped wo_a, remap quant metadata keys
0xClandestine b2e0456
Fix sanitize for pre-quantized DS4: stacked experts, biases, group_size
0xClandestine 43b8430
Fix tokenizer loading for custom model types with rope_scaling
0xClandestine e7f62b6
Replace metal::fast::recip with 1/x for mlx 0.31 compatibility
0xClandestine 6755555
Add fused Metal sparse attention kernel for DS4 prefill
0xClandestine e4687ca
Fix fused sparse attn kernel: address space and reference cast bugs
0xClandestine 71e337b
Fix sanitize: don't drop expert .biases keys needed by QuantizedSwitc…
0xClandestine fef630b
Fix fused sparse attention kernel: update address space for topk_idxs…
Blaizzy d64e3fb
Fix fused sparse attn: use device int32_t* for topk_idxs pointer
0xClandestine 0597549
Fix fused sparse attn: topk_idxs is constant address space, not device
0xClandestine 9bd7cd8
Fix fused sparse attention Metal kernel: address space, dispatch, out…
0xClandestine 981f2fa
Fix generate_step crash: flatten extra batch dims in model __call__
0xClandestine 40341ea
Fix decode regression: skip indexer for L==1, use full pooled KV dire…
0xClandestine 5ee3d96
Add fused partial RoPE Metal kernel to eliminate split/concat interme…
0xClandestine 40b1c02
Optimize decode path: q-norm kernel, overlap_transform, HC fn bf16
0xClandestine cd74c13
Defer Compressor wkv/wgate GEMVs until a full window is ready
0xClandestine 11c5397
Three more decode optimizations: collapse/expand kernels + MoE gate bf16
0xClandestine 7579a71
Revert "Three more decode optimizations: collapse/expand kernels + Mo…
0xClandestine afdbaf1
Fix ModelArgs: __post_init__ references quantization_config not quant…
0xClandestine 09c0874
Remove unused quantization_config from ModelArgs and update __post_in…
Blaizzy 2cf48d2
Add @mx.compile decorator to fused_sparse_attention and _split_sparse…
Blaizzy 8726c2e
format
Blaizzy 3737312
Fix prefill slowdown: skip fused sparse attn kernel for L > 1
0xClandestine ea99374
Revert rope kernel and cache x-buffering per maintainer feedback
0xClandestine 9126d7c
Address review feedback: revert n_rows guard, fused attn kernel, HC f…
0xClandestine eecab43
Vectorize HC sinkhorn kernel: float4 loads, unroll inner loops
0xClandestine a3c04af
Restore cast_predicate HC exclusions accidentally dropped in review c…
0xClandestine da913ff
Replace _split_sparse_attention with standard SDPA for prefill
0xClandestine de4c25c
Revert SwitchGLU, HyperConnection, and compress path to match upstream
0xClandestine 1e7fcf8
Revert README to match upstream
0xClandestine a0bcde7
Remove dead code: q-norm kernel and _split_sparse_attention
0xClandestine c55bc53
Compile MoE expert selection + skip empty pooled tensor processing
0xClandestine fba00da
Fuse HyperHead into single compiled graph
0xClandestine 28a5a1f
Cache repeated reshape/cast ops in attention hot path
0xClandestine ea8ea24
Fix _ensure_cached None check before dtype comparison
0xClandestine 49e37b0
Fuse HC compute_weights into single compiled graph
0xClandestine 9929953
Fuse sinkhorn + collapse into single Metal kernel dispatch
0xClandestine 9a69e7b
Use explicit adds and ILP-friendly layout in sinkhorn kernels
0xClandestine 245f6cb
Branchless sinkhorn + native bfloat4 loads in fused collapse kernel
0xClandestine File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s remove the readme changes for now