Blaizzy · Blaizzy · Apr 26, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/README.md b/README.md
@@ -257,6 +257,80 @@ model, tokenizer = load(
 )
 ```
 
+### DeepSeek V4 / DeepSeek V4 Flash
+
+This fork adds native support for **DeepSeek-V4** and **DeepSeek-V4-Flash** on Apple Silicon, including full Metal kernel acceleration.
+
+#### Usage
+
+```python
+from mlx_lm import load, generate
+
+model, tokenizer = load("path/to/deepseek-v4-flash")
+text = generate(model, tokenizer, prompt="Explain attention sinks.", verbose=True)
+```
+
+Or from the command line:
+
+```bash
+mlx_lm.generate --model path/to/deepseek-v4-flash --prompt "Explain attention sinks."
+```
+
+The model type is `deepseek_v4`. Pre-quantized checkpoints (FP8 or FP4 experts) are loaded and dequantized automatically — no manual conversion step is required.
+
+#### Architecture
+
+DeepSeek V4 Flash introduces several architectural innovations that this implementation fully supports:
+
+**HyperConnection** replaces standard residual connections. Each layer maintains `hc_mult=4` parallel hidden streams that are combined through a learned Sinkhorn-normalized mixing matrix. A custom Metal kernel (`_make_hc_split_sinkhorn_kernel`) computes the 4×4 doubly-stochastic combination weights using float4 SIMD and online Sinkhorn iterations entirely on-GPU.
+
+**Compressed attention (Compressor + Indexer)** provides long-range context without quadratic cost. At every layer with `compress_ratio > 0`, hidden states are pooled into a compressed KV sequence (at ratio 4 with overlap or ratio 128 for long range). During decode, an x-buffer defers the expensive `wkv`/`wgate` projections until a full compression window is ready, saving `(ratio−1)/ratio` GEMVs per step. An Indexer then selects the top-k most relevant compressed KV entries per query head using learned index projections.
+
+**Sparse attention paths** handle prefill and decode differently:
+- Prefill (`L > 1`): a fused Metal kernel (`ds4_fused_sparse_attn`) computes online softmax over the local sliding-window KV and the top-k sparse compressed KV in a single pass, avoiding materialising the `[B, L, topk, D]` gather intermediate.
+- Decode (`L = 1`): the indexer is skipped (compressed pool fits within `index_topk` anyway); standard SDPA is used over `[local KV ∥ pooled KV]`.
+
+**Attention sinks** add a learnable virtual token to every attention layer whose score is a per-head bias and whose value contribution is zero, stabilising attention distributions over long contexts.
+
+**Mixture of Experts** uses 256 routed experts plus 1 shared expert per MoE layer. Expert routing uses `sqrtsoftplus` scoring with auxiliary-free top-k selection (`noaux_tc`). The first `num_hash_layers` layers use hash-based routing. `LimitedSwiGLU` clamps gate and up projections to prevent activation overflow.
+
+**Grouped output projection** splits the large O-projection into 8 groups (`o_groups=8`), each with its own low-rank A matrix (`wo_a`) shared across groups, reducing peak memory during the projection step.
+
+**Partial RoPE** applies rotary embeddings only to the `qk_rope_head_dim`-sized suffix of each head, leaving the `nope` prefix unrotated. A dedicated Metal kernel (`ds4_partial_rope`) fuses this with the split/concat that a naive implementation would require.
+
+**Per-head query RMS norm** (`ds4_q_norm`) normalises each query head in a fused Metal kernel before the RoPE step, replacing the `mx.rsqrt` + elementwise pattern.
+
+#### Metal Kernels
+
+| Kernel | Purpose |
+|---|---|
+| `ds4_partial_rope` | Fused partial RoPE — eliminates split/concat intermediate |
+| `ds4_q_norm` | Per-head query RMS normalisation |
+| `_make_hc_split_sinkhorn_kernel` | float4 SIMD Sinkhorn for HyperConnection mixing weights |
+| `ds4_fused_sparse_attn` | Online-softmax prefill over local + sparse KV + attention sink |
+| `_split_sparse_attention` | MLX fallback for `ds4_fused_sparse_attn` on CPU or older Metal |
+
+All kernels fall back gracefully to pure MLX when Metal is unavailable.
+
+#### Quantization Support
+
+The `sanitize` method handles pre-quantized checkpoints transparently:
+
+- **FP8 weights** (E4M3/E5M2, block-scaled) are dequantized to BF16 on load.
+- **FP4/MXFP4 expert weights** are unpacked from the 4-bit lookup table and dequantized to BF16, then re-quantized using MLX's native group-quantized matmul format.
+- The safetensors loader is extended to reinterpret the `F8_E8M0` dtype used by some HuggingFace checkpoints that standard MLX cannot parse.
+
+Precision-sensitive parameters (attention sinks, HyperConnection base/scale, expert correction biases) are excluded from any subsequent `cast` operations via `cast_predicate`.
+
+#### Cache
+
+`DeepseekV4Cache` wraps `RotatingKVCache` for the local sliding window and adds two parallel state buffers (compressor and indexer). It implements the full `BatchRotatingKVCache` interface — supporting `extract`, `extend`, `merge`, `filter`, and `trim` — so batch generation and prompt caching work out of the box.
+
+#### Infrastructure Changes
+
+- **`tokenizer_utils.py`**: adds a fallback to `PreTrainedTokenizerFast` when `AutoTokenizer` raises `AttributeError` on custom model types whose config triggers transformers' `rope_scaling` standardisation before `max_position_embeddings` is available.
+- **`utils.py`**: adds `_load_safetensors` which patches `F8_E8M0` dtype headers in-place before loading, allowing FP8-quantized checkpoints to be loaded without a separate conversion step.
+
 ### Large Models
 
 > [!NOTE]