Pipeline parallelism patches for Llama, Qwen2, and Mixtral (with benchmark data) #1051

guruswami-ai · 2026-03-25T00:52:40Z

guruswami-ai
Mar 25, 2026

We have working pipeline parallelism implementations for three models that currently only support tensor parallelism (or no distributed inference at all):

Model	What we added	Benchmark evidence
Llama	Pipeline parallelism (`PipelineMixin`)	PP2 at 97% of single-node gen TPS vs TP2 at 69%
Qwen2	Pipeline parallelism (`PipelineMixin`)	PP2 at 30.2 TPS vs TP2 at 21.6 TPS (Qwen 32B Q4)
Mixtral	Tensor parallelism + Pipeline parallelism	First distributed MoE inference for Mixtral in MLX. TP2 at 43.8 TPS, PP2 at 62.5 TPS

The patches follow established patterns from DeepSeek V3 and Ministral3 in the codebase. TP uses shard_linear. PP uses PipelineMixin with send/recv at layer boundaries.

Full source files: patches/ (written against mlx-lm 0.30.8)

Why PP matters

PP preserves nearly all single-node generation speed because it has far fewer sync points than TP:

Qwen 32B Q4: PP2 loses 4% gen TPS. TP2 loses 31%.
Mixtral 8x7B Q4: PP2 loses 10% gen TPS. TP2 loses 37%.

For models that fit on a single node but need more memory headroom for long context, PP gives you 2x the memory budget with minimal speed penalty. TP is still needed for models that genuinely require multi-node compute (Llama 405B, DeepSeek V3, Kimi K2.5).

Important caveat: PP hits Metal's ~60-second GPU timeout on large dense models. Llama 405B PP2 (63 layers per node) crashes. PP works for models up to ~200B dense parameters. Above that, TP is required.

Benchmark data

These patches were validated across 290 benchmark configurations on a 5-node M3 Ultra TB5 cluster:

Full benchmark dataset
Detailed findings (13 key results)
RDMA failure modes (6 documented TB5 issues)

Happy to submit these as PRs if there is interest. They are complete, tested, and follow existing codebase conventions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline parallelism patches for Llama, Qwen2, and Mixtral (with benchmark data) #1051

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Pipeline parallelism patches for Llama, Qwen2, and Mixtral (with benchmark data) #1051

Uh oh!

guruswami-ai Mar 25, 2026

Why PP matters

Benchmark data

Replies: 0 comments

guruswami-ai
Mar 25, 2026