feat: add modular architecture system supporting all DFlash models by 0xClandestine · Pull Request #12 · bstnxbt/dflash-mlx

0xClandestine · 2026-04-25T21:12:45Z

Summary

Add a modular architecture system to support all DFlash models from z-lab and RedHatAI.

Changes

New archs/ directory with pluggable architecture system:
- base.py - Protocols (DFlashAttention, DFlashMLP, etc.), DFlashArgs, ArchitectureRegistry
- qwen3.py - Qwen3 implementation (qwen3, qwen3_moe, kimi architectures)
- llama.py - Llama implementation (llama, gemma, gemma4 architectures)
Config parsing - Handles both standard config and Gemma-style speculator config (with transformer_layer_config)
Updated DRAFT_REGISTRY with 16 models from official sources:
- z-lab: Qwen3.5-4B/9B/27B/35B-A3B/122B-A10B, Qwen3-4B/8B, Qwen3.6-27B/35B-A3B, Qwen3-Coder-Next/30B-A3B, Kimi-K2.5, Llama-3.1-8B-Instruct, GPT-OSS-20B/120B
- RedHatAI: Gemma-4-31B-it
Backward compatibility maintained via model.py wrapper re-exporting old class names

Architecture System

The system uses protocols to define interfaces that each architecture must implement:

DFlashAttention - cross-attention to target hidden states
DFlashMLP - feed-forward network
DFlashNorm - normalization layer
DFlashRope - rotary positional embeddings
DFlashCache - KV cache handling
DFlashModel - full draft model

Key differences between architectures:

Qwen3: Uses Q/K normalization in attention
Llama/Gemma: No Q/K normalization, uses SwiGLU MLP

Related: Based on SwiftLM architecture from https://github.com/SharpAI/SwiftLM

mlx-lm 0.31.3 requires mlx>=0.31.2 on Darwin per its published metadata. Bump the lower bounds to match what's actually needed at runtime. Installed versions unchanged (mlx 0.31.2, mlx-lm 0.31.3). Test suite: 43 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously dflash-benchmark would overwrite benchmark/results/<chip>/<name>.json on every run. Append UTC YYYYMMDDTHHMMSSZ to the basename so repeated runs never lose data and deltas across branches/commits stay traceable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- New archs/ directory with pluggable architecture system - Supports Qwen3 (dense + MoE), Llama, Gemma architectures - Handles both standard config and Gemma-style speculator config - Updated DRAFT_REGISTRY with 16 models from z-lab and RedHatAI - Backward compatibility maintained via model.py wrapper Models now supported: - z-lab: Qwen3.5-4B/9B/27B/35B-A3B/122B-A10B, Qwen3-4B/8B, Qwen3.6-27B/35B-A3B, Qwen3-Coder-Next/30B-A3B, Kimi-K2.5, Llama-3.1-8B-Instruct, GPT-OSS-20B/120B - RedHatAI: Gemma-4-31B-it

hankbobtheresearchoor · 2026-04-25T22:29:28Z

Benchmark Results — M3 Ultra 256GB

Ran all 16 models from the updated DRAFT_REGISTRY on commit a4650ce.

Config: max_tokens=512, block_tokens=16, repeat=3, cooldown=60s

Successful (11/16)

Model	Base tok/s	DFlash tok/s	Speedup	Acceptance
🟢 Qwen3.6-35B-A3B	105.7	260.9	2.46×	89.8%
🟢 Qwen3.5-27B	39.4	80.6	2.04×	87.7%
🟢 Qwen3.5-35B-A3B	109.2	179.0	1.64×	85.0%
🟢 Qwen3-4B	180.5	258.6	1.43×	86.3%
🟢 Qwen3-8B	122.7	170.4	1.39×	86.7%
🟢 Qwen3.5-9B	115.3	156.2	1.36×	86.7%
🟢 Qwen3.5-122B-A10B	60.4	76.2	1.26×	81.4%
🟢 Qwen3-Coder-30B-A3B	107.2	135.1	1.26×	81.2%
🟢 Qwen3.5-4B	169.9	183.4	1.08×	82.4%
🔴 Meta-Llama-3.1-8B	131.1	70.9	0.54×	70.9%
🔴 gpt-oss-120b	96.5	23.4	0.24×	0.0%

Failed (5/16)

Model	Error
🔴 Qwen3.6-27B	`GatedRepoError` — draft model access restricted on HF
🔴 Kimi-K2.5	`ValueError` — 2558 weight params not in model (MoE target mismatch)
🔴 Qwen3-Coder-Next	`AttributeError` — `Qwen3NextGatedDeltaNet` has no `in_proj_qkv` (GatedDeltaNet arch unsupported)
🔴 gpt-oss-20b	`RepositoryNotFoundError` — `mlx-community/gpt-oss-20b-4bit` doesn't exist
🔴 Gemma-4-31B-it	`ValueError` — 4 extra params not in model: `d2t`, `embed_tokens.weight`, `lm_head.weight`, `t2d` (speculator config mismatch)

Key Findings

No regression vs PR perf: branchless Metal kernels + fused draft KV projections #10 on previously-benchmarked models — the modular arch system is neutral-to-slightly-positive
MoE models see biggest gains — Qwen3.6-35B-A3B at 2.46× is the standout
5 bugs identified — gated repos, missing targets, unsupported architectures (GatedDeltaNet, Gemma speculator, Kimi MoE)

Full results: https://gist.github.com/hankbobtheresearchoor/d18e938646583a5beb97759a0bebc7a1

Removed models with known issues: - Qwen3.6-27B: Gated repo (requires HF auth) - Kimi-K2.5: MLA not supported yet - GPT-OSS models: Target architecture not in mlx-lm Kept 12 verified working models.

bstnxbt and others added 3 commits April 24, 2026 12:32

fix: remove problematic models from registry

6f0aae8

Removed models with known issues: - Qwen3.6-27B: Gated repo (requires HF auth) - Kimi-K2.5: MLA not supported yet - GPT-OSS models: Target architecture not in mlx-lm Kept 12 verified working models.

bstnxbt force-pushed the engine-v2 branch from ad2433a to bc24ab0 Compare April 27, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add modular architecture system supporting all DFlash models#12

feat: add modular architecture system supporting all DFlash models#12
0xClandestine wants to merge 4 commits into
bstnxbt:engine-v2from
0xClandestine:feat/more-models

0xClandestine commented Apr 25, 2026

Uh oh!

hankbobtheresearchoor commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0xClandestine commented Apr 25, 2026

Summary

Changes

Architecture System

Uh oh!

hankbobtheresearchoor commented Apr 25, 2026

Benchmark Results — M3 Ultra 256GB

Successful (11/16)

Failed (5/16)

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants