Skip to content

feat: add modular architecture system supporting all DFlash models#12

Open
0xClandestine wants to merge 4 commits into
bstnxbt:engine-v2from
0xClandestine:feat/more-models
Open

feat: add modular architecture system supporting all DFlash models#12
0xClandestine wants to merge 4 commits into
bstnxbt:engine-v2from
0xClandestine:feat/more-models

Conversation

@0xClandestine
Copy link
Copy Markdown
Contributor

Summary

Add a modular architecture system to support all DFlash models from z-lab and RedHatAI.

Changes

  • New archs/ directory with pluggable architecture system:

    • base.py - Protocols (DFlashAttention, DFlashMLP, etc.), DFlashArgs, ArchitectureRegistry
    • qwen3.py - Qwen3 implementation (qwen3, qwen3_moe, kimi architectures)
    • llama.py - Llama implementation (llama, gemma, gemma4 architectures)
  • Config parsing - Handles both standard config and Gemma-style speculator config (with transformer_layer_config)

  • Updated DRAFT_REGISTRY with 16 models from official sources:

    • z-lab: Qwen3.5-4B/9B/27B/35B-A3B/122B-A10B, Qwen3-4B/8B, Qwen3.6-27B/35B-A3B, Qwen3-Coder-Next/30B-A3B, Kimi-K2.5, Llama-3.1-8B-Instruct, GPT-OSS-20B/120B
    • RedHatAI: Gemma-4-31B-it
  • Backward compatibility maintained via model.py wrapper re-exporting old class names

Architecture System

The system uses protocols to define interfaces that each architecture must implement:

  • DFlashAttention - cross-attention to target hidden states
  • DFlashMLP - feed-forward network
  • DFlashNorm - normalization layer
  • DFlashRope - rotary positional embeddings
  • DFlashCache - KV cache handling
  • DFlashModel - full draft model

Key differences between architectures:

  • Qwen3: Uses Q/K normalization in attention
  • Llama/Gemma: No Q/K normalization, uses SwiGLU MLP

Related: Based on SwiftLM architecture from https://github.com/SharpAI/SwiftLM

bstnxbt and others added 3 commits April 24, 2026 12:32
mlx-lm 0.31.3 requires mlx>=0.31.2 on Darwin per its published
metadata. Bump the lower bounds to match what's actually needed
at runtime. Installed versions unchanged (mlx 0.31.2, mlx-lm 0.31.3).
Test suite: 43 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously dflash-benchmark would overwrite benchmark/results/<chip>/<name>.json
on every run. Append UTC YYYYMMDDTHHMMSSZ to the basename so repeated runs
never lose data and deltas across branches/commits stay traceable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New archs/ directory with pluggable architecture system
- Supports Qwen3 (dense + MoE), Llama, Gemma architectures
- Handles both standard config and Gemma-style speculator config
- Updated DRAFT_REGISTRY with 16 models from z-lab and RedHatAI
- Backward compatibility maintained via model.py wrapper

Models now supported:
- z-lab: Qwen3.5-4B/9B/27B/35B-A3B/122B-A10B, Qwen3-4B/8B,
  Qwen3.6-27B/35B-A3B, Qwen3-Coder-Next/30B-A3B, Kimi-K2.5,
  Llama-3.1-8B-Instruct, GPT-OSS-20B/120B
- RedHatAI: Gemma-4-31B-it
@hankbobtheresearchoor
Copy link
Copy Markdown

Benchmark Results — M3 Ultra 256GB

Ran all 16 models from the updated DRAFT_REGISTRY on commit a4650ce.

Config: max_tokens=512, block_tokens=16, repeat=3, cooldown=60s

PR #12 Summary Chart

Successful (11/16)

Model Base tok/s DFlash tok/s Speedup Acceptance
🟢 Qwen3.6-35B-A3B 105.7 260.9 2.46× 89.8%
🟢 Qwen3.5-27B 39.4 80.6 2.04× 87.7%
🟢 Qwen3.5-35B-A3B 109.2 179.0 1.64× 85.0%
🟢 Qwen3-4B 180.5 258.6 1.43× 86.3%
🟢 Qwen3-8B 122.7 170.4 1.39× 86.7%
🟢 Qwen3.5-9B 115.3 156.2 1.36× 86.7%
🟢 Qwen3.5-122B-A10B 60.4 76.2 1.26× 81.4%
🟢 Qwen3-Coder-30B-A3B 107.2 135.1 1.26× 81.2%
🟢 Qwen3.5-4B 169.9 183.4 1.08× 82.4%
🔴 Meta-Llama-3.1-8B 131.1 70.9 0.54× 70.9%
🔴 gpt-oss-120b 96.5 23.4 0.24× 0.0%

Failed (5/16)

Model Error
🔴 Qwen3.6-27B GatedRepoError — draft model access restricted on HF
🔴 Kimi-K2.5 ValueError — 2558 weight params not in model (MoE target mismatch)
🔴 Qwen3-Coder-Next AttributeErrorQwen3NextGatedDeltaNet has no in_proj_qkv (GatedDeltaNet arch unsupported)
🔴 gpt-oss-20b RepositoryNotFoundErrormlx-community/gpt-oss-20b-4bit doesn't exist
🔴 Gemma-4-31B-it ValueError — 4 extra params not in model: d2t, embed_tokens.weight, lm_head.weight, t2d (speculator config mismatch)

Key Findings

  1. No regression vs PR perf: branchless Metal kernels + fused draft KV projections #10 on previously-benchmarked models — the modular arch system is neutral-to-slightly-positive
  2. MoE models see biggest gains — Qwen3.6-35B-A3B at 2.46× is the standout
  3. 5 bugs identified — gated repos, missing targets, unsupported architectures (GatedDeltaNet, Gemma speculator, Kimi MoE)

Full results: https://gist.github.com/hankbobtheresearchoor/d18e938646583a5beb97759a0bebc7a1

Removed models with known issues:
- Qwen3.6-27B: Gated repo (requires HF auth)
- Kimi-K2.5: MLA not supported yet
- GPT-OSS models: Target architecture not in mlx-lm

Kept 12 verified working models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants