feat: add modular architecture system supporting all DFlash models#12
Open
0xClandestine wants to merge 4 commits into
Open
feat: add modular architecture system supporting all DFlash models#120xClandestine wants to merge 4 commits into
0xClandestine wants to merge 4 commits into
Conversation
mlx-lm 0.31.3 requires mlx>=0.31.2 on Darwin per its published metadata. Bump the lower bounds to match what's actually needed at runtime. Installed versions unchanged (mlx 0.31.2, mlx-lm 0.31.3). Test suite: 43 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously dflash-benchmark would overwrite benchmark/results/<chip>/<name>.json on every run. Append UTC YYYYMMDDTHHMMSSZ to the basename so repeated runs never lose data and deltas across branches/commits stay traceable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New archs/ directory with pluggable architecture system - Supports Qwen3 (dense + MoE), Llama, Gemma architectures - Handles both standard config and Gemma-style speculator config - Updated DRAFT_REGISTRY with 16 models from z-lab and RedHatAI - Backward compatibility maintained via model.py wrapper Models now supported: - z-lab: Qwen3.5-4B/9B/27B/35B-A3B/122B-A10B, Qwen3-4B/8B, Qwen3.6-27B/35B-A3B, Qwen3-Coder-Next/30B-A3B, Kimi-K2.5, Llama-3.1-8B-Instruct, GPT-OSS-20B/120B - RedHatAI: Gemma-4-31B-it
Benchmark Results — M3 Ultra 256GBRan all 16 models from the updated Config: max_tokens=512, block_tokens=16, repeat=3, cooldown=60s Successful (11/16)
Failed (5/16)
Key Findings
Full results: https://gist.github.com/hankbobtheresearchoor/d18e938646583a5beb97759a0bebc7a1 |
Removed models with known issues: - Qwen3.6-27B: Gated repo (requires HF auth) - Kimi-K2.5: MLA not supported yet - GPT-OSS models: Target architecture not in mlx-lm Kept 12 verified working models.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Add a modular architecture system to support all DFlash models from z-lab and RedHatAI.
Changes
New
archs/directory with pluggable architecture system:base.py- Protocols (DFlashAttention, DFlashMLP, etc.), DFlashArgs, ArchitectureRegistryqwen3.py- Qwen3 implementation (qwen3, qwen3_moe, kimi architectures)llama.py- Llama implementation (llama, gemma, gemma4 architectures)Config parsing - Handles both standard config and Gemma-style speculator config (with
transformer_layer_config)Updated DRAFT_REGISTRY with 16 models from official sources:
Backward compatibility maintained via
model.pywrapper re-exporting old class namesArchitecture System
The system uses protocols to define interfaces that each architecture must implement:
DFlashAttention- cross-attention to target hidden statesDFlashMLP- feed-forward networkDFlashNorm- normalization layerDFlashRope- rotary positional embeddingsDFlashCache- KV cache handlingDFlashModel- full draft modelKey differences between architectures:
Related: Based on SwiftLM architecture from https://github.com/SharpAI/SwiftLM