Summary
Qwen3.6-35B-A3B-8bit with DFlash enabled on Apple Silicon improves decode throughput, but repeatedly aborts the Python process during generation. The crash stack points into dflash_mlx/runtime.py.
This is happening through oMLX, but the fatal stack is in dflash_mlx, so I am filing here as well.
Environment
- Hardware: Mac Studio M2 Ultra 64 GB
- OS: macOS Sequoia
- Python: 3.13.12
- Runtime path in crash log:
/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/dflash_mlx/runtime.py
- Target model:
Qwen3.6-35B-A3B-8bit
- Draft model:
Qwen3.6-35B-A3B-DFlash
Relevant config on the target model:
{
"specprefill_enabled": false,
"dflash_enabled": true,
"dflash_draft_model": "/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash",
"is_pinned": true
}
What I observed
The setup does produce better throughput before failing:
short_50: mean 37.1 tok/s
medium_200: mean 38.2 tok/s
long_4k: mean 34.6 tok/s
Excerpt from my local benchmark:
=== 3.6-8bit+DFlash ===
short_50 mean: 37.1 tok/s
medium_200 mean: 38.2 tok/s
long_4k mean: 34.6 tok/s
But the Python process then aborts with a fatal error.
Crash trace
From ~/.omlx/logs/crash.log:
Thread 0x000000016e807000 (most recent call first):
File "/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/dflash_mlx/runtime.py", line 1323 in generate_dflash_once
File "/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/omlx/engine/dflash.py", line 381 in _run
...
Fatal Python error: Aborted
I also have multiple similar abort entries in the same file pointing at:
dflash_mlx/runtime.py:1323 in generate_dflash_once
dflash_mlx/runtime.py:1440 in generate_dflash_once
dflash_mlx/runtime.py:171 in _eval_logits_and_captured
Repro shape
The failure showed up while running repeated chat completions against a local OpenAI-compatible endpoint with:
- short prompt /
max_tokens=400
- medium prompt /
max_tokens=400
- long prompt around
5311 prompt tokens / max_tokens=400
The issue did not look like an immediate model-load failure. DFlash initialized successfully first.
From server.log:
DFlash enabled for Qwen3.6-35B-A3B-8bit, draft=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash
DFlashEngine loaded: target=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-8bit, draft=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash, max_ctx=4096, fallback=vlm
Expected
- DFlash should either run stably on this target/draft pair, or
- fail gracefully with a recoverable exception instead of aborting the Python process.
Actual
- decode throughput improves at first
- then the process aborts hard
- upstream server loses availability until restarted
Notes
After disabling dflash_enabled for Qwen3.6-35B-A3B-8bit, the same server came back up in plain VLM mode and passed health checks again.
If useful, I can provide a smaller reproducer script that just loops the same three prompts against the local endpoint.
Summary
Qwen3.6-35B-A3B-8bitwith DFlash enabled on Apple Silicon improves decode throughput, but repeatedly aborts the Python process during generation. The crash stack points intodflash_mlx/runtime.py.This is happening through
oMLX, but the fatal stack is indflash_mlx, so I am filing here as well.Environment
/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/dflash_mlx/runtime.pyQwen3.6-35B-A3B-8bitQwen3.6-35B-A3B-DFlashRelevant config on the target model:
{ "specprefill_enabled": false, "dflash_enabled": true, "dflash_draft_model": "/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash", "is_pinned": true }What I observed
The setup does produce better throughput before failing:
short_50: mean37.1 tok/smedium_200: mean38.2 tok/slong_4k: mean34.6 tok/sExcerpt from my local benchmark:
But the Python process then aborts with a fatal error.
Crash trace
From
~/.omlx/logs/crash.log:I also have multiple similar abort entries in the same file pointing at:
dflash_mlx/runtime.py:1323 in generate_dflash_oncedflash_mlx/runtime.py:1440 in generate_dflash_oncedflash_mlx/runtime.py:171 in _eval_logits_and_capturedRepro shape
The failure showed up while running repeated chat completions against a local OpenAI-compatible endpoint with:
max_tokens=400max_tokens=4005311prompt tokens /max_tokens=400The issue did not look like an immediate model-load failure. DFlash initialized successfully first.
From
server.log:Expected
Actual
Notes
After disabling
dflash_enabledforQwen3.6-35B-A3B-8bit, the same server came back up in plain VLM mode and passed health checks again.If useful, I can provide a smaller reproducer script that just loops the same three prompts against the local endpoint.