Skip to content

Qwen3.6-35B-A3B-8bit + DFlash improves tok/s but aborts Python process in dflash_mlx/runtime.py #9

@gizmax

Description

@gizmax

Summary

Qwen3.6-35B-A3B-8bit with DFlash enabled on Apple Silicon improves decode throughput, but repeatedly aborts the Python process during generation. The crash stack points into dflash_mlx/runtime.py.

This is happening through oMLX, but the fatal stack is in dflash_mlx, so I am filing here as well.

Environment

  • Hardware: Mac Studio M2 Ultra 64 GB
  • OS: macOS Sequoia
  • Python: 3.13.12
  • Runtime path in crash log: /Users/jarvis/venvs/omlx/lib/python3.13/site-packages/dflash_mlx/runtime.py
  • Target model: Qwen3.6-35B-A3B-8bit
  • Draft model: Qwen3.6-35B-A3B-DFlash

Relevant config on the target model:

{
  "specprefill_enabled": false,
  "dflash_enabled": true,
  "dflash_draft_model": "/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash",
  "is_pinned": true
}

What I observed

The setup does produce better throughput before failing:

  • short_50: mean 37.1 tok/s
  • medium_200: mean 38.2 tok/s
  • long_4k: mean 34.6 tok/s

Excerpt from my local benchmark:

=== 3.6-8bit+DFlash ===
short_50 mean: 37.1 tok/s
medium_200 mean: 38.2 tok/s
long_4k mean: 34.6 tok/s

But the Python process then aborts with a fatal error.

Crash trace

From ~/.omlx/logs/crash.log:

Thread 0x000000016e807000 (most recent call first):
  File "/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/dflash_mlx/runtime.py", line 1323 in generate_dflash_once
  File "/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/omlx/engine/dflash.py", line 381 in _run
...
Fatal Python error: Aborted

I also have multiple similar abort entries in the same file pointing at:

  • dflash_mlx/runtime.py:1323 in generate_dflash_once
  • dflash_mlx/runtime.py:1440 in generate_dflash_once
  • dflash_mlx/runtime.py:171 in _eval_logits_and_captured

Repro shape

The failure showed up while running repeated chat completions against a local OpenAI-compatible endpoint with:

  • short prompt / max_tokens=400
  • medium prompt / max_tokens=400
  • long prompt around 5311 prompt tokens / max_tokens=400

The issue did not look like an immediate model-load failure. DFlash initialized successfully first.

From server.log:

DFlash enabled for Qwen3.6-35B-A3B-8bit, draft=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash
DFlashEngine loaded: target=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-8bit, draft=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash, max_ctx=4096, fallback=vlm

Expected

  • DFlash should either run stably on this target/draft pair, or
  • fail gracefully with a recoverable exception instead of aborting the Python process.

Actual

  • decode throughput improves at first
  • then the process aborts hard
  • upstream server loses availability until restarted

Notes

After disabling dflash_enabled for Qwen3.6-35B-A3B-8bit, the same server came back up in plain VLM mode and passed health checks again.

If useful, I can provide a smaller reproducer script that just loops the same three prompts against the local endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions