Qwen3.6-35B-A3B-8bit + DFlash improves tok/s but aborts Python process in dflash_mlx/runtime.py

## Summary

`Qwen3.6-35B-A3B-8bit` with DFlash enabled on Apple Silicon improves decode throughput, but repeatedly aborts the Python process during generation. The crash stack points into `dflash_mlx/runtime.py`.

This is happening through `oMLX`, but the fatal stack is in `dflash_mlx`, so I am filing here as well.

## Environment

- Hardware: Mac Studio M2 Ultra 64 GB
- OS: macOS Sequoia
- Python: 3.13.12
- Runtime path in crash log: `/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/dflash_mlx/runtime.py`
- Target model: `Qwen3.6-35B-A3B-8bit`
- Draft model: `Qwen3.6-35B-A3B-DFlash`

Relevant config on the target model:

```json
{
  "specprefill_enabled": false,
  "dflash_enabled": true,
  "dflash_draft_model": "/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash",
  "is_pinned": true
}
```

## What I observed

The setup does produce better throughput before failing:

- `short_50`: mean `37.1 tok/s`
- `medium_200`: mean `38.2 tok/s`
- `long_4k`: mean `34.6 tok/s`

Excerpt from my local benchmark:

```text
=== 3.6-8bit+DFlash ===
short_50 mean: 37.1 tok/s
medium_200 mean: 38.2 tok/s
long_4k mean: 34.6 tok/s
```

But the Python process then aborts with a fatal error.

## Crash trace

From `~/.omlx/logs/crash.log`:

```text
Thread 0x000000016e807000 (most recent call first):
  File "/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/dflash_mlx/runtime.py", line 1323 in generate_dflash_once
  File "/Users/jarvis/venvs/omlx/lib/python3.13/site-packages/omlx/engine/dflash.py", line 381 in _run
...
Fatal Python error: Aborted
```

I also have multiple similar abort entries in the same file pointing at:

- `dflash_mlx/runtime.py:1323 in generate_dflash_once`
- `dflash_mlx/runtime.py:1440 in generate_dflash_once`
- `dflash_mlx/runtime.py:171 in _eval_logits_and_captured`

## Repro shape

The failure showed up while running repeated chat completions against a local OpenAI-compatible endpoint with:

- short prompt / `max_tokens=400`
- medium prompt / `max_tokens=400`
- long prompt around `5311` prompt tokens / `max_tokens=400`

The issue did not look like an immediate model-load failure. DFlash initialized successfully first.

From `server.log`:

```text
DFlash enabled for Qwen3.6-35B-A3B-8bit, draft=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash
DFlashEngine loaded: target=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-8bit, draft=/Users/jarvis/.omlx/models/Qwen3.6-35B-A3B-DFlash, max_ctx=4096, fallback=vlm
```

## Expected

- DFlash should either run stably on this target/draft pair, or
- fail gracefully with a recoverable exception instead of aborting the Python process.

## Actual

- decode throughput improves at first
- then the process aborts hard
- upstream server loses availability until restarted

## Notes

After disabling `dflash_enabled` for `Qwen3.6-35B-A3B-8bit`, the same server came back up in plain VLM mode and passed health checks again.

If useful, I can provide a smaller reproducer script that just loops the same three prompts against the local endpoint.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.6-35B-A3B-8bit + DFlash improves tok/s but aborts Python process in dflash_mlx/runtime.py #9

Summary

Environment

What I observed

Crash trace

Repro shape

Expected

Actual

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Qwen3.6-35B-A3B-8bit + DFlash improves tok/s but aborts Python process in dflash_mlx/runtime.py #9

Description

Summary

Environment

What I observed

Crash trace

Repro shape

Expected

Actual

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions