AX Engine is a Mac-first LLM inference runtime, local server, SDK layer, and benchmark toolkit for Apple Silicon.
It is not "AX MLX" as a product. MLX is the primary Apple Silicon execution
backend for supported model families, while AX Engine also exposes explicit
compatibility routes for upstream mlx-lm and llama.cpp so users can stay on
one AX surface while model coverage grows.
Requires macOS on Apple Silicon M4 or newer and Rust 1.85+.
Install the released command-line tools and verify the local runtime contract:
brew install defai-digital/ax-engine/ax-engineax-engine-bench doctor
ax-engine-server --helpThis verifies the released AX Engine tools. Running inference requires choosing
a runtime path below: repo-owned MLX, delegated mlx-lm, or delegated
llama.cpp.
AX Engine gives local inference work a stable runtime contract:
ax-engine-serverexposes a local HTTP adapter over the runtime.ax-engine-benchrecords workload contracts, route identity, correctness, determinism, and performance evidence.ax-engine-sdk, Python bindings, and the JavaScript preview client provide thin integration surfaces over the same backend-resolution rules.- Repo-owned MLX execution is optimized for supported Qwen and Gemma families.
- Delegated
mlx_lm.serverandllama.cpproutes cover explicit compatibility cases without turning delegated results into AX-owned throughput claims.
mlx_lm and
mlx-swift-lm remain the canonical
MLX references. AX Engine compares against them, learns from them, and delegates
to mlx-lm for unsupported MLX text models when requested. The AX-owned value
is the runtime layer around supported workloads: request lifecycle, scheduling,
KV/cache policy, n-gram acceleration, and auditable benchmark artifacts.
For supported transformer families on Apple Silicon, the AX-owned runtime layer can produce higher effective throughput than the reference MLX runtimes on matching benchmark shapes:
- N-gram acceleration reaches up to 2.4x mlx_lm decode throughput on high-hit benchmark rows — with no second draft model and no model changes
- AX-owned request lifecycle provides deterministic, auditable scheduling, KV block management, and prefix reuse that upstream Python runtimes do not expose as stable contracts
- workload-contract tooling (
ax-engine-bench) validates correctness, determinism, route identity, and regression across checked-in manifests, not just throughput snapshots
The thesis is not "our MLX tensor ops are faster." MLX compiles and executes the same compute graph either way. The thesis is that AX's decode strategy above MLX — how tokens are speculated, how requests are scheduled, how KV state is materialized — produces measurably higher effective throughput on supported workloads.
| Path | Use it for | Current scope |
|---|---|---|
| Repo-owned MLX runtime | Supported Qwen/Gemma MLX model artifacts and repo-owned performance claims | Local Apple Silicon inference, token-based server/SDK requests, benchmarked direct and n-gram acceleration modes |
mlx_lm_delegated |
MLX text models that upstream mlx-lm supports before AX has a repo-owned graph |
Blocking text generation through a user-provided mlx_lm.server; /v1/generate and OpenAI-compatible completion/chat text endpoints. Token streaming is not exposed because this route does not provide AX-owned token IDs or per-token deltas |
llama_cpp |
GGUF and non-MLX local inference | Delegated llama.cpp server/CLI compatibility; route-contract evidence, not repo-owned MLX throughput |
The runtime report exposes selected_backend, support_tier, and
resolution_policy so callers and benchmark artifacts can distinguish these
paths.
For the exact OpenAI-shaped endpoint contract, including what is and is not
compatible today, see docs/API-COMPATIBILITY.md.
The repo-owned MLX path uses MLX directly for tensor operations via the official
mlx-c C API. Matrix multiply, quantized matmul, attention, RMSNorm, and RoPE
go through MLX's Apple-maintained Metal kernels. AX owns the runtime behavior
above that graph.
What AX Engine adds around model execution:
- N-gram acceleration: a bigram/trigram table built at runtime predicts
up to 4 draft tokens per step. The target model verifies them in one forward
pass over
[last_token, D1, …, D_n]. An EMA accept-rate gate (α=0.1, threshold 0.5) disables acceleration after a bad sequence and re-enables when the table recovers. No second draft model required. - Scheduler and KV manager: request lifecycle, batching, memory-blocked
recovery, and execution planning live in
ax-engine-core— deterministic, async-free, no framework dependencies. - Chunked KV cache: keys and values grow in pre-allocated backing buffers via
slice_update. Draft rollback is O(1) — only the sequence-length pointer moves. After each decode step, all KV buffers are evaluated with the output token to flatten the lazy-eval graph and prevent O(N²) graph depth. - Graph compilation:
mlx_enable_compile()is called once at startup so Metal shader compilation and dispatch tables are reused across steps with the same shape — equivalent tomx.compile()in mlx_lm. - GatedDelta linear attention: hybrid architectures (Qwen3.5, Qwen3-Next) use a custom SIMD-group Metal kernel for the recurrent GatedDelta state update. All other ops in the same models (dense attention, FFN, projections) delegate to MLX's hardware-optimized paths.
mlx_set_wired_limit(recommendedMaxWorkingSetSize) wires model weights into GPU
memory at startup, preventing Metal from paging them between requests. A
dedicated GPU stream avoids cross-stream synchronization on the shared default
stream.
| Family | Model | Architecture notes |
|---|---|---|
| Gemma 4 | gemma-4-e2b-it, gemma-4-e4b-it, gemma-4-26b-a4b-it, gemma-4-31b-it | Dense, per-layer embedding, and MoE variants; MLX affine 4/5/6/8-bit weights, sliding-window + full attention, K=V full-attention layers, logit softcapping |
| Qwen 3.5 | Qwen3.5-9B | Linear attention + MoE FFN, attn_output_gate per-head interleaving |
| Qwen 3.6 / Coder Next | Qwen3.6-35B-A3B 4/5/6/8-bit MLX, Qwen3-Coder-Next-4bit | qwen3_next architecture: GatedDelta linear attention (3 of every 4 layers) + full attention with per-head sigmoid gate (every 4th layer) + sparse top-k MoE with shared expert |
All models use MLX safetensors format with the AX model-manifest.json
descriptor. Each supported architecture has a hand-written forward pass in
ax-engine-mlx. Adding a new architecture means implementing the model graph,
not wiring up a generic loader.
Recent community-model checks are tracked according to the evidence they have.
On 2026-05-06, mlx-community/GLM-4.7-Flash-4bit was promoted to a repo-owned
MLX runtime path after the GLM MLA attention, sigmoid router, and latent-KV
cache contracts landed and an AX server benchmark completed.
See
benchmarks/results/mlx-inference/2026-05-06/README.md for commands and
artifacts. Before promoting any additional architecture, run
scripts/probe_mlx_model_support.py --model-dir <model-dir>: GLM now reports
repo_owned_runtime_ready when the runtime-ready manifest and local reference
files are present.
- GatedDelta prefill: Qwen3.5 and Qwen3-Next linear-attention layers use a custom Metal kernel with a serial time loop. For long prompts (512+ tokens), this puts AX prefill behind mlx-swift-lm on those models; decode throughput is unaffected.
- Raw HuggingFace weights: ax-engine loads MLX community (pre-sanitized)
weights. Raw HF checkpoints for hybrid models need norm-weight
+1.0and conv1dmoveaxis(2,1)transformations that the converter does not apply. - N-gram acceleration rows: effective-throughput measurements, not raw model-kernel speedups. n-gram hit rate is prompt/output-pattern dependent.
- TurboQuant KV compression: experimental and off by default. The
turboquant-shadowandturboquant-fused-experimentalmodes are evidence and route-telemetry surfaces, not production support claims. Public support requires a passing long-context, model-level quality artifact and decode throughput gate; current Gemma 4 E2B local evidence reaches the fused compressed route with zero fallback but does not pass the promotion performance gate.
Performance (methodology)
| Model | MLX quantization | Prompt tok | mlx_lm | mlx_swift_lm | ax engine | ax engine + n-gram accel |
|---|---|---|---|---|---|---|
| Gemma 4 E2B | 4-bit · group=64 · affine | 128 | 197.5 | 192.4 (−2.6%) | 176.1 (−10.8%) | 524.5 (+165.6%) |
| 512 | 191.9 | 179.5 (−6.5%) | 169.7 (−11.6%) | 528.7 (+175.5%) | ||
| Gemma 4 E2B | 5-bit · group=64 · affine | 128 | 182.9 | 174.1 (−4.8%) | 156.4 (−14.5%) | 419.1 (+129.1%) |
| 512 | 178.1 | 167.0 (−6.2%) | 150.0 (−15.8%) | 414.1 (+132.5%) | ||
| Gemma 4 E2B | 6-bit · group=64 · affine | 128 | 161.3 | 153.0 (−5.1%) | 139.5 (−13.5%) | 389.3 (+141.4%) |
| 512 | 154.2 | 147.1 (−4.6%) | 134.6 (−12.7%) | 385.7 (+150.1%) | ||
| Gemma 4 E2B | 8-bit · group=64 · affine | 128 | 139.4 | 134.9 (−3.2%) | 123.7 (−11.3%) | 413.9 (+196.9%) |
| 512 | 134.5 | 130.8 (−2.8%) | 120.2 (−10.7%) | 413.2 (+207.1%) | ||
| Gemma 4 26B A4B | 4-bit · group=64 · affine | 128 | 118.3 | 109.4 (−7.5%) | 109.9 (−7.1%) | 243.1 (+105.6%) |
| 512 | 113.1 | 104.7 (−7.5%) | 107.0 (−5.4%) | 198.1 (+75.2%) | ||
| Gemma 4 31B | 4-bit · group=64 · affine | 128 | 26.2 | 24.8 (−5.5%) | 23.5 (−10.6%) | 57.7 (+119.9%) |
| 512 | 24.9 | 24.7 (−0.9%) | 22.3 (−10.6%) | 51.0 (+104.4%) | ||
| Qwen 3.5 9B | 4-bit · group=64 · affine | 128 | 96.5 | 93.7 (−2.9%) | 88.4 (−8.4%) | 82.9 (−14.1%) |
| 512 | 101.3 | 91.4 (−9.8%) | 88.9 (−12.3%) | 86.3 (−14.7%) | ||
| Qwen 3.6 35B A3B | UD-MLX 4-bit · group=64 · affine | 128 | 107.6 | 103.6 (−3.7%) | 108.0 (+0.4%) | 135.8 (+26.2%) |
| 512 | 103.3 | 101.4 (−1.9%) | 106.6 (+3.2%) | 118.4 (+14.6%) | ||
| Qwen 3.6 35B A3B | MLX 5-bit · group=64 · affine | 128 | 116.8 | 110.2 (−5.6%) | 114.8 (−1.7%) | 82.1 (−29.7%) |
| 512 | 113.7 | 108.7 (−4.4%) | 109.7 (−3.6%) | 107.5 (−5.5%) | ||
| Qwen 3.6 35B A3B | MLX 6-bit · group=64 · affine | 128 | 102.9 | 99.1 (−3.6%) | 101.8 (−1.1%) | 129.9 (+26.2%) |
| 512 | 101.1 | 98.0 (−3.1%) | 100.8 (−0.2%) | 114.6 (+13.4%) | ||
| Qwen 3.6 35B A3B | MLX 8-bit · group=64 · affine | 128 | 93.6 | 89.3 (−4.6%) | 92.2 (−1.5%) | 90.9 (−2.9%) |
| 512 | 91.4 | 89.1 (−2.6%) | 92.0 (+0.6%) | 114.0 (+24.7%) | ||
| Qwen Coder Next | 4-bit · group=64 · affine | 128 | 92.2 | 89.4 (−3.0%) | 90.7 (−1.5%) | 240.1 (+160.6%) |
| 512 | 90.4 | 89.2 (−1.3%) | 90.0 (−0.5%) | 239.0 (+164.5%) | ||
| GLM 4.7 Flash | 4-bit · group=64 · affine | 128 | 93.0 | 88.0 (−5.4%) | 91.2 (−1.9%) | 250.6 (+169.4%) |
| 512 | 90.4 | 84.5 (−6.6%) | 90.3 (−0.1%) | 244.8 (+170.8%) |
| Model | MLX quantization | Prompt tok | mlx_lm | mlx_swift_lm | ax engine |
|---|---|---|---|---|---|
| Gemma 4 E2B | 4-bit · group=64 · affine | 128 | 2,265.8 | 2,450.4 (+8.1%) | 3,134.9 (+38.4%) |
| 512 | 7,634.1 | 6,664.3 (−12.7%) | 7,426.4 (−2.7%) | ||
| Gemma 4 E2B | 5-bit · group=64 · affine | 128 | 2,267.5 | 2,393.9 (+5.6%) | 3,058.7 (+34.9%) |
| 512 | 8,405.7 | 6,742.6 (−19.8%) | 7,128.6 (−15.2%) | ||
| Gemma 4 E2B | 6-bit · group=64 · affine | 128 | 2,156.3 | 3,436.8 (+59.4%) | 3,008.3 (+39.5%) |
| 512 | 7,320.7 | 7,962.3 (+8.8%) | 7,071.6 (−3.4%) | ||
| Gemma 4 E2B | 8-bit · group=64 · affine | 128 | 1,911.7 | 3,082.0 (+61.2%) | 3,003.0 (+57.1%) |
| 512 | 6,582.8 | 6,758.1 (+2.7%) | 7,081.7 (+7.6%) | ||
| Gemma 4 26B A4B | 4-bit · group=64 · affine | 128 | 545.3 | 1,227.3 (+125.1%) | 1,196.0 (+119.3%) |
| 512 | 1,620.7 | 2,938.6 (+81.3%) | 2,695.9 (+66.3%) | ||
| Gemma 4 31B | 4-bit · group=64 · affine | 128 | 336.5 | 641.6 (+90.7%) | 504.1 (+49.8%) |
| 512 | 563.5 | 760.6 (+35.0%) | 645.9 (+14.6%) | ||
| Qwen 3.5 9B | 4-bit · group=64 · affine | 128 | 1,133.3 | 2,101.1 (+85.4%) | 1,845.3 (+62.8%) |
| 512 | 2,245.7 | 3,165.8 (+41.0%) | 2,613.9 (+16.4%) | ||
| Qwen 3.6 35B A3B | UD-MLX 4-bit · group=64 · affine | 128 | 531.7 | 963.2 (+81.1%) | 909.2 (+71.0%) |
| 512 | 1,594.2 | 2,546.5 (+59.7%) | 2,268.5 (+42.3%) | ||
| Qwen 3.6 35B A3B | MLX 5-bit · group=64 · affine | 128 | 474.4 | 861.8 (+81.7%) | 858.4 (+81.0%) |
| 512 | 1,484.5 | 2,416.7 (+62.8%) | 2,135.9 (+43.9%) | ||
| Qwen 3.6 35B A3B | MLX 6-bit · group=64 · affine | 128 | 420.0 | 762.4 (+81.5%) | 819.4 (+95.1%) |
| 512 | 1,377.9 | 2,350.6 (+70.6%) | 2,078.3 (+50.8%) | ||
| Qwen 3.6 35B A3B | MLX 8-bit · group=64 · affine | 128 | 393.1 | 617.7 (+57.1%) | 784.8 (+99.6%) |
| 512 | 1,202.2 | 2,305.2 (+91.7%) | 2,056.5 (+71.1%) | ||
| Qwen Coder Next | 4-bit · group=64 · affine | 128 | 267.1 | 384.9 (+44.1%) | 811.7 (+203.9%) |
| 512 | 815.4 | 1,417.0 (+73.8%) | 2,600.6 (+218.9%) | ||
| GLM 4.7 Flash | 4-bit · group=64 · affine | 128 | 502.9 | 1,045.0 (+107.8%) | 822.7 (+63.6%) |
| 512 | 1,584.7 | 2,588.8 (+63.4%) | 2,218.2 (+40.0%) |
For tagged macOS arm64 releases, install the preview command-line tools from the AutomatosX tap:
brew install defai-digital/ax-engine/ax-engineThis installs:
ax-engine-server: local HTTP adapter over the SDK runtimeax-engine-bench: workload-contract, readiness, direct-generate, and benchmark-support CLI- the Homebrew
mlx-cruntime dependency required by the released binaries
Check the installed tools:
ax-engine-server --help
ax-engine-bench doctorHomebrew is the quickest path for the released server and benchmark binaries.
If ax-engine-bench doctor fails with Library not loaded: /opt/homebrew/opt/mlx-c/lib/libmlxc.dylib, install or repair the runtime with
brew install mlx-c and brew reinstall defai-digital/ax-engine/ax-engine.
Use the source build when you need the full Rust workspace, Python extension,
local examples, or changes that have not been tagged yet.
The release archive attached to GitHub is the Homebrew formula payload. It is
not a standalone installer with bundled dynamic libraries. Use Homebrew unless
you are prepared to provide mlx-c and its dynamic library path yourself.
Development builds require Rust and the MLX C runtime on Apple Silicon:
brew install mlx-c
cargo build --workspace --releasePython bindings are built from source:
maturin develop
python -m unittest discover -s python/tests -vThe commands below use source-build paths. If you installed with Homebrew, use
ax-engine-server and ax-engine-bench directly instead of
./target/release/....
# HTTP inference server (repo-owned MLX runtime)
./target/release/ax-engine-server \
--mlx \
--mlx-model-artifacts-dir /path/to/local/mlx-model \
--port 8080
# Python bindings (after maturin develop)
python3 - <<'EOF'
import ax_engine
with ax_engine.Session(model_id='gemma4', mlx=True,
mlx_model_artifacts_dir='/path/to/local/mlx-model') as s:
result = s.generate([1, 2, 3], max_output_tokens=32)
print(result.output_tokens)
EOFFor an unsupported MLX text model that upstream mlx-lm can serve, keep AX
Engine as the CLI/server surface and delegate the model execution explicitly:
mlx_lm.server --model /path/to/local/mlx-model --host 127.0.0.1 --port 8090
./target/release/ax-engine-bench generate \
--prompt "Hello from mlx-lm" \
--support-tier mlx_lm_delegated \
--mlx-lm-server-url http://127.0.0.1:8090mlx_lm_delegated is a compatibility route, not a repo-owned MLX throughput
claim. It forwards text generation to upstream mlx_lm.server, preserves AX
sampling fields such as temperature, top_p, top_k, repetition_penalty,
and seed, and exposes blocking plus fake-SSE text surfaces through AX. Tool
calls and visual/multimodal inputs are not yet AX compatibility contracts.
# Primary benchmark: AX vs mlx_lm vs mlx-swift-lm
python3 scripts/bench_mlx_inference_stack.py \
--model-dir /path/to/local/mlx-model \
--prompt-tokens 128,512 --generation-tokens 128 \
--ax-compare-policies --repetitions 3 \
--mlx-swift-lm-command './scripts/mlx-swift-bench/.build/release/mlx-swift-bench \
--model {model} --prompt-token-ids {prompt_token_ids_path} \
--generation-tokens {generation_tokens} --trials {trials} \
--delay {delay} --prefill-step-size {prefill_step_size}' \
--output benchmarks/results/mlx-inference/2026-05-04/gemma-4-e2b-it-4bit.json
# Secondary workload-contract benchmark
./target/release/ax-engine-bench scenario \
--manifest benchmarks/manifests/scenario/chat_gemma4_e2b_short.json \
--output-root benchmarks/results
# Smoke checks
bash scripts/check-server-preview.sh
bash scripts/check-python-preview.shcrates/ax-engine-core Engine state machine, scheduler, KV manager, sampler
crates/ax-engine-mlx MLX model graph, n-gram acceleration, KV cache, runner
crates/mlx-sys bindgen FFI over mlx-c; safe MlxArray RAII wrappers
crates/ax-engine-sdk Session API, backend resolution (MLX, mlx-lm delegated, or llama.cpp)
crates/ax-engine-server Axum HTTP/SSE adapter (OpenAI-compatible routes)
crates/ax-engine-bench Manifest-driven workload-contract CLI
crates/ax-engine-py PyO3 extension (ABI3, Python 3.10+)
Unsupported MLX text models can use the explicit delegated mlx_lm_delegated
route through a user-provided mlx_lm.server. Non-MLX inference routes through
the delegated llama.cpp contract.
cargo build --workspace # build all crates
cargo test --quiet # full Rust test suite
cargo clippy --all-targets --all-features -- -D warnings # lint (CI gate)
cargo fmt # format
maturin develop # rebuild Python extension
python -m unittest discover -s python/tests -v # Python testsCoverage is collected by the report-only GitHub Actions workflow in
.github/workflows/coverage.yml. It publishes Rust cargo llvm-cov and Python
coverage.py artifacts without enforcing a percentage threshold yet; add a gate
only after the project has a stable baseline across macOS, MLX, and PyO3 paths.
Public documentation is in docs/. Canonical benchmark manifests are in
benchmarks/manifests/.
AX Engine welcomes public contributions. See CONTRIBUTING.md for guidelines.
- Website: automatosx.com
- Discord: Join us
- Email: [email protected]
MIT License. See LICENSE for details.
Copyright (c) 2026 DEFAI Private Limited