AX Engine

AX Engine is a Mac-first LLM inference runtime, local server, SDK layer, and benchmark toolkit for Apple Silicon.

It is not "AX MLX" as a product. MLX is the primary Apple Silicon execution backend for supported model families, while AX Engine also exposes explicit compatibility routes for upstream mlx-lm and llama.cpp so users can stay on one AX surface while model coverage grows.

Requires macOS on Apple Silicon M4 or newer and Rust 1.85+.

30-Second Setup

Install the released command-line tools and verify the local runtime contract:

brew install defai-digital/ax-engine/ax-engine

ax-engine-bench doctor
ax-engine-server --help

This verifies the released AX Engine tools. Running inference requires choosing a runtime path below: repo-owned MLX, delegated mlx-lm, or delegated llama.cpp.

Why AX Engine

AX Engine gives local inference work a stable runtime contract:

ax-engine-server exposes a local HTTP adapter over the runtime.
ax-engine-bench records workload contracts, route identity, correctness, determinism, and performance evidence.
ax-engine-sdk, Python bindings, and the JavaScript preview client provide thin integration surfaces over the same backend-resolution rules.
Repo-owned MLX execution is optimized for supported Qwen and Gemma families.
Delegated mlx_lm.server and llama.cpp routes cover explicit compatibility cases without turning delegated results into AX-owned throughput claims.

mlx_lm and mlx-swift-lm remain the canonical MLX references. AX Engine compares against them, learns from them, and delegates to mlx-lm for unsupported MLX text models when requested. The AX-owned value is the runtime layer around supported workloads: request lifecycle, scheduling, KV/cache policy, n-gram acceleration, and auditable benchmark artifacts.

For supported transformer families on Apple Silicon, the AX-owned runtime layer can produce higher effective throughput than the reference MLX runtimes on matching benchmark shapes:

N-gram acceleration reaches up to 2.4x mlx_lm decode throughput on high-hit benchmark rows — with no second draft model and no model changes
AX-owned request lifecycle provides deterministic, auditable scheduling, KV block management, and prefix reuse that upstream Python runtimes do not expose as stable contracts
workload-contract tooling (ax-engine-bench) validates correctness, determinism, route identity, and regression across checked-in manifests, not just throughput snapshots

The thesis is not "our MLX tensor ops are faster." MLX compiles and executes the same compute graph either way. The thesis is that AX's decode strategy above MLX — how tokens are speculated, how requests are scheduled, how KV state is materialized — produces measurably higher effective throughput on supported workloads.

Runtime Paths

Path	Use it for	Current scope
Repo-owned MLX runtime	Supported Qwen/Gemma MLX model artifacts and repo-owned performance claims	Local Apple Silicon inference, token-based server/SDK requests, benchmarked direct and n-gram acceleration modes
`mlx_lm_delegated`	MLX text models that upstream `mlx-lm` supports before AX has a repo-owned graph	Blocking text generation through a user-provided `mlx_lm.server`; `/v1/generate` and OpenAI-compatible completion/chat text endpoints. Token streaming is not exposed because this route does not provide AX-owned token IDs or per-token deltas
`llama_cpp`	GGUF and non-MLX local inference	Delegated llama.cpp server/CLI compatibility; route-contract evidence, not repo-owned MLX throughput

The runtime report exposes selected_backend, support_tier, and resolution_policy so callers and benchmark artifacts can distinguish these paths.

For the exact OpenAI-shaped endpoint contract, including what is and is not compatible today, see docs/API-COMPATIBILITY.md.

Design

Execution Layer

The repo-owned MLX path uses MLX directly for tensor operations via the official mlx-c C API. Matrix multiply, quantized matmul, attention, RMSNorm, and RoPE go through MLX's Apple-maintained Metal kernels. AX owns the runtime behavior above that graph.

What AX Engine adds around model execution:

N-gram acceleration: a bigram/trigram table built at runtime predicts up to 4 draft tokens per step. The target model verifies them in one forward pass over [last_token, D1, …, D_n]. An EMA accept-rate gate (α=0.1, threshold 0.5) disables acceleration after a bad sequence and re-enables when the table recovers. No second draft model required.
Scheduler and KV manager: request lifecycle, batching, memory-blocked recovery, and execution planning live in ax-engine-core — deterministic, async-free, no framework dependencies.
Chunked KV cache: keys and values grow in pre-allocated backing buffers via slice_update. Draft rollback is O(1) — only the sequence-length pointer moves. After each decode step, all KV buffers are evaluated with the output token to flatten the lazy-eval graph and prevent O(N²) graph depth.
Graph compilation: mlx_enable_compile() is called once at startup so Metal shader compilation and dispatch tables are reused across steps with the same shape — equivalent to mx.compile() in mlx_lm.
GatedDelta linear attention: hybrid architectures (Qwen3.5, Qwen3-Next) use a custom SIMD-group Metal kernel for the recurrent GatedDelta state update. All other ops in the same models (dense attention, FFN, projections) delegate to MLX's hardware-optimized paths.

Memory Layer

mlx_set_wired_limit(recommendedMaxWorkingSetSize) wires model weights into GPU memory at startup, preventing Metal from paging them between requests. A dedicated GPU stream avoids cross-stream synchronization on the shared default stream.

Supported Models

Family	Model	Architecture notes
Gemma 4	gemma-4-e2b-it, gemma-4-e4b-it, gemma-4-26b-a4b-it, gemma-4-31b-it	Dense, per-layer embedding, and MoE variants; MLX affine 4/5/6/8-bit weights, sliding-window + full attention, K=V full-attention layers, logit softcapping
Qwen 3.5	Qwen3.5-9B	Linear attention + MoE FFN, attn_output_gate per-head interleaving
Qwen 3.6 / Coder Next	Qwen3.6-35B-A3B 4/5/6/8-bit MLX, Qwen3-Coder-Next-4bit	`qwen3_next` architecture: GatedDelta linear attention (3 of every 4 layers) + full attention with per-head sigmoid gate (every 4th layer) + sparse top-k MoE with shared expert

All models use MLX safetensors format with the AX model-manifest.json descriptor. Each supported architecture has a hand-written forward pass in ax-engine-mlx. Adding a new architecture means implementing the model graph, not wiring up a generic loader.

Recent community-model checks are tracked according to the evidence they have. On 2026-05-06, mlx-community/GLM-4.7-Flash-4bit was promoted to a repo-owned MLX runtime path after the GLM MLA attention, sigmoid router, and latent-KV cache contracts landed and an AX server benchmark completed. See benchmarks/results/mlx-inference/2026-05-06/README.md for commands and artifacts. Before promoting any additional architecture, run scripts/probe_mlx_model_support.py --model-dir <model-dir>: GLM now reports repo_owned_runtime_ready when the runtime-ready manifest and local reference files are present.

Limitations

GatedDelta prefill: Qwen3.5 and Qwen3-Next linear-attention layers use a custom Metal kernel with a serial time loop. For long prompts (512+ tokens), this puts AX prefill behind mlx-swift-lm on those models; decode throughput is unaffected.
Raw HuggingFace weights: ax-engine loads MLX community (pre-sanitized) weights. Raw HF checkpoints for hybrid models need norm-weight +1.0 and conv1d moveaxis(2,1) transformations that the converter does not apply.
N-gram acceleration rows: effective-throughput measurements, not raw model-kernel speedups. n-gram hit rate is prompt/output-pattern dependent.
TurboQuant KV compression: experimental and off by default. The turboquant-shadow and turboquant-fused-experimental modes are evidence and route-telemetry surfaces, not production support claims. Public support requires a passing long-context, model-level quality artifact and decode throughput gate; current Gemma 4 E2B local evidence reaches the fused compressed route with zero fallback but does not pass the promotion performance gate.

Performance (methodology)

Decode throughput (tok/s) — generation=128 tokens, temp=0

Model	MLX quantization	Prompt tok	mlx_lm	mlx_swift_lm	ax engine	ax engine + n-gram accel
Gemma 4 E2B	4-bit · group=64 · affine	128	197.5	192.4 (−2.6%)	176.1 (−10.8%)	524.5 (+165.6%)
		512	191.9	179.5 (−6.5%)	169.7 (−11.6%)	528.7 (+175.5%)
Gemma 4 E2B	5-bit · group=64 · affine	128	182.9	174.1 (−4.8%)	156.4 (−14.5%)	419.1 (+129.1%)
		512	178.1	167.0 (−6.2%)	150.0 (−15.8%)	414.1 (+132.5%)
Gemma 4 E2B	6-bit · group=64 · affine	128	161.3	153.0 (−5.1%)	139.5 (−13.5%)	389.3 (+141.4%)
		512	154.2	147.1 (−4.6%)	134.6 (−12.7%)	385.7 (+150.1%)
Gemma 4 E2B	8-bit · group=64 · affine	128	139.4	134.9 (−3.2%)	123.7 (−11.3%)	413.9 (+196.9%)
		512	134.5	130.8 (−2.8%)	120.2 (−10.7%)	413.2 (+207.1%)
Gemma 4 26B A4B	4-bit · group=64 · affine	128	118.3	109.4 (−7.5%)	109.9 (−7.1%)	243.1 (+105.6%)
		512	113.1	104.7 (−7.5%)	107.0 (−5.4%)	198.1 (+75.2%)
Gemma 4 31B	4-bit · group=64 · affine	128	26.2	24.8 (−5.5%)	23.5 (−10.6%)	57.7 (+119.9%)
		512	24.9	24.7 (−0.9%)	22.3 (−10.6%)	51.0 (+104.4%)
Qwen 3.5 9B	4-bit · group=64 · affine	128	96.5	93.7 (−2.9%)	88.4 (−8.4%)	82.9 (−14.1%)
		512	101.3	91.4 (−9.8%)	88.9 (−12.3%)	86.3 (−14.7%)
Qwen 3.6 35B A3B	UD-MLX 4-bit · group=64 · affine	128	107.6	103.6 (−3.7%)	108.0 (+0.4%)	135.8 (+26.2%)
		512	103.3	101.4 (−1.9%)	106.6 (+3.2%)	118.4 (+14.6%)
Qwen 3.6 35B A3B	MLX 5-bit · group=64 · affine	128	116.8	110.2 (−5.6%)	114.8 (−1.7%)	82.1 (−29.7%)
		512	113.7	108.7 (−4.4%)	109.7 (−3.6%)	107.5 (−5.5%)
Qwen 3.6 35B A3B	MLX 6-bit · group=64 · affine	128	102.9	99.1 (−3.6%)	101.8 (−1.1%)	129.9 (+26.2%)
		512	101.1	98.0 (−3.1%)	100.8 (−0.2%)	114.6 (+13.4%)
Qwen 3.6 35B A3B	MLX 8-bit · group=64 · affine	128	93.6	89.3 (−4.6%)	92.2 (−1.5%)	90.9 (−2.9%)
		512	91.4	89.1 (−2.6%)	92.0 (+0.6%)	114.0 (+24.7%)
Qwen Coder Next	4-bit · group=64 · affine	128	92.2	89.4 (−3.0%)	90.7 (−1.5%)	240.1 (+160.6%)
		512	90.4	89.2 (−1.3%)	90.0 (−0.5%)	239.0 (+164.5%)
GLM 4.7 Flash	4-bit · group=64 · affine	128	93.0	88.0 (−5.4%)	91.2 (−1.9%)	250.6 (+169.4%)
		512	90.4	84.5 (−6.6%)	90.3 (−0.1%)	244.8 (+170.8%)

Prefill throughput (tok/s) — percentages vs mlx_lm

Model	MLX quantization	Prompt tok	mlx_lm	mlx_swift_lm	ax engine
Gemma 4 E2B	4-bit · group=64 · affine	128	2,265.8	2,450.4 (+8.1%)	3,134.9 (+38.4%)
		512	7,634.1	6,664.3 (−12.7%)	7,426.4 (−2.7%)
Gemma 4 E2B	5-bit · group=64 · affine	128	2,267.5	2,393.9 (+5.6%)	3,058.7 (+34.9%)
		512	8,405.7	6,742.6 (−19.8%)	7,128.6 (−15.2%)
Gemma 4 E2B	6-bit · group=64 · affine	128	2,156.3	3,436.8 (+59.4%)	3,008.3 (+39.5%)
		512	7,320.7	7,962.3 (+8.8%)	7,071.6 (−3.4%)
Gemma 4 E2B	8-bit · group=64 · affine	128	1,911.7	3,082.0 (+61.2%)	3,003.0 (+57.1%)
		512	6,582.8	6,758.1 (+2.7%)	7,081.7 (+7.6%)
Gemma 4 26B A4B	4-bit · group=64 · affine	128	545.3	1,227.3 (+125.1%)	1,196.0 (+119.3%)
		512	1,620.7	2,938.6 (+81.3%)	2,695.9 (+66.3%)
Gemma 4 31B	4-bit · group=64 · affine	128	336.5	641.6 (+90.7%)	504.1 (+49.8%)
		512	563.5	760.6 (+35.0%)	645.9 (+14.6%)
Qwen 3.5 9B	4-bit · group=64 · affine	128	1,133.3	2,101.1 (+85.4%)	1,845.3 (+62.8%)
		512	2,245.7	3,165.8 (+41.0%)	2,613.9 (+16.4%)
Qwen 3.6 35B A3B	UD-MLX 4-bit · group=64 · affine	128	531.7	963.2 (+81.1%)	909.2 (+71.0%)
		512	1,594.2	2,546.5 (+59.7%)	2,268.5 (+42.3%)
Qwen 3.6 35B A3B	MLX 5-bit · group=64 · affine	128	474.4	861.8 (+81.7%)	858.4 (+81.0%)
		512	1,484.5	2,416.7 (+62.8%)	2,135.9 (+43.9%)
Qwen 3.6 35B A3B	MLX 6-bit · group=64 · affine	128	420.0	762.4 (+81.5%)	819.4 (+95.1%)
		512	1,377.9	2,350.6 (+70.6%)	2,078.3 (+50.8%)
Qwen 3.6 35B A3B	MLX 8-bit · group=64 · affine	128	393.1	617.7 (+57.1%)	784.8 (+99.6%)
		512	1,202.2	2,305.2 (+91.7%)	2,056.5 (+71.1%)
Qwen Coder Next	4-bit · group=64 · affine	128	267.1	384.9 (+44.1%)	811.7 (+203.9%)
		512	815.4	1,417.0 (+73.8%)	2,600.6 (+218.9%)
GLM 4.7 Flash	4-bit · group=64 · affine	128	502.9	1,045.0 (+107.8%)	822.7 (+63.6%)
		512	1,584.7	2,588.8 (+63.4%)	2,218.2 (+40.0%)

Installation

Homebrew

For tagged macOS arm64 releases, install the preview command-line tools from the AutomatosX tap:

brew install defai-digital/ax-engine/ax-engine

This installs:

ax-engine-server: local HTTP adapter over the SDK runtime
ax-engine-bench: workload-contract, readiness, direct-generate, and benchmark-support CLI
the Homebrew mlx-c runtime dependency required by the released binaries

Check the installed tools:

ax-engine-server --help
ax-engine-bench doctor

Homebrew is the quickest path for the released server and benchmark binaries. If ax-engine-bench doctor fails with Library not loaded: /opt/homebrew/opt/mlx-c/lib/libmlxc.dylib, install or repair the runtime with brew install mlx-c and brew reinstall defai-digital/ax-engine/ax-engine. Use the source build when you need the full Rust workspace, Python extension, local examples, or changes that have not been tagged yet.

The release archive attached to GitHub is the Homebrew formula payload. It is not a standalone installer with bundled dynamic libraries. Use Homebrew unless you are prepared to provide mlx-c and its dynamic library path yourself.

Source

Development builds require Rust and the MLX C runtime on Apple Silicon:

brew install mlx-c
cargo build --workspace --release

Python bindings are built from source:

maturin develop
python -m unittest discover -s python/tests -v

Quick Start

The commands below use source-build paths. If you installed with Homebrew, use ax-engine-server and ax-engine-bench directly instead of ./target/release/....

# HTTP inference server (repo-owned MLX runtime)
./target/release/ax-engine-server \
  --mlx \
  --mlx-model-artifacts-dir /path/to/local/mlx-model \
  --port 8080

# Python bindings (after maturin develop)
python3 - <<'EOF'
import ax_engine
with ax_engine.Session(model_id='gemma4', mlx=True,
        mlx_model_artifacts_dir='/path/to/local/mlx-model') as s:
    result = s.generate([1, 2, 3], max_output_tokens=32)
    print(result.output_tokens)
EOF

For an unsupported MLX text model that upstream mlx-lm can serve, keep AX Engine as the CLI/server surface and delegate the model execution explicitly:

mlx_lm.server --model /path/to/local/mlx-model --host 127.0.0.1 --port 8090

./target/release/ax-engine-bench generate \
  --prompt "Hello from mlx-lm" \
  --support-tier mlx_lm_delegated \
  --mlx-lm-server-url http://127.0.0.1:8090

mlx_lm_delegated is a compatibility route, not a repo-owned MLX throughput claim. It forwards text generation to upstream mlx_lm.server, preserves AX sampling fields such as temperature, top_p, top_k, repetition_penalty, and seed, and exposes blocking plus fake-SSE text surfaces through AX. Tool calls and visual/multimodal inputs are not yet AX compatibility contracts.

# Primary benchmark: AX vs mlx_lm vs mlx-swift-lm
python3 scripts/bench_mlx_inference_stack.py \
  --model-dir /path/to/local/mlx-model \
  --prompt-tokens 128,512 --generation-tokens 128 \
  --ax-compare-policies --repetitions 3 \
  --mlx-swift-lm-command './scripts/mlx-swift-bench/.build/release/mlx-swift-bench \
    --model {model} --prompt-token-ids {prompt_token_ids_path} \
    --generation-tokens {generation_tokens} --trials {trials} \
    --delay {delay} --prefill-step-size {prefill_step_size}' \
  --output benchmarks/results/mlx-inference/2026-05-04/gemma-4-e2b-it-4bit.json

# Secondary workload-contract benchmark
./target/release/ax-engine-bench scenario \
  --manifest benchmarks/manifests/scenario/chat_gemma4_e2b_short.json \
  --output-root benchmarks/results

# Smoke checks
bash scripts/check-server-preview.sh
bash scripts/check-python-preview.sh

Workspace

crates/ax-engine-core    Engine state machine, scheduler, KV manager, sampler
crates/ax-engine-mlx     MLX model graph, n-gram acceleration, KV cache, runner
crates/mlx-sys           bindgen FFI over mlx-c; safe MlxArray RAII wrappers
crates/ax-engine-sdk     Session API, backend resolution (MLX, mlx-lm delegated, or llama.cpp)
crates/ax-engine-server  Axum HTTP/SSE adapter (OpenAI-compatible routes)
crates/ax-engine-bench   Manifest-driven workload-contract CLI
crates/ax-engine-py      PyO3 extension (ABI3, Python 3.10+)

Unsupported MLX text models can use the explicit delegated mlx_lm_delegated route through a user-provided mlx_lm.server. Non-MLX inference routes through the delegated llama.cpp contract.

Development

cargo build --workspace                                           # build all crates
cargo test --quiet                                                # full Rust test suite
cargo clippy --all-targets --all-features -- -D warnings         # lint (CI gate)
cargo fmt                                                         # format
maturin develop                                                   # rebuild Python extension
python -m unittest discover -s python/tests -v                   # Python tests

Coverage is collected by the report-only GitHub Actions workflow in .github/workflows/coverage.yml. It publishes Rust cargo llvm-cov and Python coverage.py artifacts without enforcing a percentage threshold yet; add a gate only after the project has a stable baseline across macOS, MLX, and PyO3 paths.

Public documentation is in docs/. Canonical benchmark manifests are in benchmarks/manifests/.

Contributing

AX Engine welcomes public contributions. See CONTRIBUTING.md for guidelines.

Community

Website: automatosx.com
Discord: Join us
Email: [email protected]

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
.claude		.claude
.github/workflows		.github/workflows
.internal		.internal
benchmarks		benchmarks
build/metal		build/metal
crates		crates
docs		docs
examples/python		examples/python
javascript/ax-engine		javascript/ax-engine
metal		metal
python		python
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AX Engine

30-Second Setup

Why AX Engine

Runtime Paths

Design

Execution Layer

Memory Layer

Supported Models

Limitations

Performance (methodology)

Decode throughput (tok/s) — generation=128 tokens, temp=0

Prefill throughput (tok/s) — percentages vs mlx_lm

Installation

Homebrew

Source

Quick Start

Workspace

Development

Contributing

Community

License

About

Uh oh!

Releases 44

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AX Engine

30-Second Setup

Why AX Engine

Runtime Paths

Design

Execution Layer

Memory Layer

Supported Models

Limitations

Performance (methodology)

Decode throughput (tok/s) — generation=128 tokens, temp=0

Prefill throughput (tok/s) — percentages vs mlx_lm

Installation

Homebrew

Source

Quick Start

Workspace

Development

Contributing

Community

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 44

Uh oh!

Contributors

Uh oh!

Languages