Multi-Mac MLX inference over Thunderbolt RDMA — confirmed 8 GB/s on Apple M4 Pro.
A lightweight, production-ready toolkit for running distributed MLX inference across Apple Silicon Macs connected via Thunderbolt, using JACCL (RDMA over Thunderbolt) as the transport layer. Exposes an OpenAI-compatible API with a live dashboard.
Why this exists: exo is a much larger project (32K lines) that also targets TB RDMA, but its auto-discovery and planner logic failed to produce working RDMA on our M4 Pro Mac minis. This fork takes the opposite approach — explicit configuration, minimal code, proven RDMA. See docs/comparison-vs-exo.md for the full analysis.
| 🔗 RDMA verified | 8.05 GB/s peak bandwidth · 25.5 µs latency over Thunderbolt |
| 🧠 Distributed inference | Tensor-parallel mlx_lm.sharded_load across 2+ Macs |
| 🌐 OpenAI-compatible API | /v1/chat/completions + /v1/completions with SSE streaming |
| 📊 Live dashboard | HTMX + SSE — real-time tok/s, latency, queue depth, sparkline, chat UI |
| 🔧 Makefile-driven | make setup → make rdma-test → make server — every operation is one command |
| 📦 Zero build toolchain | No Rust, no Node.js, no npm — pure Python + Bash, managed by uv |
| 🍎 Stock MLX | Uses official mlx from PyPI — no custom forks |
| Mac 1 | Mac 2 | |
|---|---|---|
| Model | Mac mini (Mac16,11) | Mac mini (Mac16,11) |
| Chip | Apple M4 Pro | Apple M4 Pro |
| Memory | 48 GB unified | 48 GB unified |
| macOS | 26.3 (25D125) | 26.3 (25D125) |
| RDMA device | rdma_en4 (PORT_ACTIVE) |
rdma_en4 (PORT_ACTIVE) |
RDMA benchmark results (from make rdma-test):
| Tensor size | Avg bandwidth | Peak bandwidth | Avg latency |
|---|---|---|---|
| 4 KB | — | — | 25.5 µs |
| 256 KB | 1.82 GB/s | 2.14 GB/s | 0.12 ms |
| 4 MB | 6.71 GB/s | 7.38 GB/s | 0.56 ms |
| 64 MB | 7.94 GB/s | 8.05 GB/s | 7.60 ms |
mlx-jaccl-cluster/
├── Makefile # All operations as make targets
├── pyproject.toml # uv / pip dependency manifest
├── hostfiles/
│ ├── hosts-2node.json # Working 2-node hostfile
│ ├── hosts-1node.json # Single-node (local testing)
│ └── hosts.json.example # Template for custom setups
├── server/
│ ├── openai_cluster_server.py # OpenAI-compatible API (rank 0 HTTP, all ranks compute)
│ └── dashboard.py # HTMX + SSE live dashboard
├── scripts/
│ ├── setup.sh # One-shot node installer (uv + .venv + deps + fingerprint)
│ ├── bootstrap_node.sh # Remote node setup over SSH
│ ├── rdma_test.py # RDMA correctness + latency + bandwidth test
│ ├── jaccl_tps_bench.py # Distributed tokens/sec benchmark
│ ├── cluster_info.sh # Side-by-side node alignment report
│ ├── verify_cluster.sh # SSH + RDMA device checks
│ ├── sync_nodes.sh # git pull on all nodes in parallel
│ ├── run_openai_cluster_server.sh # Start the cluster server
│ └── stop_openai_cluster_server.sh # Stop the cluster server
└── docs/
├── architecture.md # Deep technical architecture reference
├── roadmap.md # Feature roadmap + gap analysis vs exo
├── from-scratch.md # Full setup guide (RDMA enable → running server)
├── comparison-vs-exo.md # Deep comparison with exo project
└── scripts-reference.md # All scripts + Makefile targets reference
- 2 Apple Silicon Macs connected via Thunderbolt cable
- RDMA enabled on both Macs (one-time, in macOS Recovery — see docs/from-scratch.md)
- SSH key-based auth between the Macs
- Homebrew installed on both
git clone https://github.com/omar-karray/mlx-jaccl-cluster.git
cd mlx-jaccl-cluster
make setupThis installs uv, creates a .venv, installs all Python dependencies, verifies imports, checks RDMA devices, and saves a hardware fingerprint.
REMOTE=mac2.local make bootstrapThis SSHes into Mac 2, installs Homebrew/git/uv if needed, clones the repo to the same path, and runs setup.sh — all in one command.
Edit hostfiles/hosts-2node.json with your actual hostnames and IPs:
[
{
"ssh": "mac1.local",
"ips": ["192.168.1.14"],
"rdma": [null, "rdma_en4"]
},
{
"ssh": "mac2.local",
"ips": [],
"rdma": ["rdma_en4", null]
}
]Find your RDMA device name: ibv_devinfo 2>/dev/null | grep -E "hca_id|state" — look for PORT_ACTIVE.
make verify # SSH + RDMA device checks
make cluster-info # Side-by-side version/hardware alignmentmake rdma-testExpected output: correctness check → latency measurement → bandwidth sweep with GB/s readings. A healthy TB link shows > 5 GB/s.
# Download on Mac 1
source .venv/bin/activate
huggingface-cli download mlx-community/Qwen3-4B-Instruct-2507-4bit \
--local-dir ~/models_mlx/Qwen3-4B
# Sync to Mac 2
rsync -avz ~/models_mlx/Qwen3-4B/ mac2.local:~/models_mlx/Qwen3-4B/
# Start the cluster server
MODEL_DIR=~/models_mlx/Qwen3-4B make server# Health check
make health
# Chat
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":64}'
# Dashboard
open http://localhost:8080/dashboardRun make help for the full list. Key targets:
| Target | Description |
|---|---|
make setup |
Install deps on this node (uv + .venv) |
REMOTE=mac2 make bootstrap |
Set up a remote node over SSH |
| Target | Description |
|---|---|
make verify |
SSH + RDMA device checks on all nodes |
make cluster-info |
Side-by-side node alignment report |
make sync |
Pull latest code on all nodes |
| Target | Description | Duration |
|---|---|---|
make rdma-quick |
5 rounds, small tensors — smoke test | ~10 s |
make rdma-test |
20 rounds, 4 sizes — full benchmark | ~30 s |
make rdma-verbose |
Same as above with per-round timing | ~30 s |
make rdma-stress |
100 rounds, large tensors — stability test | ~5 min |
Override defaults:
RDMA_ROUNDS=50 RDMA_SIZES=1048576,16777216,67108864 RDMA_VERBOSE=1 make rdma-test| Target | Description |
|---|---|
MODEL_DIR=... make server |
Start the OpenAI-compatible cluster server |
make server-stop |
Stop the server on all nodes |
make server-restart |
Stop then start (requires MODEL_DIR) |
make health |
Check /health endpoint |
make models |
List served models |
make chat-test |
Send a test chat completion |
make queue |
Show request queue depth |
make dashboard |
Open the live dashboard in the default browser |
make metrics |
Show current metrics snapshot (JSON) |
| Target | Description |
|---|---|
MODEL=mlx-community/... make download |
Download a model from HuggingFace and rsync to all nodes |
make models-local |
List locally downloaded models with sizes |
MODEL_DIR=... make models-check |
Verify model exists on all nodes |
# Download and sync a model to the whole cluster in one command
MODEL=mlx-community/Qwen3-4B-Instruct-2507-4bit make download
# Then serve it
MODEL_DIR=~/models_mlx/Qwen3-4B-Instruct-2507-4bit make server| Target | Description |
|---|---|
make status |
Full cluster snapshot: nodes, memory, RDMA, server, model |
make monitor |
Live-updating status (refreshes every 5 s, Ctrl+C to stop) |
make logs |
Tail server log file |
make version |
Show version info for all components (mlx, Python, macOS, chip) |
| Target | Description |
|---|---|
make lint |
Syntax check (py_compile) + shellcheck on all code |
make test |
Full test suite: lint → RDMA quick → health check |
make loc |
Count lines of code by component |
| Target | Description |
|---|---|
make bench |
Distributed tokens/sec benchmark (requires MODEL_DIR) |
make kill-all |
Emergency stop — kill all MLX processes on all nodes |
make fingerprint |
Print this node's hardware/MLX info as JSON |
make clean |
Remove .venv locally |
make clean-all |
Remove .venv on all nodes |
The server (rank 0) exposes:
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | OpenAI chat completions (streaming + non-streaming) |
/v1/completions |
POST | OpenAI text completions (streaming + non-streaming) |
/v1/models |
GET | List served models |
/health |
GET | Cluster health (world size, queue depth) |
/queue |
GET | Request queue status |
/dashboard |
GET | Live HTMX dashboard |
/metrics/stream |
GET | SSE metrics stream (tok/s, latency, queue) |
/metrics/snapshot |
GET | Current metrics as JSON |
/docs |
GET | Auto-generated Swagger/OpenAPI docs |
| Variable | Default | Description |
|---|---|---|
MODEL_DIR |
(required) | Path to the MLX model directory |
HOSTFILE |
hostfiles/hosts-2node.json |
Path to cluster hostfile |
MODEL_ID |
basename of MODEL_DIR |
Model identifier for API responses |
HTTP_HOST |
0.0.0.0 |
HTTP server bind address |
HTTP_PORT |
8080 |
HTTP server port |
CTRL_HOST |
auto-detect from hostfile | Coordinator IP (rank 0 LAN IP) |
CTRL_PORT |
18080 |
Coordinator control-plane port |
QUEUE_MAX |
8 |
Max queued requests |
REQ_TIMEOUT |
120 |
Per-request timeout in seconds |
The server includes a built-in live dashboard at /dashboard:
- Cluster topology — rank, role, RDMA device, status for each node
- RDMA banner — Thunderbolt 5 / JACCL badge with peak bandwidth
- Live metrics — avg tok/s, peak tok/s, request count, latency (updated via SSE every 2s)
- Queue depth — visual bar with color coding (green → yellow → red)
- Sparkline — tok/s history over the last 40 generations
- Chat UI — full streaming chat interface, send messages directly from the dashboard
- Uptime — server uptime and total tokens generated
No build step. No Node.js. Pure HTMX + SSE served from Python.
These are passed to all nodes via mlx.launch --env:
| Variable | Description |
|---|---|
MLX_METAL_FAST_SYNCH=1 |
Critical. Enables fast Metal synchronization. Without this, expect 5–6× slower inference. |
HF_HUB_OFFLINE=1 |
Prevents HuggingFace from downloading models at runtime. |
TRANSFORMERS_OFFLINE=1 |
Same for the transformers library. |
Why offline mode? In a distributed cluster, every node would attempt to download the model simultaneously — causing races, inconsistent states, and unpredictable startup. Always download once on rank 0, then rsync to other nodes.
| Document | Description |
|---|---|
| docs/architecture.md | Deep technical architecture: data plane, control plane, process model, request lifecycle |
| docs/roadmap.md | Feature roadmap: gap analysis vs exo, dashboard v2 design, tool support plan, phased priorities |
| docs/from-scratch.md | Full setup guide: RDMA enablement → uv install → hostfile → RDMA test → model download → server |
| docs/comparison-vs-exo.md | Deep comparison with exo — architecture, failure modes, benchmarks, closing the gap |
| docs/scripts-reference.md | Complete reference for all scripts and Makefile targets |
┌─────────────────────────────────────────────────────────────┐
│ Mac 1 (rank 0) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ FastAPI + │ │ mlx_lm │ │ Dashboard │ │
│ │ uvicorn │──▶│ .generate() │ │ (HTMX + SSE) │ │
│ │ :8080 │ │ (rank 0 │ │ /dashboard │ │
│ │ │ │ shards) │ │ │ │
│ └──────────────┘ └──────┬───────┘ └────────────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ TCP ctrl-plane │ RDMA all_sum │ │
│ │ :18080 │ (JACCL/TB) │ │
│ │ {"type":"task"} │ ~8 GB/s │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
└────────────────────────────┼────────────────────────────────┘
Thunderbolt ════╪════ rdma_en4
┌────────────────────────────┼────────────────────────────────┐
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ TCP ctrl-plane │ RDMA all_sum │ │
│ │ worker_loop() │ (JACCL/TB) │ │
│ │ recv task → │ ~8 GB/s │ │
│ │ generate() → │ │ │
│ │ send done │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ┌──────┴───────┐ │
│ │ mlx_lm │ │
│ │ .generate() │ │
│ │ (rank 1 │ │
│ │ shards) │ │
│ └──────────────┘ │
│ │
│ Mac 2 (rank 1) │
└─────────────────────────────────────────────────────────────┘
Data path: Tensor-parallel collective operations (all_sum) run over RDMA at ~8 GB/s.
Control path: Rank 0 broadcasts {"type":"task","prompt":"...","max_tokens":N} to workers over a simple TCP socket. Workers call generate() (which triggers the RDMA collectives), then reply {"type":"done"}. This is ~120 lines of code total.
- Confirm RDMA is enabled: boot into macOS Recovery →
rdma_ctl enable→ reboot - Check devices:
ibv_devinfo— look forPORT_ACTIVE - Make sure
MLX_METAL_FAST_SYNCH=1is set (without it, bandwidth is 5–6× lower) - Re-seat the Thunderbolt cable
Both nodes must have matching environments. Run make cluster-info and check for yellow/red mismatches.
All ranks must enter generate() per request. Confirm workers connected:
# Check server logs for "all workers connected"
# If not, check CTRL_HOST and CTRL_PORT are reachable from workersThe server passes HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 to all nodes. If you're still seeing downloads, the model path may be wrong — confirm MODEL_DIR exists on all nodes.
make kill-all # kills all MLX processes on all nodes
make server-stop # just the servermake clean # or: make clean-all (all nodes)
make setupWe're actively closing the gap between our minimal-but-working approach and exo's richer feature set — without sacrificing reliability, simplicity, or the zero-build-toolchain promise.
Planned in upcoming phases:
| Phase | What | Status |
|---|---|---|
| 1 — Observability | Live RAM/memory per node, RDMA link probe, worker health detection, Prometheus /metrics, make status / make monitor |
🔜 Next |
| 2 — Dashboard v2 | Memory gauges, SVG topology graph, request history table, error log, responsive layout | Planned |
| 3 — Model Management | make download MODEL=... with progress + auto-sync to all nodes, model registry |
✅ Makefile targets shipped |
| 4 — Tool Support | Function calling (tools=), structured output (response_format), sampling params (temperature, top_p, stop) |
Planned |
| 5 — API Parity | Ollama API compatibility (/api/generate, /api/chat), client SDK testing |
Planned |
Full details, wireframes, and architecture decisions: docs/roadmap.md Deep architecture reference: docs/architecture.md
mlx_lm.serveris single-host only. This repo's server runs HTTP on rank 0 while all ranks participate in sharded compute.- For 4 nodes, JACCL requires a fully connected Thunderbolt mesh (6 cables total).
- RDMA must be enabled in macOS Recovery on each Mac (
rdma_ctl enable). - The dashboard requires no build step — it's pure HTMX + SSE served from
dashboard.py.
MIT