This guide sets up a multi‑Mac, fully connected Thunderbolt mesh using MLX JACCL (RDMA over Thunderbolt) and runs distributed jobs via mlx.launch --backend jaccl.
For 4 nodes, JACCL requires a fully connected mesh:
- 6 Thunderbolt cables total (every pair directly connected)
For 2 nodes:
- 1 Thunderbolt cable directly between the two Macs
RDMA over Thunderbolt must be enabled locally in macOS Recovery:
- Boot into Recovery
- Open Terminal
- Run:
rdma_ctl enable - Reboot
- Verify:
ibv_devices
You should see rdma_en* devices (e.g. rdma_en3, rdma_en4, rdma_en5).
This project uses uv instead of conda — it is faster, lighter, and requires no extra tool beyond Homebrew.
Do this on each Mac:
brew install uvFrom the repo root:
./scripts/setup.shThis will:
- Create a
.venvvirtualenv in the repo root (Python 3.12) - Install all dependencies:
mlx,mlx-lm,fastapi,uvicorn,transformers,tokenizers,mistral_common,huggingface_hub - Verify all packages are importable
- Check for RDMA devices
If you prefer to do it step by step:
# Create virtualenv
uv venv .venv --python 3.12
# Activate
source .venv/bin/activate
# Install dependencies
uv pip install "mlx>=0.30.4" "mlx-lm>=0.30.5"
uv pip install "fastapi>=0.110.0" "uvicorn[standard]>=0.29.0" "pydantic>=2.0"
uv pip install "transformers>=4.50.0" tokenizers mistral_common "huggingface_hub[cli]"source .venv/bin/activate
python -m pip show mlx mlx-lm transformers | grep -E "Name|Version"JACCL uses RDMA for the data path, but needs a TCP coordinator address that all nodes can reach.
On rank0, prefer Ethernet:
ipconfig getifaddr en0Copy the appropriate template:
# For 2 nodes:
cp hostfiles/hosts-2node.json hostfiles/hosts.json
# For 4 nodes:
cp hostfiles/hosts.json.example hostfiles/hosts.jsonEdit hostfiles/hosts.json:
- set
sshhostnames (e.g.mac1.local,mac2.local, …) - set rank0
"ips": ["<rank0_lan_ip>"] - keep the
rdmamatrix consistent with your wiring (useibv_devicesto find device names)
hostfiles/hosts.jsonis ignored by git.
HOSTFILE=hostfiles/hosts-2node.json ./scripts/verify_cluster.shThis checks:
- SSH connectivity to each node
- RDMA devices present on each node (
ibv_devices)
Note: this does not send data over RDMA. It only checks SSH + device presence.
Run the minimal RDMA test to confirm actual data flows over Thunderbolt between both Macs:
.venv/bin/mlx.launch --backend jaccl \
--hostfile hostfiles/hosts-2node.json \
--env MLX_METAL_FAST_SYNCH=1 -- \
scripts/rdma_test.pyExpected output on rank0:
- Phase 0: barrier smoke test (all ranks reached barrier)
- Phase 1: correctness check on
all_sumresults - Phase 2: latency of a 1-element all_sum in µs
- Phase 3: bandwidth sweep across tensor sizes with GB/s readings
A healthy TB5 RDMA link should show > 5 GB/s peak bandwidth.
Optional env vars:
RDMA_ROUNDS=50 RDMA_VERBOSE=1 \
.venv/bin/mlx.launch --backend jaccl \
--hostfile hostfiles/hosts-2node.json \
--env MLX_METAL_FAST_SYNCH=1 -- \
scripts/rdma_test.pyThe same model path must exist on every node. Download once on rank0, then sync to other nodes.
Download a model from HuggingFace to a local directory:
# Activate the venv first
source .venv/bin/activate
huggingface-cli download mlx-community/Qwen3-4B-Instruct-2507-4bit \
--local-dir ~/models_mlx/Qwen3-4B-Instruct-2507-4bitSync to other nodes:
# Replace paths and hostnames with your actual values
ssh mac2.local "mkdir -p ~/models_mlx"
rsync -avz -e ssh ~/models_mlx/Qwen3-4B-Instruct-2507-4bit/ \
mac2.local:/Users/yourusername/models_mlx/Qwen3-4B-Instruct-2507-4bit/Verify all nodes have the model:
HOSTS=$(python3 -c "import json; print(' '.join(h['ssh'] for h in json.load(open('hostfiles/hosts.json'))))")
for h in $HOSTS; do
echo -n "$h: "
ssh "$h" "test -d '$MODEL_DIR' && echo OK || echo MISSING"
doneTip: For large models (100GB+), copying via an external SSD is much faster than rsync over the network.
.venv/bin/mlx.launch --verbose --backend jaccl \
--hostfile hostfiles/hosts-2node.json \
--env MLX_METAL_FAST_SYNCH=1 \
--env HF_HUB_OFFLINE=1 \
--env TRANSFORMERS_OFFLINE=1 -- \
scripts/jaccl_tps_bench.py \
--model "$MODEL_DIR" \
--prompt "Write 5 sentences about Thunderbolt RDMA." \
--max-tokens 256Rank0 prints prompt_tokens, gen_tokens, seconds, tokens_per_sec.
Start:
MODEL_DIR=~/models_mlx/your-model-name ./scripts/run_openai_cluster_server.shOr with custom settings:
MODEL_DIR=~/models_mlx/your-model-name \
HTTP_PORT=8000 \
HOSTFILE=hostfiles/hosts-2node.json \
./scripts/run_openai_cluster_server.shStop:
./scripts/stop_openai_cluster_server.shTest:
curl -s http://<rank0-host>:8080/v1/models
curl -s http://<rank0-host>:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"<MODEL_ID>","messages":[{"role":"user","content":"hello"}],"max_tokens":64}'
curl -s http://<rank0-host>:8080/v1/completions \
-H 'Content-Type: application/json' \
-d '{"model":"<MODEL_ID>","prompt":"Hello","max_tokens":64}'These environment variables are passed to all nodes via mlx.launch --env:
| Variable | Description |
|---|---|
MLX_METAL_FAST_SYNCH=1 |
Critical for performance. Enables fast Metal synchronization. Without this, you may see 5-6x slower inference speeds. |
HF_HUB_OFFLINE=1 |
Prevents automatic model downloads. |
TRANSFORMERS_OFFLINE=1 |
Prevents automatic model downloads. |
The HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 flags prevent HuggingFace from automatically downloading models. This is critical for distributed clusters because:
- All nodes would download simultaneously — wasteful and slow
- Nodes may have different network access — some might fail while others succeed
- Race conditions — nodes may end up with inconsistent model states
- Unpredictable startup times — downloading large models can take a long time
Best practice: Always download models once on rank0, then sync to all other nodes (see step 7 above), and run with offline mode enabled.
For sharded distributed inference, all ranks must enter generate() per request.
- Confirm all nodes are running the server (rank0 + workers)
- Confirm the server control-plane port is reachable (
CTRL_PORT)
- Confirm
rdma_ctl enablewas run in Recovery on both Macs - Run
ibv_deviceson each Mac — you must seerdma_en*entries - Confirm the Thunderbolt cable is seated properly
- Try
MLX_METAL_FAST_SYNCH=1— without it bandwidth will appear 5-6x lower
Pass offline env vars via mlx.launch --env:
HF_HUB_OFFLINE=1TRANSFORMERS_OFFLINE=1
./scripts/stop_openai_cluster_server.sh
# If needed, also kill any other MLX processes:
HOSTS=$(python3 -c "import json; print(' '.join(h['ssh'] for h in json.load(open('hostfiles/hosts.json'))))")
for h in $HOSTS; do
ssh "$h" 'pkill -f "python.*mlx" || true'
done# Remove old venv and start fresh
rm -rf .venv
./scripts/setup.sh