Living document. Last updated: 2025-07-14
This document sketches what we have, what we're missing relative to exo, and the phased plan to close the gap — dashboard v2, tool support, model management, observability, and beyond.
- 1. Current State
- 2. Gap Analysis vs exo
- 3. Priority Matrix
- 4. Phase 1 — Observability & Monitoring
- 5. Phase 2 — Dashboard v2
- 6. Phase 3 — Model Management
- 7. Phase 4 — Tool Support & Structured Output
- 8. Phase 5 — API Parity & Ecosystem
- 9. Phase 6 — Advanced Inference
- 10. Non-Goals
- 11. Architecture Decisions
- 12. Dashboard v2 — Wireframe
- 13. Implementation Notes
- 14. Dependency Budget
- 15. Success Criteria
| Area | Status | Details |
|---|---|---|
| RDMA transport | ✅ Production-ready | 8.05 GB/s peak, 25.5 µs latency, stress-tested |
| Tensor-parallel inference | ✅ Working | mlx_lm.sharded_load across 2 nodes |
| OpenAI-compatible API | ✅ Working | /v1/chat/completions, /v1/completions, streaming SSE |
| Dashboard | ✅ Basic | HTMX+SSE, tok/s sparkline, queue depth, chat UI |
| Cluster tooling | ✅ Solid | Makefile, setup, bootstrap, verify, sync, benchmarks |
| RDMA test suite | ✅ Comprehensive | Correctness, latency, bandwidth, stress modes |
| Documentation | ✅ Good | Quickstart, from-scratch, comparison, scripts reference |
| Area | Status | Impact |
|---|---|---|
| Live RAM / memory monitoring | ❌ | Can't see if we're approaching OOM during inference |
| Live RDMA link health | ❌ | Static label only — no real-time bandwidth probe |
| Model download & management | ❌ | Manual huggingface-cli + rsync every time |
| Tool calls / function calling | ❌ | Can't use with agents, LangChain, OpenAI SDK tools |
| Structured output / JSON mode | ❌ | No response_format support |
| KV prefix cache | ❌ | Every request re-processes the full prompt |
| Ollama API compatibility | ❌ | Can't use with Ollama-native clients |
| Multi-model serving | ❌ | One model per server instance |
| Request logging / tracing | ❌ | No persistent logs, no request history |
| Prometheus / Grafana export | ❌ | No standard metrics format |
| Node failure detection | ❌ | Worker disconnect = silent hang |
| Image generation | ❌ | No Flux / image pipeline |
Detailed feature-by-feature comparison showing what exo has, what we have, and whether closing the gap makes sense for our use case.
| Feature | exo | Us (v0.1) | Gap | Priority |
|---|---|---|---|---|
/v1/chat/completions |
✅ | ✅ | — | — |
/v1/completions |
✅ | ✅ | — | — |
| SSE streaming | ✅ | ✅ | — | — |
| Tool calls / function calling | ✅ | ❌ | Big | 🔴 High |
Structured output (response_format) |
✅ | ❌ | Big | 🔴 High |
temperature, top_p, top_k |
✅ | ❌ | Medium | 🟡 Medium |
stop sequences |
✅ | ❌ | Medium | 🟡 Medium |
n (multiple completions) |
✅ | ❌ | Small | 🟢 Low |
logprobs |
✅ | ❌ | Small | 🟢 Low |
| Token usage in streaming | ✅ | ❌ | Medium | 🟡 Medium |
Ollama /api/generate |
✅ | ❌ | Medium | 🟡 Medium |
Ollama /api/chat |
✅ | ❌ | Medium | 🟡 Medium |
/v1/embeddings |
❌ | ❌ | — | Future |
| Feature | exo | Us (v0.1) | Gap | Priority |
|---|---|---|---|---|
| Live tok/s + sparkline | ✅ | ✅ | — | — |
| Cluster topology table | ✅ | ✅ | — | — |
| Chat UI (streaming) | ✅ | ✅ | — | — |
| Queue depth indicator | ✅ via inference | ✅ | — | — |
| RAM / unified memory usage | ❌ | ❌ | Both miss | 🔴 High |
| Live RDMA bandwidth probe | ❌ | ❌ static | We should own this | 🔴 High |
| Per-node GPU memory | ❌ | ❌ | Both miss | 🔴 High |
| D3.js topology graph | ✅ animated | ❌ | Medium | 🟡 Medium |
| Model download progress | ✅ | ❌ | Medium | 🟡 Medium |
| Token heatmap / attention | ✅ | ❌ | Small | 🟢 Low |
| Generation traces | ✅ | ❌ | Small | 🟢 Low |
| Error log viewer | ❌ | ❌ | Medium | 🟡 Medium |
| Request history table | ❌ | ❌ | Medium | 🟡 Medium |
Prometheus /metrics |
❌ | ❌ | Medium | 🟡 Medium |
| Feature | exo | Us (v0.1) | Gap | Priority |
|---|---|---|---|---|
| Built-in model download | ✅ coordinator | ❌ manual CLI | Big | 🔴 High |
| Download progress tracking | ✅ per-shard | ❌ | Big | 🔴 High |
| Auto-sync to all nodes | ✅ via tasks | ❌ manual rsync | Medium | 🟡 Medium |
| Model registry / list | ✅ in code | ❌ | Medium | 🟡 Medium |
| Hot model swap | ❌ | ❌ | — | Future |
| Feature | exo | Us (v0.1) | Gap | Priority |
|---|---|---|---|---|
| Auto-discovery (libp2p) | ✅ | ❌ explicit | Not a gap | — |
| Leader election | ✅ | ❌ | Not needed (2 nodes) | — |
| Node health monitoring | Partial | ❌ | Medium | 🟡 Medium |
| Worker disconnect detection | ✅ | ❌ | Medium | 🟡 Medium |
| Graceful shutdown | Partial | ❌ | Medium | 🟡 Medium |
| Server logs (persistent) | ❌ | ❌ | Medium | 🟡 Medium |
| CI / automated tests | Partial | ❌ | Medium | 🟡 Medium |
| Advantage | Details |
|---|---|
| RDMA actually works | 8.05 GB/s proven; exo's auto-mapping is broken on M4 Pro |
| Deterministic startup | No race conditions, no election timeouts |
| Zero build toolchain | No Rust, no Node.js, no npm, no Swift |
| Debuggable | ~2K lines; any failure is traceable in minutes |
| Stock MLX | Uses official mlx from PyPI — no custom forks |
| 3-minute setup | make setup → make rdma-test → make server |
| Explicit configuration | Hostfile gives full control; no magic |
Quadrant view — Impact (value to daily use) vs Effort (implementation complexity).
HIGH IMPACT
│
┌──────────────────┼──────────────────┐
│ │ │
│ RAM monitoring │ Tool calls │
│ RDMA live probe │ Structured out │
│ Model download │ Ollama compat │
│ Worker health │ KV prefix cache │
│ │ │
LOW EFFORT ├──────────────────┼──────────────────┤ HIGH EFFORT
│ │ │
│ Sampling params │ D3 topology │
│ Stop sequences │ Token heatmap │
│ Error log view │ Image generation │
│ Request history │ Pipeline parallel │
│ Prometheus │ Multi-model │
│ │ │
└──────────────────┼──────────────────┘
│
LOW IMPACT
Do first (top-left): High impact, low effort — observability, model management Do next (top-right): High impact, high effort — tool calls, Ollama, KV cache Do later (bottom-left): Low impact, low effort — sampling params, logging Probably never (bottom-right): Low impact, high effort — image gen, pipeline parallel
Goal: See everything happening in the cluster in real time. Effort: ~2–3 days. Impact: Transforms daily operations.
Problem: We have 48 GB unified memory per node but no visibility into usage during inference. A large model + long context can silently approach OOM and crash.
Design:
┌─────────────────────────────────────────────────────────┐
│ Memory (rank 0 — mac.home) │
│ ┌───────────────────────────────────────┐ 37.4 / 48 GB│
│ │████████████████████████████░░░░░░░░░░░│ 78% used │
│ └───────────────────────────────────────┘ │
│ Model: 14.2 GB │ KV cache: 2.1 GB │ OS: 21.1 GB │
│ │
│ Memory (rank 1 — mac2) │
│ ┌───────────────────────────────────────┐ 36.8 / 48 GB│
│ │███████████████████████████░░░░░░░░░░░░│ 77% used │
│ └───────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Data source — per-node memory probe:
# Runs on every rank, reports to rank 0 via control-plane
import mlx.core as mx
import resource
def memory_snapshot() -> dict:
info = mx.device_info()
return {
"total_gb": round(info["memory_size"] / (1024**3), 1),
"working_set_gb": round(info["max_recommended_working_set_size"] / (1024**3), 1),
"cache_gb": round(mx.metal.get_cache_memory() / (1024**3), 2),
"active_gb": round(mx.metal.get_active_memory() / (1024**3), 2),
"peak_gb": round(mx.metal.get_peak_memory() / (1024**3), 2),
"rss_gb": round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / (1024**3), 2),
}Transport: Workers periodically send memory snapshots to rank 0 over the existing TCP control-plane (new message type: {"type": "metrics", "memory": {...}}). No new connections needed.
Dashboard integration: New panel in dashboard, updated via SSE every 2 seconds. Color-coded bars (green < 70%, yellow 70–85%, red > 85%).
Problem: The dashboard shows a static "~8 GB/s" label. We need to know if the link is degraded or down.
Design:
┌─────────────────────────────────────────────────────────┐
│ RDMA Link │
│ rdma_en4 ←→ rdma_en4 │
│ ┌────────────────────────────────┐ │
│ │████████████████████████████████│ 7.92 GB/s │
│ └────────────────────────────────┘ │
│ Latency: 26.1 µs │ Status: ● ACTIVE │ Last check: 3s │
└─────────────────────────────────────────────────────────┘
Implementation approach — lightweight background probe:
- A background thread on rank 0 performs a small
all_sum(e.g., 4 KB) every 10 seconds - Measures round-trip latency
- Every 60 seconds, performs a larger probe (4 MB) to estimate bandwidth
- Reports results to the dashboard via the existing
MetricsStore - Does NOT interfere with inference (runs only when queue is empty)
Key constraint: The probe must NOT run during active generation. Use a lock shared with the _queue_worker to ensure mutual exclusion.
# Pseudocode for RDMA health probe
class RDMAProbe:
def __init__(self, world):
self.world = world
self.last_bw_gbps = 0.0
self.last_latency_us = 0.0
self.link_active = True
self._generation_lock = asyncio.Lock() # shared with queue_worker
async def probe_latency(self):
"""4-byte all_sum — measures pure round-trip."""
async with self._generation_lock:
x = mx.ones(1)
t0 = time.perf_counter()
mx.distributed.all_sum(x)
mx.eval(x)
self.last_latency_us = (time.perf_counter() - t0) * 1e6
async def probe_bandwidth(self):
"""4 MB all_sum — estimates sustained bandwidth."""
async with self._generation_lock:
x = mx.ones(1_048_576) # 4 MB float32
t0 = time.perf_counter()
mx.distributed.all_sum(x)
mx.eval(x)
elapsed = time.perf_counter() - t0
self.last_bw_gbps = (4.0 / 1024) / elapsed # GB/s
del x
mx.clear_cache()Problem: If a worker process dies or the TB cable is unplugged, rank 0 hangs forever in rank0_wait_done().
Design:
- TCP control-plane sockets get a heartbeat: workers send
{"type": "heartbeat"}every 5 seconds - Rank 0 tracks last heartbeat time per worker
- If no heartbeat for 15 seconds → mark worker as
DISCONNECTED - Dashboard shows per-node status:
ACTIVE/DEGRADED/DISCONNECTED - On disconnect, queued requests get a 503 error instead of hanging forever
Problem: No way to see past requests, errors, or debug failed generations.
Design:
- Ring buffer of last 200 requests with: timestamp, kind, prompt (truncated), tokens, latency, status
- Ring buffer of last 50 errors with: timestamp, error type, message, traceback
- Exposed via:
GET /requests→ JSON array of recent requestsGET /errors→ JSON array of recent errors- Dashboard panel with scrollable table
Design:
GET /metrics
# HELP mlx_cluster_requests_total Total inference requests
# TYPE mlx_cluster_requests_total counter
mlx_cluster_requests_total 1423
# HELP mlx_cluster_tokens_generated_total Total tokens generated
# TYPE mlx_cluster_tokens_generated_total counter
mlx_cluster_tokens_generated_total 182947
# HELP mlx_cluster_tokens_per_second Current tokens per second
# TYPE mlx_cluster_tokens_per_second gauge
mlx_cluster_tokens_per_second 62.3
# HELP mlx_cluster_queue_depth Current queue depth
# TYPE mlx_cluster_queue_depth gauge
mlx_cluster_queue_depth 2
# HELP mlx_cluster_memory_used_bytes Unified memory used per rank
# TYPE mlx_cluster_memory_used_bytes gauge
mlx_cluster_memory_used_bytes{rank="0"} 40265318400
mlx_cluster_memory_used_bytes{rank="1"} 39528046592
# HELP mlx_cluster_rdma_bandwidth_gbps Last measured RDMA bandwidth
# TYPE mlx_cluster_rdma_bandwidth_gbps gauge
mlx_cluster_rdma_bandwidth_gbps 7.92
# HELP mlx_cluster_rdma_latency_us Last measured RDMA latency
# TYPE mlx_cluster_rdma_latency_us gauge
mlx_cluster_rdma_latency_us 25.8
No new dependency. Plain text Prometheus exposition format is trivial to generate.
make status # Full cluster status: nodes, memory, RDMA, queue, model
make logs # Tail server logs (rank 0)
make monitor # Watch mode: refresh status every 5s
make download MODEL=mlx-community/Qwen3-4B # Download + sync model (Phase 3)Goal: A dashboard that rivals exo's SvelteKit UI — but still zero build step. Effort: ~3–4 days. Impact: Professional-grade monitoring.
The dashboard v2 adds these panels to the existing layout:
| Panel | Data Source | Update Frequency |
|---|---|---|
| Memory gauges (per node) | Worker heartbeats via control-plane | Every 2s |
| RDMA link monitor | Background probe | Every 10s (latency), 60s (bandwidth) |
| Node health grid | Worker heartbeats | Every 5s |
| Request history table | Ring buffer | On each request |
| Error log | Ring buffer | On error |
| Model info card | Static at startup | Once |
| D3-lite topology | Hostfile + RDMA probe | Every 10s |
┌─────────────────────────────────────────────────────────────────────────┐
│ ⚡ mlx-jaccl-cluster │ Qwen3-4B-Instruct │ ● Online │ /docs │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─── Cluster Overview ──────────┐ ┌─── RDMA Link ──────────────────┐ │
│ │ │ │ │ │
│ │ Nodes: 2/2 online │ │ ● ACTIVE — rdma_en4 ↔ rdma_en4│
│ │ Model: Qwen3-4B (4-bit) │ │ Bandwidth: 7.92 GB/s │ │
│ │ World size: 2 │ │ Latency: 26.1 µs │ │
│ │ Uptime: 2h 14m │ │ Last probe: 3s ago │ │
│ │ Total requests: 1,423 │ │ │ │
│ │ Total tokens: 182,947 │ │ ┌──────────────────────────┐ │ │
│ │ │ │ │████████████████████████░░│ │ │
│ └───────────────────────────────┘ │ └──────────────────────────┘ │ │
│ │ 99% of theoretical max │ │
│ ┌─── Memory (rank 0) ──────────┐ └────────────────────────────────┘ │
│ │ ┌──────────────────────┐ │ │
│ │ │███████████████████░░░│ │ ┌─── Performance ─────────────────┐ │
│ │ └──────────────────────┘ │ │ │ │
│ │ 37.4 / 48 GB (78%) │ │ Avg tok/s (60s): 62.3 │ │
│ │ Active: 14.2 │ Cache: 2.1 │ │ Peak tok/s: 71.8 │ │
│ │ Peak: 16.3 GB │ │ Avg latency: 4.12s │ │
│ └──────────────────────────────┘ │ Queue: 1/8 ██░░░░░░ │ │
│ │ │ │
│ ┌─── Memory (rank 1) ──────────┐ │ ┌───── tok/s sparkline ─────┐ │ │
│ │ ┌──────────────────────┐ │ │ │ ╱╲ ╱╲ ╱╲ │ │ │
│ │ │██████████████████░░░░│ │ │ │ ╱ ╲_╱ ╲╱ ╲_╱╲ │ │ │
│ │ └──────────────────────┘ │ │ └───────────────────────────┘ │ │
│ │ 36.8 / 48 GB (77%) │ └─────────────────────────────────┘ │
│ │ Active: 14.0 │ Cache: 2.0 │ │
│ └──────────────────────────────┘ │
│ │
│ ┌─── Topology ─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────┐ rdma_en4 ┌──────────┐ │ │
│ │ │ rank 0 │ ◄════════════════════► │ rank 1 │ │ │
│ │ │ mac.home│ 8.05 GB/s │ mac2 │ │ │
│ │ │ coord │ 25.5 µs │ worker │ │ │
│ │ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Request History ──────────────────────────────────────────────┐ │
│ │ Time │ Kind │ Tokens │ Latency │ tok/s │ Status │ │
│ │─────────────┼──────┼────────┼─────────┼───────┼──────────────────│ │
│ │ 14:23:01 │ chat │ 128 │ 2.06s │ 62.1 │ ✅ ok │ │
│ │ 14:22:45 │ chat │ 256 │ 4.12s │ 62.1 │ ✅ ok │ │
│ │ 14:22:30 │ cmpl │ 64 │ 1.03s │ 62.1 │ ✅ ok │ │
│ │ 14:21:58 │ chat │ 32 │ 0.52s │ 61.5 │ ✅ ok │ │
│ │ 14:21:12 │ chat │ 512 │ 8.31s │ 61.6 │ ⚠️ slow │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Chat ──────────────────────────────────────────────────────────┐ │
│ │ (existing chat UI — keep as-is) │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Instead of importing D3.js (which exo does), we draw the topology with pure SVG updated via HTMX/SSE:
- Nodes rendered as rounded rectangles
- RDMA links as animated dashed lines (CSS animation, no JS library)
- Link color: green = healthy, yellow = degraded, red = down
- Bandwidth label on the link, updated from the RDMA probe
- Works with 2-node and 4-node topologies (reads from hostfile)
Current SSE pushes a flat JSON blob. v2 adds structured sections:
{
"uptime_s": 8040,
"total_requests": 1423,
"total_tokens": 182947,
"avg_tps_60s": 62.3,
"peak_tps_60s": 71.8,
"avg_latency_60s": 4.12,
"queue_size": 1,
"queue_max": 8,
"history": [ ... ],
"memory": {
"0": { "active_gb": 14.2, "cache_gb": 2.1, "peak_gb": 16.3, "total_gb": 48.0 },
"1": { "active_gb": 14.0, "cache_gb": 2.0, "peak_gb": 16.1, "total_gb": 48.0 }
},
"rdma": {
"bandwidth_gbps": 7.92,
"latency_us": 26.1,
"link_active": true,
"last_probe_s": 3
},
"nodes": {
"0": { "status": "active", "hostname": "mac.home", "last_heartbeat_s": 0 },
"1": { "status": "active", "hostname": "mac2", "last_heartbeat_s": 2 }
}
}Goal: Download, sync, and manage models without leaving the terminal (or the dashboard). Effort: ~2–3 days. Impact: Eliminates the most tedious manual step.
# Download a model from HuggingFace and sync to all nodes
make download MODEL=mlx-community/Qwen3-4B-Instruct-2507-4bit
# Download to a custom directory
make download MODEL=mlx-community/Qwen3-4B-Instruct-2507-4bit MODELS_DIR=~/models_mlx
# List downloaded models
make models-localImplementation:
# scripts/download_model.sh
# 1. huggingface-cli download $MODEL --local-dir $MODELS_DIR/$MODEL_NAME
# 2. For each node in hostfile (except rank 0):
# ssh $node "mkdir -p $MODELS_DIR"
# rsync -avz --progress $LOCAL_PATH/ $node:$MODELS_DIR/$MODEL_NAME/
# 3. Verify all nodes have the model (checksum on config.json)A simple JSON file tracking downloaded models:
// ~/.mlx-jaccl-cluster/models.json
{
"models": [
{
"id": "Qwen3-4B-Instruct-2507-4bit",
"source": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"path": "/Users/omar/models_mlx/Qwen3-4B-Instruct-2507-4bit",
"size_gb": 2.4,
"downloaded_at": "2025-07-14T10:23:00Z",
"synced_nodes": ["mac.home", "mac2"],
"quantization": "4-bit"
}
]
}┌─── Models ─────────────────────────────────────────────────────────────┐
│ │
│ ● Active: Qwen3-4B-Instruct-2507-4bit (4-bit, 2.4 GB) │
│ │
│ Downloaded: │
│ ┌────────────────────────────────┬──────┬───────┬──────────────────┐ │
│ │ Model │ Size │ Quant │ Synced │ │
│ ├────────────────────────────────┼──────┼───────┼──────────────────┤ │
│ │ Qwen3-4B-Instruct-2507-4bit │ 2.4G │ 4-bit │ ✅ 2/2 nodes │ │
│ │ Llama-3.1-8B-Instruct-4bit │ 4.5G │ 4-bit │ ✅ 2/2 nodes │ │
│ │ Mistral-7B-v0.3-4bit │ 3.8G │ 4-bit │ ⚠️ 1/2 nodes │ │
│ └────────────────────────────────┴──────┴───────┴──────────────────┘ │
│ │
│ [Download New Model] input: ______________________________ [Go] │
│ │
└─────────────────────────────────────────────────────────────────────────┘
When a download is in progress, the dashboard shows a progress bar:
Downloading: mlx-community/Qwen3-8B-Instruct-4bit
┌──────────────────────────────────────────────┐
│██████████████████████████░░░░░░░░░░░░░░░░░░░│ 62% — 1.5 / 2.4 GB — 45 MB/s
└──────────────────────────────────────────────┘
Syncing to mac2... waiting
Implementation: A background asyncio task wraps huggingface-cli download subprocess, parses progress from stderr, pushes updates via a new SSE event type.
Goal: Support OpenAI function calling and JSON mode so agents and LangChain work. Effort: ~4–5 days. Impact: Unlocks the agent/tool ecosystem.
OpenAI tool calling request:
{
"model": "Qwen3-4B",
"messages": [
{"role": "user", "content": "What's the weather in Paris?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}
]
}Expected response:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"Paris\"}"
}
}]
},
"finish_reason": "tool_calls"
}]
}Step 1: Prompt formatting. Convert tools + messages into a prompt the model understands. Most instruct models (Qwen, Llama, Mistral) have specific chat templates for tool use that tokenizer.apply_chat_template already handles when tools= is passed.
# In _build_chat_prompt(), add tools support:
def _build_chat_prompt(messages, tools=None):
msgs = [{"role": m.role, "content": m.content} for m in messages]
kwargs = {"tokenize": False, "add_generation_prompt": True}
if tools:
kwargs["tools"] = tools
return _tok.apply_chat_template(msgs, **kwargs)Step 2: Response parsing. After generation, detect if the output contains a tool call (model-specific format) and parse it into the OpenAI tool_calls structure.
# Tool call detection (model-dependent patterns)
# Qwen3: <tool_call>{"name": "...", "arguments": {...}}</tool_call>
# Llama: <|python_tag|>{"name": "...", "parameters": {...}}
# Mistral: [TOOL_CALLS][{"name": "...", "arguments": {...}}]
def parse_tool_calls(text: str, model_family: str) -> list[dict] | None:
"""Extract tool calls from model output. Returns None if no tool call detected."""
...Step 3: Schema updates. Extend ChatCompletionsReq and response schemas:
class Tool(BaseModel):
type: str = "function"
function: dict # {name, description, parameters}
class ChatCompletionsReq(BaseModel):
model: Optional[str] = None
messages: list[ChatMessage]
max_tokens: Optional[int] = None
stream: Optional[bool] = False
tools: Optional[list[Tool]] = None
tool_choice: Optional[str] = None # "auto", "none", or {"type":"function","function":{"name":"..."}}
response_format: Optional[dict] = None # {"type": "json_object"}Step 4: Streaming tool calls. SSE chunks must include tool_calls delta objects per the OpenAI spec.
When response_format: {"type": "json_object"} is set:
- Append
"Respond with valid JSON."to the system prompt - After generation, validate that the output is valid JSON
- If not valid, retry once with a stronger prompt
- Return
finish_reason: "stop"only if valid JSON
Currently missing — easy wins:
| Parameter | Default | Notes |
|---|---|---|
temperature |
1.0 | Pass to generate() / stream_generate() |
top_p |
1.0 | Nucleus sampling |
top_k |
-1 | Top-k sampling (-1 = disabled) |
repetition_penalty |
1.0 | Penalize repeated tokens |
stop |
[] |
Stop sequences — check after each token |
All of these are already supported by mlx_lm.generate() — we just need to plumb them through from the HTTP request.
Goal: Drop-in replacement for more clients. Effort: ~3–4 days. Impact: Works with Ollama clients, LangChain, etc.
Many tools (Open WebUI, Continue.dev, etc.) speak Ollama's API:
| Endpoint | Method | Description |
|---|---|---|
/api/generate |
POST | Text generation (Ollama format) |
/api/chat |
POST | Chat (Ollama format) |
/api/tags |
GET | List models |
/api/show |
POST | Model info |
/api/ps |
GET | Running models |
Implementation: Thin adapter layer that translates Ollama requests → our internal format → Ollama responses. ~200 lines.
| Endpoint | Effort | Notes |
|---|---|---|
/v1/embeddings |
Medium | Requires an embedding model or adapter |
/v1/models/{id} |
Trivial | Return model details |
/v1/chat/completions with n > 1 |
Medium | Multiple completions per request |
Verify against:
- OpenAI Python SDK (
openai.ChatCompletion.create()) - OpenAI Node SDK
- LangChain (
ChatOpenAI) - LlamaIndex
- Continue.dev (VS Code)
- Open WebUI
- Cursor / Cody (via OpenAI-compatible endpoint)
-
curl(already tested)
Goal: Performance and capability improvements. Effort: ~1–2 weeks per feature. Impact: Competitive with production inference servers.
What it does: Caches the key-value tensors for shared prompt prefixes. If 10 users ask questions with the same system prompt, the KV cache for that prefix is computed once.
Impact: Dramatic latency reduction for chat applications with long system prompts.
Complexity: High — requires modifying how we call generate() and managing a cache eviction policy. May need mlx_lm updates.
What it does: Instead of processing one request at a time (current behavior), interleave tokens from multiple requests.
Impact: Higher throughput under concurrent load. Currently our queue processes requests serially.
Complexity: Very high — requires rewriting the generation loop. The control-plane protocol would need significant changes since all ranks must agree on batch composition.
What it does: Use a small draft model to propose tokens, then verify with the large model in parallel.
Impact: 2–3× speedup for large models.
Complexity: High — requires loading two models and coordinating draft/verify cycles across ranks.
Things we deliberately choose NOT to implement:
| Feature | Reason |
|---|---|
| Auto-discovery | Explicit hostfile is simpler, more reliable, and correct for 1–4 node clusters |
| Leader election | Adds complexity; rank 0 is always the coordinator; 2-node clusters don't need it |
| Image generation | Different workload; use a dedicated tool (mflux, ComfyUI) |
| SvelteKit dashboard | Requires Node.js build toolchain; HTMX+SSE is sufficient and zero-build |
| Custom MLX fork | We use stock PyPI mlx; this is a core advantage |
| Rust / Swift components | Pure Python + Bash; zero build toolchain is a feature |
| N > 4 node scaling | JACCL requires fully connected TB mesh; 4 nodes = 6 cables, already impractical |
| Multi-tenant isolation | Single-user inference server; auth/isolation adds complexity for no benefit |
Context: exo uses SvelteKit + D3.js for a richer dashboard.
Decision: Stay with HTMX + SSE + inline HTML/CSS/JS.
Rationale:
- Zero build step is a core project value
- HTMX can handle all planned features (memory bars, topology, tables)
- SVG can replace D3.js for the topology graph
- SSE is already working and battle-tested
- Adding Node.js + npm + Svelte contradicts our "zero toolchain" promise
Context: Workers need to report memory/health to rank 0 for the dashboard.
Decision: Extend the existing TCP control-plane protocol with new message types (metrics, heartbeat).
Rationale:
- No new connections or ports needed
- Protocol is already framed JSON, easy to extend
- Workers already have an open socket to rank 0
- Alternative (HTTP from workers) would require each worker to run a server
Context: exo has a built-in DownloadCoordinator that downloads models via HTTP from a leader.
Decision: Use huggingface-cli download + rsync wrapped in a script.
Rationale:
- HuggingFace CLI handles auth, resume, checksums, LFS
- rsync is battle-tested for large file sync
- Building a download coordinator is high effort, low marginal value for 2 nodes
- Script approach is debuggable and composable
Context: Different model families use different formats for tool calls.
Decision: Implement a pluggable parser with model-family detection.
Rationale:
- Qwen, Llama, and Mistral all use different tool call formats
- A single regex won't work
- Auto-detect model family from
config.jsonor tokenizer config - Start with Qwen3 (our primary model), add others incrementally
For access from phones or narrow windows, the grid collapses to single column:
┌─────────────────────────┐
│ ⚡ mlx-jaccl-cluster │
│ Qwen3-4B │ ● Online │
├─────────────────────────┤
│ Cluster Overview │
│ 2/2 nodes │ 1,423 reqs │
├─────────────────────────┤
│ Performance │
│ 62.3 tok/s │ 4.12s lat │
│ ▁▃▅▇▅▃▅▇▅▃ (sparkline) │
├─────────────────────────┤
│ RDMA: 7.92 GB/s ● UP │
├─────────────────────────┤
│ Memory │
│ R0: ████████░░ 78% │
│ R1: ███████░░░ 77% │
├─────────────────────────┤
│ Queue: 1/8 ██░░░░░░ │
├─────────────────────────┤
│ Chat UI │
│ [message input] [Send] │
└─────────────────────────┘
No SPA routing needed. Single page with anchor links and collapsible sections:
[Overview] [Memory] [RDMA] [Requests] [Models] [Chat]
Each section is an HTMX fragment that auto-updates via SSE. No full page reloads.
Phase 1 — Observability:
| File | Change |
|---|---|
server/openai_cluster_server.py |
Add heartbeat protocol, memory probe, worker health tracking |
server/dashboard.py |
Add memory panel, RDMA panel, request history, error log |
server/rdma_probe.py |
New — Background RDMA health probe |
server/prometheus.py |
New — /metrics endpoint |
Makefile |
Add status, logs, monitor targets |
scripts/cluster_status.sh |
New — Full cluster status script |
Phase 2 — Dashboard v2:
| File | Change |
|---|---|
server/dashboard.py |
Major rewrite — new layout, panels, SSE v2 schema |
server/openai_cluster_server.py |
Pass new data sources to dashboard |
Phase 3 — Model Management:
| File | Change |
|---|---|
scripts/download_model.sh |
New — Download + sync script |
Makefile |
Add download, models-local, models-sync targets |
server/openai_cluster_server.py |
Add /models/download endpoint (optional) |
server/dashboard.py |
Add models panel |
Phase 4 — Tool Support:
| File | Change |
|---|---|
server/openai_cluster_server.py |
Tool calls in request/response, sampling params, stop sequences |
server/tool_parser.py |
New — Model-specific tool call parser |
server/schemas.py |
New — Extracted Pydantic models with tools support |
Phase 5 — API Parity:
| File | Change |
|---|---|
server/ollama_compat.py |
New — Ollama API adapter |
server/openai_cluster_server.py |
Mount Ollama routes |
Current message types:
→ worker→rank0: {"type": "hello", "rank": N}
→ rank0→worker: {"type": "task", "prompt": "...", "max_tokens": N}
→ worker→rank0: {"type": "done", "rank": N}
v2 additions:
→ worker→rank0: {"type": "heartbeat", "rank": N, "memory": {...}, "timestamp": T}
→ rank0→worker: {"type": "config", "probe_interval_s": 5}
→ worker→rank0: {"type": "metrics", "rank": N, "memory": {...}}
→ rank0→worker: {"type": "shutdown"} (graceful stop)
We're strict about dependencies. Every new package must justify itself.
mlx >= 0.30.4
mlx-lm >= 0.30.5
fastapi >= 0.110.0
uvicorn[standard] >= 0.29.0
pydantic >= 2.0
transformers >= 4.50.0
tokenizers
mistral_common
huggingface_hub
| Package | Phase | Justification | Alternative Considered |
|---|---|---|---|
psutil |
Phase 1 | Cross-platform memory/CPU stats | /proc parsing (Linux only, we're macOS) |
| — | — | — | — |
That's it. One new dependency across all phases. Everything else is built with the standard library, MLX APIs, or inline code.
- Prometheus export: hand-written text format (no
prometheus_client) - Topology SVG: inline SVG (no D3.js)
- RDMA probe:
mx.distributed.all_sum(already available) - Download:
huggingface-clisubprocess (already installed)
- Dashboard shows live RAM usage per node (auto-refreshing)
- Dashboard shows live RDMA bandwidth and latency (from actual probes)
- Worker disconnect is detected within 15 seconds and shown on dashboard
-
make statusprints a complete cluster snapshot in the terminal -
/metricsreturns valid Prometheus exposition format - Request history is visible in dashboard (last 50 requests)
- Dashboard has all panels from the wireframe
- SVG topology graph shows nodes and RDMA links with live status
- Dashboard works on mobile (responsive layout)
- SSE pushes v2 schema with memory + RDMA + node data
- Zero build step maintained (no npm, no bundler)
-
make download MODEL=...downloads and syncs to all nodes -
make models-locallists all downloaded models with sizes - Dashboard shows downloaded models and active model
- Download progress is visible (terminal and/or dashboard)
- OpenAI SDK
tools=parameter works with Qwen3 models -
response_format: {"type": "json_object"}works -
temperature,top_p,stopparameters are plumbed through - Streaming tool calls work per OpenAI spec
- LangChain
ChatOpenAIwith tools works against our server
- Open WebUI connects via Ollama API and works for chat
- Continue.dev (VS Code) works with our server as OpenAI backend
- All Ollama core endpoints return valid responses
Week 1-2: Phase 1 — Observability & Monitoring
Week 3-4: Phase 2 — Dashboard v2
Week 5: Phase 3 — Model Management
Week 6-7: Phase 4 — Tool Support
Week 8: Phase 5 — API Parity
Ongoing: Phase 6 — Advanced Inference (opportunistic)
For completeness, these exo features are not on our roadmap and why:
| exo Feature | Lines of Code | Why We Skip It |
|---|---|---|
| libp2p auto-discovery | ~2,000 | Explicit hostfile is more reliable for 1–4 nodes |
| Raft leader election | ~1,500 | 2-node cluster doesn't need it |
DownloadCoordinator |
~800 | huggingface-cli + rsync is simpler and more robust |
| Topology-aware shard placement | ~600 | 2-node = trivial placement |
| Flux image generation | ~400 | Use dedicated tools (mflux, ComfyUI) |
Swift SystemProfiler integration |
~300 | mlx.device_info() gives us what we need |
| Nix flake | ~200 | uv + pyproject.toml is sufficient |
| Ring-buffer P2P topology | ~500 | JACCL handles topology via hostfile |
| Custom MLX fork maintenance | ongoing | Using stock PyPI mlx is a core advantage |
Total lines we avoid maintaining: ~6,300+
This roadmap is a living document. Update it as phases are completed or priorities shift.