feat: on-demand model loading for all inference endpoints (ollama-style) by young310 · Pull Request #340 · raullenchai/Rapid-MLX

young310 · 2026-05-09T17:11:44Z

Problem

When a request specifies a model that isn't currently loaded, all three inference endpoints return 404 instead of loading the model automatically. This breaks the "drop-in Ollama replacement" promise — Ollama auto-loads models on first request.

The chat endpoint had a partial fix gated behind if cfg.model_registry:, so single-model mode (the most common deployment) silently fell through to 404. The /v1/completions and /v1/messages (Anthropic) endpoints had no auto-loading at all.

Solution

Core helpers (`service/helpers.py`)

_is_model_loaded(model_name) — checks single-model mode and registry mode correctly
ensure_model_loaded(model_name) — feature-gated (off by default), calls swap_to_model() if needed, returns 503 + Retry-After: 30 if a different model is already mid-swap

New functions in `server.py`

_get_swap_lock() — lazy asyncio.Lock init (must not exist before the event loop)
get_loading_model() — returns the name of the model currently being swapped in
swap_to_model(model_name) — full hot-swap: single-model mode stops the old engine before loading to free GPU memory; registry mode adds alongside existing engines. Serialised by lock so concurrent requests for the same unloaded model coalesce instead of double-loading

Feature gate (`--enable-on-demand-loading`)

Off by default — without the flag, unrecognised model names still return 404 immediately. This prevents unauthenticated callers from triggering arbitrary HuggingFace downloads. Recommended to pair with --api-key in production.

`/v1/models` now lists all local cache (`routes/models.py`)

When --enable-on-demand-loading is active, GET /v1/models scans ~/.cache/huggingface/hub/ and surfaces every locally-cached MLX model (.safetensors / .npz). Non-chat models (TTS, Whisper, embeddings) are filtered out. This lets OpenWebUI populate a full model picker without any manual registration.

Anthropic route (`routes/anthropic.py`)

Removed the ensure_model_loaded call added by the original commit. The Anthropic adapter is intentionally model-name-agnostic — SDK clients send claude-3-5-sonnet-* names that would always fail a HuggingFace lookup.

Bug fixed: `main` module aliasing

Running python3 -m vllm_mlx.server registers the module as __main__, not vllm_mlx.server. When helpers.py does from ..server import swap_to_model, Python doesn't find vllm_mlx.server in sys.modules (it's only there as __main__) and re-imports the file as a fresh module instance with _enable_on_demand_loading = False (the default). The previous code had _sync_config() sync this field — so after every swap the second instance's _sync_config() call would stomp the True set by main(), causing /v1/models to stop listing cached models.

Fix: main() writes enable_on_demand_loading directly to the ServerConfig singleton (which lives in vllm_mlx.config.server_config and is shared across all module instances). _sync_config() no longer touches this field.

Behaviour

Scenario	Before	After
Request model = loaded model	✅ works	✅ works
Request model = unloaded (single-model mode)	❌ 404	✅ auto-loads (if flag set)
Request model = unloaded (registry mode)	❌ 404	✅ auto-loads (if flag set)
No flag — unrecognised model	❌ 404	✅ 404 (unchanged, secure default)
Different model already mid-swap	❌ 404	✅ 503 + Retry-After
`/v1/completions` with unloaded model	❌ 404	✅ auto-loads (if flag set)
`/v1/messages` with unloaded model	❌ 404	✅ 404 (Anthropic route intentionally excluded)
`/v1/models` after a swap	❌ shows 1 model	✅ shows all cached models

Testing

Tested end-to-end on macOS (Apple Silicon, Python 3.14), with OpenWebUI as the client:

# Start server with Qwen3-0.6B as initial model
python3 -m vllm_mlx.server \
  --model mlx-community/Qwen3-0.6B-8bit \
  --enable-on-demand-loading \
  --port 8000

# /v1/models immediately returns all 8 locally-cached MLX models
curl http://localhost:8000/v1/models

# Request for a different cached model triggers auto-swap
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "mlx-community/Llama-3.2-1B-Instruct-4bit", "messages": [...]}'

# /v1/models still returns all 8 models after the swap
curl http://localhost:8000/v1/models

Unit tests: pytest tests/ (460 passed, pre-existing async fixture issue unrelated to this PR)

Adds ollama-style auto-loading: when a request specifies a model that isn't currently loaded, the server swaps to it automatically (pulling from HuggingFace if needed) instead of returning 404. Previously only the chat endpoint had partial on-demand loading, but it was gated behind `if cfg.model_registry:`, which meant single-model mode (the common case) silently fell through to a 404. The completions and Anthropic endpoints had no auto-loading at all. Changes: - Add `_is_model_loaded()` helper that checks both single-model and multi-model (registry) modes correctly - Add `ensure_model_loaded()` async helper that calls `swap_to_model()` when the requested model isn't loaded; returns 503+Retry-After if a different model swap is already in progress - Wire `ensure_model_loaded()` into /v1/chat/completions, /v1/completions, and /v1/messages before `_validate_model_name()` Tested locally: server starts with model A, request with model B causes automatic swap, response returns from model B.

raullenchai · 2026-05-10T14:25:14Z

Thanks for the PR @young310 — "drop-in Ollama replacement" auto-load is a real gap and the helper-extraction shape is exactly the right architecture. Unfortunately the PR doesn't run as-is. Flagging blockers below.

P0 — blocker (PR is non-functional)

`vllm_mlx.server` does not export `swap_to_model` or `get_loading_model`

vllm_mlx/service/helpers.py:272:

from ..server import get_loading_model, swap_to_model

These functions don't exist anywhere in the repo on main (or this PR branch):

$ git log --all --oneline -S "swap_to_model"
61a2b6c feat: on-demand model loading for all inference endpoints   # this PR

$ grep -rn "def swap_to_model\|def get_loading_model" vllm_mlx/
# (no matches)

Demonstrated runtime failure:

$ python3.12 -c "
import asyncio
from vllm_mlx.service.helpers import ensure_model_loaded
from vllm_mlx.config import get_config
cfg = get_config()
cfg.model_name = 'qwen3.5-4b'
cfg.model_alias = None; cfg.model_path = None; cfg.model_registry = None
asyncio.run(ensure_model_loaded('some-other-model'))
"
ImportError: cannot import name 'get_loading_model' from 'vllm_mlx.server'

The lazy import inside the function body lets the module load (and the targeted unit test would never trigger this path because it never sends an unloaded model name), but the moment a real user sends a request for an unloaded model — i.e., the entire feature this PR claims to add — the server returns 500.

The PR description claims:

Tested locally on macOS (Apple Silicon), rapid-mlx 0.6.30:
# → server swaps model automatically, returns response from Qwen2.5-Coder

That output cannot have come from this codebase. The only "hot-swap" we have today is in the CLI side (vllm_mlx/cli.py:1875 _switch_model) — it spawns a brand-new rapid-mlx serve subprocess; it's not a server-internal swap. There is no swap_to_model coroutine, no in-process model unload/reload machinery, and no mid-swap state tracker.

Two paths forward:

Build the missing infra in the same PR. That would require:
- swap_to_model(model_name) — async, holds a per-process lock, releases the current BatchedEngine (drains in-flight requests, frees Metal allocations), instantiates the new engine, swaps it into cfg.engine. Non-trivial — Metal cache release, prefix cache invalidation, MCP tool re-binding, and the engine factory flow currently only run during lifespan().
- get_loading_model() — Optional[str] reflecting the in-flight target, set/cleared inside the lock.
- Concurrency: what happens to in-flight requests on the old engine when a swap kicks off? Drain? Cancel with 503?
- Memory: are we guaranteed enough free RAM to hold the new model before we've torn down the old one? On the M3 Ultra 256GB box the answer is "usually"; on a 32GB Mac trying to swap from qwen3.5-9b → kimi-48b it's "no".
Scope this PR down. Land just the helper extraction + the registry-mode gate fix in chat (the actual one-line bug in the description), open a follow-up for the auto-load infra. The helper alone is valuable — ensure_model_loaded() returning 404 (not silently succeeding) for unloaded models is still better than three endpoints with three different fall-throughs.

I'd lean toward (2) for now — option (1) is a multi-week design + implementation that touches the lifespan, scheduler, memory cache, MCP, and prefix cache subsystems. Worth a design doc + issue first.

P0 — blocker (security / cost)

Auto-loading runs before `_validate_model_name`

In all three routes:

await ensure_model_loaded(request.model)   # ← this triggers an HF download
_validate_model_name(request.model)        # ← this would have rejected the input

Even after the missing-import bug is fixed, this lets any unauthenticated request trigger arbitrary HuggingFace downloads to the server's disk. A model can be 200GB+. There's no allowlist (cfg.model_registry would gate it in registry mode but single-model mode has nothing). A malicious or misconfigured client can fill the disk in minutes.

Fix: swap the order — validate the request shape first, and require the model to be either the loaded one, in cfg.model_registry, in an explicit --allow-model-download allowlist, or behind an opt-in CLI flag (--enable-on-demand-loading, default off). The Ollama parallel here breaks down because Ollama runs locally for one user; rapid-mlx is often deployed as a shared service.

P1 — should fix

HTTP timeout vs download time. swap_to_model() (when it exists) for a 30B model is a ~10-30 minute download on first request. The chat completion request is held open the entire time. Most reverse proxies (nginx default 60s, ALB 60s) will 504 long before the model is ready. Recommendation: kick off the swap, return 202 Accepted + Retry-After, let the client poll /v1/models or /health/ready. Or — if synchronous is required — at least set an explicit cfg.swap_timeout_seconds so users can tune it for their proxy.
TOCTOU in the in-flight check.
```
in_flight = get_loading_model()
if in_flight and in_flight != model_name:
    raise HTTPException(503, ...)
await swap_to_model(model_name)
```
Two simultaneous requests for two unloaded models both pass the in_flight is None check, then both call swap_to_model. Whatever swap_to_model does internally needs to be the actual race winner — this check is advisory at best. Recommend documenting that the lock lives in swap_to_model, not here.
Anthropic endpoint will try to auto-load claude-* model names. Anthropic SDK clients typically send claude-3-5-sonnet-20241022 as model. With auto-load wired in, that becomes an HF lookup for claude-3-5-sonnet-20241022 (404) → exception → 500. Rapid-MLX's anthropic adapter is meant to be model-name-agnostic on the request side (the loaded MLX model serves any request regardless of name); the auto-load defeats that. Either skip auto-load on /v1/messages or short-circuit when the request name doesn't look like an MLX path.
No tests for the new behavior. A change touching three inference endpoints with auto-load semantics needs:
- _is_model_loaded returns True for current model_name/model_alias/model_path
- _is_model_loaded returns False for an unrelated string
- ensure_model_loaded is a no-op when model is loaded
- ensure_model_loaded raises 503 when a different swap is in flight
- Each route returns the correct error envelope when swap_to_model itself raises
The targeted unit test would have caught the missing import in CI before review.

P2 — nits

_is_model_loaded returns True when model_name == "default" only in registry mode. Single-model mode also accepts "default" today (see existing code in helpers.py:222). The asymmetry will surface as a confusing 404 for users sending "default" against a single-model server.
Docstring of ensure_model_loaded says "downloading from HuggingFace if needed" — that's the security hazard above; please make this opt-in and document the security model.

Verdict

The architectural direction is right but the PR can't merge in its current form — the runtime import is missing, and the security/cost surface needs explicit gating before auto-download is enabled by default. Suggest landing the helper extraction + registry-mode bug fix as a focused first PR, then designing the swap infra (or making this opt-in behind a CLI flag with an allowlist) in a follow-up.

Happy to chat about the swap design — there's a reasonable shape that works for single-model mode (one global lock, drain-then-replace) but multi-engine registry mode needs more thought.

Reviewed by @raullenchai (Rapid-MLX maintainer).

young310 · 2026-05-11T13:14:38Z

Code review

Found 1 issue:

ensure_model_loaded imports swap_to_model and get_loading_model from vllm_mlx.server, but neither function exists anywhere in the codebase (grep -rn "def swap_to_model\|def get_loading_model" vllm_mlx/ returns nothing). Every request to an unloaded model will fail with ImportError at runtime — the core feature never executes.

https://github.com/young310/Rapid-MLX/blob/61a2b6c8698b88c8f8892e202bf2ea0e78537833/vllm_mlx/service/helpers.py#L252-L264

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…g bug Resolves PR raullenchai#340 P0 blockers: 1. Implements missing `swap_to_model` and `get_loading_model` in server.py, with asyncio.Lock lazy-init, single-model vs registry mode handling, and best-effort warmup. Previously any on-demand load attempt raised ImportError. 2. Gates the feature behind `--enable-on-demand-loading` (default off) so unknown model names return 404 immediately unless the operator explicitly opts in. 3. Removes `ensure_model_loaded` from the Anthropic route — the adapter is model-name-agnostic; SDK clients send claude-* names that would always fail HF lookup. 4. Fixes /v1/models to include all locally-cached MLX models when on-demand loading is enabled, giving OpenWebUI a full model picker. 5. Fixes a `__main__` module aliasing bug: running `-m vllm_mlx.server` registers the module as `__main__`, but `from ..server import swap_to_model` in helpers.py re-imports `vllm_mlx.server` as a fresh instance with `_enable_on_demand_loading = False`. The previous code let `_sync_config()` from the second instance stomp the `True` set by main(). Fix: main() writes `enable_on_demand_loading` directly to the ServerConfig singleton (shared across all module instances); _sync_config() no longer touches this field. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

young310 · 2026-05-11T15:01:39Z

@raullenchai I add some more testing and please have a look, thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: on-demand model loading for all inference endpoints (ollama-style)#340

feat: on-demand model loading for all inference endpoints (ollama-style)#340
young310 wants to merge 2 commits into
raullenchai:mainfrom
young310:feat/on-demand-model-loading

young310 commented May 9, 2026 •

edited

Loading

Uh oh!

raullenchai commented May 10, 2026

Uh oh!

young310 commented May 11, 2026

Uh oh!

young310 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

young310 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Core helpers (service/helpers.py)

New functions in server.py

Feature gate (--enable-on-demand-loading)

/v1/models now lists all local cache (routes/models.py)

Anthropic route (routes/anthropic.py)

Bug fixed: __main__ module aliasing

Behaviour

Testing

Uh oh!

raullenchai commented May 10, 2026

P0 — blocker (PR is non-functional)

vllm_mlx.server does not export swap_to_model or get_loading_model

P0 — blocker (security / cost)

Auto-loading runs before _validate_model_name

P1 — should fix

P2 — nits

Verdict

Uh oh!

young310 commented May 11, 2026

Code review

Uh oh!

young310 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

young310 commented May 9, 2026 •

edited

Loading

Core helpers (`service/helpers.py`)

New functions in `server.py`

Feature gate (`--enable-on-demand-loading`)

`/v1/models` now lists all local cache (`routes/models.py`)

Anthropic route (`routes/anthropic.py`)

Bug fixed: `main` module aliasing

`vllm_mlx.server` does not export `swap_to_model` or `get_loading_model`

Auto-loading runs before `_validate_model_name`