feat: on-demand model loading for all inference endpoints (ollama-style)#340
feat: on-demand model loading for all inference endpoints (ollama-style)#340young310 wants to merge 2 commits into
Conversation
Adds ollama-style auto-loading: when a request specifies a model that isn't currently loaded, the server swaps to it automatically (pulling from HuggingFace if needed) instead of returning 404. Previously only the chat endpoint had partial on-demand loading, but it was gated behind `if cfg.model_registry:`, which meant single-model mode (the common case) silently fell through to a 404. The completions and Anthropic endpoints had no auto-loading at all. Changes: - Add `_is_model_loaded()` helper that checks both single-model and multi-model (registry) modes correctly - Add `ensure_model_loaded()` async helper that calls `swap_to_model()` when the requested model isn't loaded; returns 503+Retry-After if a different model swap is already in progress - Wire `ensure_model_loaded()` into /v1/chat/completions, /v1/completions, and /v1/messages before `_validate_model_name()` Tested locally: server starts with model A, request with model B causes automatic swap, response returns from model B.
|
Thanks for the PR @young310 — "drop-in Ollama replacement" auto-load is a real gap and the helper-extraction shape is exactly the right architecture. Unfortunately the PR doesn't run as-is. Flagging blockers below. P0 — blocker (PR is non-functional)
|
Code reviewFound 1 issue:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
…g bug Resolves PR raullenchai#340 P0 blockers: 1. Implements missing `swap_to_model` and `get_loading_model` in server.py, with asyncio.Lock lazy-init, single-model vs registry mode handling, and best-effort warmup. Previously any on-demand load attempt raised ImportError. 2. Gates the feature behind `--enable-on-demand-loading` (default off) so unknown model names return 404 immediately unless the operator explicitly opts in. 3. Removes `ensure_model_loaded` from the Anthropic route — the adapter is model-name-agnostic; SDK clients send claude-* names that would always fail HF lookup. 4. Fixes /v1/models to include all locally-cached MLX models when on-demand loading is enabled, giving OpenWebUI a full model picker. 5. Fixes a `__main__` module aliasing bug: running `-m vllm_mlx.server` registers the module as `__main__`, but `from ..server import swap_to_model` in helpers.py re-imports `vllm_mlx.server` as a fresh instance with `_enable_on_demand_loading = False`. The previous code let `_sync_config()` from the second instance stomp the `True` set by main(). Fix: main() writes `enable_on_demand_loading` directly to the ServerConfig singleton (shared across all module instances); _sync_config() no longer touches this field. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
@raullenchai I add some more testing and please have a look, thank you |
Problem
When a request specifies a model that isn't currently loaded, all three inference endpoints return 404 instead of loading the model automatically. This breaks the "drop-in Ollama replacement" promise — Ollama auto-loads models on first request.
The chat endpoint had a partial fix gated behind
if cfg.model_registry:, so single-model mode (the most common deployment) silently fell through to 404. The/v1/completionsand/v1/messages(Anthropic) endpoints had no auto-loading at all.Solution
Core helpers (
service/helpers.py)_is_model_loaded(model_name)— checks single-model mode and registry mode correctlyensure_model_loaded(model_name)— feature-gated (off by default), callsswap_to_model()if needed, returns503 + Retry-After: 30if a different model is already mid-swapNew functions in
server.py_get_swap_lock()— lazy asyncio.Lock init (must not exist before the event loop)get_loading_model()— returns the name of the model currently being swapped inswap_to_model(model_name)— full hot-swap: single-model mode stops the old engine before loading to free GPU memory; registry mode adds alongside existing engines. Serialised by lock so concurrent requests for the same unloaded model coalesce instead of double-loadingFeature gate (
--enable-on-demand-loading)Off by default — without the flag, unrecognised model names still return 404 immediately. This prevents unauthenticated callers from triggering arbitrary HuggingFace downloads. Recommended to pair with
--api-keyin production./v1/modelsnow lists all local cache (routes/models.py)When
--enable-on-demand-loadingis active,GET /v1/modelsscans~/.cache/huggingface/hub/and surfaces every locally-cached MLX model (.safetensors/.npz). Non-chat models (TTS, Whisper, embeddings) are filtered out. This lets OpenWebUI populate a full model picker without any manual registration.Anthropic route (
routes/anthropic.py)Removed the
ensure_model_loadedcall added by the original commit. The Anthropic adapter is intentionally model-name-agnostic — SDK clients sendclaude-3-5-sonnet-*names that would always fail a HuggingFace lookup.Bug fixed:
__main__module aliasingRunning
python3 -m vllm_mlx.serverregisters the module as__main__, notvllm_mlx.server. Whenhelpers.pydoesfrom ..server import swap_to_model, Python doesn't findvllm_mlx.serverinsys.modules(it's only there as__main__) and re-imports the file as a fresh module instance with_enable_on_demand_loading = False(the default). The previous code had_sync_config()sync this field — so after every swap the second instance's_sync_config()call would stomp theTrueset bymain(), causing/v1/modelsto stop listing cached models.Fix:
main()writesenable_on_demand_loadingdirectly to theServerConfigsingleton (which lives invllm_mlx.config.server_configand is shared across all module instances)._sync_config()no longer touches this field.Behaviour
/v1/completionswith unloaded model/v1/messageswith unloaded model/v1/modelsafter a swapTesting
Tested end-to-end on macOS (Apple Silicon, Python 3.14), with OpenWebUI as the client:
Unit tests:
pytest tests/(460 passed, pre-existing async fixture issue unrelated to this PR)