feat(default): make Phi-3.5-mini the recommended default model (#66)

unamedkr · claude · web-flow · commit eb4f7d174997 · 2026-04-12T12:07:47.000+09:00
Phi-3 architecture support landed in #65 and validated end-to-end as the best speed/quality combo we ship (vocab 32K + 3.8B params makes the lm_head matmul the fastest of any registered model). Promote it to the default everywhere. ## Code - `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block marks it as the default and explains the reasoning - `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini - Module docstring + `Model.from_pretrained` example use Phi-3.5-mini - CLI `--help` epilog: examples lead with `phi-3.5-mini` and the backwards-compat block mentions `smollm2` / `llama3.2:1b` as alternatives instead ## Docs - README.md: Quick Start renamed Phi-3.5-mini as the recommended default; CLI examples and Python `from_pretrained` example updated. Benchmark/perf sections still reference SmolLM2/Llama models because those are historical measurement data. - README.ko.md: same changes mirrored in Korean. - bindings/python/README.md (PyPI README): replaced "Basic question answering" with "Quick start (auto-download)" using `from_pretrained`. Added a multi-turn chat example using `m.chat()` + KV cache reuse, and an API reference entry for `Model.chat()` and `Model.from_pretrained()`. ## Verified - ctest --test-dir build → 35/35 passed - Full build clean (no new warnings) - Phi-3.5-mini end-to-end inference test still produces coherent multi-paragraph output ("Name three planets..." → Earth, Mars, Jupiter with descriptions) - `available_models()` returns Phi-3.5-mini in the list - `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly - `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"` - `quantcpp --help` epilog reflects the new defaults Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/README.ko.md b/README.ko.md
@@ -28,28 +28,30 @@
 ```bash
 pip install quantcpp
 
-quantcpp pull llama3.2:1b               # HuggingFace에서 다운로드
-quantcpp run llama3.2:1b                # 대화형 채팅
-quantcpp serve llama3.2:1b -p 8080      # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
+quantcpp pull phi-3.5-mini              # HuggingFace에서 다운로드 (~2.4 GB)
+quantcpp run phi-3.5-mini               # 대화형 채팅
+quantcpp serve phi-3.5-mini -p 8080     # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
 quantcpp client "안녕"                   # 스트리밍 클라이언트 → :8080 서버
 quantcpp list                           # 캐시된 모델 목록
 ```
 
-짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
+추천 기본 모델: **Phi-3.5-mini** (3.8B params, vocab 32K). registry의 모든 모델 중 가장 작은 vocab(32K)이라 토큰당 `lm_head` matmul이 가장 빠릅니다 — 노트북에서 속도와 품질의 최적 조합입니다. 다른 별칭: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. `run`/`serve` 첫 실행 시 자동 다운로드.
+
+`serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
 
 **한 줄 질문:**
 ```bash
-quantcpp run llama3.2:1b "중력이란 무엇인가요?"
+quantcpp run phi-3.5-mini "중력이란 무엇인가요?"
 ```
 
 **Python API (3줄):**
 ```python
 from quantcpp import Model
-m = Model.from_pretrained("Llama-3.2-1B")
+m = Model.from_pretrained("Phi-3.5-mini")
 print(m.ask("중력이란 무엇인가요?"))
 ```
 
-API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
+API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. 지원되는 architecture와 모델 선택 가이드는 [`docs/supported_models.md`](docs/supported_models.md)를 참고하세요. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
 
 ---
 
diff --git a/README.md b/README.md
@@ -41,28 +41,30 @@
 ```bash
 pip install quantcpp
 
-quantcpp pull llama3.2:1b               # download from HuggingFace
-quantcpp run llama3.2:1b                # interactive chat
-quantcpp serve llama3.2:1b -p 8080      # OpenAI-compatible HTTP server (SSE streaming)
+quantcpp pull phi-3.5-mini              # download from HuggingFace (~2.4 GB)
+quantcpp run phi-3.5-mini               # interactive chat
+quantcpp serve phi-3.5-mini -p 8080     # OpenAI-compatible HTTP server (SSE streaming)
 quantcpp client "Hi"                    # streaming client → server on :8080
 quantcpp list                           # show cached models
 ```
 
-Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
+Recommended default: **Phi-3.5-mini** (3.8B params, vocab 32K). The 32K vocab is the smallest in the registry, which makes the per-token `lm_head` matmul the fastest of any model we ship — Phi-3.5-mini is the best speed/quality combo on a laptop. Other aliases: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. Auto-pulls on first `run` / `serve`.
+
+The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
 
 **One-shot question:**
 ```bash
-quantcpp run llama3.2:1b "What is gravity?"
+quantcpp run phi-3.5-mini "What is gravity?"
 ```
 
 **Python API (3 lines):**
 ```python
 from quantcpp import Model
-m = Model.from_pretrained("Llama-3.2-1B")
+m = Model.from_pretrained("Phi-3.5-mini")
 print(m.ask("What is gravity?"))
 ```
 
-Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
+Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. See [`docs/supported_models.md`](docs/supported_models.md) for the architecture support matrix and model selection guide. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
 
 ---
 
diff --git a/bindings/python/README.md b/bindings/python/README.md
@@ -33,14 +33,30 @@ pip install .
 
 ## Usage
 
-### Basic question answering
+### Quick start (auto-download)
 
 ```python
 from quantcpp import Model
 
+m = Model.from_pretrained("Phi-3.5-mini")  # ~2.4 GB, downloaded once and cached
+print(m.ask("What is 2+2?"))
+```
+
+`from_pretrained` accepts any name from `quantcpp.available_models()`.
+**Phi-3.5-mini** is the recommended default — 3.8B params with the smallest
+vocab (32K) in the registry, which makes the per-token `lm_head` matmul
+the fastest of any model we ship. Other ready-to-use names:
+
+- `SmolLM2-1.7B` — lightweight all-rounder (1.7 GB, vocab 49K)
+- `Llama-3.2-1B` — smallest download (750 MB) but slower at inference
+- `SmolLM2-135M` — 138 MB demo model, low quality
+- `Qwen3.5-0.8B`
+
+You can also load any local GGUF file directly:
+
+```python
 m = Model("model.gguf")
-answer = m.ask("What is 2+2?")
-print(answer)
+print(m.ask("What is 2+2?"))
 ```
 
 ### Streaming generation
@@ -50,10 +66,30 @@ for token in m.generate("Once upon a time"):
     print(token, end="", flush=True)
 ```
 
+### Multi-turn chat with KV cache reuse
+
+```python
+m = Model.from_pretrained("Phi-3.5-mini")
+history = ""
+while True:
+    user = input("\nYou: ")
+    history += f"<|user|>\n{user}<|end|>\n<|assistant|>\n"
+    print("AI: ", end="", flush=True)
+    reply = ""
+    for tok in m.chat(history):
+        print(tok, end="", flush=True)
+        reply += tok
+    history += reply + "<|end|>\n"
+```
+
+`m.chat()` reuses the KV cache across turns — turn N's prefill cost is
+O(new tokens), not O(history). Catch `quantcpp.ChatContextOverflow` if
+the conversation exceeds the model's context window.
+
 ### Context manager
 
 ```python
-with Model("model.gguf") as m:
+with Model.from_pretrained("Phi-3.5-mini") as m:
     print(m.ask("Explain gravity in one sentence"))
 ```
 
@@ -92,6 +128,12 @@ Load a GGUF model file and create an inference context.
 - `n_threads` -- CPU thread count.
 - `kv_compress` -- KV cache compression mode (0=off, 1=4-bit, 2=delta+3-bit).
 
+### `Model.from_pretrained(name) -> Model`
+
+Download a registered model from HuggingFace (cached at
+`~/.cache/quantcpp/`) and return an open Model. See
+`quantcpp.available_models()` for the registry.
+
 ### `Model.ask(prompt) -> str`
 
 Generate a complete response. Returns the full text.
@@ -100,6 +142,14 @@ Generate a complete response. Returns the full text.
 
 Stream tokens one at a time. Yields individual token strings.
 
+### `Model.chat(prompt) -> Iterator[str]`
+
+Stream tokens with KV cache reuse across calls — turn N pays only for
+the new bytes since turn N-1. Pass `prompt=None` (or call
+`Model.reset_chat()`) to start a fresh session. Raises
+`quantcpp.ChatContextOverflow` when the history exceeds the model's
+context window (the C side has already auto-reset by then).
+
 ### `Model.close()`
 
 Release resources. Called automatically via `with` or garbage collection.
diff --git a/bindings/python/quantcpp/__init__.py b/bindings/python/quantcpp/__init__.py
@@ -4,14 +4,20 @@
 Quick start:
 
     from quantcpp import Model
-    m = Model.from_pretrained("SmolLM2-1.7B")
+    m = Model.from_pretrained("Phi-3.5-mini")
     print(m.ask("What is gravity?"))
 
 Model selection guide:
-    SmolLM2-1.7B  (1.7 GB, vocab 49K)  — recommended. ~12 tok/s on Apple M3.
-    Llama-3.2-1B  (750 MB, vocab 128K) — smaller download but slower
+    Phi-3.5-mini   (2.4 GB, vocab 32K)  — DEFAULT. 3.8B params with the
+                                          smallest lm_head in the registry,
+                                          producing the best speed/quality
+                                          combo. Coherent multi-paragraph
+                                          output even at Q4_K_M.
+    SmolLM2-1.7B   (1.7 GB, vocab 49K)  — lightweight all-rounder. ~12 tok/s
+                                          on Apple M3, smaller download.
+    Llama-3.2-1B   (750 MB, vocab 128K) — smallest download but slower
                                           due to large vocab (~2 tok/s on M3).
-    SmolLM2-135M  (138 MB, vocab 49K)  — demo only, low quality output.
+    SmolLM2-135M   (138 MB, vocab 49K)  — demo only, low quality output.
 
 Larger vocab = slower lm_head matmul → smaller params with smaller vocab
 often beats larger params with larger vocab. See docs/supported_models.md
@@ -65,47 +71,48 @@ class ChatContextOverflow(RuntimeError):
 # Verify both fields against the actual HuggingFace listing before
 # adding new entries — there is no integrity check at runtime.
 _MODEL_REGISTRY = {
-    # 138 MB demo model. Tokenizer + arch are llama-compatible but the
-    # model is too small to produce coherent output for general chat.
-    # Listed only so users can verify the install/load path quickly.
-    "SmolLM2-135M": (
-        "Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
-        "smollm2-135m-instruct-q8_0.gguf",
-        135,
+    # ── DEFAULT ──
+    # Phi-3.5-mini-instruct (3.8B params, vocab 32K). Set as default on
+    # 2026-04-12 after end-to-end Phi-3 architecture support landed
+    # (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab is the
+    # smallest of the registry, which makes the lm_head matmul the
+    # fastest per-token. Combined with 3.8B params it produces the
+    # best quality-per-token of any model we ship.
+    "Phi-3.5-mini": (
+        "bartowski/Phi-3.5-mini-instruct-GGUF",
+        "Phi-3.5-mini-instruct-Q4_K_M.gguf",
+        2400,
     ),
-    # Recommended default for first-time users on Apple Silicon / typical
-    # laptops. vocab 49K keeps the lm_head matmul small, so even on a
-    # mid-range M-series chip we measure ~12 tok/s — comfortable for
-    # interactive chat. Same llama arch family as SmolLM2-135M, so it
-    # exercises the most-tested code path.
+    # Lightweight all-rounder for users who want a smaller download
+    # than Phi-3.5-mini. vocab 49K keeps the lm_head matmul small, so
+    # on a mid-range M-series chip we measure ~12 tok/s — comfortable
+    # for interactive chat. Same llama arch family as SmolLM2-135M.
     "SmolLM2-1.7B": (
         "bartowski/SmolLM2-1.7B-Instruct-GGUF",
         "SmolLM2-1.7B-Instruct-Q8_0.gguf",
         1700,
     ),
-    "Qwen3.5-0.8B": (
-        "unsloth/Qwen3.5-0.8B-GGUF",
-        "Qwen3.5-0.8B-Q4_K_M.gguf",
-        508,
-    ),
-    # Smaller download than SmolLM2-1.7B but slower at inference time
-    # because of the 128K Llama-3 vocab (~5x slower lm_head matmul on M3).
-    # Kept in the registry for users who specifically want a Llama model.
+    # Smallest download in the "actually usable" tier. Slower at
+    # inference time because of the 128K Llama-3 vocab (~5x slower
+    # lm_head matmul on M3). Kept in the registry for users who
+    # specifically want a Llama model.
     "Llama-3.2-1B": (
         "hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",
         "llama-3.2-1b-instruct-q4_k_m.gguf",
         750,
     ),
-    # Phi-3.5-mini-instruct (3.8B params, vocab 32K).
-    # Added 2026-04-12 after end-to-end Phi-3 architecture support
-    # landed (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab
-    # is the smallest of the registry, which makes the lm_head matmul
-    # the fastest per-token. Combined with 3.8B params it's the best
-    # quality-per-token model we ship.
-    "Phi-3.5-mini": (
-        "bartowski/Phi-3.5-mini-instruct-GGUF",
-        "Phi-3.5-mini-instruct-Q4_K_M.gguf",
-        2400,
+    "Qwen3.5-0.8B": (
+        "unsloth/Qwen3.5-0.8B-GGUF",
+        "Qwen3.5-0.8B-Q4_K_M.gguf",
+        508,
+    ),
+    # 138 MB demo model. Tokenizer + arch are llama-compatible but the
+    # model is too small to produce coherent output for general chat.
+    # Listed only so users can verify the install/load path quickly.
+    "SmolLM2-135M": (
+        "Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
+        "smollm2-135m-instruct-q8_0.gguf",
+        135,
     ),
 }
 
@@ -208,9 +215,9 @@ class Model:
 
     Examples
     --------
-    >>> m = Model.from_pretrained("SmolLM2-1.7B")
+    >>> m = Model.from_pretrained("Phi-3.5-mini")
     >>> m.ask("What is gravity?")
-    'Gravity is a force that attracts ...'
+    'Gravity is a fundamental force that attracts ...'
 
     >>> with Model("model.gguf") as m:
     ...     for tok in m.generate("Once upon a time"):
diff --git a/bindings/python/quantcpp/cli.py b/bindings/python/quantcpp/cli.py
@@ -337,13 +337,16 @@ def cmd_client(args):
 
 
 def cmd_chat_default(args):
-    """Backwards-compatible default: auto-download SmolLM2-1.7B and chat.
-
-    Default switched from Llama-3.2-1B to SmolLM2-1.7B (2026-04-12) after
-    user feedback that Llama-3.2-1B's 128K vocab makes it ~5x slower at
-    interactive chat than SmolLM2-1.7B's 49K vocab on Apple Silicon.
+    """Backwards-compatible default: auto-download Phi-3.5-mini and chat.
+
+    Default progression:
+      Llama-3.2-1B → SmolLM2-1.7B (2026-04-12, vocab fix)
+                   → Phi-3.5-mini (2026-04-12, after Phi-3 arch support
+                     landed). Phi-3.5-mini has the smallest vocab in
+                     the registry (32K) AND 3.8B params, giving the
+                     best speed/quality combo we ship.
     """
-    args.model = args.model or "SmolLM2-1.7B"
+    args.model = args.model or "Phi-3.5-mini"
     args.threads = getattr(args, "threads", 4)
     args.max_tokens = getattr(args, "max_tokens", 256)
     args.temperature = getattr(args, "temperature", 0.7)
@@ -367,19 +370,20 @@ def main():
   client PROMPT         Send a request to a running serve (default: SSE streaming)
 
 examples:
-  quantcpp pull smollm2              # recommended: small vocab → fast
+  quantcpp pull phi-3.5-mini         # recommended default (32K vocab → fast)
   quantcpp list
-  quantcpp run smollm2
-  quantcpp run smollm2 "What is gravity?"
-  quantcpp serve smollm2 --port 8080
+  quantcpp run phi-3.5-mini
+  quantcpp run phi-3.5-mini "What is gravity?"
+  quantcpp serve phi-3.5-mini --port 8080
   quantcpp client "What is gravity?"                  # streams from :8080
   quantcpp client "Hi" --url http://localhost:8081
   quantcpp client "Hi" --no-stream                    # single JSON response
 
 backwards-compat (no subcommand):
-  quantcpp                          # default chat with SmolLM2-1.7B
+  quantcpp                          # default chat with Phi-3.5-mini
   quantcpp "What is gravity?"       # one-shot
-  quantcpp --model llama3.2:1b      # different model
+  quantcpp --model smollm2          # lightweight alternative
+  quantcpp --model llama3.2:1b      # smallest download
 """,
     )