Skip to content

Commit eb4f7d1

Browse files
unamedkrclaude
andauthored
feat(default): make Phi-3.5-mini the recommended default model (#66)
Phi-3 architecture support landed in #65 and validated end-to-end as the best speed/quality combo we ship (vocab 32K + 3.8B params makes the lm_head matmul the fastest of any registered model). Promote it to the default everywhere. ## Code - `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block marks it as the default and explains the reasoning - `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini - Module docstring + `Model.from_pretrained` example use Phi-3.5-mini - CLI `--help` epilog: examples lead with `phi-3.5-mini` and the backwards-compat block mentions `smollm2` / `llama3.2:1b` as alternatives instead ## Docs - README.md: Quick Start renamed Phi-3.5-mini as the recommended default; CLI examples and Python `from_pretrained` example updated. Benchmark/perf sections still reference SmolLM2/Llama models because those are historical measurement data. - README.ko.md: same changes mirrored in Korean. - bindings/python/README.md (PyPI README): replaced "Basic question answering" with "Quick start (auto-download)" using `from_pretrained`. Added a multi-turn chat example using `m.chat()` + KV cache reuse, and an API reference entry for `Model.chat()` and `Model.from_pretrained()`. ## Verified - ctest --test-dir build → 35/35 passed - Full build clean (no new warnings) - Phi-3.5-mini end-to-end inference test still produces coherent multi-paragraph output ("Name three planets..." → Earth, Mars, Jupiter with descriptions) - `available_models()` returns Phi-3.5-mini in the list - `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly - `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"` - `quantcpp --help` epilog reflects the new defaults Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1e1ea2c commit eb4f7d1

5 files changed

Lines changed: 131 additions & 66 deletions

File tree

README.ko.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,28 +28,30 @@
2828
```bash
2929
pip install quantcpp
3030

31-
quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
32-
quantcpp run llama3.2:1b # 대화형 채팅
33-
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
31+
quantcpp pull phi-3.5-mini # HuggingFace에서 다운로드 (~2.4 GB)
32+
quantcpp run phi-3.5-mini # 대화형 채팅
33+
quantcpp serve phi-3.5-mini -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
3434
quantcpp client "안녕" # 스트리밍 클라이언트 → :8080 서버
3535
quantcpp list # 캐시된 모델 목록
3636
```
3737

38-
짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
38+
추천 기본 모델: **Phi-3.5-mini** (3.8B params, vocab 32K). registry의 모든 모델 중 가장 작은 vocab(32K)이라 토큰당 `lm_head` matmul이 가장 빠릅니다 — 노트북에서 속도와 품질의 최적 조합입니다. 다른 별칭: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. `run`/`serve` 첫 실행 시 자동 다운로드.
39+
40+
`serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
3941

4042
**한 줄 질문:**
4143
```bash
42-
quantcpp run llama3.2:1b "중력이란 무엇인가요?"
44+
quantcpp run phi-3.5-mini "중력이란 무엇인가요?"
4345
```
4446

4547
**Python API (3줄):**
4648
```python
4749
from quantcpp import Model
48-
m = Model.from_pretrained("Llama-3.2-1B")
50+
m = Model.from_pretrained("Phi-3.5-mini")
4951
print(m.ask("중력이란 무엇인가요?"))
5052
```
5153

52-
API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
54+
API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. 지원되는 architecture와 모델 선택 가이드는 [`docs/supported_models.md`](docs/supported_models.md)를 참고하세요. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
5355

5456
---
5557

README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,28 +41,30 @@
4141
```bash
4242
pip install quantcpp
4343

44-
quantcpp pull llama3.2:1b # download from HuggingFace
45-
quantcpp run llama3.2:1b # interactive chat
46-
quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server (SSE streaming)
44+
quantcpp pull phi-3.5-mini # download from HuggingFace (~2.4 GB)
45+
quantcpp run phi-3.5-mini # interactive chat
46+
quantcpp serve phi-3.5-mini -p 8080 # OpenAI-compatible HTTP server (SSE streaming)
4747
quantcpp client "Hi" # streaming client → server on :8080
4848
quantcpp list # show cached models
4949
```
5050

51-
Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
51+
Recommended default: **Phi-3.5-mini** (3.8B params, vocab 32K). The 32K vocab is the smallest in the registry, which makes the per-token `lm_head` matmul the fastest of any model we ship — Phi-3.5-mini is the best speed/quality combo on a laptop. Other aliases: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. Auto-pulls on first `run` / `serve`.
52+
53+
The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
5254

5355
**One-shot question:**
5456
```bash
55-
quantcpp run llama3.2:1b "What is gravity?"
57+
quantcpp run phi-3.5-mini "What is gravity?"
5658
```
5759

5860
**Python API (3 lines):**
5961
```python
6062
from quantcpp import Model
61-
m = Model.from_pretrained("Llama-3.2-1B")
63+
m = Model.from_pretrained("Phi-3.5-mini")
6264
print(m.ask("What is gravity?"))
6365
```
6466

65-
Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
67+
Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. See [`docs/supported_models.md`](docs/supported_models.md) for the architecture support matrix and model selection guide. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
6668

6769
---
6870

bindings/python/README.md

Lines changed: 54 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,30 @@ pip install .
3333

3434
## Usage
3535

36-
### Basic question answering
36+
### Quick start (auto-download)
3737

3838
```python
3939
from quantcpp import Model
4040

41+
m = Model.from_pretrained("Phi-3.5-mini") # ~2.4 GB, downloaded once and cached
42+
print(m.ask("What is 2+2?"))
43+
```
44+
45+
`from_pretrained` accepts any name from `quantcpp.available_models()`.
46+
**Phi-3.5-mini** is the recommended default — 3.8B params with the smallest
47+
vocab (32K) in the registry, which makes the per-token `lm_head` matmul
48+
the fastest of any model we ship. Other ready-to-use names:
49+
50+
- `SmolLM2-1.7B` — lightweight all-rounder (1.7 GB, vocab 49K)
51+
- `Llama-3.2-1B` — smallest download (750 MB) but slower at inference
52+
- `SmolLM2-135M` — 138 MB demo model, low quality
53+
- `Qwen3.5-0.8B`
54+
55+
You can also load any local GGUF file directly:
56+
57+
```python
4158
m = Model("model.gguf")
42-
answer = m.ask("What is 2+2?")
43-
print(answer)
59+
print(m.ask("What is 2+2?"))
4460
```
4561

4662
### Streaming generation
@@ -50,10 +66,30 @@ for token in m.generate("Once upon a time"):
5066
print(token, end="", flush=True)
5167
```
5268

69+
### Multi-turn chat with KV cache reuse
70+
71+
```python
72+
m = Model.from_pretrained("Phi-3.5-mini")
73+
history = ""
74+
while True:
75+
user = input("\nYou: ")
76+
history += f"<|user|>\n{user}<|end|>\n<|assistant|>\n"
77+
print("AI: ", end="", flush=True)
78+
reply = ""
79+
for tok in m.chat(history):
80+
print(tok, end="", flush=True)
81+
reply += tok
82+
history += reply + "<|end|>\n"
83+
```
84+
85+
`m.chat()` reuses the KV cache across turns — turn N's prefill cost is
86+
O(new tokens), not O(history). Catch `quantcpp.ChatContextOverflow` if
87+
the conversation exceeds the model's context window.
88+
5389
### Context manager
5490

5591
```python
56-
with Model("model.gguf") as m:
92+
with Model.from_pretrained("Phi-3.5-mini") as m:
5793
print(m.ask("Explain gravity in one sentence"))
5894
```
5995

@@ -92,6 +128,12 @@ Load a GGUF model file and create an inference context.
92128
- `n_threads` -- CPU thread count.
93129
- `kv_compress` -- KV cache compression mode (0=off, 1=4-bit, 2=delta+3-bit).
94130

131+
### `Model.from_pretrained(name) -> Model`
132+
133+
Download a registered model from HuggingFace (cached at
134+
`~/.cache/quantcpp/`) and return an open Model. See
135+
`quantcpp.available_models()` for the registry.
136+
95137
### `Model.ask(prompt) -> str`
96138

97139
Generate a complete response. Returns the full text.
@@ -100,6 +142,14 @@ Generate a complete response. Returns the full text.
100142

101143
Stream tokens one at a time. Yields individual token strings.
102144

145+
### `Model.chat(prompt) -> Iterator[str]`
146+
147+
Stream tokens with KV cache reuse across calls — turn N pays only for
148+
the new bytes since turn N-1. Pass `prompt=None` (or call
149+
`Model.reset_chat()`) to start a fresh session. Raises
150+
`quantcpp.ChatContextOverflow` when the history exceeds the model's
151+
context window (the C side has already auto-reset by then).
152+
103153
### `Model.close()`
104154

105155
Release resources. Called automatically via `with` or garbage collection.

bindings/python/quantcpp/__init__.py

Lines changed: 43 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,20 @@
44
Quick start:
55
66
from quantcpp import Model
7-
m = Model.from_pretrained("SmolLM2-1.7B")
7+
m = Model.from_pretrained("Phi-3.5-mini")
88
print(m.ask("What is gravity?"))
99
1010
Model selection guide:
11-
SmolLM2-1.7B (1.7 GB, vocab 49K) — recommended. ~12 tok/s on Apple M3.
12-
Llama-3.2-1B (750 MB, vocab 128K) — smaller download but slower
11+
Phi-3.5-mini (2.4 GB, vocab 32K) — DEFAULT. 3.8B params with the
12+
smallest lm_head in the registry,
13+
producing the best speed/quality
14+
combo. Coherent multi-paragraph
15+
output even at Q4_K_M.
16+
SmolLM2-1.7B (1.7 GB, vocab 49K) — lightweight all-rounder. ~12 tok/s
17+
on Apple M3, smaller download.
18+
Llama-3.2-1B (750 MB, vocab 128K) — smallest download but slower
1319
due to large vocab (~2 tok/s on M3).
14-
SmolLM2-135M (138 MB, vocab 49K) — demo only, low quality output.
20+
SmolLM2-135M (138 MB, vocab 49K) — demo only, low quality output.
1521
1622
Larger vocab = slower lm_head matmul → smaller params with smaller vocab
1723
often beats larger params with larger vocab. See docs/supported_models.md
@@ -65,47 +71,48 @@ class ChatContextOverflow(RuntimeError):
6571
# Verify both fields against the actual HuggingFace listing before
6672
# adding new entries — there is no integrity check at runtime.
6773
_MODEL_REGISTRY = {
68-
# 138 MB demo model. Tokenizer + arch are llama-compatible but the
69-
# model is too small to produce coherent output for general chat.
70-
# Listed only so users can verify the install/load path quickly.
71-
"SmolLM2-135M": (
72-
"Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
73-
"smollm2-135m-instruct-q8_0.gguf",
74-
135,
74+
# ── DEFAULT ──
75+
# Phi-3.5-mini-instruct (3.8B params, vocab 32K). Set as default on
76+
# 2026-04-12 after end-to-end Phi-3 architecture support landed
77+
# (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab is the
78+
# smallest of the registry, which makes the lm_head matmul the
79+
# fastest per-token. Combined with 3.8B params it produces the
80+
# best quality-per-token of any model we ship.
81+
"Phi-3.5-mini": (
82+
"bartowski/Phi-3.5-mini-instruct-GGUF",
83+
"Phi-3.5-mini-instruct-Q4_K_M.gguf",
84+
2400,
7585
),
76-
# Recommended default for first-time users on Apple Silicon / typical
77-
# laptops. vocab 49K keeps the lm_head matmul small, so even on a
78-
# mid-range M-series chip we measure ~12 tok/s — comfortable for
79-
# interactive chat. Same llama arch family as SmolLM2-135M, so it
80-
# exercises the most-tested code path.
86+
# Lightweight all-rounder for users who want a smaller download
87+
# than Phi-3.5-mini. vocab 49K keeps the lm_head matmul small, so
88+
# on a mid-range M-series chip we measure ~12 tok/s — comfortable
89+
# for interactive chat. Same llama arch family as SmolLM2-135M.
8190
"SmolLM2-1.7B": (
8291
"bartowski/SmolLM2-1.7B-Instruct-GGUF",
8392
"SmolLM2-1.7B-Instruct-Q8_0.gguf",
8493
1700,
8594
),
86-
"Qwen3.5-0.8B": (
87-
"unsloth/Qwen3.5-0.8B-GGUF",
88-
"Qwen3.5-0.8B-Q4_K_M.gguf",
89-
508,
90-
),
91-
# Smaller download than SmolLM2-1.7B but slower at inference time
92-
# because of the 128K Llama-3 vocab (~5x slower lm_head matmul on M3).
93-
# Kept in the registry for users who specifically want a Llama model.
95+
# Smallest download in the "actually usable" tier. Slower at
96+
# inference time because of the 128K Llama-3 vocab (~5x slower
97+
# lm_head matmul on M3). Kept in the registry for users who
98+
# specifically want a Llama model.
9499
"Llama-3.2-1B": (
95100
"hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",
96101
"llama-3.2-1b-instruct-q4_k_m.gguf",
97102
750,
98103
),
99-
# Phi-3.5-mini-instruct (3.8B params, vocab 32K).
100-
# Added 2026-04-12 after end-to-end Phi-3 architecture support
101-
# landed (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab
102-
# is the smallest of the registry, which makes the lm_head matmul
103-
# the fastest per-token. Combined with 3.8B params it's the best
104-
# quality-per-token model we ship.
105-
"Phi-3.5-mini": (
106-
"bartowski/Phi-3.5-mini-instruct-GGUF",
107-
"Phi-3.5-mini-instruct-Q4_K_M.gguf",
108-
2400,
104+
"Qwen3.5-0.8B": (
105+
"unsloth/Qwen3.5-0.8B-GGUF",
106+
"Qwen3.5-0.8B-Q4_K_M.gguf",
107+
508,
108+
),
109+
# 138 MB demo model. Tokenizer + arch are llama-compatible but the
110+
# model is too small to produce coherent output for general chat.
111+
# Listed only so users can verify the install/load path quickly.
112+
"SmolLM2-135M": (
113+
"Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
114+
"smollm2-135m-instruct-q8_0.gguf",
115+
135,
109116
),
110117
}
111118

@@ -208,9 +215,9 @@ class Model:
208215
209216
Examples
210217
--------
211-
>>> m = Model.from_pretrained("SmolLM2-1.7B")
218+
>>> m = Model.from_pretrained("Phi-3.5-mini")
212219
>>> m.ask("What is gravity?")
213-
'Gravity is a force that attracts ...'
220+
'Gravity is a fundamental force that attracts ...'
214221
215222
>>> with Model("model.gguf") as m:
216223
... for tok in m.generate("Once upon a time"):

bindings/python/quantcpp/cli.py

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -337,13 +337,16 @@ def cmd_client(args):
337337

338338

339339
def cmd_chat_default(args):
340-
"""Backwards-compatible default: auto-download SmolLM2-1.7B and chat.
341-
342-
Default switched from Llama-3.2-1B to SmolLM2-1.7B (2026-04-12) after
343-
user feedback that Llama-3.2-1B's 128K vocab makes it ~5x slower at
344-
interactive chat than SmolLM2-1.7B's 49K vocab on Apple Silicon.
340+
"""Backwards-compatible default: auto-download Phi-3.5-mini and chat.
341+
342+
Default progression:
343+
Llama-3.2-1B → SmolLM2-1.7B (2026-04-12, vocab fix)
344+
→ Phi-3.5-mini (2026-04-12, after Phi-3 arch support
345+
landed). Phi-3.5-mini has the smallest vocab in
346+
the registry (32K) AND 3.8B params, giving the
347+
best speed/quality combo we ship.
345348
"""
346-
args.model = args.model or "SmolLM2-1.7B"
349+
args.model = args.model or "Phi-3.5-mini"
347350
args.threads = getattr(args, "threads", 4)
348351
args.max_tokens = getattr(args, "max_tokens", 256)
349352
args.temperature = getattr(args, "temperature", 0.7)
@@ -367,19 +370,20 @@ def main():
367370
client PROMPT Send a request to a running serve (default: SSE streaming)
368371
369372
examples:
370-
quantcpp pull smollm2 # recommended: small vocab → fast
373+
quantcpp pull phi-3.5-mini # recommended default (32K vocab → fast)
371374
quantcpp list
372-
quantcpp run smollm2
373-
quantcpp run smollm2 "What is gravity?"
374-
quantcpp serve smollm2 --port 8080
375+
quantcpp run phi-3.5-mini
376+
quantcpp run phi-3.5-mini "What is gravity?"
377+
quantcpp serve phi-3.5-mini --port 8080
375378
quantcpp client "What is gravity?" # streams from :8080
376379
quantcpp client "Hi" --url http://localhost:8081
377380
quantcpp client "Hi" --no-stream # single JSON response
378381
379382
backwards-compat (no subcommand):
380-
quantcpp # default chat with SmolLM2-1.7B
383+
quantcpp # default chat with Phi-3.5-mini
381384
quantcpp "What is gravity?" # one-shot
382-
quantcpp --model llama3.2:1b # different model
385+
quantcpp --model smollm2 # lightweight alternative
386+
quantcpp --model llama3.2:1b # smallest download
383387
""",
384388
)
385389

0 commit comments

Comments
 (0)