Skip to content

Commit 48a9b03

Browse files
committed
README: add model support table, hybrid SSM+Attention docs
Add tested model results (bitnet-2B, Qwen2.5-3B, Llama3-8B, Qwen3.5-9B) with performance and quality notes. Document hybrid SSM architecture support (Gated DeltaNet) for models like Qwen3.5.
1 parent 77c322e commit 48a9b03

1 file changed

Lines changed: 23 additions & 1 deletion

File tree

README.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
A minimal, embeddable LLM inference engine in pure C11.
44

5-
Loads GGUF models and runs autoregressive text generation on CPU. Supports 20+ quantization formats — from ternary (I2_S, TQ1/TQ2) through k-quants (Q2–Q8) and imatrix (IQ2–IQ4) to unquantized (F16, BF16, F32). Inspired by Karpathy's [llama2.c](https://github.com/karpathy/llama2.c).
5+
Loads GGUF models and runs autoregressive text generation on CPU. Supports 20+ quantization formats — from ternary (I2_S, TQ1/TQ2) through k-quants (Q2–Q8) and imatrix (IQ2–IQ4) to unquantized (F16, BF16, F32). Handles both standard transformer and hybrid SSM+Attention architectures (Gated DeltaNet). Inspired by Karpathy's [llama2.c](https://github.com/karpathy/llama2.c).
66

77
Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, and fits in ~8,000 lines of modular C.
88

@@ -12,6 +12,7 @@ Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, an
1212
- **GGUF model loading** — loads any GGUF file with supported tensor types
1313
- **20+ quantization formats** — ternary, k-quants, imatrix codebook, and unquantized (see table below)
1414
- **Full transformer forward pass** — RoPE, GQA, RMSNorm, sub-norms, tied/untied embeddings
15+
- **Hybrid SSM + Attention** — Gated DeltaNet SSM layers (conv1d, SiLU, delta rule recurrence) alongside standard GQA attention layers
1516
- **Flash GQA attention** — online softmax with KV-head grouping, single-pass over KV cache
1617
- **Optional F16 KV cache**`--kv16` halves attention DRAM bandwidth with minimal precision loss
1718
- **4 SIMD backends** — ARM NEON/SDOT, AVX2, WASM SIMD128, scalar fallback (auto-selected at compile time)
@@ -222,6 +223,27 @@ Requires [Emscripten](https://emscripten.org/):
222223
# Open wasm/index.html in a browser
223224
```
224225

226+
## Model Support
227+
228+
Tested models with generation quality and performance on Apple M1 Max (8 P-cores, 32 GB), release build, greedy decoding, 8 threads:
229+
230+
| Model | Size | Quant | Architecture | tok/s | Quality |
231+
|-------|------|-------|--------------|-------|---------|
232+
| bitnet-b1.58-2B-4T | 1.1 GB | I2_S (ternary) | Transformer | **29–41** | Factual, correct code |
233+
| Qwen2.5-3B-Instruct | 1.7 GB | Q4_0 | Transformer | **23–25** | Coherent, instruct-following |
234+
| Llama3-8B-1.58 | 3.3 GB | TQ1_0 (ternary) | Transformer | **8–9** | Basic completion |
235+
| Qwen3.5-9B | 5.3 GB | Q4_K_M (mixed) | Hybrid SSM+Attention | **2.8–3.1** | Best quality, chain-of-thought |
236+
237+
Any GGUF model using supported weight types works — these are just the tested configurations.
238+
239+
### Hybrid SSM + Attention
240+
241+
bitnet.c supports hybrid architectures that mix SSM (state space model) layers with standard attention layers, such as Qwen3.5's Gated DeltaNet:
242+
243+
- **SSM layers**: Conv1d (kernel=4) → SiLU → L2-norm Q/K → delta rule recurrence → per-head RMSNorm → SiLU gate → output projection
244+
- **Attention layers**: Standard GQA with RoPE, flash attention, KV cache
245+
- **Mixed layout**: The model config specifies which layers use SSM vs attention
246+
225247
## Performance
226248

227249
Measured on Apple M1 Max (8 P-cores, 32 GB), PGO build, greedy decoding, 8 threads. llama.cpp b8320 via Homebrew.

0 commit comments

Comments
 (0)