You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
README: add model support table, hybrid SSM+Attention docs
Add tested model results (bitnet-2B, Qwen2.5-3B, Llama3-8B, Qwen3.5-9B)
with performance and quality notes. Document hybrid SSM architecture
support (Gated DeltaNet) for models like Qwen3.5.
Copy file name to clipboardExpand all lines: README.md
+23-1Lines changed: 23 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
A minimal, embeddable LLM inference engine in pure C11.
4
4
5
-
Loads GGUF models and runs autoregressive text generation on CPU. Supports 20+ quantization formats — from ternary (I2_S, TQ1/TQ2) through k-quants (Q2–Q8) and imatrix (IQ2–IQ4) to unquantized (F16, BF16, F32). Inspired by Karpathy's [llama2.c](https://github.com/karpathy/llama2.c).
5
+
Loads GGUF models and runs autoregressive text generation on CPU. Supports 20+ quantization formats — from ternary (I2_S, TQ1/TQ2) through k-quants (Q2–Q8) and imatrix (IQ2–IQ4) to unquantized (F16, BF16, F32). Handles both standard transformer and hybrid SSM+Attention architectures (Gated DeltaNet). Inspired by Karpathy's [llama2.c](https://github.com/karpathy/llama2.c).
6
6
7
7
Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, and fits in ~8,000 lines of modular C.
8
8
@@ -12,6 +12,7 @@ Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, an
12
12
-**GGUF model loading** — loads any GGUF file with supported tensor types
13
13
-**20+ quantization formats** — ternary, k-quants, imatrix codebook, and unquantized (see table below)
0 commit comments