You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update docs: add inference pipeline reference, actualize roadmap and README
- Add docs/inference.md: mathematical description of the full inference
pipeline (tokenization, embedding, RMSNorm, ternary matvec, RoPE, KV
cache, GQA attention, FFN, logits, sampling) with code pointers
- Update roadmap: add Phase 6 (modular backends) and Phase 7 (extended
quant formats), update optimization history
- Update README: accurate line count (8K), add Q3_K/Q4_K/Q5_K/Q8_K to
format list, rewrite project structure tree for new module layout
- Remove stale docs/audit.md
Copy file name to clipboardExpand all lines: README.md
+53-47Lines changed: 53 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,13 +4,13 @@ A minimal, embeddable LLM inference engine in pure C11.
4
4
5
5
Started as a clean-room inference engine for Microsoft's [BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary models, inspired by Karpathy's [llama2.c](https://github.com/karpathy/llama2.c). Now supports standard GGUF quantization formats (Q4_0, Q8_0) alongside ternary (I2_S, TQ1_0, TQ2_0) — covering most small language models on HuggingFace.
6
6
7
-
Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, and fits in ~4,500 lines of modular C.
7
+
Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, and fits in ~8,000 lines of modular C.
8
8
9
9
## Features
10
10
11
11
-**Pure C11** — no C++, no frameworks, no dependencies beyond libc and libm
12
12
-**GGUF model loading** — loads any GGUF file with supported tensor types
-**Flash GQA attention** — online softmax with KV-head grouping, single-pass over KV cache
16
16
-**Optional F16 KV cache** — `--kv16` halves attention DRAM bandwidth with minimal precision loss
@@ -41,7 +41,7 @@ Auto-selected at compile time based on target architecture.
41
41
|[Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF)| 0.5B | Q4_0 + Q8_0 | Working |
42
42
|[Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF)| 3B | Q4_0 + Q6_K | Working |
43
43
44
-
Models must use only supported weight types (Q4_0, Q6_K, Q8_0, I2_S, TQ1_0, TQ2_0, F16). GGUF files with other k-quant types (Q4_K, Q5_K, etc.) are not yet supported.
44
+
Models must use only supported weight types (I2_S, TQ1_0, TQ2_0, Q3_K, Q4_0, Q4_K, Q5_K, Q6_K, Q8_0, Q8_K, F16, F32).
45
45
46
46
## Quick Start
47
47
@@ -116,46 +116,48 @@ The `model/` directory is git-ignored — model files won't be committed.
@@ -287,26 +293,26 @@ llama.cpp supports dozens of quantization formats, model architectures, GPU back
287
293
bitnet.c exists because BitNet's ternary weights ({-1, 0, +1}) make most of that machinery irrelevant:
288
294
289
295
-**No GPU needed.** Ternary matvec is memory-bandwidth-bound, not compute-bound. A single M1 Max CPU core sustains ~52 tok/s — there's no FLOPs deficit to offload. GPU dispatch overhead and PCIe transfers would add latency for zero throughput gain.
290
-
-**No quantization format zoo.**The model has one weight type (I2_S ternary). llama.cpp's `ggml_compute_forward` dispatches through dozens of quant format×operation combinations. bitnet.c has one kernel.
296
+
-**Minimal format support.**bitnet.c supports ~10 quant formats vs llama.cpp's dozens of format x operation x backend combinations.
291
297
-**No abstraction layers.** llama.cpp routes tensor operations through GGML's graph-based backend abstraction. bitnet.c calls the matvec kernel directly — the forward pass is a flat loop over layers with inline SIMD.
292
-
-**Embeddable.** The entire engine is ~4,500 lines of C11 with zero dependencies. It compiles to WASM and runs in a browser. Try that with llama.cpp's Metal backend.
298
+
-**Embeddable.** The entire engine is ~8,000 lines of C11 with zero dependencies. It compiles to WASM and runs in a browser. Try that with llama.cpp's Metal backend.
293
299
294
300
Microsoft's own [BitNet inference framework](https://github.com/microsoft/BitNet) takes the opposite approach: it forks llama.cpp and patches in ternary kernel support. This inherits llama.cpp's full dependency tree (CMake, Python, conda) for a model that only needs addition and subtraction.
295
301
296
302
**When to use llama.cpp/Ollama instead:** if you need GPU inference, non-BitNet models, OpenAI-compatible API serving, or multi-model management. They're the right tools for general LLM deployment. bitnet.c is for when you want a single ternary model running as fast as possible with nothing else in the way.
297
303
298
304
### Why C
299
305
300
-
**The inner loop is 8 NEON intrinsics.** The entire I2_S SDOT matvec kernel — the function that consumes >90% of runtime — is 40 lines of C with ARM intrinsics. There is no abstraction to manage, no object lifetime to track, no type system to appease. The code *is* the assembly, minus register allocation.
306
+
**The inner loop is 8 NEON intrinsics.** The I2_S SDOT matvec kernel — the function that dominates runtime on ternary models — is 40 lines of C with ARM intrinsics. There is no abstraction to manage, no object lifetime to track, no type system to appease. The code *is* the assembly, minus register allocation.
301
307
302
308
**Why not Rust?** Rust solves memory safety at the type system level, which is genuinely valuable at scale — large teams, deep dependency trees, long-lived codebases with contributor churn. For a small, focused inference engine with zero dependencies and one or two authors who understand every line, the tradeoffs don't pay off:
303
309
304
310
-*The hot path is NEON intrinsics.* The matvec kernel, logits computation, and attention scoring are all hand-written SIMD. These are `unsafe` by definition in Rust — you get the borrow checker's overhead without its guarantees where performance matters.
305
311
-*Zero-copy GGUF loading.* The model is mmap'd and weights are read directly from the mapped buffer as raw pointers into packed binary data. This is one `mmap` call in C. In Rust it's `unsafe` pointer arithmetic wrapped in lifetime-annotated structs, or a dependency on `memmap2` + `bytemuck` + `zerocopy`.
306
312
-*No dependencies to protect against.* This project links libc, libm, and nothing else. No crates, no supply chain, no transitive dependencies. Cargo's safety story is about managing a dependency graph — there is no graph here.
307
-
-*Build time.* Clean build: under 2 seconds. A comparable Rust project with `memmap2`, `half`, `rayon`: 20-40 seconds.
313
+
-*Build time.* Clean build: under 3 seconds. A comparable Rust project with `memmap2`, `half`, `rayon`: 20-40 seconds.
308
314
309
-
**Why not C++?** Everything C gives you here but with a language that actively fights simplicity. The GGUF parser is a flat struct with pointer arithmetic. The thread pool is pthreads + condvar. The transformer forward pass is a `for` loop over layers. In C++ someone would reach for `std::variant` for tensor types, `std::jthread` with `std::barrier`, and a template-based layer abstraction. All objectively worse for 4,100 lines of inference code.
315
+
**Why not C++?** Everything C gives you here but with a language that actively fights simplicity. The GGUF parser is a flat struct with pointer arithmetic. The thread pool is pthreads + condvar. The transformer forward pass is a `for` loop over layers. In C++ someone would reach for `std::variant` for tensor types, `std::jthread` with `std::barrier`, and a template-based layer abstraction. All objectively worse for a focused inference engine.
310
316
311
317
**Why not Zig?** Zig's `comptime`, explicit allocators, and native SIMD vectors are genuinely appealing. For a new project not needing Emscripten WASM support, it would be a strong choice. But Zig's WASM target doesn't support the Emscripten features this project relies on (`EXPORTED_FUNCTIONS`, `MODULARIZE`, `SINGLE_FILE`), and the ecosystem isn't there yet for production inference workloads.
0 commit comments