Skip to content

Commit 018a8a9

Browse files
committed
Update docs: add inference pipeline reference, actualize roadmap and README
- Add docs/inference.md: mathematical description of the full inference pipeline (tokenization, embedding, RMSNorm, ternary matvec, RoPE, KV cache, GQA attention, FFN, logits, sampling) with code pointers - Update roadmap: add Phase 6 (modular backends) and Phase 7 (extended quant formats), update optimization history - Update README: accurate line count (8K), add Q3_K/Q4_K/Q5_K/Q8_K to format list, rewrite project structure tree for new module layout - Remove stale docs/audit.md
1 parent 5713dcb commit 018a8a9

4 files changed

Lines changed: 497 additions & 266 deletions

File tree

README.md

Lines changed: 53 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@ A minimal, embeddable LLM inference engine in pure C11.
44

55
Started as a clean-room inference engine for Microsoft's [BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary models, inspired by Karpathy's [llama2.c](https://github.com/karpathy/llama2.c). Now supports standard GGUF quantization formats (Q4_0, Q8_0) alongside ternary (I2_S, TQ1_0, TQ2_0) — covering most small language models on HuggingFace.
66

7-
Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, and fits in ~4,500 lines of modular C.
7+
Zero dependencies beyond libc and libm, four SIMD backends, compiles to WASM, and fits in ~8,000 lines of modular C.
88

99
## Features
1010

1111
- **Pure C11** — no C++, no frameworks, no dependencies beyond libc and libm
1212
- **GGUF model loading** — loads any GGUF file with supported tensor types
13-
- **Quantization formats** — I2_S, TQ1_0, TQ2_0 (ternary), Q4_0 (4-bit), Q6_K (6-bit k-quant), Q8_0 (8-bit)
13+
- **Quantization formats** — I2_S, TQ1_0, TQ2_0 (ternary), Q3_K, Q4_0, Q4_K, Q5_K, Q6_K (k-quants), Q8_0, Q8_K
1414
- **Full transformer forward pass** — RoPE, GQA, RMSNorm, sub-norms, tied embeddings
1515
- **Flash GQA attention** — online softmax with KV-head grouping, single-pass over KV cache
1616
- **Optional F16 KV cache**`--kv16` halves attention DRAM bandwidth with minimal precision loss
@@ -41,7 +41,7 @@ Auto-selected at compile time based on target architecture.
4141
| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF) | 0.5B | Q4_0 + Q8_0 | Working |
4242
| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF) | 3B | Q4_0 + Q6_K | Working |
4343

44-
Models must use only supported weight types (Q4_0, Q6_K, Q8_0, I2_S, TQ1_0, TQ2_0, F16). GGUF files with other k-quant types (Q4_K, Q5_K, etc.) are not yet supported.
44+
Models must use only supported weight types (I2_S, TQ1_0, TQ2_0, Q3_K, Q4_0, Q4_K, Q5_K, Q6_K, Q8_0, Q8_K, F16, F32).
4545

4646
## Quick Start
4747

@@ -116,46 +116,48 @@ The `model/` directory is git-ignored — model files won't be committed.
116116
```
117117
bitnet.c/
118118
├── include/
119-
│ ├── platform.h # Platform abstraction (mmap, timing)
120-
│ ├── gguf.h # GGUF v3 reader API
121-
│ ├── quant.h # Quantization: I2_S/TQ/Q4_0/Q8_0 dequant + matvec
122-
│ ├── model.h # Config, Weights, model loading
123-
│ ├── transformer.h # Forward pass API
124-
│ ├── tokenizer.h # BPE tokenizer API
125-
│ ├── sampler.h # Sampling strategies
126-
│ ├── threadpool.h # Persistent pthread thread pool
127-
│ ├── sh_arena.h # Arena allocator
128-
│ └── sh_log.h # Structured logging
119+
│ ├── platform.h # Platform abstraction (mmap, timing)
120+
│ ├── gguf.h # GGUF v3 reader API
121+
│ ├── quant.h # Public quant API: block structs, matvec, dequant
122+
│ ├── quant_internal.h # Quant backend context structs + range function decls
123+
│ ├── model.h # Config, Weights, model loading
124+
│ ├── transformer.h # Forward pass public API
125+
│ ├── transformer_internal.h # Transformer backend context structs + range function decls
126+
│ ├── tokenizer.h # BPE tokenizer API
127+
│ ├── sampler.h # Sampling strategies
128+
│ ├── threadpool.h # Persistent pthread thread pool
129+
│ ├── simd_helpers.h # Shared AVX2/WASM SIMD inline helpers
130+
│ ├── sh_arena.h # Arena allocator
131+
│ └── sh_log.h # Structured logging
129132
├── src/
130-
│ ├── platform.c # mmap/fread/timing abstraction
131-
│ ├── gguf.c # GGUF binary format parser
132-
│ ├── quant.c # FP16 conversion, dequant + matvec for all quant formats
133-
│ ├── model.c # GGUF → Config/Weights mapping
134-
│ ├── transformer.c # Forward pass: flash attention, FFN, sub-norms
135-
│ ├── tokenizer.c # BPE encode/decode from GGUF vocab
136-
│ ├── sampler.c # Argmax, multinomial, top-p sampling
137-
│ ├── threadpool.c # Thread pool with condvar dispatch
138-
│ ├── sh_arena.c # Arena allocator implementation
139-
│ ├── sh_log.c # Structured logging implementation
140-
│ └── main.c # CLI entry point
141-
├── test/
142-
│ ├── test_gguf.c # GGUF parser tests
143-
│ ├── test_quant.c # Dequantization + matvec tests
144-
│ ├── test_transformer.c # RMSNorm, softmax, RoPE tests
145-
│ ├── test_tokenizer.c # BPE encode/decode tests
146-
│ ├── test_threadpool.c # Thread pool dispatch tests
147-
│ ├── test_safety.c # Safety/bounds-checking regression tests
148-
│ ├── test_prefill.c # Prefill vs sequential correctness test
149-
│ ├── test_kv_f16.c # F16 KV cache correctness test
150-
│ └── test_e2e.c # End-to-end greedy decode test
151-
├── wasm/
152-
│ ├── api.c # WASM-exported API wrapper
153-
│ ├── build.sh # Emscripten build script
154-
│ ├── worker.js # Web Worker for non-blocking inference
155-
│ └── index.html # Browser demo
133+
│ ├── platform.c # mmap/fread/timing abstraction
134+
│ ├── gguf.c # GGUF binary format parser
135+
│ ├── quant/ # Per-format per-backend matvec kernels
136+
│ │ ├── dispatch.c # Format dispatch + batch matvec
137+
│ │ ├── dequant.c # Dequantization functions
138+
│ │ ├── fp16.c # FP16 ↔ FP32 conversion
139+
│ │ ├── i2s_neon_sdot.c # I2_S SDOT kernel (ARM dotprod)
140+
│ │ ├── i2s_scalar.c # I2_S scalar fallback
141+
│ │ ├── q4_neon_sdot.c # Q4_0 SDOT kernel
142+
│ │ ├── q6k_neon.c # Q6_K NEON kernel
143+
│ │ └── ... # ~50 backend files total
144+
│ ├── transformer/ # Per-backend transformer kernels
145+
│ │ ├── rmsnorm_{neon,avx2,wasm,scalar}.c
146+
│ │ ├── gqa_{neon,avx2,wasm,scalar}.c
147+
│ │ └── logits_{neon,avx2,wasm,scalar}.c
148+
│ ├── model.c # GGUF → Config/Weights mapping
149+
│ ├── transformer.c # Forward pass: layer loop, FFN, dispatch
150+
│ ├── tokenizer.c # BPE encode/decode from GGUF vocab
151+
│ ├── sampler.c # Argmax, multinomial, top-p sampling
152+
│ ├── threadpool.c # Thread pool with condvar dispatch
153+
│ ├── sh_arena.c # Arena allocator implementation
154+
│ ├── sh_log.c # Structured logging implementation
155+
│ └── main.c # CLI entry point
156+
├── test/ # Assert-based unit tests (synthetic data, no model needed)
157+
├── wasm/ # Emscripten WASM build + browser demo
156158
├── docs/
157-
│ ├── roadmap.md # Development roadmap
158-
│ └── audit.md # Security/correctness audit
159+
│ ├── inference.md # Inference pipeline: math, algorithms, code map
160+
│ └── roadmap.md # Development roadmap + optimization history
159161
└── Makefile
160162
```
161163

@@ -252,9 +254,13 @@ BitNet b1.58 is a transformer variant where all linear layer weights are constra
252254
| I2_S | 2.0 | 2-bit interleaved (4 values/byte) + per-tensor scale | 128 |
253255
| TQ1_0 | 1.6875 | Base-3 (5 values/byte) + residual | 256 |
254256
| TQ2_0 | 2.0625 | 2-bit fields (4 values/byte) | 256 |
257+
| Q3_K | 3.4375 | 3-bit quants (split ql/qh) + 6-bit sub-block scales | 256 |
255258
| Q4_0 | 4.5 | 4-bit nibbles (2 values/byte) + FP16 per-block scale | 32 |
256-
| Q6_K | 6.5625 | 6-bit quants (split ql/qh) + int8 sub-block scales + FP16 super-block scale | 256 |
259+
| Q4_K | 4.5 | 4-bit quants + 6-bit sub-block scales/mins | 256 |
260+
| Q5_K | 5.5 | 5-bit quants (split ql/qh) + 6-bit sub-block scales/mins | 256 |
261+
| Q6_K | 6.5625 | 6-bit quants (split ql/qh) + int8 sub-block scales | 256 |
257262
| Q8_0 | 8.5 | 8-bit values + FP16 per-block scale | 32 |
263+
| Q8_K | 9.125 | 8-bit values + float32 scale + int16 block sums | 256 |
258264

259265
### Memory Budget (bitnet-b1.58-2B-4T, 2048 context)
260266

@@ -287,26 +293,26 @@ llama.cpp supports dozens of quantization formats, model architectures, GPU back
287293
bitnet.c exists because BitNet's ternary weights ({-1, 0, +1}) make most of that machinery irrelevant:
288294

289295
- **No GPU needed.** Ternary matvec is memory-bandwidth-bound, not compute-bound. A single M1 Max CPU core sustains ~52 tok/s — there's no FLOPs deficit to offload. GPU dispatch overhead and PCIe transfers would add latency for zero throughput gain.
290-
- **No quantization format zoo.** The model has one weight type (I2_S ternary). llama.cpp's `ggml_compute_forward` dispatches through dozens of quant format×operation combinations. bitnet.c has one kernel.
296+
- **Minimal format support.** bitnet.c supports ~10 quant formats vs llama.cpp's dozens of format x operation x backend combinations.
291297
- **No abstraction layers.** llama.cpp routes tensor operations through GGML's graph-based backend abstraction. bitnet.c calls the matvec kernel directly — the forward pass is a flat loop over layers with inline SIMD.
292-
- **Embeddable.** The entire engine is ~4,500 lines of C11 with zero dependencies. It compiles to WASM and runs in a browser. Try that with llama.cpp's Metal backend.
298+
- **Embeddable.** The entire engine is ~8,000 lines of C11 with zero dependencies. It compiles to WASM and runs in a browser. Try that with llama.cpp's Metal backend.
293299

294300
Microsoft's own [BitNet inference framework](https://github.com/microsoft/BitNet) takes the opposite approach: it forks llama.cpp and patches in ternary kernel support. This inherits llama.cpp's full dependency tree (CMake, Python, conda) for a model that only needs addition and subtraction.
295301

296302
**When to use llama.cpp/Ollama instead:** if you need GPU inference, non-BitNet models, OpenAI-compatible API serving, or multi-model management. They're the right tools for general LLM deployment. bitnet.c is for when you want a single ternary model running as fast as possible with nothing else in the way.
297303

298304
### Why C
299305

300-
**The inner loop is 8 NEON intrinsics.** The entire I2_S SDOT matvec kernel — the function that consumes >90% of runtime — is 40 lines of C with ARM intrinsics. There is no abstraction to manage, no object lifetime to track, no type system to appease. The code *is* the assembly, minus register allocation.
306+
**The inner loop is 8 NEON intrinsics.** The I2_S SDOT matvec kernel — the function that dominates runtime on ternary models — is 40 lines of C with ARM intrinsics. There is no abstraction to manage, no object lifetime to track, no type system to appease. The code *is* the assembly, minus register allocation.
301307

302308
**Why not Rust?** Rust solves memory safety at the type system level, which is genuinely valuable at scale — large teams, deep dependency trees, long-lived codebases with contributor churn. For a small, focused inference engine with zero dependencies and one or two authors who understand every line, the tradeoffs don't pay off:
303309

304310
- *The hot path is NEON intrinsics.* The matvec kernel, logits computation, and attention scoring are all hand-written SIMD. These are `unsafe` by definition in Rust — you get the borrow checker's overhead without its guarantees where performance matters.
305311
- *Zero-copy GGUF loading.* The model is mmap'd and weights are read directly from the mapped buffer as raw pointers into packed binary data. This is one `mmap` call in C. In Rust it's `unsafe` pointer arithmetic wrapped in lifetime-annotated structs, or a dependency on `memmap2` + `bytemuck` + `zerocopy`.
306312
- *No dependencies to protect against.* This project links libc, libm, and nothing else. No crates, no supply chain, no transitive dependencies. Cargo's safety story is about managing a dependency graph — there is no graph here.
307-
- *Build time.* Clean build: under 2 seconds. A comparable Rust project with `memmap2`, `half`, `rayon`: 20-40 seconds.
313+
- *Build time.* Clean build: under 3 seconds. A comparable Rust project with `memmap2`, `half`, `rayon`: 20-40 seconds.
308314

309-
**Why not C++?** Everything C gives you here but with a language that actively fights simplicity. The GGUF parser is a flat struct with pointer arithmetic. The thread pool is pthreads + condvar. The transformer forward pass is a `for` loop over layers. In C++ someone would reach for `std::variant` for tensor types, `std::jthread` with `std::barrier`, and a template-based layer abstraction. All objectively worse for 4,100 lines of inference code.
315+
**Why not C++?** Everything C gives you here but with a language that actively fights simplicity. The GGUF parser is a flat struct with pointer arithmetic. The thread pool is pthreads + condvar. The transformer forward pass is a `for` loop over layers. In C++ someone would reach for `std::variant` for tensor types, `std::jthread` with `std::barrier`, and a template-based layer abstraction. All objectively worse for a focused inference engine.
310316

311317
**Why not Zig?** Zig's `comptime`, explicit allocators, and native SIMD vectors are genuinely appealing. For a new project not needing Emscripten WASM support, it would be a strong choice. But Zig's WASM target doesn't support the Emscripten features this project relies on (`EXPORTED_FUNCTIONS`, `MODULARIZE`, `SINGLE_FILE`), and the ecosystem isn't there yet for production inference workloads.
312318

0 commit comments

Comments
 (0)