You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optional F16 KV cache (--kv16): halve attention DRAM bandwidth
Store KV cache in FP16 instead of F32 when --kv16 is passed. Writes
K/V to temp F32 buffers, applies RoPE, then converts to F16 via NEON
vcvt_f16_f32 (scalar fallback for non-ARM). Read path converts F16
back to a stack F32 buffer before dot products — 512 bytes fits in L1.
Both gqa_range and flash_gqa_range get the F16 treatment. Arena sizing
and allocation are conditional on the new kv_f16 config flag.
Benchmark (M1 Max, bitnet-b1.58-2B-4T, 128 tokens):
F32 KV: 40.8 tok/s
F16 KV: 47.9 tok/s (+17%)
Greedy argmax matches F32; generated text diverges slightly due to
accumulated F16 rounding but remains coherent and correct.
Copy file name to clipboardExpand all lines: README.md
+14-6Lines changed: 14 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ A clean-room, pure C inference engine for [BitNet b1.58](https://arxiv.org/abs/2
4
4
5
5
Inspired by Andrej Karpathy's [llama2.c](https://github.com/karpathy/llama2.c) — a beautifully minimal LLaMA inference implementation in a single C file — **bitnet.c** takes the same philosophy and applies it to Microsoft's [BitNet](https://github.com/microsoft/BitNet) architecture with its 1.58-bit ternary weights.
6
6
7
-
Where Microsoft's official BitNet inference framework depends on a modified llama.cpp fork (~100K+ lines of C++), bitnet.c delivers a complete inference pipeline in ~4,100 lines of modular, readable C.
7
+
Where Microsoft's official BitNet inference framework depends on a modified llama.cpp fork (~100K+ lines of C++), bitnet.c delivers a complete inference pipeline in ~4,500 lines of modular, readable C.
8
8
9
9
## Features
10
10
@@ -13,6 +13,7 @@ Where Microsoft's official BitNet inference framework depends on a modified llam
13
13
-**I2_S, TQ1_0, & TQ2_0 formats** — native support for Microsoft's I2_S and GGML ternary quantization
0 commit comments