Skip to content

Commit 5db67fe

Browse files
committed
Update benchmark numbers: Q4_0 20→25.4 tok/s (63% of llama.cpp)
1 parent f894f12 commit 5db67fe

2 files changed

Lines changed: 124 additions & 3 deletions

File tree

README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -224,14 +224,15 @@ Requires [Emscripten](https://emscripten.org/):
224224

225225
## Performance
226226

227-
Measured on Apple M1 Max (8 P-cores, 32 GB), greedy decoding (`--temp 0`), 8 threads.
227+
Measured on Apple M1 Max (8 P-cores, 32 GB), PGO build, greedy decoding, 8 threads. llama.cpp b8320 via Homebrew.
228228

229229
| Model | Size | Quant | bitnet.c | llama.cpp CPU | llama.cpp Metal |
230230
|-------|------|-------|----------|---------------|-----------------|
231231
| bitnet-b1.58-2B-4T | 620 MB | I2_S (ternary) | **52 tok/s** |||
232-
| Qwen2.5-3B-Instruct | 1.7 GB | Q4_0 + Q6_K | **14 tok/s** | 25 tok/s | 112 tok/s |
232+
| Qwen2.5-3B-Instruct | 1.7 GB | Q4_0 | **25.4 tok/s** | 40 tok/s | 84 tok/s |
233+
| Llama3-8B-1.58 | 3.4 GB | TQ1_0 (ternary) | **14.5 tok/s** | 19 tok/s ||
233234

234-
bitnet.c is a pure CPU engine with no GPU backend. On bandwidth-bound models (ternary, small quants) it's competitive with anything. On standard quant models, llama.cpp's CPU path is ~1.8x faster due to weight repacking and more aggressive integer dot product kernels; its Metal path is ~8x faster via GPU offload.
235+
bitnet.c is a pure CPU engine with no GPU backend. On ternary models (TQ1_0) it reaches **76% of llama.cpp CPU** — close to parity. On standard quants (Q4_0) it reaches **63% of llama.cpp CPU**, with the remaining gap due to llama.cpp's Accelerate BLAS for attention matmuls. llama.cpp does not support TQ1_0 on Metal.
235236

236237
## Design Decisions
237238

docs/benchmarks.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Benchmarks
2+
3+
Single-core and multi-core performance on Apple M1 Max (32 GB), macOS.
4+
5+
**Toolchain:** Apple Clang 17.0.0, Emscripten 4.0.1, Node.js 20.20.1
6+
7+
## Throughput (tok/s)
8+
9+
| Model | Format | Params | NEON 1T | NEON 8T | NEON 8T PGO | WASM 1T |
10+
|-------|--------|--------|---------|---------|-------------|---------|
11+
| BitNet b1.58 2B-4T | I2_S | 2B | 20.5 | 48.7 | 52.5 | 7.3 |
12+
| Qwen2.5 3B Instruct | Q4_0 | 3B | 4.3 | 18.7 | 25.4 | 4.5 |
13+
| Llama3 8B 1.58 | TQ1_0 | 8B | 3.2 | 10.4 | 14.5 | - |
14+
15+
PGO build trained on respective model (128 tokens). Q4_0 uses weight repacking (split scales/qs) for NEON SDOT. WASM is single-threaded (no SharedArrayBuffer in Node.js RAWFS mode).
16+
Llama3 8B (3.3 GB) exceeds the WASM 4 GB address space with runtime allocations.
17+
18+
## vs llama.cpp (b8320)
19+
20+
Measured with `llama-bench`, same hardware (M1 Max, 8 threads). Both use `-p 0 -n 256` (pure generation, no prompt).
21+
22+
| Model | Format | bitnet.c (PGO) | llama.cpp CPU | llama.cpp Metal | CPU ratio |
23+
|-------|--------|----------------|---------------|-----------------|-----------|
24+
| BitNet b1.58 2B-4T | I2_S | 52.5 ||||
25+
| Qwen2.5 3B Instruct | Q4_0 | 25.4 | 40.2 | 84.4 | 63% |
26+
| Llama3 8B 1.58 | TQ1_0 | 14.5 | 19.3 | N/A | 76% |
27+
28+
llama.cpp CPU uses Apple Accelerate (BLAS) for attention matmuls. Both engines now use weight-repacked NEON kernels for Q4_0 matvec. Metal is GPU offload (`-ngl 99`). TQ1_0 Metal is not implemented in llama.cpp b8320.
29+
30+
## Per-Kernel Bandwidth (GB/s)
31+
32+
### BitNet b1.58 2B-4T (I2_S)
33+
34+
| Kernel | Dims | NEON 1T | NEON 4T | WASM 1T | WASM/NEON |
35+
|--------|------|---------|---------|---------|-----------|
36+
| wq | 2560x2560 | 14.3 | 34.0 | 13.6 | 0.95x |
37+
| wk | 640x2560 | 14.2 | 26.5 | 14.0 | 0.99x |
38+
| wv | 640x2560 | 14.2 | 29.7 | 13.6 | 0.96x |
39+
| wo | 2560x2560 | 14.1 | 30.7 | 13.8 | 0.98x |
40+
| up | 6912x2560 | 12.1 | 33.3 | 13.7 | **1.13x** |
41+
| down | 2560x6912 | 12.4 | 32.1 | 14.0 | **1.13x** |
42+
| gate | 6912x2560 | 12.0 | 34.0 | 13.6 | **1.13x** |
43+
44+
WASM Relaxed SIMD SDOT achieves 95-113% of native NEON SDOT throughput on I2_S ternary matvec. The FFN kernels (up/down/gate) are slightly faster in WASM due to V8's superior instruction scheduling for large matrices.
45+
46+
### Qwen2.5 3B Instruct (Q4_0)
47+
48+
| Kernel | Dims | NEON 1T | NEON 4T | WASM 1T | WASM/NEON |
49+
|--------|------|---------|---------|---------|-----------|
50+
| wq | 2048x2048 | 8.3 | 20.7 | 9.1 | **1.09x** |
51+
| wk | 256x2048 | 8.2 | 7.3 | 9.3 | **1.13x** |
52+
| wv | 256x2048 | 8.3 | 19.0 | 9.4 | **1.13x** |
53+
| wo | 2048x2048 | 8.2 | 21.7 | 9.2 | **1.12x** |
54+
| up | 11008x2048 | 6.7 | 27.9 | 9.1 | **1.36x** |
55+
| down | 2048x11008 | 8.1 | 29.0 | 9.4 | **1.16x** |
56+
| gate | 11008x2048 | 8.0 | 28.3 | 9.1 | **1.13x** |
57+
58+
WASM Q4_0 SDOT kernels consistently outperform single-threaded NEON by 9-36%, likely due to V8's SIMD JIT optimizations and more efficient register allocation for the Q4 dequant+dot product loop.
59+
60+
### Llama3 8B 1.58 (TQ1_0) — native only
61+
62+
| Kernel | Dims | NEON 1T | NEON 4T |
63+
|--------|------|---------|---------|
64+
| wq | 4096x4096 | 5.0 | 17.1 |
65+
| wk | 1024x4096 | 5.3 | 16.7 |
66+
| wv | 1024x4096 | 5.3 | 15.7 |
67+
| wo | 4096x4096 | 5.0 | 16.6 |
68+
| up | 14336x4096 | 5.0 | 18.4 |
69+
| down | 4096x14336 | 5.1 | 18.3 |
70+
| gate | 14336x4096 | 5.1 | 18.3 |
71+
| logits (F16) | 128256x4096 | 12.1 | 46.5 |
72+
73+
## Per-Kernel Latency (us/call)
74+
75+
### BitNet b1.58 2B-4T (I2_S)
76+
77+
| Kernel | NEON 1T | NEON 4T | WASM 1T |
78+
|--------|---------|---------|---------|
79+
| wq | 115 | 49 | 121 |
80+
| wk | 30 | 16 | 30 |
81+
| wv | 30 | 14 | 31 |
82+
| wo | 117 | 54 | 119 |
83+
| up | 365 | 133 | 325 |
84+
| down | 358 | 139 | 318 |
85+
| gate | 369 | 130 | 327 |
86+
87+
### Qwen2.5 3B Instruct (Q4_0)
88+
89+
| Kernel | NEON 1T | NEON 4T | WASM 1T |
90+
|--------|---------|---------|---------|
91+
| wq | 284 | 114 | 260 |
92+
| wk | 37 | 41 | 33 |
93+
| wv | 36 | 16 | 32 |
94+
| wo | 287 | 109 | 257 |
95+
| up | 1898 | 454 | 1397 |
96+
| down | 1574 | 440 | 1361 |
97+
| gate | 1582 | 448 | 1402 |
98+
99+
## Notes
100+
101+
**Backend details:**
102+
- **ARM NEON + SDOT**: `vdotq_s32` for integer dot products (I2_S, TQ1, TQ2, Q4_0, Q8_0), `vmlaq_f32` FMA, native FP16 logits with `vfmaq_f16`
103+
- **WASM Relaxed SIMD**: `i32x4.relaxed_dot_i8x16_i7x16_add` (SDOT equivalent), `f32x4.relaxed_madd` (FMA), vectorized F16 bit-manipulation for logits
104+
- Multi-threading uses a persistent pthread pool with condvar dispatch (~2us overhead)
105+
106+
**WASM limitations:**
107+
- Single-threaded only (no `SharedArrayBuffer` in Node.js RAWFS mode)
108+
- 4 GB address space limit (wasm32) — models + runtime must fit in 4 GB
109+
- Logits benchmark unreliable (V8 JIT eliminates unused results); tok/s numbers are authoritative
110+
111+
**Reproducing:**
112+
```bash
113+
# Native
114+
make bench
115+
./bench_kernels models/<model>.gguf --iters 100 --threads 4 --toks 32
116+
117+
# WASM (requires Emscripten + Node.js 20+)
118+
bash bench/bench_wasm.sh
119+
node --experimental-wasm-relaxed-simd bench/bench_wasm.js models/<model>.gguf --iters 100 --toks 32
120+
```

0 commit comments

Comments
 (0)