|
| 1 | +# Benchmarks |
| 2 | + |
| 3 | +Single-core and multi-core performance on Apple M1 Max (32 GB), macOS. |
| 4 | + |
| 5 | +**Toolchain:** Apple Clang 17.0.0, Emscripten 4.0.1, Node.js 20.20.1 |
| 6 | + |
| 7 | +## Throughput (tok/s) |
| 8 | + |
| 9 | +| Model | Format | Params | NEON 1T | NEON 8T | NEON 8T PGO | WASM 1T | |
| 10 | +|-------|--------|--------|---------|---------|-------------|---------| |
| 11 | +| BitNet b1.58 2B-4T | I2_S | 2B | 20.5 | 48.7 | 52.5 | 7.3 | |
| 12 | +| Qwen2.5 3B Instruct | Q4_0 | 3B | 4.3 | 18.7 | 25.4 | 4.5 | |
| 13 | +| Llama3 8B 1.58 | TQ1_0 | 8B | 3.2 | 10.4 | 14.5 | - | |
| 14 | + |
| 15 | +PGO build trained on respective model (128 tokens). Q4_0 uses weight repacking (split scales/qs) for NEON SDOT. WASM is single-threaded (no SharedArrayBuffer in Node.js RAWFS mode). |
| 16 | +Llama3 8B (3.3 GB) exceeds the WASM 4 GB address space with runtime allocations. |
| 17 | + |
| 18 | +## vs llama.cpp (b8320) |
| 19 | + |
| 20 | +Measured with `llama-bench`, same hardware (M1 Max, 8 threads). Both use `-p 0 -n 256` (pure generation, no prompt). |
| 21 | + |
| 22 | +| Model | Format | bitnet.c (PGO) | llama.cpp CPU | llama.cpp Metal | CPU ratio | |
| 23 | +|-------|--------|----------------|---------------|-----------------|-----------| |
| 24 | +| BitNet b1.58 2B-4T | I2_S | 52.5 | — | — | — | |
| 25 | +| Qwen2.5 3B Instruct | Q4_0 | 25.4 | 40.2 | 84.4 | 63% | |
| 26 | +| Llama3 8B 1.58 | TQ1_0 | 14.5 | 19.3 | N/A | 76% | |
| 27 | + |
| 28 | +llama.cpp CPU uses Apple Accelerate (BLAS) for attention matmuls. Both engines now use weight-repacked NEON kernels for Q4_0 matvec. Metal is GPU offload (`-ngl 99`). TQ1_0 Metal is not implemented in llama.cpp b8320. |
| 29 | + |
| 30 | +## Per-Kernel Bandwidth (GB/s) |
| 31 | + |
| 32 | +### BitNet b1.58 2B-4T (I2_S) |
| 33 | + |
| 34 | +| Kernel | Dims | NEON 1T | NEON 4T | WASM 1T | WASM/NEON | |
| 35 | +|--------|------|---------|---------|---------|-----------| |
| 36 | +| wq | 2560x2560 | 14.3 | 34.0 | 13.6 | 0.95x | |
| 37 | +| wk | 640x2560 | 14.2 | 26.5 | 14.0 | 0.99x | |
| 38 | +| wv | 640x2560 | 14.2 | 29.7 | 13.6 | 0.96x | |
| 39 | +| wo | 2560x2560 | 14.1 | 30.7 | 13.8 | 0.98x | |
| 40 | +| up | 6912x2560 | 12.1 | 33.3 | 13.7 | **1.13x** | |
| 41 | +| down | 2560x6912 | 12.4 | 32.1 | 14.0 | **1.13x** | |
| 42 | +| gate | 6912x2560 | 12.0 | 34.0 | 13.6 | **1.13x** | |
| 43 | + |
| 44 | +WASM Relaxed SIMD SDOT achieves 95-113% of native NEON SDOT throughput on I2_S ternary matvec. The FFN kernels (up/down/gate) are slightly faster in WASM due to V8's superior instruction scheduling for large matrices. |
| 45 | + |
| 46 | +### Qwen2.5 3B Instruct (Q4_0) |
| 47 | + |
| 48 | +| Kernel | Dims | NEON 1T | NEON 4T | WASM 1T | WASM/NEON | |
| 49 | +|--------|------|---------|---------|---------|-----------| |
| 50 | +| wq | 2048x2048 | 8.3 | 20.7 | 9.1 | **1.09x** | |
| 51 | +| wk | 256x2048 | 8.2 | 7.3 | 9.3 | **1.13x** | |
| 52 | +| wv | 256x2048 | 8.3 | 19.0 | 9.4 | **1.13x** | |
| 53 | +| wo | 2048x2048 | 8.2 | 21.7 | 9.2 | **1.12x** | |
| 54 | +| up | 11008x2048 | 6.7 | 27.9 | 9.1 | **1.36x** | |
| 55 | +| down | 2048x11008 | 8.1 | 29.0 | 9.4 | **1.16x** | |
| 56 | +| gate | 11008x2048 | 8.0 | 28.3 | 9.1 | **1.13x** | |
| 57 | + |
| 58 | +WASM Q4_0 SDOT kernels consistently outperform single-threaded NEON by 9-36%, likely due to V8's SIMD JIT optimizations and more efficient register allocation for the Q4 dequant+dot product loop. |
| 59 | + |
| 60 | +### Llama3 8B 1.58 (TQ1_0) — native only |
| 61 | + |
| 62 | +| Kernel | Dims | NEON 1T | NEON 4T | |
| 63 | +|--------|------|---------|---------| |
| 64 | +| wq | 4096x4096 | 5.0 | 17.1 | |
| 65 | +| wk | 1024x4096 | 5.3 | 16.7 | |
| 66 | +| wv | 1024x4096 | 5.3 | 15.7 | |
| 67 | +| wo | 4096x4096 | 5.0 | 16.6 | |
| 68 | +| up | 14336x4096 | 5.0 | 18.4 | |
| 69 | +| down | 4096x14336 | 5.1 | 18.3 | |
| 70 | +| gate | 14336x4096 | 5.1 | 18.3 | |
| 71 | +| logits (F16) | 128256x4096 | 12.1 | 46.5 | |
| 72 | + |
| 73 | +## Per-Kernel Latency (us/call) |
| 74 | + |
| 75 | +### BitNet b1.58 2B-4T (I2_S) |
| 76 | + |
| 77 | +| Kernel | NEON 1T | NEON 4T | WASM 1T | |
| 78 | +|--------|---------|---------|---------| |
| 79 | +| wq | 115 | 49 | 121 | |
| 80 | +| wk | 30 | 16 | 30 | |
| 81 | +| wv | 30 | 14 | 31 | |
| 82 | +| wo | 117 | 54 | 119 | |
| 83 | +| up | 365 | 133 | 325 | |
| 84 | +| down | 358 | 139 | 318 | |
| 85 | +| gate | 369 | 130 | 327 | |
| 86 | + |
| 87 | +### Qwen2.5 3B Instruct (Q4_0) |
| 88 | + |
| 89 | +| Kernel | NEON 1T | NEON 4T | WASM 1T | |
| 90 | +|--------|---------|---------|---------| |
| 91 | +| wq | 284 | 114 | 260 | |
| 92 | +| wk | 37 | 41 | 33 | |
| 93 | +| wv | 36 | 16 | 32 | |
| 94 | +| wo | 287 | 109 | 257 | |
| 95 | +| up | 1898 | 454 | 1397 | |
| 96 | +| down | 1574 | 440 | 1361 | |
| 97 | +| gate | 1582 | 448 | 1402 | |
| 98 | + |
| 99 | +## Notes |
| 100 | + |
| 101 | +**Backend details:** |
| 102 | +- **ARM NEON + SDOT**: `vdotq_s32` for integer dot products (I2_S, TQ1, TQ2, Q4_0, Q8_0), `vmlaq_f32` FMA, native FP16 logits with `vfmaq_f16` |
| 103 | +- **WASM Relaxed SIMD**: `i32x4.relaxed_dot_i8x16_i7x16_add` (SDOT equivalent), `f32x4.relaxed_madd` (FMA), vectorized F16 bit-manipulation for logits |
| 104 | +- Multi-threading uses a persistent pthread pool with condvar dispatch (~2us overhead) |
| 105 | + |
| 106 | +**WASM limitations:** |
| 107 | +- Single-threaded only (no `SharedArrayBuffer` in Node.js RAWFS mode) |
| 108 | +- 4 GB address space limit (wasm32) — models + runtime must fit in 4 GB |
| 109 | +- Logits benchmark unreliable (V8 JIT eliminates unused results); tok/s numbers are authoritative |
| 110 | + |
| 111 | +**Reproducing:** |
| 112 | +```bash |
| 113 | +# Native |
| 114 | +make bench |
| 115 | +./bench_kernels models/<model>.gguf --iters 100 --threads 4 --toks 32 |
| 116 | + |
| 117 | +# WASM (requires Emscripten + Node.js 20+) |
| 118 | +bash bench/bench_wasm.sh |
| 119 | +node --experimental-wasm-relaxed-simd bench/bench_wasm.js models/<model>.gguf --iters 100 --toks 32 |
| 120 | +``` |
0 commit comments