Instructions for Claude Code when working on this project.
bitnet.c is a C11 GGUF inference engine for dense, MoE, and hybrid SSM/attention LLMs. It is CPU-first, with scalar, ARM NEON/SDOT, x86 AVX2, x86 AVX512 BW/VNNI, and WASM SIMD CPU paths plus optional Metal and wgpu-native WebGPU backends. The current architecture separates model anatomy, quant formats, backend-resident state, transformer planning, CPU execution, GPU op emission, KV/logits helpers, and generation APIs.
make clean
make bitnet
make test
make debug
make asan
make avx2-check
make avx512-check
make test_avx512_quant
make fetch-wgpu
make BN_ENABLE_WEBGPU=1 bitnet test_gpu_wgpu
make BN_ENABLE_METAL=1 bitnet test_coherenceBN_ENABLE_GPU=1 is a compatibility alias for WebGPU. Prefer
BN_ENABLE_WEBGPU=1 in new docs and commands.
Individual tests include:
make test_architecture
make test_backend_matrix
make test_model_matrix
make test_gguf
make test_quant
make test_tokenizer
make test_transformer
make test_generate
make test_session
make test_prompt_cache
make test_threadpool
make test_safety
make test_arena
make test_ssm
make test_gguf_fuzz
make test_moe
make test_qwen36
make test_gemma4
make test_turboquant
make test_gpu_backendCoherence tests require a real GGUF model:
make BN_ENABLE_METAL=1 test_coherence
./test_coherence models/qwen2.5-3b-instruct-q4_0.gguf --metal
make BN_ENABLE_WEBGPU=1 test_coherence
./test_coherence models/model.gguf --webgpu
make test_coherence
./test_coherence models/model.ggufModules are organized to avoid circular dependencies and cross-product branches:
platform— mmap/buffer abstraction, timinggguf— GGUF parserquant— format metadata, dequantization, CPU kernels, backend capability declarationsturboquant— compressed KV supportmodel_arch— model-family rules and tensor-role mappingmodel— config, immutable CPU-visible weights, model loadingbackend_layout/backend_model— backend-owned uploads, packed/fused layouts, backend session statetokenizer— BPE tokenizermoe— expert routing, loading, cache, and sparse FFN computesession— per-request mutable KV, activations, SSM, and MoE scratchtransformer— planning and CPU/GPU executionsampler— token samplingthreadpool— persistent pthread workersbn_alloc— allocator vtableprompt_cache— shared KV prefix cachegenerate— library API, prefill/generate/chat/SSE/logprobs/stop stringsgpu_wgpu/gpu_metal— optional backend implementationsmain— CLI wiring
Key transformer files:
| File | Responsibility |
|---|---|
src/transformer.c |
Top-level forward orchestration only. |
src/transformer/plan.c |
Layer/block plans and placement decisions. |
src/transformer/cpu.c |
CPU execution for attention, SSM, FFN, MoE, RoPE, residuals. |
src/transformer/gpu.c |
GPU-resident execution and CPU fallback boundaries. |
src/transformer/gpu_emit.c |
Emits backend-neutral BnGPUOp commands. |
src/transformer/kv.c |
FP32, FP16, TurboQuant KV helpers. |
src/transformer/logits.c |
CPU logits routing. |
src/transformer/prefill.c |
Batch prefill. |
BnModelis shared and immutable after load. It owns config, architecture metadata, CPU-visible weights, file state, thread pool, and shared MoE I/O.BnSessionis per request. It owns KV cache, activations, SSM state, MoE scratch, and generation position.BnBackendModeland backend session state own GPU/backend-resident buffers, stacked QKV/gate-up/SSM layouts, fused buffers, activation buffers, and future CUDA backend state. CPU SIMD kernels, including AVX512, stay insrc/quant/and do not attach handles to model weights.BnQWeight,BnLayerWeights, andBnWeightsmust not expose backend handles.BnQuantFormatOpsowns quant block geometry, sizing, CPU hooks, repack/native layout support, split/fused capability, and backend capability metadata.BnModelArchOpsowns model-family rules: tensor names, tensor roles, activation/norm choices, SSM/MoE/shared-expert rules, and architecture flags.- Public GPU graph code uses
BnGPUOpKind,BnGPUOpCode, andBN_GPU_VALUE_*.BN_GPU_SHADER_*IDs are backend-private insrc/gpu_shader.h.
- C11,
-Wall -Wextra, no GNU extensions unless already isolated behind a build guard. - No external dependencies for CPU builds beyond libc/libm/pthread.
- Public functions use module prefixes.
- Internal helpers are
static. - Use
_init/_freepairs for caller-owned structs.BnSessionis created withbn_session_createand freed withbn_session_free. - Keep platform-specific code behind feature macros such as
__EMSCRIPTEN__,BN_ENABLE_WEBGPU, andBN_ENABLE_METAL. - Avoid global mutable state in library modules.
main.candwasm/api.care exceptions for application-level state.
- Add a model-family rule: update
include/model_arch.h, then add synthetic architecture tests. - Add a quant format: update
include/quant.h,src/quant/registry.c, format kernels insrc/quant/, Makefile sources, and quant capability tests. - Add a CPU SIMD kernel: keep it orthogonal under
src/quant/, declare it in the matchinginclude/quant_kernels_*.h, route it from dispatch/batch/multi code with feature guards, and compare it against scalar and existing SIMD references. - Add backend layout behavior: update
include/backend_layout.handsrc/backend_layout.c; keep model load backend-neutral. - Modify transformer behavior: update the relevant plan/execution module under
src/transformer/, not justsrc/transformer.c. - Add GPU behavior: emit backend-neutral op codes in
gpu_emit.c, then lower insrc/gpu_metal.morsrc/gpu_wgpu.c. Keep shader IDs private. - Modify MoE expert dispatch: update
include/moe.handsrc/moe.c. - Add a sampling strategy: extend
src/sampler.c. - Add a CLI flag: update
src/main.cand docs. - Export a WASM API: add the
EMSCRIPTEN_KEEPALIVEwrapper inwasm/api.cand updatewasm/build.sh. - Integrate as a library: include
generate.handsession.h, load aBnModel, create oneBnSessionper request, then callbn_prefillandbn_generate.
- MoE I/O modes are mmap,
--pread --cache-mb N, and experimental--madvise. --maxseqis important on GPU and large-context models because KV allocation follows the selected sequence cap.--kv16halves KV cache storage.--kv-tq 2|3|4enables TurboQuant compressed KV cache.--draft PATHenables speculative decoding with a same-tokenizer draft model.- WebGPU depends on wgpu-native adapter availability. Runtime checks may skip on machines with no suitable adapter.
- Metal is macOS-only and uses system Metal/Foundation frameworks. It is functional but can lag the CPU SIMD paths on local benchmarks; keep CPU fallback boundaries explicit when adding GPU coverage.
WASM builds use platform_load_buffer() rather than mmap and are constrained by
wasm32 memory limits. wasm/api.c is allowed to use global application state for
the browser demo. Keep exported functions listed in wasm/build.sh.