EDGESCRIBE currently uses onnxruntime-genai for LLM and Vision inference (Qwen3-VL-2B INT4). This document evaluates whether llama.cpp is a worthwhile replacement for the LLM/Vision engines, while keeping onnxruntime-genai for ASR (Nemotron) and plain onnxruntime for TTS (Piper/Kokoro).
edgescribe.exe (1.04 MB)
├── onnxruntime.dll (13.48 MB) ← TTS engine (Piper/Kokoro)
├── onnxruntime-genai.dll ( 2.30 MB) ← ASR + LLM + Vision
└── onnxruntime_providers_shared.dll (0.02 MB)
Models:
├── nemotron/ 670 MB (ASR, 6 ONNX files)
├── qwen3-vl/ 1500 MB (LLM+Vision, 7 ONNX files)
└── piper/ 60 MB (TTS, 2 files: .onnx + .onnx.json)
────────
Total: ~2.25 GB
| Aspect | ONNX (current) | GGUF (llama.cpp) |
|---|---|---|
| Qwen3-VL-2B model size | 1500 MB | ~990 MB |
| File count | 7 files | 1 file |
| Format overhead | Protobuf graph + sidecar JSONs | Flat binary, minimal metadata |
| Tokenizer | Separate tokenizer.json + config files |
Embedded in .gguf |
| Quantization | INT4 (uniform per-tensor) | Q4_K_M (mixed precision per-block) |
ONNX INT4 applies uniform quantization per tensor. GGUF's K-quant methods (Q4_K_M, Q4_K_S) use mixed precision per block — important weights get higher precision, less important ones get lower. This yields ~5–10% smaller files with equal or better quality.
Additionally, ONNX files carry:
- Full computational graph in protobuf (operator definitions, shapes, types)
- External data file (
.onnx.data) with alignment padding - Multiple sidecar JSON configs (
genai_config.json,tokenizer_config.json,special_tokens_map.json,preprocessor_config.json)
GGUF stores only:
- Tensor weights in a compact binary layout
- Minimal extensible key-value metadata
- Embedded tokenizer vocabulary
ONNX (INT4):
model.onnx ~XX MB (graph + partial weights)
model.onnx.data ~XX MB (bulk weights)
genai_config.json <1 KB
tokenizer.json ~XX KB
tokenizer_config.json <1 KB
special_tokens_map.json <1 KB
preprocessor_config.json <1 KB
──────────────────────────────────────
Total: 1500 MB
GGUF (Q4_K_M):
qwen3-vl-2b-Q4_K_M.gguf ~990 MB
──────────────────────────────────────
Total: 990 MB
Savings: -510 MB (34% smaller)
| Library | Size | Needed For | Can Remove? |
|---|---|---|---|
onnxruntime.dll |
13.48 MB | TTS (Piper/Kokoro) | ❌ No — TTS depends on it |
onnxruntime-genai.dll |
2.30 MB | ASR (Nemotron) | ❌ No — StreamingProcessor API |
onnxruntime_providers_shared.dll |
0.02 MB | Execution providers | ❌ No — comes with ORT |
llama.cpp (if added) |
~3 MB | LLM + Vision | New dependency |
You cannot remove onnxruntime or onnxruntime-genai even if you switch LLM/Vision to llama.cpp:
onnxruntime.dllis required for TTS (Piper ONNX models)onnxruntime-genai.dllis required for ASR (Nemotron usesOgaStreamingProcessor)- llama.cpp does not support audio/ASR or TTS models
The runtime lib cost of adding llama.cpp is +3 MB — negligible compared to the 510 MB model savings.
| Model Size | onnxruntime-genai (INT4) | llama.cpp (Q4_K_M) | Speedup |
|---|---|---|---|
| Qwen 2B | ~120–160 | ~200–250 | +40–55% |
| Qwen 7B | ~60–80 | ~110–120 | +50–75% |
| Qwen 13B | ~30–50 | ~80–110 | +100%+ |
Benchmarks on modern x86 CPUs (Zen 4/5, 16 cores). llama.cpp benefits from highly optimized SIMD (AVX2/AVX-512) kernels and K-quant-aware inference paths.
- Hand-tuned SIMD kernels — AVX2/AVX-512/NEON assembly for quantized matmul
- K-quant native support — inference kernel matches quantization layout exactly
- Lower framework overhead — minimal abstraction, direct hardware access
- Fine-grained threading — per-layer thread scheduling, NUMA-aware
onnxruntime-genai relies on ORT's general-purpose quantized operators which are portable but less aggressively optimized for specific quantization formats.
| Feature | onnxruntime-genai | llama.cpp |
|---|---|---|
| NVIDIA CUDA | ✅ | ✅ |
| macOS Metal | ❌ Limited | ✅ First-class |
| AMD ROCm | ✅ (via ORT EP) | ✅ |
| Intel SYCL | ✅ (via OpenVINO EP) | ✅ |
| Vulkan | ✅ (via DirectML) | ✅ |
| Windows DirectML | ✅ | ❌ |
| NPU (Intel/Qualcomm) | ✅ (via QNN/OpenVINO EP) | ❌ Limited |
macOS Metal — llama.cpp has first-class Metal support, making it significantly faster on Apple Silicon Macs. onnxruntime-genai has limited/experimental macOS GPU support.
NPU offload — onnxruntime has better NPU support via QNN and OpenVINO execution providers. If future Intel/Qualcomm NPU offload is important, ORT has the edge.
// Loading
auto model = OgaModel::Create(model_path.c_str());
auto tokenizer = OgaTokenizer::Create(*model);
// Generation
auto sequences = OgaSequences::Create();
tokenizer->Encode(prompt.c_str(), *sequences);
auto params = OgaGeneratorParams::Create(*model);
params->SetSearchOption("max_length", 2048);
params->SetInputSequences(*sequences);
auto generator = OgaGenerator::Create(*model, *params);
auto tok_stream = OgaTokenizerStream::Create(*tokenizer);
while (!generator->IsDone()) {
generator->GenerateNextToken();
auto token = generator->GetSequenceData(0)[generator->GetSequenceCount(0) - 1];
auto piece = tok_stream->Decode(token);
callback(piece); // stream to user
}// Loading
auto model_params = llama_model_default_params();
auto model = llama_model_load_from_file(gguf_path.c_str(), model_params);
auto ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048;
auto ctx = llama_init_from_model(model, ctx_params);
// Tokenization
auto tokens = llama_tokenize(model, prompt.c_str(), prompt.size(), true, false);
// Generation
auto batch = llama_batch_get_one(tokens.data(), tokens.size());
llama_decode(ctx, batch);
auto sampler = llama_sampler_chain_init(llama_sampler_chain_default_params());
llama_sampler_chain_add(sampler, llama_sampler_init_temp(0.7f));
llama_sampler_chain_add(sampler, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));
while (true) {
auto token = llama_sampler_sample(sampler, ctx, -1);
if (llama_token_is_eog(model, token)) break;
char buf[128];
int n = llama_token_to_piece(model, token, buf, sizeof(buf), 0, true);
callback(std::string(buf, n)); // stream to user
// prepare next batch with the new token
batch = llama_batch_get_one(&token, 1);
llama_decode(ctx, batch);
}
// Cleanup
llama_sampler_free(sampler);
llama_free(ctx);
llama_model_free(model);// onnxruntime-genai (current)
auto processor = OgaMultiModalProcessor::Create(*model);
auto images = OgaImages::Create(&image_path, 1);
auto inputs = processor->ProcessImages(prompt.c_str(), *images);
// ... then generate as above
// llama.cpp (would need llava/clip integration)
// Qwen3-VL GGUF includes vision encoder
// Use llama_chat_apply_template() for Qwen3-VL format
// Vision processing is handled internally when image tokens are detected| Component | Files to Change | Complexity |
|---|---|---|
src/llm/llm_engine.cpp |
Full rewrite | Medium |
src/vision/vision_engine.cpp |
Full rewrite | Medium-High |
src/core/model_manager.cpp |
Add GGUF download support | Low |
src/server/api_server.cpp |
Minor (same string I/O) | Low |
src/cli/main.cpp |
Minor (same CLI interface) | Low |
CMakeLists.txt |
Add llama.cpp as dependency | Medium |
src/asr/transcriber.cpp |
No change — stays on ORT-GenAI | None |
src/tts/tts_engine.cpp |
No change — stays on ORT | None |
# CMakeLists.txt addition
set(LLAMA_BUILD_COMMON ON)
set(LLAMA_CURL OFF)
add_subdirectory(external/llama.cpp)
target_link_libraries(edgescribe PRIVATE llama common)Pros: Self-contained, pinned version, builds from source Cons: Adds ~2-3 min to build time, source in repo
set(LLAMA_PATH "" CACHE PATH "Path to llama.cpp installation")
find_library(LLAMA_LIBRARY NAMES llama PATHS "${LLAMA_PATH}/lib")
target_include_directories(edgescribe PRIVATE "${LLAMA_PATH}/include")
target_link_libraries(edgescribe PRIVATE ${LLAMA_LIBRARY})Pros: Fast builds, consistent with ORT_GENAI_PATH pattern Cons: User must provide pre-built library
Runtime:
onnxruntime.dll 13.48 MB
onnxruntime-genai.dll 2.30 MB
onnxruntime_providers_shared 0.02 MB
edgescribe.exe 1.04 MB
────────
Subtotal: 16.84 MB
Models:
nemotron/ (ASR) 670 MB
qwen3-vl/ (LLM+Vision) 1500 MB
piper/ (TTS) 60 MB
────────
Subtotal: 2230 MB
TOTAL: ~2.25 GB
Runtime:
onnxruntime.dll 13.48 MB (TTS)
onnxruntime-genai.dll 2.30 MB (ASR)
onnxruntime_providers_shared 0.02 MB
llama.dll ~3.00 MB (LLM+Vision)
edgescribe.exe ~1.10 MB (slightly larger)
────────
Subtotal: 19.90 MB (+3 MB)
Models:
nemotron/ (ASR, ONNX) 670 MB
qwen3-vl.gguf (LLM+Vision) 990 MB (-510 MB)
piper/ (TTS, ONNX) 60 MB
────────
Subtotal: 1720 MB
TOTAL: ~1.74 GB (-510 MB, 23% smaller)
| Component | All ONNX | Hybrid (llama.cpp) | Delta |
|---|---|---|---|
| Runtime libs | 16.8 MB | 19.9 MB | +3 MB |
| ASR model (Nemotron) | 670 MB | 670 MB | 0 |
| LLM+Vision model | 1500 MB | 990 MB | -510 MB |
| TTS model (Piper) | 60 MB | 60 MB | 0 |
| Total | 2.25 GB | 1.74 GB | -510 MB |
| Benefit | Impact |
|---|---|
| 510 MB smaller model download | High — 34% smaller LLM package |
| 25-40% faster CPU inference | High — noticeable in chat/SOAP |
| Single-file model (1 GGUF vs 7 files) | Medium — simpler download/cache |
| macOS Metal GPU support | High — for Mac users |
| K-quant quality/size ratio | Medium — better quality at same bits |
| Massive community, rapid model updates | Medium — new models available faster |
| Smaller total package (2.25 → 1.74 GB) | High |
| Benefit | Impact |
|---|---|
| Single runtime family (ORT) for all engines | High — simpler architecture |
| Zero additional dependencies | Medium — no new build/link complexity |
| ORT-GenAI is only 2.3 MB marginal cost | Low — already paid for ORT |
| NPU support (Intel/Qualcomm via ORT EPs) | Medium — future-proofing |
| DirectML for AMD/Intel GPUs on Windows | Medium |
| No code rewrite needed | High — saves development time |
| Consistent API across ASR/LLM/Vision | Medium — easier maintenance |
Both EDGESCRIBE and llama.cpp use cpp-httplib (by yhirose) as their HTTP server:
- EDGESCRIBE:
include/httplib.hv0.37.1 - llama.cpp: same library, same API patterns
This shared dependency confirms compatible design sensibilities but is NOT the main enabler for migration — the engine abstraction is.
EDGESCRIBE's API server (api_server.cpp) is completely decoupled from inference backends.
It calls abstract engine interfaces and handles string I/O:
// api_server.cpp calls engines like this:
std::string result = llm_engine->Chat(system_prompt, user_prompt, max_length);
std::string description = vision_engine->Analyze(image_path, prompt);
// ... then wraps result in JSON and returns via httplib
// The server NEVER touches OgaModel, OgaGenerator, or any ORT-GenAI types.
// Swapping the engine internals requires ZERO changes to api_server.cpp.REWRITE (2 files):
src/llm/llm_engine.cpp ← OgaModel/Generator → llama_model/context
src/vision/vision_engine.cpp ← OgaMultiModalProcessor → llama.cpp vision API
MODIFY (2 files):
src/core/model_manager.cpp ← add GGUF download support to manifest
CMakeLists.txt ← add llama.cpp as build dependency
UNCHANGED (6 files):
src/server/api_server.cpp ← engine-agnostic, string I/O only
src/cli/main.cpp ← routes commands, engine-agnostic
src/asr/transcriber.cpp ← stays on ORT-GenAI (StreamingProcessor)
src/tts/tts_engine.cpp ← stays on ORT (Piper/Kokoro ONNX)
src/tts/phonemizer.cpp ← no change
src/asr/audio_capture.cpp ← no change
| Engine | Can switch to llama.cpp? | Reason |
|---|---|---|
| LLM | ✅ Yes | Qwen3-VL GGUF fully supported |
| Vision | ✅ Yes | Qwen3-VL GGUF includes vision encoder |
| ASR | ❌ No | llama.cpp has no audio/speech model support |
| TTS | ❌ No | llama.cpp has no TTS model support |
onnxruntime and onnxruntime-genai CANNOT be fully removed. ASR requires
OgaStreamingProcessor (only in ORT-GenAI) and TTS requires Ort::Session.
| Risk | Severity | Mitigation |
|---|---|---|
| Qwen3-VL vision API differences | Medium | Well-documented in llama.cpp, community examples |
| Build complexity (two runtimes) | Low | CMake handles both; llama.cpp builds cleanly |
| API surface compatibility | None | Server layer is engine-agnostic |
| Model availability | None | Official Qwen3-VL GGUF on HuggingFace |
| Prompt template compatibility | Low | llama.cpp supports Qwen chat templates natively |
| Streaming token output | Low | llama.cpp has native streaming, same callback pattern |
Given that:
- Package size is a priority — saves 510 MB (34% smaller LLM model)
- CPU performance matters — 25-40% faster token generation
- Migration is low-risk — only 2 engine files to rewrite, server untouched
- Dependencies stay clean — ORT remains for ASR+TTS (sunk cost), llama.cpp adds ~3 MB
- Shared httplib.h confirms compatible design patterns
- macOS Metal support — major win for Mac users (currently unsupported)
┌─────────────────────────┐ ┌──────────────────┐
│ onnxruntime-genai │ │ llama.cpp │
│ ┌─────────────────────┐ │ │ ┌────────────┐ │
│ │ ASR (Nemotron) │ │ │ │ LLM Qwen3 │ │
│ │ 670 MB, ONNX │ │ │ │ 990MB GGUF │ │
│ └─────────────────────┘ │ │ ├────────────┤ │
├─────────────────────────┤ │ │ Vision VL │ │
│ onnxruntime (C++) │ │ │ (same GGUF)│ │
│ ┌─────────────────────┐ │ │ └────────────┘ │
│ │ TTS (Piper) │ │ └──────────────────┘
│ │ 60 MB, ONNX │ │
│ └─────────────────────┘ │ Total: ~1.74 GB
└─────────────────────────┘ (was 2.25 GB)
Both sides share: httplib.h (HTTP server), espeak-ng (phonemizer)
- Add llama.cpp as CMake dependency (submodule or pre-built)
- Rewrite
src/llm/llm_engine.cppusing llama.cpp C API - Rewrite
src/vision/vision_engine.cppusing llama.cpp vision API - Update model manifest in
model_manager.cppwith GGUF variants - Test all
/v1/chat/*and/v1/vision/*endpoints (server unchanged) - Verify ASR and TTS still work (should be zero-impact)