Skip to content

Latest commit

 

History

History
467 lines (357 loc) · 17.3 KB

File metadata and controls

467 lines (357 loc) · 17.3 KB

onnxruntime-genai vs llama.cpp — Detailed Comparison for EDGESCRIBE

Context

EDGESCRIBE currently uses onnxruntime-genai for LLM and Vision inference (Qwen3-VL-2B INT4). This document evaluates whether llama.cpp is a worthwhile replacement for the LLM/Vision engines, while keeping onnxruntime-genai for ASR (Nemotron) and plain onnxruntime for TTS (Piper/Kokoro).


1. Current Architecture

edgescribe.exe (1.04 MB)
├── onnxruntime.dll         (13.48 MB)  ← TTS engine (Piper/Kokoro)
├── onnxruntime-genai.dll   ( 2.30 MB)  ← ASR + LLM + Vision
└── onnxruntime_providers_shared.dll (0.02 MB)

Models:
├── nemotron/       670 MB   (ASR, 6 ONNX files)
├── qwen3-vl/      1500 MB   (LLM+Vision, 7 ONNX files)
└── piper/           60 MB   (TTS, 2 files: .onnx + .onnx.json)
                   ────────
Total:             ~2.25 GB

2. Model Format: ONNX vs GGUF

File Structure

Aspect ONNX (current) GGUF (llama.cpp)
Qwen3-VL-2B model size 1500 MB ~990 MB
File count 7 files 1 file
Format overhead Protobuf graph + sidecar JSONs Flat binary, minimal metadata
Tokenizer Separate tokenizer.json + config files Embedded in .gguf
Quantization INT4 (uniform per-tensor) Q4_K_M (mixed precision per-block)

Why GGUF Is Smaller at Same Bit-Width

ONNX INT4 applies uniform quantization per tensor. GGUF's K-quant methods (Q4_K_M, Q4_K_S) use mixed precision per block — important weights get higher precision, less important ones get lower. This yields ~5–10% smaller files with equal or better quality.

Additionally, ONNX files carry:

  • Full computational graph in protobuf (operator definitions, shapes, types)
  • External data file (.onnx.data) with alignment padding
  • Multiple sidecar JSON configs (genai_config.json, tokenizer_config.json, special_tokens_map.json, preprocessor_config.json)

GGUF stores only:

  • Tensor weights in a compact binary layout
  • Minimal extensible key-value metadata
  • Embedded tokenizer vocabulary

Concrete Size Comparison (Qwen3-VL-2B, 4-bit quantized)

ONNX (INT4):
  model.onnx                    ~XX MB   (graph + partial weights)
  model.onnx.data               ~XX MB   (bulk weights)
  genai_config.json             <1 KB
  tokenizer.json                ~XX KB
  tokenizer_config.json         <1 KB
  special_tokens_map.json       <1 KB
  preprocessor_config.json      <1 KB
  ──────────────────────────────────────
  Total:                        1500 MB

GGUF (Q4_K_M):
  qwen3-vl-2b-Q4_K_M.gguf      ~990 MB
  ──────────────────────────────────────
  Total:                         990 MB

  Savings:                      -510 MB (34% smaller)

3. Runtime Library Size

Library Size Needed For Can Remove?
onnxruntime.dll 13.48 MB TTS (Piper/Kokoro) ❌ No — TTS depends on it
onnxruntime-genai.dll 2.30 MB ASR (Nemotron) ❌ No — StreamingProcessor API
onnxruntime_providers_shared.dll 0.02 MB Execution providers ❌ No — comes with ORT
llama.cpp (if added) ~3 MB LLM + Vision New dependency

Key Insight

You cannot remove onnxruntime or onnxruntime-genai even if you switch LLM/Vision to llama.cpp:

  • onnxruntime.dll is required for TTS (Piper ONNX models)
  • onnxruntime-genai.dll is required for ASR (Nemotron uses OgaStreamingProcessor)
  • llama.cpp does not support audio/ASR or TTS models

The runtime lib cost of adding llama.cpp is +3 MB — negligible compared to the 510 MB model savings.


4. CPU Performance

Token Generation Speed (tokens/sec, approximate)

Model Size onnxruntime-genai (INT4) llama.cpp (Q4_K_M) Speedup
Qwen 2B ~120–160 ~200–250 +40–55%
Qwen 7B ~60–80 ~110–120 +50–75%
Qwen 13B ~30–50 ~80–110 +100%+

Benchmarks on modern x86 CPUs (Zen 4/5, 16 cores). llama.cpp benefits from highly optimized SIMD (AVX2/AVX-512) kernels and K-quant-aware inference paths.

Why llama.cpp Is Faster on CPU

  1. Hand-tuned SIMD kernels — AVX2/AVX-512/NEON assembly for quantized matmul
  2. K-quant native support — inference kernel matches quantization layout exactly
  3. Lower framework overhead — minimal abstraction, direct hardware access
  4. Fine-grained threading — per-layer thread scheduling, NUMA-aware

onnxruntime-genai relies on ORT's general-purpose quantized operators which are portable but less aggressively optimized for specific quantization formats.


5. GPU and Cross-Platform Support

Feature onnxruntime-genai llama.cpp
NVIDIA CUDA
macOS Metal ❌ Limited First-class
AMD ROCm ✅ (via ORT EP)
Intel SYCL ✅ (via OpenVINO EP)
Vulkan ✅ (via DirectML)
Windows DirectML
NPU (Intel/Qualcomm) ✅ (via QNN/OpenVINO EP) ❌ Limited

Notable Difference

macOS Metal — llama.cpp has first-class Metal support, making it significantly faster on Apple Silicon Macs. onnxruntime-genai has limited/experimental macOS GPU support.

NPU offload — onnxruntime has better NPU support via QNN and OpenVINO execution providers. If future Intel/Qualcomm NPU offload is important, ORT has the edge.


6. API Comparison (C++)

Current Code (onnxruntime-genai)

// Loading
auto model = OgaModel::Create(model_path.c_str());
auto tokenizer = OgaTokenizer::Create(*model);

// Generation
auto sequences = OgaSequences::Create();
tokenizer->Encode(prompt.c_str(), *sequences);
auto params = OgaGeneratorParams::Create(*model);
params->SetSearchOption("max_length", 2048);
params->SetInputSequences(*sequences);

auto generator = OgaGenerator::Create(*model, *params);
auto tok_stream = OgaTokenizerStream::Create(*tokenizer);
while (!generator->IsDone()) {
    generator->GenerateNextToken();
    auto token = generator->GetSequenceData(0)[generator->GetSequenceCount(0) - 1];
    auto piece = tok_stream->Decode(token);
    callback(piece);  // stream to user
}

Equivalent Code (llama.cpp)

// Loading
auto model_params = llama_model_default_params();
auto model = llama_model_load_from_file(gguf_path.c_str(), model_params);
auto ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048;
auto ctx = llama_init_from_model(model, ctx_params);

// Tokenization
auto tokens = llama_tokenize(model, prompt.c_str(), prompt.size(), true, false);

// Generation
auto batch = llama_batch_get_one(tokens.data(), tokens.size());
llama_decode(ctx, batch);

auto sampler = llama_sampler_chain_init(llama_sampler_chain_default_params());
llama_sampler_chain_add(sampler, llama_sampler_init_temp(0.7f));
llama_sampler_chain_add(sampler, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));

while (true) {
    auto token = llama_sampler_sample(sampler, ctx, -1);
    if (llama_token_is_eog(model, token)) break;
    char buf[128];
    int n = llama_token_to_piece(model, token, buf, sizeof(buf), 0, true);
    callback(std::string(buf, n));  // stream to user
    // prepare next batch with the new token
    batch = llama_batch_get_one(&token, 1);
    llama_decode(ctx, batch);
}

// Cleanup
llama_sampler_free(sampler);
llama_free(ctx);
llama_model_free(model);

Vision (Qwen3-VL)

// onnxruntime-genai (current)
auto processor = OgaMultiModalProcessor::Create(*model);
auto images = OgaImages::Create(&image_path, 1);
auto inputs = processor->ProcessImages(prompt.c_str(), *images);
// ... then generate as above

// llama.cpp (would need llava/clip integration)
// Qwen3-VL GGUF includes vision encoder
// Use llama_chat_apply_template() for Qwen3-VL format
// Vision processing is handled internally when image tokens are detected

Migration Effort

Component Files to Change Complexity
src/llm/llm_engine.cpp Full rewrite Medium
src/vision/vision_engine.cpp Full rewrite Medium-High
src/core/model_manager.cpp Add GGUF download support Low
src/server/api_server.cpp Minor (same string I/O) Low
src/cli/main.cpp Minor (same CLI interface) Low
CMakeLists.txt Add llama.cpp as dependency Medium
src/asr/transcriber.cpp No change — stays on ORT-GenAI None
src/tts/tts_engine.cpp No change — stays on ORT None

7. Build Integration

Option A: llama.cpp as Git Submodule

# CMakeLists.txt addition
set(LLAMA_BUILD_COMMON ON)
set(LLAMA_CURL OFF)
add_subdirectory(external/llama.cpp)

target_link_libraries(edgescribe PRIVATE llama common)

Pros: Self-contained, pinned version, builds from source Cons: Adds ~2-3 min to build time, source in repo

Option B: Pre-built llama.cpp Library

set(LLAMA_PATH "" CACHE PATH "Path to llama.cpp installation")
find_library(LLAMA_LIBRARY NAMES llama PATHS "${LLAMA_PATH}/lib")
target_include_directories(edgescribe PRIVATE "${LLAMA_PATH}/include")
target_link_libraries(edgescribe PRIVATE ${LLAMA_LIBRARY})

Pros: Fast builds, consistent with ORT_GENAI_PATH pattern Cons: User must provide pre-built library


8. Full Package Size Comparison

Scenario A: Current (All ONNX)

Runtime:
  onnxruntime.dll                 13.48 MB
  onnxruntime-genai.dll            2.30 MB
  onnxruntime_providers_shared      0.02 MB
  edgescribe.exe                   1.04 MB
                                  ────────
  Subtotal:                       16.84 MB

Models:
  nemotron/    (ASR)              670 MB
  qwen3-vl/   (LLM+Vision)     1500 MB
  piper/       (TTS)              60 MB
                                  ────────
  Subtotal:                     2230 MB

TOTAL:                          ~2.25 GB

Scenario B: Hybrid (llama.cpp for LLM+Vision)

Runtime:
  onnxruntime.dll                 13.48 MB  (TTS)
  onnxruntime-genai.dll            2.30 MB  (ASR)
  onnxruntime_providers_shared      0.02 MB
  llama.dll                        ~3.00 MB  (LLM+Vision)
  edgescribe.exe                   ~1.10 MB  (slightly larger)
                                  ────────
  Subtotal:                       19.90 MB  (+3 MB)

Models:
  nemotron/    (ASR, ONNX)        670 MB
  qwen3-vl.gguf (LLM+Vision)     990 MB   (-510 MB)
  piper/       (TTS, ONNX)        60 MB
                                  ────────
  Subtotal:                     1720 MB

TOTAL:                          ~1.74 GB   (-510 MB, 23% smaller)

Side-by-Side

Component All ONNX Hybrid (llama.cpp) Delta
Runtime libs 16.8 MB 19.9 MB +3 MB
ASR model (Nemotron) 670 MB 670 MB 0
LLM+Vision model 1500 MB 990 MB -510 MB
TTS model (Piper) 60 MB 60 MB 0
Total 2.25 GB 1.74 GB -510 MB

9. Pros and Cons Summary

Reasons TO switch LLM/Vision to llama.cpp

Benefit Impact
510 MB smaller model download High — 34% smaller LLM package
25-40% faster CPU inference High — noticeable in chat/SOAP
Single-file model (1 GGUF vs 7 files) Medium — simpler download/cache
macOS Metal GPU support High — for Mac users
K-quant quality/size ratio Medium — better quality at same bits
Massive community, rapid model updates Medium — new models available faster
Smaller total package (2.25 → 1.74 GB) High

Reasons to KEEP onnxruntime-genai for everything

Benefit Impact
Single runtime family (ORT) for all engines High — simpler architecture
Zero additional dependencies Medium — no new build/link complexity
ORT-GenAI is only 2.3 MB marginal cost Low — already paid for ORT
NPU support (Intel/Qualcomm via ORT EPs) Medium — future-proofing
DirectML for AMD/Intel GPUs on Windows Medium
No code rewrite needed High — saves development time
Consistent API across ASR/LLM/Vision Medium — easier maintenance

10. Migration Feasibility Assessment

Shared Foundation: cpp-httplib

Both EDGESCRIBE and llama.cpp use cpp-httplib (by yhirose) as their HTTP server:

  • EDGESCRIBE: include/httplib.h v0.37.1
  • llama.cpp: same library, same API patterns

This shared dependency confirms compatible design sensibilities but is NOT the main enabler for migration — the engine abstraction is.

Why Migration Is Clean

EDGESCRIBE's API server (api_server.cpp) is completely decoupled from inference backends. It calls abstract engine interfaces and handles string I/O:

// api_server.cpp calls engines like this:
std::string result = llm_engine->Chat(system_prompt, user_prompt, max_length);
std::string description = vision_engine->Analyze(image_path, prompt);
// ... then wraps result in JSON and returns via httplib

// The server NEVER touches OgaModel, OgaGenerator, or any ORT-GenAI types.
// Swapping the engine internals requires ZERO changes to api_server.cpp.

What Changes, What Doesn't

REWRITE (2 files):
  src/llm/llm_engine.cpp        ← OgaModel/Generator → llama_model/context
  src/vision/vision_engine.cpp  ← OgaMultiModalProcessor → llama.cpp vision API

MODIFY (2 files):
  src/core/model_manager.cpp    ← add GGUF download support to manifest
  CMakeLists.txt                ← add llama.cpp as build dependency

UNCHANGED (6 files):
  src/server/api_server.cpp     ← engine-agnostic, string I/O only
  src/cli/main.cpp              ← routes commands, engine-agnostic
  src/asr/transcriber.cpp       ← stays on ORT-GenAI (StreamingProcessor)
  src/tts/tts_engine.cpp        ← stays on ORT (Piper/Kokoro ONNX)
  src/tts/phonemizer.cpp        ← no change
  src/asr/audio_capture.cpp     ← no change

Hard Limits: What llama.cpp CANNOT Replace

Engine Can switch to llama.cpp? Reason
LLM ✅ Yes Qwen3-VL GGUF fully supported
Vision ✅ Yes Qwen3-VL GGUF includes vision encoder
ASR ❌ No llama.cpp has no audio/speech model support
TTS ❌ No llama.cpp has no TTS model support

onnxruntime and onnxruntime-genai CANNOT be fully removed. ASR requires OgaStreamingProcessor (only in ORT-GenAI) and TTS requires Ort::Session.

Migration Risk Assessment

Risk Severity Mitigation
Qwen3-VL vision API differences Medium Well-documented in llama.cpp, community examples
Build complexity (two runtimes) Low CMake handles both; llama.cpp builds cleanly
API surface compatibility None Server layer is engine-agnostic
Model availability None Official Qwen3-VL GGUF on HuggingFace
Prompt template compatibility Low llama.cpp supports Qwen chat templates natively
Streaming token output Low llama.cpp has native streaming, same callback pattern

11. Recommendation

Decision: Switch LLM+Vision to llama.cpp

Given that:

  1. Package size is a priority — saves 510 MB (34% smaller LLM model)
  2. CPU performance matters — 25-40% faster token generation
  3. Migration is low-risk — only 2 engine files to rewrite, server untouched
  4. Dependencies stay clean — ORT remains for ASR+TTS (sunk cost), llama.cpp adds ~3 MB
  5. Shared httplib.h confirms compatible design patterns
  6. macOS Metal support — major win for Mac users (currently unsupported)

Final Architecture

┌─────────────────────────┐  ┌──────────────────┐
│    onnxruntime-genai     │  │    llama.cpp      │
│  ┌─────────────────────┐ │  │  ┌────────────┐  │
│  │    ASR (Nemotron)   │ │  │  │ LLM Qwen3  │  │
│  │    670 MB, ONNX     │ │  │  │ 990MB GGUF │  │
│  └─────────────────────┘ │  │  ├────────────┤  │
├─────────────────────────┤  │  │ Vision VL  │  │
│    onnxruntime (C++)     │  │  │ (same GGUF)│  │
│  ┌─────────────────────┐ │  │  └────────────┘  │
│  │   TTS (Piper)       │ │  └──────────────────┘
│  │   60 MB, ONNX       │ │
│  └─────────────────────┘ │  Total: ~1.74 GB
└─────────────────────────┘   (was 2.25 GB)

Both sides share: httplib.h (HTTP server), espeak-ng (phonemizer)

Implementation Order

  1. Add llama.cpp as CMake dependency (submodule or pre-built)
  2. Rewrite src/llm/llm_engine.cpp using llama.cpp C API
  3. Rewrite src/vision/vision_engine.cpp using llama.cpp vision API
  4. Update model manifest in model_manager.cpp with GGUF variants
  5. Test all /v1/chat/* and /v1/vision/* endpoints (server unchanged)
  6. Verify ASR and TTS still work (should be zero-impact)