EDGESCRIBE Memory Management & Model Loading Strategy

Overview

EDGESCRIBE runs four independent AI engines in a single process. Naively loading all models at startup would consume ~5.5 GB of RAM even when idle. This document describes the hybrid loading strategy that keeps idle RAM under 200 MB while avoiding cold-start latency for the most-used engines.

The Two Loading Strategies

EDGESCRIBE uses different loading strategies based on how each inference runtime manages memory:

Engine	Runtime	Loading	Why
LLM	llama.cpp	Eager (mmap)	Near-zero idle RAM, no cold start
Vision	llama.cpp	Eager (mmap)	Same model as LLM, shared mapping
ASR	onnxruntime	Lazy + idle unload	Full RAM load; defer until needed
TTS	onnxruntime	Lazy + idle unload	Full RAM load; defer until needed

How mmap Works (LLM/Vision)

llama.cpp loads GGUF models via mmap() — the operating system's memory-mapped file mechanism. This is fundamentally different from reading a file into RAM:

Traditional file loading (what ONNX Runtime does):
─────────────────────────────────────────────────
  1. open("model.onnx")
  2. malloc(670 MB)          ← physical RAM allocated immediately
  3. read(fd, buffer, 670MB) ← entire file copied into RAM
  4. RAM stays occupied until explicitly freed

  Result: 670 MB physical RAM consumed, whether or not the model is being used.


Memory-mapped loading (what llama.cpp does):
─────────────────────────────────────────────
  1. open("model.gguf")
  2. mmap(990 MB)            ← virtual address space reserved
                               NO physical RAM allocated yet
  3. First access to page    ← OS page fault → loads 4 KB from disk
  4. Subsequent accesses     ← already in RAM, instant
  5. Under memory pressure   ← OS evicts least-recently-used pages
  6. Re-access evicted page  ← transparent page-in, no crash, no reload

  Result: Physical RAM usage adapts to actual access patterns automatically.

What This Means in Practice

State	Virtual Memory	Physical RAM	Explanation
Model just loaded (mmap)	990 MB	~0 MB	Address space mapped, nothing paged in yet
First chat request	990 MB	~1.5 GB	Active weight pages faulted in + KV cache
Chat complete, 30s idle	990 MB	~200 MB	OS reclaims unused pages for other processes
Chat complete, 5 min idle	990 MB	~50 MB	Most pages evicted, only hot metadata remains
New chat request	990 MB	~1.5 GB	Pages fault back in transparently (no "reload")

Key insight: The mmap'd model handle stays open. There is no "unload" and "reload" cycle — the OS manages physical RAM as a cache. From the application's perspective, the model is always loaded. From the OS's perspective, it only uses RAM for pages that are actually being accessed.

Why Not mmap Everything?

ONNX Runtime does not use mmap by default. It reads the entire model file into a malloc'd buffer and keeps it resident. This means:

ASR model (670 MB ONNX) → 670 MB physical RAM the moment it loads
TTS model (300 MB ONNX) → 300-700 MB physical RAM the moment it loads
These stay resident until the Ort::Session object is destroyed

This is why ASR and TTS use lazy loading with idle unloading — the only way to reclaim their RAM is to destroy the engine object.

Server Memory Lifecycle

Startup

edgescribe serve --port 8080
  │
  ├─ LLM/Vision (llama.cpp)
  │    mmap("qwen3-vl.gguf")     → instant, ~0 MB physical RAM
  │    Model handle ready         → no cold start on first request
  │
  ├─ ASR (onnxruntime)
  │    Not loaded yet             → 0 MB
  │    Will load on first /v1/transcribe/* request
  │
  └─ TTS (onnxruntime)
       Not loaded yet             → 0 MB
       Will load on first /v1/tts/* request

Server RAM at startup: ~50 MB (process + httplib + mmap metadata)

Under Load

Time     Event                         Physical RAM    Notes
──────   ──────                        ────────────    ─────
0s       Server starts                 ~50 MB          mmap'd, not resident
5s       GET /v1/health                ~50 MB          No model access
10s      POST /v1/chat                 ~1.5 GB         LLM pages fault in
11s      Chat streaming...             ~1.5 GB         KV cache active
15s      Chat complete                 ~1.5 GB         Pages still hot
60s      No requests (1 min idle)      ~500 MB         OS reclaims LLM pages
90s      POST /v1/transcribe/file      ~1.7 GB         ASR lazy-loads (1.2 GB)
95s      Transcription complete        ~1.7 GB         ASR stays resident
120s     POST /v1/tts/synthesize       ~2.4 GB         TTS lazy-loads (700 MB)
125s     TTS complete                  ~2.4 GB         TTS stays resident
420s     5 min idle (no requests)      ~200 MB         ASR unloaded, TTS unloaded
                                                        LLM pages evicted by OS
600s     POST /v1/chat                 ~1.5 GB         LLM pages fault back in
                                                        (transparent, no "reload")
601s     POST /v1/transcribe/file      ~2.7 GB         ASR re-loads (~2s delay)

Idle Unloading (ASR/TTS Only)

A background monitor thread checks every 30 seconds for idle ONNX engines:

// Pseudocode — see api_server.cpp for actual implementation
while (server_running) {
    sleep(30 seconds);

    if (asr_loaded && asr_idle > 5 minutes) {
        unload ASR;  // frees ~1.2 GB
    }
    if (tts_loaded && tts_idle > 5 minutes) {
        unload TTS;  // frees ~700 MB
    }
    // LLM/Vision: never explicitly unloaded — OS manages via mmap
}

Default idle timeout: 300 seconds (5 minutes). Defined as kIdleTimeoutSeconds in api_server.cpp.

Memory Budget by Scenario

Scenario 1: Chat Only (Most Common)

LLM loaded (mmap)     → ~1.5 GB active, ~50 MB idle
ASR not loaded         → 0 MB
TTS not loaded         → 0 MB
────────────────────────────────
Peak:                    ~1.5 GB
Idle:                    ~50 MB

Scenario 2: Transcription + SOAP Notes (Medical Workflow)

LLM loaded (mmap)     → ~1.5 GB (for SOAP note generation)
ASR loaded (lazy)     → ~1.2 GB (for transcription)
TTS not loaded         → 0 MB
────────────────────────────────
Peak:                    ~2.7 GB
After ASR idle unload:   ~1.5 GB (then ~50 MB when LLM pages evicted)

Scenario 3: Full Stack (Server Mode, All Features Used)

LLM loaded (mmap)     → ~1.5 GB active
Vision (shared model) → ~3.5 GB active (with image embeddings)
ASR loaded (lazy)     → ~1.2 GB
TTS loaded (lazy)     → ~0.7 GB
────────────────────────────────
Peak:                    ~5.4 GB (all engines active simultaneously)
After 5 min idle:        ~50 MB  (ASR+TTS unloaded, LLM pages evicted)

Scenario 4: Jetson Orin Nano (8 GB Shared Memory)

Available after OS:      ~6.5 GB
LLM loaded (mmap)     → ~1.5 GB active
ASR loaded (lazy)     → ~1.2 GB
TTS not loaded         → 0 MB (skip TTS to save memory)
────────────────────────────────
Peak:                    ~2.7 GB
Headroom:                ~3.8 GB (enough for CUDA kernels + OS)

Comparison with Naive Loading

Approach	RAM at Startup	RAM When Idle	Cold Start	Implementation
Load everything at startup	5.5 GB	5.5 GB	None	Simple
Lazy load everything	50 MB	50 MB	2-5s per engine	Complex
Hybrid (current)	50 MB	50 MB	None for LLM	Moderate

The hybrid approach gives you the best of both worlds:

No cold start for LLM/Vision (the most-used engines) — mmap is instant
Near-zero idle RAM — OS manages mmap pages, ONNX engines unloaded
Minimal peak RAM — only load what's actually being used

Configuration

Idle Timeout

The default idle timeout for ONNX engines (ASR/TTS) is 300 seconds (5 minutes). To change it, modify kIdleTimeoutSeconds in src/server/api_server.cpp:

static constexpr int kIdleTimeoutSeconds = 300;  // Change this value

Disabling Lazy Loading

To preload all engines at startup (original behavior), modify InitEngines() in api_server.cpp to eagerly create all engine objects. This uses more RAM but eliminates any cold-start delay for ASR/TTS.

Disabling Idle Unloading

To keep ONNX engines loaded permanently once they're first used, set the timeout to a very large value or remove the idle monitor logic.

Implementation Details

Source Files

File	What It Does
`src/server/api_server.cpp`	`EnsureASR()`, `EnsureTTS()` — lazy loaders
	`EnsureLLM()`, `EnsureVision()` — eager (mmap) availability check
	`StartIdleMonitor()` — background thread for ONNX unloading
	`InitEngines()` — startup: mmap LLM/Vision, defer ASR/TTS

Thread Safety

Each engine has its own std::mutex (asr_mutex, llm_mutex, etc.)
EnsureASR() / EnsureTTS() acquire the lock before checking/loading
The idle monitor acquires locks before unloading
Endpoint handlers acquire locks before using engines
No deadlock risk — each handler touches only one engine's mutex

Ensure* Pattern

Every endpoint uses this pattern:

svr.Post("/v1/chat", [this](const Request& req, Response& res) {
    if (!EnsureLLM(res)) return;  // loads if needed, or returns 503

    std::lock_guard<std::mutex> lock(llm_mutex);
    std::string result = llm->Chat(system, prompt, max_length);
    res.set_content(JsonObj({{"text", JsonStr(result)}}), "application/json");
});

For LLM/Vision (mmap'd), Ensure* just checks the pointer is non-null. For ASR/TTS (lazy), Ensure* loads the engine on first call and updates the last-used timestamp.

FAQ

Q: Does mmap cause latency on first request?

The first request will experience page faults as the OS loads model weights from disk. On an SSD, this adds ~100-500ms total (spread across the generation). On HDD, it can be 1-2s. Subsequent requests are instant because pages are cached. This is much faster than a full model load (which would be 3-5s for a 990 MB model).

Q: What if the system runs low on RAM?

The OS evicts mmap'd LLM pages first (they can be re-read from disk). If RAM is still insufficient, the ONNX engines' idle unloading kicks in after 5 min. On extremely constrained systems (4 GB), you may want to only configure the engines you actually need via edgescribe serve --asr-model ... --vlm-model ... and omit the others.

Q: Can I run on a 4 GB Raspberry Pi?

Yes, but only one engine at a time. Configure the server with only the models you need. The mmap'd LLM alone uses ~1.5 GB active, which fits in 4 GB with OS overhead. Don't load ASR and LLM simultaneously on 4 GB.

Q: What about GPU VRAM management?

When using --device cuda, llama.cpp offloads model layers to GPU VRAM. The mmap behavior applies to the CPU portion. VRAM-resident layers stay loaded and are not subject to OS paging. For GPU deployments, memory management is simpler — VRAM is dedicated and not shared with the OS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EDGESCRIBE Memory Management & Model Loading Strategy

Overview

The Two Loading Strategies

How mmap Works (LLM/Vision)

What This Means in Practice

Why Not mmap Everything?

Server Memory Lifecycle

Startup

Under Load

Idle Unloading (ASR/TTS Only)

Memory Budget by Scenario

Scenario 1: Chat Only (Most Common)

Scenario 2: Transcription + SOAP Notes (Medical Workflow)

Scenario 3: Full Stack (Server Mode, All Features Used)

Scenario 4: Jetson Orin Nano (8 GB Shared Memory)

Comparison with Naive Loading

Configuration

Idle Timeout

Disabling Lazy Loading

Disabling Idle Unloading

Implementation Details

Source Files

Thread Safety

Ensure* Pattern

FAQ

FilesExpand file tree

RAM-memory-management.md

Latest commit

History

RAM-memory-management.md

File metadata and controls

EDGESCRIBE Memory Management & Model Loading Strategy

Overview

The Two Loading Strategies

How mmap Works (LLM/Vision)

What This Means in Practice

Why Not mmap Everything?

Server Memory Lifecycle

Startup

Under Load

Idle Unloading (ASR/TTS Only)

Memory Budget by Scenario

Scenario 1: Chat Only (Most Common)

Scenario 2: Transcription + SOAP Notes (Medical Workflow)

Scenario 3: Full Stack (Server Mode, All Features Used)

Scenario 4: Jetson Orin Nano (8 GB Shared Memory)

Comparison with Naive Loading

Configuration

Idle Timeout

Disabling Lazy Loading

Disabling Idle Unloading

Implementation Details

Source Files

Thread Safety

Ensure* Pattern

FAQ