Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5922000
feat: make the 4096-token prompt ceiling explicit in EngineConfig
nmrenyi May 16, 2026
3318896
analysis: document the maxNumTokens experiment in latency_report_v2
nmrenyi May 16, 2026
31f310d
docs: maxnumtoken_investigation.md — explain why GPU breaks at lifted…
nmrenyi May 16, 2026
7e570d6
analysis: cross-link the maxnumtoken investigation from latency_repor…
nmrenyi May 16, 2026
04d88f1
docs: maxnumtoken_investigation — Step 2 reproducibility + reachabili…
nmrenyi May 16, 2026
3ac6b94
analysis: cross-link Step 2 reproducibility + FP32-reachability into …
nmrenyi May 16, 2026
14b03ee
docs: maxnumtoken_investigation — Step 3 confirms FP16 as root cause
nmrenyi May 16, 2026
06c1105
analysis: cross-link Step 3 FP32-control finding into latency_report_v2
nmrenyi May 16, 2026
8b6a813
docs: maxnumtoken_investigation — Step 5 full FP32 GPU sweep at max=4096
nmrenyi May 16, 2026
520d0b7
docs: lead investigation + report with prominent FP16 default warning
nmrenyi May 16, 2026
52e11e9
feat: consolidate max_num_tokens to single source + record provenance…
nmrenyi May 16, 2026
fb35546
docs: record artifact-fingerprint → precision mapping (verified)
nmrenyi May 16, 2026
4076a1f
docs: maxnumtoken_investigation — Step 6 instrumented FP32-vs-FP16 GP…
nmrenyi May 17, 2026
7a49c69
analysis: cross-link Step 6 instrumented sweep into latency_report_v2
nmrenyi May 17, 2026
0de3e02
docs: maxnumtoken_investigation — note FP32 not directly verified on …
nmrenyi May 17, 2026
1da590e
analysis: latency_report_v2 — surface FP16/FP32 story prominently + f…
nmrenyi May 17, 2026
11fb5d3
analysis: latency_report_v2 — tighten + fix remaining stale claims
nmrenyi May 17, 2026
eec2bff
analysis: order per-model sections with production model first (E4B b…
nmrenyi May 17, 2026
b05f6fc
analysis: latency_report_v2 — third-round polish (12 review items)
nmrenyi May 17, 2026
6bbff86
docs: maxnumtoken_investigation — refresh 'Last updated' to 2026-05-17
nmrenyi May 17, 2026
dd0596c
test(config): add schema test for runtime_config.engine.max_num_tokens
nmrenyi May 17, 2026
e78f165
fix(bench): record ACTUAL backend in benchmark JSON, not the requeste…
nmrenyi May 17, 2026
b23b129
docs: check in canonical FP16 GPU failure JSON (long_01, k=20)
nmrenyi May 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion app/android/app/src/main/kotlin/com/example/app/RagPipeline.kt
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,19 @@ class RagPipeline(application: Application) {
}

private fun buildEngine(modelPath: String, backend: Backend, cacheDir: String) {
val e = Engine(EngineConfig(modelPath = modelPath, backend = backend, cacheDir = cacheDir))
// Set the prompt-budget ceiling explicitly so the limit is visible at the
// call site rather than inferred from runtime errors at high k. Empirical
// testing on Gemma 4 E4B/E2B .litertlm (see latency_report_v2.md §context
// wall): the engine accepts higher values (8192 init OK on both backends),
// but on GPU the output degenerates into a repetition loop past 4096 even
// when no error is thrown. 4096 is the highest value that produces clean
// generations across both backends for this artifact family.
val e = Engine(EngineConfig(
modelPath = modelPath,
backend = backend,
maxNumTokens = 4096,
cacheDir = cacheDir,
))
e.initialize()
engine = e
}
Expand Down
48 changes: 40 additions & 8 deletions evaluation/aggregate_k_sweep.py
Original file line number Diff line number Diff line change
Expand Up @@ -463,10 +463,37 @@ def write_report(runs: list[dict], out_path: Path) -> None:
"(model × backend) combination tested: ")
md.append("`long_01, long_03, medium_02, medium_04, short_01, short_03, short_04, short_05`. ")
md.append("Each failure reports `Input token ids are too long. Exceeding the maximum "
"number of tokens allowed: …>= 4096`. ")
md.append("Both Gemma 4 E4B and Gemma 4 E2B ship the same 4096-token context window; "
"the wall is a property of the `.litertlm` artifact format, not the "
"parameter count or backend. **k_max ≈ 17–18** for both models.")
"number of tokens allowed: …>= 4096`. The cap is enforced by LiteRT-LM's "
"native runtime (verified by extracting `liblitertlm_jni.so` from the AAR "
"and locating the literal error template).")
md.append("")
md.append("### Where the 4096 comes from — and why we set it explicitly\n")
md.append("The Kotlin `EngineConfig` constructor exposes a `maxNumTokens` parameter; "
"leaving it `null` falls back to whatever the engine's default is for the "
"loaded artifact. The original `RagPipeline.kt` left it null, so the 4096 "
"ceiling was an inferred property of *somewhere* in the stack rather than a "
"stated choice. **A 2026-05-16 experiment on the test device pinned this "
"down** — see commit log for `feat/explicit-max-num-tokens`:")
md.append("")
md.append("- **Lower-bound test (`maxNumTokens = 2048`)**: queries with prompts "
"between 2048–4096 tokens that previously succeeded now fail, with the "
"error message reporting the new ceiling verbatim (`>= 2048`). Both GPU "
"and CPU clamp identically. **The knob is wired through to the native "
"runtime as-advertised.**")
md.append("- **Upper-bound test (`maxNumTokens = 8192`)**: `Engine.initialize()` "
"succeeds on both backends; the artifact is *not* hard-bounded at 4096. "
"Previously-failing k=20 queries now run end-to-end on both backends. "
"**However:** on CPU the output stays coherent (real medical reasoning, "
"ends with reference numbers); on GPU the output degenerates into a long "
"repetition loop (`* * * * ...`) past the 4096-token mark. "
"Same artifact, same query — output diverges by backend at lifted context.")
md.append("")
md.append("**Operational conclusion:** 4096 is the highest value that produces clean "
"generations across both backends for this artifact family, and is "
"therefore the right value to ship. `RagPipeline.kt:buildEngine()` now "
"passes it explicitly so the ceiling is visible at the call site rather "
"than left implicit. **k_max ≈ 17–18** for both models — a deployment "
"ceiling driven by output quality on GPU, not by a runtime hard cap.")
md.append("")

md.append("## Key findings\n")
Expand Down Expand Up @@ -495,11 +522,16 @@ def write_report(runs: list[dict], out_path: Path) -> None:
"(mid-tier MediaTek, older Snapdragon without OpenCL) now have a realistic path: "
"ship E2B on CPU, restrict k to small values.")
md.append("")
md.append("### 5. 4096-token context wall is the binding ceiling at high k")
md.append("### 5. 4096-token context wall is the binding ceiling at high k — and the right one")
md.append("k=15 works cleanly on all four (model × backend) combinations. k=20 fails identically "
"across all four: same 8 queries, same 24 (query × rep) failures. The cap is in the "
"model artifact, not the runtime, and is **shared between E4B and E2B**. "
"**Latency is not the constraint at the upper end of k — context window is.**")
"across all four: same 8 queries, same 24 (query × rep) failures, same `>= 4096` "
"error. Phase B/C experiments on 2026-05-16 (see §context wall above) show the cap "
"is **liftable** — passing `maxNumTokens = 8192` makes the runtime accept larger "
"prompts — but the lift produces **quality degradation on GPU** (response loops "
"into repetition past 4096 tokens) while CPU output stays clean. 4096 is therefore "
"the right deployment ceiling for cross-backend safety, not just a memory or "
"runtime constraint. **Latency is not the constraint at the upper end of k — "
"output quality is.**")
md.append("")
md.append("### 6. TTFT scales linearly with retrieved-doc content past k=3")
md.append("On both backends and both models, TTFT-per-doc-char is roughly constant past k=3, so "
Expand Down
18 changes: 13 additions & 5 deletions evaluation/reports/latency_report_v2.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MAM-AI On-Device Latency Sweep — Model × Backend × k

_Generated: 2026-05-16T09:00:40_
_Generated: 2026-05-16T10:56:17_


## Device & stack
Expand Down Expand Up @@ -229,8 +229,16 @@ Each table below compares **Gemma 4 E4B** (baseline) against each comparator mod

At k=20, the **same 8 queries × 3 reps = 24 runs** failed across every (model × backend) combination tested:
`long_01, long_03, medium_02, medium_04, short_01, short_03, short_04, short_05`.
Each failure reports `Input token ids are too long. Exceeding the maximum number of tokens allowed: …>= 4096`.
Both Gemma 4 E4B and Gemma 4 E2B ship the same 4096-token context window; the wall is a property of the `.litertlm` artifact format, not the parameter count or backend. **k_max ≈ 17–18** for both models.
Each failure reports `Input token ids are too long. Exceeding the maximum number of tokens allowed: …>= 4096`. The cap is enforced by LiteRT-LM's native runtime (verified by extracting `liblitertlm_jni.so` from the AAR and locating the literal error template).

### Where the 4096 comes from — and why we set it explicitly

The Kotlin `EngineConfig` constructor exposes a `maxNumTokens` parameter; leaving it `null` falls back to whatever the engine's default is for the loaded artifact. The original `RagPipeline.kt` left it null, so the 4096 ceiling was an inferred property of *somewhere* in the stack rather than a stated choice. **A 2026-05-16 experiment on the test device pinned this down** — see commit log for `feat/explicit-max-num-tokens`:

- **Lower-bound test (`maxNumTokens = 2048`)**: queries with prompts between 2048–4096 tokens that previously succeeded now fail, with the error message reporting the new ceiling verbatim (`>= 2048`). Both GPU and CPU clamp identically. **The knob is wired through to the native runtime as-advertised.**
- **Upper-bound test (`maxNumTokens = 8192`)**: `Engine.initialize()` succeeds on both backends; the artifact is *not* hard-bounded at 4096. Previously-failing k=20 queries now run end-to-end on both backends. **However:** on CPU the output stays coherent (real medical reasoning, ends with reference numbers); on GPU the output degenerates into a long repetition loop (`* * * * ...`) past the 4096-token mark. Same artifact, same query — output diverges by backend at lifted context.

**Operational conclusion:** 4096 is the highest value that produces clean generations across both backends for this artifact family, and is therefore the right value to ship. `RagPipeline.kt:buildEngine()` now passes it explicitly so the ceiling is visible at the call site rather than left implicit. **k_max ≈ 17–18** for both models — a deployment ceiling driven by output quality on GPU, not by a runtime hard cap.

## Key findings

Expand All @@ -246,8 +254,8 @@ Decode speedup from E4B → E2B is **~1.5× on GPU** but **~2× on CPU**. Decode
### 4. GPU still wins, but E2B CPU opens up the no-GPU device tier
E2B CPU is 1.4–2.4× slower than E2B GPU at every k — GPU remains the preferred backend where available. But E2B CPU at k=1 (~16 s median) is comparable to E4B GPU at k=1 (~14 s), which means devices that previously could *not* deploy MAM-AI at acceptable latency (mid-tier MediaTek, older Snapdragon without OpenCL) now have a realistic path: ship E2B on CPU, restrict k to small values.

### 5. 4096-token context wall is the binding ceiling at high k
k=15 works cleanly on all four (model × backend) combinations. k=20 fails identically across all four: same 8 queries, same 24 (query × rep) failures. The cap is in the model artifact, not the runtime, and is **shared between E4B and E2B**. **Latency is not the constraint at the upper end of k — context window is.**
### 5. 4096-token context wall is the binding ceiling at high k — and the right one
k=15 works cleanly on all four (model × backend) combinations. k=20 fails identically across all four: same 8 queries, same 24 (query × rep) failures, same `>= 4096` error. Phase B/C experiments on 2026-05-16 (see §context wall above) show the cap is **liftable** — passing `maxNumTokens = 8192` makes the runtime accept larger prompts — but the lift produces **quality degradation on GPU** (response loops into repetition past 4096 tokens) while CPU output stays clean. 4096 is therefore the right deployment ceiling for cross-backend safety, not just a memory or runtime constraint. **Latency is not the constraint at the upper end of k — output quality is.**

### 6. TTFT scales linearly with retrieved-doc content past k=3
On both backends and both models, TTFT-per-doc-char is roughly constant past k=3, so the prefill story scales predictably. The model shrink translates directly into a TTFT shrink across the whole range.
Expand Down
Loading