nmrenyi · nmrenyi · May 17, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/app/android/app/src/main/kotlin/com/example/app/RagPipeline.kt b/app/android/app/src/main/kotlin/com/example/app/RagPipeline.kt
@@ -332,7 +332,19 @@ class RagPipeline(application: Application) {
         }
 
     private fun buildEngine(modelPath: String, backend: Backend, cacheDir: String) {
-        val e = Engine(EngineConfig(modelPath = modelPath, backend = backend, cacheDir = cacheDir))
+        // Set the prompt-budget ceiling explicitly so the limit is visible at the
+        // call site rather than inferred from runtime errors at high k. Empirical
+        // testing on Gemma 4 E4B/E2B .litertlm (see latency_report_v2.md §context
+        // wall): the engine accepts higher values (8192 init OK on both backends),
+        // but on GPU the output degenerates into a repetition loop past 4096 even
+        // when no error is thrown. 4096 is the highest value that produces clean
+        // generations across both backends for this artifact family.
+        val e = Engine(EngineConfig(
+            modelPath = modelPath,
+            backend = backend,
+            maxNumTokens = 4096,
+            cacheDir = cacheDir,
+        ))
         e.initialize()
         engine = e
     }

diff --git a/evaluation/aggregate_k_sweep.py b/evaluation/aggregate_k_sweep.py
@@ -463,10 +463,37 @@ def write_report(runs: list[dict], out_path: Path) -> None:
               "(model × backend) combination tested: ")
     md.append("`long_01, long_03, medium_02, medium_04, short_01, short_03, short_04, short_05`. ")
     md.append("Each failure reports `Input token ids are too long. Exceeding the maximum "
-              "number of tokens allowed: …>= 4096`. ")
-    md.append("Both Gemma 4 E4B and Gemma 4 E2B ship the same 4096-token context window; "
-              "the wall is a property of the `.litertlm` artifact format, not the "
-              "parameter count or backend. **k_max ≈ 17–18** for both models.")
+              "number of tokens allowed: …>= 4096`. The cap is enforced by LiteRT-LM's "
+              "native runtime (verified by extracting `liblitertlm_jni.so` from the AAR "
+              "and locating the literal error template).")
+    md.append("")
+    md.append("### Where the 4096 comes from — and why we set it explicitly\n")
+    md.append("The Kotlin `EngineConfig` constructor exposes a `maxNumTokens` parameter; "
+              "leaving it `null` falls back to whatever the engine's default is for the "
+              "loaded artifact. The original `RagPipeline.kt` left it null, so the 4096 "
+              "ceiling was an inferred property of *somewhere* in the stack rather than a "
+              "stated choice. **A 2026-05-16 experiment on the test device pinned this "
+              "down** — see commit log for `feat/explicit-max-num-tokens`:")
+    md.append("")
+    md.append("- **Lower-bound test (`maxNumTokens = 2048`)**: queries with prompts "
+              "between 2048–4096 tokens that previously succeeded now fail, with the "
+              "error message reporting the new ceiling verbatim (`>= 2048`). Both GPU "
+              "and CPU clamp identically. **The knob is wired through to the native "
+              "runtime as-advertised.**")
+    md.append("- **Upper-bound test (`maxNumTokens = 8192`)**: `Engine.initialize()` "
+              "succeeds on both backends; the artifact is *not* hard-bounded at 4096. "
+              "Previously-failing k=20 queries now run end-to-end on both backends. "
+              "**However:** on CPU the output stays coherent (real medical reasoning, "
+              "ends with reference numbers); on GPU the output degenerates into a long "
+              "repetition loop (`*   *   *   *   ...`) past the 4096-token mark. "
+              "Same artifact, same query — output diverges by backend at lifted context.")
+    md.append("")
+    md.append("**Operational conclusion:** 4096 is the highest value that produces clean "
+              "generations across both backends for this artifact family, and is "
+              "therefore the right value to ship. `RagPipeline.kt:buildEngine()` now "
+              "passes it explicitly so the ceiling is visible at the call site rather "
+              "than left implicit. **k_max ≈ 17–18** for both models — a deployment "
+              "ceiling driven by output quality on GPU, not by a runtime hard cap.")
     md.append("")
 
     md.append("## Key findings\n")
@@ -495,11 +522,16 @@ def write_report(runs: list[dict], out_path: Path) -> None:
               "(mid-tier MediaTek, older Snapdragon without OpenCL) now have a realistic path: "
               "ship E2B on CPU, restrict k to small values.")
     md.append("")
-    md.append("### 5. 4096-token context wall is the binding ceiling at high k")
+    md.append("### 5. 4096-token context wall is the binding ceiling at high k — and the right one")
     md.append("k=15 works cleanly on all four (model × backend) combinations. k=20 fails identically "
-              "across all four: same 8 queries, same 24 (query × rep) failures. The cap is in the "
-              "model artifact, not the runtime, and is **shared between E4B and E2B**. "
-              "**Latency is not the constraint at the upper end of k — context window is.**")
+              "across all four: same 8 queries, same 24 (query × rep) failures, same `>= 4096` "
+              "error. Phase B/C experiments on 2026-05-16 (see §context wall above) show the cap "
+              "is **liftable** — passing `maxNumTokens = 8192` makes the runtime accept larger "
+              "prompts — but the lift produces **quality degradation on GPU** (response loops "
+              "into repetition past 4096 tokens) while CPU output stays clean. 4096 is therefore "
+              "the right deployment ceiling for cross-backend safety, not just a memory or "
+              "runtime constraint. **Latency is not the constraint at the upper end of k — "
+              "output quality is.**")
     md.append("")
     md.append("### 6. TTFT scales linearly with retrieved-doc content past k=3")
     md.append("On both backends and both models, TTFT-per-doc-char is roughly constant past k=3, so "

diff --git a/evaluation/reports/latency_report_v2.md b/evaluation/reports/latency_report_v2.md
@@ -1,6 +1,6 @@
 # MAM-AI On-Device Latency Sweep — Model × Backend × k
 
-_Generated: 2026-05-16T09:00:40_
+_Generated: 2026-05-16T10:56:17_
 
 
 ## Device & stack
@@ -229,8 +229,16 @@ Each table below compares **Gemma 4 E4B** (baseline) against each comparator mod
 
 At k=20, the **same 8 queries × 3 reps = 24 runs** failed across every (model × backend) combination tested: 
 `long_01, long_03, medium_02, medium_04, short_01, short_03, short_04, short_05`. 
-Each failure reports `Input token ids are too long. Exceeding the maximum number of tokens allowed: …>= 4096`. 
-Both Gemma 4 E4B and Gemma 4 E2B ship the same 4096-token context window; the wall is a property of the `.litertlm` artifact format, not the parameter count or backend. **k_max ≈ 17–18** for both models.
+Each failure reports `Input token ids are too long. Exceeding the maximum number of tokens allowed: …>= 4096`. The cap is enforced by LiteRT-LM's native runtime (verified by extracting `liblitertlm_jni.so` from the AAR and locating the literal error template).
+
+### Where the 4096 comes from — and why we set it explicitly
+
+The Kotlin `EngineConfig` constructor exposes a `maxNumTokens` parameter; leaving it `null` falls back to whatever the engine's default is for the loaded artifact. The original `RagPipeline.kt` left it null, so the 4096 ceiling was an inferred property of *somewhere* in the stack rather than a stated choice. **A 2026-05-16 experiment on the test device pinned this down** — see commit log for `feat/explicit-max-num-tokens`:
+
+- **Lower-bound test (`maxNumTokens = 2048`)**: queries with prompts between 2048–4096 tokens that previously succeeded now fail, with the error message reporting the new ceiling verbatim (`>= 2048`). Both GPU and CPU clamp identically. **The knob is wired through to the native runtime as-advertised.**
+- **Upper-bound test (`maxNumTokens = 8192`)**: `Engine.initialize()` succeeds on both backends; the artifact is *not* hard-bounded at 4096. Previously-failing k=20 queries now run end-to-end on both backends. **However:** on CPU the output stays coherent (real medical reasoning, ends with reference numbers); on GPU the output degenerates into a long repetition loop (`*   *   *   *   ...`) past the 4096-token mark. Same artifact, same query — output diverges by backend at lifted context.
+
+**Operational conclusion:** 4096 is the highest value that produces clean generations across both backends for this artifact family, and is therefore the right value to ship. `RagPipeline.kt:buildEngine()` now passes it explicitly so the ceiling is visible at the call site rather than left implicit. **k_max ≈ 17–18** for both models — a deployment ceiling driven by output quality on GPU, not by a runtime hard cap.
 
 ## Key findings
 
@@ -246,8 +254,8 @@ Decode speedup from E4B → E2B is **~1.5× on GPU** but **~2× on CPU**. Decode
 ### 4. GPU still wins, but E2B CPU opens up the no-GPU device tier
 E2B CPU is 1.4–2.4× slower than E2B GPU at every k — GPU remains the preferred backend where available. But E2B CPU at k=1 (~16 s median) is comparable to E4B GPU at k=1 (~14 s), which means devices that previously could *not* deploy MAM-AI at acceptable latency (mid-tier MediaTek, older Snapdragon without OpenCL) now have a realistic path: ship E2B on CPU, restrict k to small values.
 
-### 5. 4096-token context wall is the binding ceiling at high k
-k=15 works cleanly on all four (model × backend) combinations. k=20 fails identically across all four: same 8 queries, same 24 (query × rep) failures. The cap is in the model artifact, not the runtime, and is **shared between E4B and E2B**. **Latency is not the constraint at the upper end of k — context window is.**
+### 5. 4096-token context wall is the binding ceiling at high k — and the right one
+k=15 works cleanly on all four (model × backend) combinations. k=20 fails identically across all four: same 8 queries, same 24 (query × rep) failures, same `>= 4096` error. Phase B/C experiments on 2026-05-16 (see §context wall above) show the cap is **liftable** — passing `maxNumTokens = 8192` makes the runtime accept larger prompts — but the lift produces **quality degradation on GPU** (response loops into repetition past 4096 tokens) while CPU output stays clean. 4096 is therefore the right deployment ceiling for cross-backend safety, not just a memory or runtime constraint. **Latency is not the constraint at the upper end of k — output quality is.**
 
 ### 6. TTFT scales linearly with retrieved-doc content past k=3
 On both backends and both models, TTFT-per-doc-char is roughly constant past k=3, so the prefill story scales predictably. The model shrink translates directly into a TTFT shrink across the whole range.