Skip to content

feat: make 4096-token prompt ceiling explicit + experimental findings#60

Open
nmrenyi wants to merge 12 commits into
mainfrom
feat/explicit-max-num-tokens
Open

feat: make 4096-token prompt ceiling explicit + experimental findings#60
nmrenyi wants to merge 12 commits into
mainfrom
feat/explicit-max-num-tokens

Conversation

@nmrenyi
Copy link
Copy Markdown
Owner

@nmrenyi nmrenyi commented May 16, 2026

Summary

  • Code change (1 line + comment): pass maxNumTokens = 4096 explicitly in EngineConfig at RagPipeline.kt:buildEngine(). The value was already being applied via the engine's null-default; this commit makes the constraint visible at the call site.
  • Experimental confirmation (no committed runtime change, only documentation): tested both lower (2048) and upper (8192) values on both backends to pin down what 4096 actually means. Findings landed in latency_report_v2.md §"Errors and the 4096-token context wall".

Why this matters

The prior latency report claimed the 4096 wall was "a property of the .litertlm artifact format." That was an inference based on (a) we didn't set it in our code, (b) the API docstring said "default from model or engine." This PR actually tested it and the inference was wrong.

What the experiment found

Test Result Implication
maxNumTokens = 2048 (both backends) k=10 prompts fail with >= 2048 instead of >= 4096 The Kotlin knob actually clamps the native runtime — value flows through
maxNumTokens = 8192 (CPU) Engine.initialize() succeeds; long_01 and long_03 at k=20 (previously failed at 4096) now generate clean responses ending with proper references Artifact does not hard-cap at 4096
maxNumTokens = 8192 (GPU) Engine.initialize() succeeds; long_01 at k=20 generates real medical text initially then degenerates into a repetition loop (* * * * ...) for ~5000 chars GPU has a soft quality boundary near 4096 even though it accepts larger inputs

So 4096 isn't an architectural cap — it's the highest cross-backend-safe value for clean output on the current Gemma 4 .litertlm artifacts. Lifting it works mechanically but breaks GPU output, which is the deployment-relevant backend.

Test plan

  • Build CPU APK (flutter build apk --release -PuseGpuForLlm=false) → install → smoke test at k=3 succeeds.
  • Build GPU APK (flutter build apk --release -PuseGpuForLlm=true) → install → smoke test at k=3 succeeds.
  • Verify k=20 queries still fail with the same >= 4096 error as before this PR (no regression in wall behavior).
  • CI green.

What this PR does NOT do

  • Does not lift the wall in production (we just documented why we wouldn't want to).
  • Does not change deployment defaults — app_config.json still ships E4B.
  • Does not open the door to k=20 deployment — the GPU quality regression makes that a non-starter without a different artifact.

Future work (not in this PR)

  • Open an issue at litert-community/ for a Gemma 4 .litertlm build with a larger context window + matching higher-quality decode behavior.
  • If we ever want to ship k=17–19 we'd need to test response quality at that k more carefully — the wall measurements above are init/error-based, not quality-based.

🤖 Generated with Claude Code

nmrenyi added 2 commits May 16, 2026 10:56
The Kotlin EngineConfig accepts a `maxNumTokens` parameter that defaults
to null. The previous buildEngine() call left it null, so the prompt
ceiling that nurses and midwives ran into at high k was an inferred
property of the stack rather than a stated choice in source.

Pass `maxNumTokens = 4096` explicitly. This makes the ceiling visible
at the call site and self-documents the deployment constraint — anyone
reading RagPipeline.kt can now find the number directly instead of
grepping LiteRT-LM error strings.

Why 4096 specifically (verified empirically on 2026-05-16):
- Lower-bound test: passing 2048 clamps the engine to 2048 and the
  error message reports `>= 2048`, so the value really is plumbed
  through to the native runtime, not a no-op.
- Upper-bound test: passing 8192 does NOT cause an init failure —
  the artifact does not enforce a hard 4096 cap. However, on the GPU
  backend the model produces degenerate output (repetition loops)
  for prompts past 4096 tokens, even when no error is thrown. CPU
  stays coherent in the same scenario.
- 4096 is therefore the highest cross-backend-safe value for this
  artifact family.

Full experimental notes will land in latency_report_v2.md in the next
commit.
Expand the "Errors and the 4096-token context wall" section with the
findings from the 2026-05-16 Phase B/C experiments on the OnePlus
OPD2413 test device:

- The cap is enforced in liblitertlm_jni.so (native runtime), confirmed
  by extracting the AAR and locating the error template.
- Passing maxNumTokens=2048 actually clamps to 2048 on both backends
  (lower-bound knob check).
- Passing maxNumTokens=8192 succeeds at init on both backends — the
  4096 ceiling is therefore NOT artifact-baked, contrary to the prior
  claim in this section.
- At 8192, k=20 prompts now run end-to-end. CPU output stays clean;
  GPU output degenerates into a repetition loop past the 4096-token
  boundary.

Update Key finding #5 to reflect the corrected story: 4096 isn't a
hard runtime cap, it's the highest cross-backend-safe value. The
deployment ceiling is driven by GPU output quality, not by an init-
time or runtime constraint.

Both updates land in aggregate_k_sweep.py (the canonical source) and
regenerate the markdown report.
Copilot AI review requested due to automatic review settings May 16, 2026 02:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the LiteRT-LM 4096-token ceiling explicit in the Android RAG engine setup and updates the latency report plus its generator to document the experimental basis for keeping that ceiling.

Changes:

  • Sets maxNumTokens = 4096 explicitly when constructing EngineConfig.
  • Updates the generated latency report with findings from 2048/8192 token-limit experiments.
  • Keeps aggregate_k_sweep.py in sync so future regenerated reports preserve the updated explanation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
app/android/app/src/main/kotlin/com/example/app/RagPipeline.kt Makes the LLM token ceiling explicit in EngineConfig with explanatory context.
evaluation/aggregate_k_sweep.py Updates generated report text for the 4096-token context-wall findings.
evaluation/reports/latency_report_v2.md Regenerates/updates the latency report with the new experimental conclusion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nmrenyi added 4 commits May 16, 2026 14:51
… context

A standalone investigation note documenting the follow-up to PR #60's
Phase A/B/C experiments. Combines two pieces of work:

Step 4 — Source-code precision check (LiteRT-LM OSS + native lib):
- GPU on Android defaults to FP16 for text decoder activations.
  Verified from a literal string in liblitertlm_jni.so: "System's
  default activation type for Text decoder is fp16. Vision encoder
  and audio encoder default is fp32."
- CPU runs FP32 via XNNPACK. No FP16 attention kernels in CPU paths.
- Our Kotlin code doesn't override the default — we are running on
  whatever the system picks per backend.
- Side finding: maxNumTokens is *total context* (prompt + response),
  equivalent to KV cache size, per the upstream header comment.

Step 1 — Transition-point analysis on the bad GPU response:
- long_01 at k=20 with maxNumTokens=8192 has a deterministic
  4917-token prompt. Response begins coherent for ~50 generated
  tokens (medical prose, references, formatting) and then collapses
  sharply into `* * * ...` for the remaining ~1450 tokens.
- Transition is at total context ~5000 — not 4096, not gradual.
- The model successfully prefills past 4096; what breaks is decode-
  side KV writes past prompt end. Most consistent with a kernel
  boundary that FP32 (CPU) absorbs and FP16 (GPU) does not.
- Operational consequence: the 4096 deployment ceiling has ~900
  tokens of safety margin. k=15 deployment is nowhere near the
  cliff — rules out a silent-degradation concern.

Doc also lays out three open questions (Steps 2/3 and the artifact's
prefer_activation_type field) for follow-up if anyone wants to write
this up rigorously upstream.
…t_v2

The latency report's context-wall section now points to the
standalone investigation doc and lists the four headline findings:

- GPU defaults to FP16 (Android Adreno OpenCL); CPU runs FP32
  (XNNPACK)
- At maxNumTokens=8192, GPU collapse happens at total context
  ~5000 — not 4096, with a sharp transition over ~50 generated
  tokens
- Failure is decode-side (KV writes past prompt end), not prefill-
  side
- 4096 deployment ceiling has ~900 tokens of safety margin;
  current k=15 ships well below the breakdown zone

Source is in aggregate_k_sweep.py so the cross-link survives any
future report regeneration. Regenerated the markdown report.
…ty constraint

Step 2 result (bit-identical across 3 reps on GPU at maxNumTokens=8192):

- All 3 reps produced response_length_chars=6027, est_tokens=1506,
  transition at char 400 (total context ~5017), and bit-identical
  response head and tail.
- Decode times vary by 0.05% — system jitter only. Model outputs are
  identical token-for-token.
- This rules out stochastic FP16 drift as the proximate cause. GPU
  uses greedy decoding by default (max_top_k=1 from GpuConfig), so
  with the same prompt and same numerical paths the output is the
  same every time. The breakdown is a *deterministic* kernel-level
  issue that FP16 produces consistently.

Reachability constraint added (a new section):

- The natural FP32-on-GPU control test is not possible from the
  public Kotlin API in LiteRT-LM 0.11.0. Verified from four sources:
  Config.kt (no precision field on EngineConfig), Engine.kt (the
  nativeCreateEngine signature has no precision arg), LiteRtLmJni.kt
  (no SetActivationDataType bridge), and the C++ CreateDefault()
  factory (doesn't set activation_data_type_).
- The C++ side has the SetActivationDataType method, but it is not
  wired to the Kotlin/JNI layer in this version.
- Three plausible unblocks: (b) modify the .litertlm artifact's
  metadata to set prefer_activation_type=FLOAT32 — half day of
  flatc work; (c) file upstream to expose the API — right systemic
  fix but doesn't help us now; (d) build a custom AAR — multi-day,
  out of scope.
- Mechanism story is well-anchored even without the control test;
  4096 stays the ship value regardless.

Updates: Refined mechanism hypothesis paragraph to mark FP16 as
"candidate differentiator" rather than confirmed; "What's still open"
table marks Step 2 + reachability resolved; adds new open questions
for the artifact-modification path.
…latency_report_v2

The latency report's context-wall section now mentions:

- 3-rep reproducibility test returned bit-identical output — same chars,
  same transition position, same head, same tail. With greedy decoding
  this rules out stochastic FP16 noise; the breakdown is deterministic.
- The FP32-on-GPU control test would directly confirm or refute FP16 as
  the root cause, but the Kotlin API in LiteRT-LM 0.11.0 does not expose
  SetActivationDataType — the JNI bridge has no precision parameter.
- The investigation doc lists three plausible unblocks (artifact-header
  modification, upstream API exposure, custom AAR build).

Source is in aggregate_k_sweep.py so the cross-link survives any future
report regeneration. Regenerated the markdown report.
Copilot AI review requested due to automatic review settings May 16, 2026 07:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +335 to +341
// Set the prompt-budget ceiling explicitly so the limit is visible at the
// call site rather than inferred from runtime errors at high k. Empirical
// testing on Gemma 4 E4B/E2B .litertlm (see latency_report_v2.md §context
// wall): the engine accepts higher values (8192 init OK on both backends),
// but on GPU the output degenerates into a repetition loop past 4096 even
// when no error is thrown. 4096 is the highest value that produces clean
// generations across both backends for this artifact family.
- **(c) File an upstream issue with `google-ai-edge/LiteRT-LM`** to expose `SetActivationDataType` in the Kotlin `EngineConfig` API. Doesn't unblock us now, but is the right systemic fix.
- **(d) Build a custom LiteRT-LM AAR.** Clone the repo, add a field to the Kotlin `EngineConfig` + parameter to `nativeCreateEngine` + plumbing to the C++ setter. Multi-day project; out of scope.

For now this PR documents the constraint and the experimental options. The mechanism story is already well-anchored even without the FP32 control: Steps 4, 1, 2 together make the FP16-induced determinstic kernel-zone breakdown the obvious hypothesis. The control test would *confirm* it; it doesn't change the deployment recommendation either way.
nmrenyi added 4 commits May 16, 2026 16:45
The FP32-on-GPU control test that we'd flagged as blocked-by-API-gap
is now done, via option (b) — modifying the .litertlm artifact's
section metadata to inject `prefer_activation_type=float32`. The
litert-lm-builder pip package supports this directly via the TOML
`additional_metadata` field; no AAR fork or Kotlin patching needed.

Mechanism:
- Used litert-lm-peek to dump E4B sections + auto-generated model.toml.
- Added `prefer_activation_type=float32` under additional_metadata
  on the prefill_decode section in the TOML.
- Rebuilt the .litertlm with litert-lm-builder. Section data
  byte-identical; only metadata header changed.
- Pushed to device, rebuilt GPU APK at maxNumTokens=5000, ran
  long_01 at k=15.

Result:
- Logcat at engine init: "section_prefer_activation_type: float32"
  and "activation_data_type: FLOAT32" — runtime honored the override.
- Response: 998 chars / 249 tokens of clean medical reasoning, no
  degeneration anywhere in the response. Total context at response
  end was ~4514 — past 4096 but below the FP16 cliff at ~5000.

Conclusion: FP16 is the root cause of the GPU breakdown, not a
symptom amplifier. Same artifact, same prompt, same backend, same
greedy decoding — only the activation precision changed, and it
fixed the output. The "kernel boundary independent of precision"
hypothesis is ruled out.

Memory observations on the test device:
- maxNumTokens=8192 with FP32 OOMs the app at first large prefill
  (KV cache ~5.8 GB + model ~3.4 GB + activations + overhead
  exceeds the ~10 GB available RAM).
- maxNumTokens=5000 works cleanly.
- Practical FP32-GPU ceiling on this hardware tier is somewhere in
  the 6500–7500 token range; we didn't bisect since latency is the
  bigger deployment blocker.

Updates: Step 3 section added, Refined mechanism hypothesis #4
upgraded from "candidate differentiator" to "confirmed root cause",
What's still open table refreshed (FP16-root-cause and artifact-
metadata questions moved to Resolved), TL;DR rewritten.
The latency report's context-wall section bullet about the FP32 test
gap is updated to reflect Step 3's result: FP16 is confirmed as the
root cause via the artifact-metadata override path. The Kotlin API
still doesn't expose SetActivationDataType, but the per-section
`prefer_activation_type` key in the .litertlm metadata is honored
by the runtime and gives us the same control.

Bullet now states the resolved mechanism plus the deployment-relevant
caveats: FP32 GPU is ~2-3x slower at decode and the KV cache doubles
in size, so it's not a free swap from FP16.

Source change is in aggregate_k_sweep.py so the cross-link survives
any future report regeneration. Regenerated the report.
Ran the same 8-cell sweep we did for FP16 GPU in PR #57/#59, but
against the FP32-tagged artifact at maxNumTokens=4096. ~4.5 hours
total wall-clock on the OnePlus OPD2413 (Snapdragon 8 Elite). Result
is a clean cell-by-cell comparison.

Headline: FP32 GPU is **~25% slower than FP16 GPU at k=15** (~6 s
extra wait per query), much less than the ~3× I'd estimated from
the single-data-point Step 3 measurement. That earlier number was
wrong — I had accidentally been comparing FP32-E4B against
FP16-**E2B** (the smaller model), not the matched FP16-E4B baseline.

The slowdown is almost entirely in TTFT (prefill):
- FP16 TTFT 0.96–4.0s, FP32 TTFT 2.0–9.8s (~2–2.5× across all k)
- FP16 decode 11–18s, FP32 decode 12–19s (essentially identical)

Mechanism: prefill is compute-bound (one parallel forward pass over
the input), so FP16's 2× arithmetic throughput on Adreno helps a
lot. Decode is bandwidth-bound (sequential token-at-a-time loading
of weights), so the FP16/FP32 precision choice barely matters.

Same 24 errors at k=20 on FP32 GPU as on FP16 GPU — the prompt-cap
rejection at maxNumTokens=4096 is precision-agnostic, just a config
check.

What this means: **FP32 GPU is a real shipping option, not just an
experiment.** At maxNumTokens=4096 the latency cost is ~25% (no
quality benefit — we're below the FP16 cliff anyway). At higher
maxNumTokens (e.g., 5500), FP32 GPU enables clean output past the
FP16 cliff at the same ~25% latency hit. Memory ceiling caps
maxNumTokens at ~6500–7500 on this 16 GB device since KV cache
doubles vs FP16.

The choice between FP16 GPU and FP32 GPU is now a UX-vs-margin
tradeoff at the deployment level, not a feasibility question.

Updates: new Step 5 section with full sweep table + corrected
slowdown narrative; updated TL;DR bullet on FP32 latency cost;
"What's still open" table marks FP32 latency curve resolved; full
8-JSON inventory added to References.
The investigation doc and the latency report now open with a clear,
unmissable callout about the failure mode anyone touching this stack
needs to know about:

- The default activation precision on Android GPU in LiteRT-LM is
  FP16.
- FP16 attention causes a silent, deterministic decoding failure
  (collapse into `*` repetition) once total context exceeds the
  artifact's calibrated zone (~5000 tokens for the current Gemma 4
  .litertlm artifacts).
- No error is raised — the response just becomes garbage tokens.
- The breakdown is bit-exactly reproducible because GPU defaults to
  greedy decoding.

References the concrete example JSON that captures this failure
(benchmark_20260516T104730_k20.json) and notes it should be kept
in the repo as the reference case.

The current MAM-AI deployment stays below the breakdown by shipping
maxNumTokens=4096, so we're safe today. The warning is for future
contributors who consider lifting the cap, switching artifacts, or
debugging unexpected output on GPU.

Updates:
- New blockquote callout at the top of maxnumtoken_investigation.md
  before the existing TL;DR section.
- Headline bullets in latency_report_v2.md (sourced from
  aggregate_k_sweep.py) lead with the same warning + pointer to
  the example JSON.
Copilot AI review requested due to automatic review settings May 16, 2026 13:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment on lines +7 to +8
> A concrete example of this failure is captured in [`benchmark_20260516T104730_k20.json`](../latency_results/benchmark_20260516T104730_k20.json): query `long_01` at k=20, FP16 GPU, maxNumTokens=8192. The response opens with coherent medical reasoning for the first ~50 generated tokens, then deterministically collapses into an asterisk-repetition loop for the remaining ~1450 tokens. **Keep this file in the repo as the reference example of the failure mode.**
>
Comment on lines +335 to +341
// Set the prompt-budget ceiling explicitly so the limit is visible at the
// call site rather than inferred from runtime errors at high k. Empirical
// testing on Gemma 4 E4B/E2B .litertlm (see latency_report_v2.md §context
// wall): the engine accepts higher values (8192 init OK on both backends),
// but on GPU the output degenerates into a repetition loop past 4096 even
// when no error is thrown. 4096 is the highest value that produces clean
// generations across both backends for this artifact family.
Comment on lines +507 to +510
"[`benchmark_20260516T104730_k20.json`](../latency_results/benchmark_20260516T104730_k20.json) "
"(long_01, k=20, FP16 GPU, maxNumTokens=8192). Today's 4096 cap stays "
"well below the breakdown; lifting the cap on FP16 GPU is unsafe without "
"switching to FP32.")
nmrenyi added 2 commits May 16, 2026 22:26
… in JSON

Three connected changes to make benchmark JSONs self-describing for the
FP16-vs-FP32 / context-window comparison:

1. Single source of truth for max_num_tokens
   - Add `engine.max_num_tokens: 4096` to runtime_config.json
   - RagPipeline.kt reads it from there instead of hardcoded literal in
     EngineConfig(maxNumTokens = 4096)
   - BenchmarkForegroundService.kt reads it from the same source and
     records the actual value in the JSON's config block as
     `max_num_tokens`

2. Remove the max_tokens=32000 fiction
   - The previous JSON config block recorded `"max_tokens": 32000` as a
     hardcoded literal. The value was never passed to any API or
     enforced anywhere; it was a documentation lie that obscured what
     the runtime actually did. Removed entirely.
   - Sampler params (temperature/top_p/top_k) were also hardcoded in
     two places (BenchmarkForegroundService + the runtime_config.json
     RagPipeline reads from). Consolidated to read from runtime_config.

3. Add provenance fields to the benchmark JSON
   - `artifact_fingerprint`: SHA-256 of the first 64 KB of the loaded
     .litertlm. The FlatBuffers header lives in that region, so this
     hash uniquely identifies the artifact variant (e.g. distinguishing
     the default Gemma 4 build from a `prefer_activation_type=float32`-
     tagged rebuild). Critical for reviewing FP16-vs-FP32 comparisons
     where the only difference is the artifact metadata.
   - `git_commit_sha`: wired into BuildConfig via a new gitShortSha()
     helper in app/android/app/build.gradle.kts. Reviewers can trace
     any benchmark JSON back to the exact source state that produced
     it.
   - `litertlm_version`: existed in BuildConfig already, now also
     surfaced in benchmark JSON metadata.

Verified via smoke benchmark (medium_01 k=3 GPU): JSON now contains
max_num_tokens=4096, the actual sampler params from runtime_config,
the artifact_fingerprint of the FP32-tagged Gemma 4 E4B artifact
currently installed on the test device, and the git_commit_sha of the
parent commit. No max_tokens=32000 field.

Motivation: while writing up the FP32-vs-FP16 latency comparison,
discovered that the existing JSONs cannot distinguish runs from
different (maxNumTokens, activation_precision) configurations because
none of those settings were recorded. The first benchmark file in the
investigation (long_01 at k=20, asterisk-loop failure) doesn't say
which maxNumTokens it ran at — that fact came from session memory only.
Going forward, JSONs are self-describing for these comparisons.
The benchmark JSONs from 52e11e9 onward record `config.artifact_fingerprint`
(SHA-256 of the first 64 KB of the loaded .litertlm). This is critical
because the FP32-tagged rebuild and the original FP16 default share the
same filename — without the fingerprint, a JSON can't tell us which
precision the run used.

Add a new "Reference: artifact fingerprint mapping" section to the
investigation doc with the two fingerprints + the precision they
correspond to, verified cryptographically against the local files:

- 9fdf9dd1... = FP32-tagged Gemma 4 E4B (prefer_activation_type=float32
  injected on prefill_decode section). All Step 3 and Step 5 FP32 GPU
  runs use this artifact.
- cfa067b6... = Original litert-community/gemma-4-E4B-it-litert-lm from
  HuggingFace. FP16 by runtime default per Step 4.

Both files are exactly 3,654,467,584 bytes — litert-lm-builder
preserved section offsets, only metadata bytes differ.

Includes a 5-line Python snippet so anyone can re-verify the mapping
against their own local artifacts. Future artifact variants should
extend the table.

Why this matters now: the overnight FP32-vs-FP16 GPU sweep produces
JSONs whose precision condition is only verifiable via this mapping
(no direct activation_data_type field in the JSON — adding that would
require parsing the .litertlm FlatBuffers header in Kotlin, out of
scope for tonight). With this doc committed, every future JSON is
classifiable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants