analysis: Gemma 4 E2B latency sweep + cross-model report#59
Merged
Conversation
The benchmark service hardcoded "gemma-4-E4B-it.litertlm" in the results JSON metadata, so switching models in app_config.json silently left stale data in benchmark output. Read llm_model from the same asset RagPipeline uses so the JSON always reflects what was actually loaded. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Pointing the app at gemma-4-E2B-it.litertlm so we can run the same k-sweep benchmarks as the E4B baseline for a direct comparison on the OPPO Snapdragon 8 Elite device. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The E2B sweep started by another session (Phase 1 committed at 976a8ac and 3042d38) hit issues with the agent-orchestration wait pattern in two attempts. Writing a self-contained runbook so a new CLI session can pick it up from Phase 2 with full context — what's already done, what to verify, the exact 16 measurements to collect, the analysis work for Phase 4, and the constraints (no push, no scope change). Lives at evaluation/runbooks/e2b_sweep.md. If we end up doing more sweeps in this style, the directory is a natural home for similar job-aid documents. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Group runs by (model, backend, k) instead of (backend, k) so the matrix can hold both Gemma 4 E4B and E2B side-by-side. Emit one per-model section (the existing six tables) for each model, plus a new cross-model comparison section with total/TTFT/decode ratio tables. Why: with the E2B sweep landing, the prior aggregator would have collapsed E4B and E2B into the same cells. The (model, backend, k) grouping keeps the existing single-model view intact while making the E4B-vs-E2B story explicit. Notes: - Added LEGACY_DEFAULT_MODEL fallback for any JSON missing config.model. All current files have it set, so this is purely defensive. - Refactored the per-model tables into _write_per_model_section() and the cross-model tables into _write_cross_model_table() to keep write_report() readable.
Re-run aggregate_k_sweep.py over the full set of 32 canonical runs (16 E4B + 16 E2B). The report now has: - Per-model sections for both Gemma 4 E4B and Gemma 4 E2B, each with the existing six tables (headline / TTFT / decode / p95 / errors / wall-clock). - A new cross-model comparison section with E4B-vs-E2B ratio tables for total query latency, TTFT, and decode. - Rewritten Key findings to reflect the measured speedup ratios: ~1.5× total GPU, ~2.2× total CPU. Prefill is compute-bound on both backends (~2.3× speedup); decode is bandwidth-bound on GPU (~1.5×) and compute-bound on CPU (~2×). The architectural story cleanly explains why total speedup is decode-dominated at low k and climbs toward the prefill ratio at high k. The 4096-token context wall is now confirmed across all four (model × backend) combinations: same 8 queries × 3 reps = 24 errors on each, anchored as a property of the .litertlm artifact format.
Replace the projected E2B latency table in §2 with measured numbers
from the 2026-05-16 E2B sweep on Snapdragon 8 Elite. The measured
speedup ratio is ~1.5× on GPU and ~2× on CPU (not the originally
projected uniform 2×), and the architectural reason is now spelled
out in latency_report_v2.md: GPU prefill is compute-bound, GPU
decode is bandwidth-bound, CPU is compute-bound throughout.
Specific changes:
- §6 Open questions: mark "Actual E2B CPU latency" as resolved
with a pointer to the new report. Add a new follow-up question
about validating the mid-tier MediaTek 2×-slowdown extrapolation
on real hardware.
- §2 Backend × model × k feasibility: replace projected E2B table
with measured Snapdragon 8 Elite values; keep the MediaTek row
flagged as extrapolation and call that out in the section preamble.
- TL;DR: add a fourth rule covering E2B CPU's newly-measured
deployment envelope (k=10 comfortable on flagship CPU, k=3–5
borderline on mid-tier).
Why: the original notes shipped with projections marked clearly as
"halve E4B numbers". With real measurements in hand, those rows now
carry actual data, and the deployment-relevant rule of thumb
("E2B CPU opens up the no-GPU device tier") is anchored on numbers
rather than a speculative ratio.
Open the doc with a bolded single sentence that captures the deployment-relevant takeaway: which (model × backend × k) cells fit a 60 s latency budget on the Snapdragon 8 Elite test device, anchored on the RAM floors and the 4096-token context wall. Why: a reader skimming for "can I ship E2B on CPU?" should get the answer from the first line, before the TL;DR's four rules and the detail tables.
There was a problem hiding this comment.
Pull request overview
Extends the existing on-device latency benchmarking/analysis workflow to include Gemma 4 E2B alongside E4B, producing a combined “model × backend × k” report and updating deployment guidance accordingly.
Changes:
- Refactors
evaluation/aggregate_k_sweep.pyto group and report by(model, backend, k), and adds cross-model comparison tables. - Regenerates
evaluation/reports/latency_report_v2.mdand updatesevaluation/reports/device_compatibility_notes.mdwith measured E2B results and revised findings. - Updates benchmark JSON metadata to record the loaded model from
app_config.json, and adds a runbook for reproducing/continuing the E2B sweep.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| evaluation/runbooks/e2b_sweep.md | New operational runbook for running the E2B GPU/CPU k-sweeps and updating analysis artifacts. |
| evaluation/reports/latency_report_v2.md | Regenerated combined report with per-model sections and E4B↔E2B comparison tables. |
| evaluation/reports/device_compatibility_notes.md | Updates deployment feasibility guidance using measured E2B latencies and revised open questions. |
| evaluation/aggregate_k_sweep.py | Adds model dimension to aggregation + emits per-model and cross-model sections in the report. |
| config/app_config.json | Switches default configured LLM model from E4B to E2B. |
| app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt | Writes config.model into benchmark JSON by reading llm_model from app_config.json. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
1
to
3
| { | ||
| "llm_model": "gemma-4-E4B-it.litertlm", | ||
| "llm_model": "gemma-4-E2B-it.litertlm", | ||
| "embedding_model": "Gecko_1024_quant.tflite", |
|
|
||
| - This work mirrors the E4B latency sweep that landed in PR #57 (commit `1be0a55` on `main`). The E4B results are in `evaluation/reports/latency_report_v2.md` and the device-compatibility analysis is in `evaluation/reports/device_compatibility_notes.md`. | ||
| - We're now measuring the **smaller** Gemma 4 E2B variant (~2 GB instead of E4B's 3.66 GB) to find out how much faster it is in real terms on the same hardware. Same 16 measurements as E4B: 8 GPU (k ∈ {1, 3, 5, 7, 10, 15, 20} + No-RAG) + 8 CPU. | ||
| - Test device: **OPPO OPD2413 (Snapdragon 8 Elite, SM8750P)** connected via ADB. OPPO Hans battery-optimization whitelist is **already configured** by the user — don't re-do it. |
Comment on lines
10
to
13
| 2. **E2B minimum RAM: 4 GB** total. The smaller model halves the runtime memory footprint (~1.7 GB), opening up the $100–$150 device tier that's the largest slice of the African market. | ||
| 3. **E4B on CPU: k=3 is the borderline.** Beyond k=3, CPU totals exceed the 60 s budget on most mid-tier silicon. **E4B on GPU: no latency worry** — totals stay 13–25 s across k=0–15 on Snapdragon 8 Elite + Adreno. | ||
| 4. **E2B on CPU: k=10 is comfortable on flagship CPU; k=3–5 on mid-tier MediaTek.** Measured E2B CPU at k=10 on Snapdragon 8 Elite is 26 s median; extrapolating ~2× slower for mid-tier MediaTek gives ~50 s at k=5–7 (borderline). This is the deployment-relevant change from the May 2026 sweep: E2B CPU is **~2× faster than E4B CPU**, not the originally-projected speedup — and that means CPU-only deployment is finally viable up to mid-range k on the no-GPU device tier. | ||
|
|
Comment on lines
+420
to
+426
| # Use E4B as baseline when present; ratio is E4B/E2B so >1 means E2B is faster. | ||
| if len(models) > 1: | ||
| baseline = "gemma-4-E4B-it.litertlm" if "gemma-4-E4B-it.litertlm" in models else models[0] | ||
| others = [m for m in models if m != baseline] | ||
| md.append("## Cross-model comparison\n") | ||
| md.append( | ||
| f"Ratios below are **{_short_model_label(baseline)} ÷ {_short_model_label(others[0])}**, " |
The comparison-section intro previously named the first comparator model explicitly (`others[0]`), which would silently mislead readers if a third model were ever added to the matrix — the intro would still mention only one comparator while the loop below rendered a table per `other`. Reword the intro to describe the comparison generically (baseline ÷ each comparator) and list all comparators inline so it self-describes no matter how many models the matrix contains. Also align the architectural-context phrasing with the Key-findings section: GPU prefill is compute-bound (tracks parameter count), GPU decode is bandwidth-bound (gains less from shrinkage), CPU is compute-bound throughout. Regenerate the report. Addresses Copilot review comment on PR #59.
The benchmark JSONs record `device.manufacturer="OnePlus"`, but the runbook and the §5 deployment-market table in device_compatibility_notes both called the device "OPPO" (OPPO Find X8). Same physical hardware, different brand label — OPPO and OnePlus share platforms and the OPD2413 ships under both brands depending on market — but the inconsistency makes it hard for a reader to cross-reference the runbook against the regenerated latency report. Settle on "OnePlus OPD2413" (firmware-reported) as the canonical name and add the OPPO branding parenthetically wherever the term first appears in a doc. Addresses Copilot review comment on PR #59.
The previous wording said E2B CPU is "~2× faster than E4B CPU, not the originally-projected speedup," which reads as a contradiction since the original notes projected a uniform ~2× speedup across both backends. The thing that actually diverged from the projection is **GPU total speedup** (~1.5×, not the projected 2×), and the reason is architectural: GPU decode is bandwidth-bound and benefits less from parameter-count shrinkage. CPU matches the projection. Rewrite the rule to make the backend split explicit instead of contrasting CPU against an unspecified projection. Addresses Copilot review comment on PR #59.
Commit 3042d38 (Phase 1 of the E2B latency sweep) switched the production `llm_model` in config/app_config.json from E4B to E2B so the benchmark would load the smaller model. That change rode into PR #59 even though the PR's own body — and the safety note in device_compatibility_notes.md §6 — explicitly state that this work does **not** authorize a deployment swap: the answer-quality regression check on kenya_vignettes and the AfriMed-QA SAQ judge run is still listed as "Critical before any model swap decision". Per CLAUDE.md, evaluation quality and response safety are the top priorities for this medical app. Shipping a default-model swap without the safety eval would mean the production app loads a model whose accuracy on safety-critical medical-advice metrics has not been validated. Revert the default to `gemma-4-E4B-it.litertlm`. The benchmark measurements that motivated the PR are unaffected — the .litertlm file used for the sweep is recorded in each benchmark JSON's config.model field, so the data is preserved regardless of the deployed default. To re-run the E2B benchmark on a future branch, the runbook author will need to temporarily flip llm_model to E2B before kicking off benchmark_latency.py (and revert before opening the PR). Addresses Copilot review comment on PR #59.
Comment on lines
+374
to
+379
| models = sorted(set(m for (m, _b, _k) in matrix.keys())) | ||
| all_ks = sorted(set(k for (_m, _b, k) in matrix.keys())) | ||
|
|
||
| sample = next(iter(matrix.values())) | ||
| dev = sample["data"]["device"] | ||
|
|
Comment on lines
+3
to
+33
| Self-contained instructions for finishing the E2B latency sweep started by another session. **Phase 1 (setup) is already complete on branch `feat/e2b-latency-sweep`.** Your job is Phase 2 (GPU sweep), Phase 3 (CPU sweep), and Phase 4 (analysis + local commits, **no push**). Expected wall-clock: **~5 hours**. | ||
|
|
||
| ## 0. Context — read this first | ||
|
|
||
| - This work mirrors the E4B latency sweep that landed in PR #57 (commit `1be0a55` on `main`). The E4B results are in `evaluation/reports/latency_report_v2.md` and the device-compatibility analysis is in `evaluation/reports/device_compatibility_notes.md`. | ||
| - We're now measuring the **smaller** Gemma 4 E2B variant (~2 GB instead of E4B's 3.66 GB) to find out how much faster it is in real terms on the same hardware. Same 16 measurements as E4B: 8 GPU (k ∈ {1, 3, 5, 7, 10, 15, 20} + No-RAG) + 8 CPU. | ||
| - Test device: **OnePlus OPD2413 (Snapdragon 8 Elite, SM8750P)** connected via ADB — that's the firmware-reported manufacturer (`device.manufacturer="OnePlus"` in the benchmark JSONs); the same OPD2413 hardware ships under the OPPO brand in some markets. The OPPO/OnePlus Hans battery-optimization whitelist is **already configured** by the user — don't re-do it. | ||
| - The benchmark infrastructure is in `evaluation/benchmark_latency.py`; the aggregator is `evaluation/aggregate_k_sweep.py`. Both are already correct for this work, with one expected exception in Phase 4 (the aggregator needs a `model` dimension added). | ||
|
|
||
| ### Why this is a runbook and not a single Bash command | ||
|
|
||
| Each benchmark run takes **12–20 minutes wall-clock** (E2B is ~1.5× faster than E4B based on the smoke test, not 2×). You can't realistically loop them in one foreground shell command; bash timeouts cap at 10 minutes in our tooling. Use `Bash run_in_background: true` and **wait for the harness completion notification** between runs. Don't use `tail -F`, sleep loops, or watchdog patterns — those caused the previous subagent to bail at 87 seconds. | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Verify Phase 1 state — fail loud if anything's missing | ||
|
|
||
| Run these checks before touching anything: | ||
|
|
||
| ```bash | ||
| cd ~/Downloads/mamai | ||
| git status # should show clean working tree on branch feat/e2b-latency-sweep | ||
| git log --oneline -3 # should show 3042d38, 976a8ac at the top | ||
| ``` | ||
|
|
||
| Expected log: | ||
| ``` | ||
| 3042d38 config: switch llm_model to Gemma 4 E2B | ||
| 976a8ac fix(benchmark): read model name from app_config asset | ||
| a2205ff docs: device compatibility notes — which phones can run E4B / E2B | ||
| ``` |
Comment on lines
+384
to
+389
| // Read model name from the same app_config.json asset the RagPipeline uses, | ||
| // so the JSON metadata reflects whatever model is actually loaded rather than | ||
| // a hardcoded string that goes stale when we switch model artifacts. | ||
| put("model", JSONObject( | ||
| application.assets.open("app_config.json").bufferedReader().use { it.readText() } | ||
| ).getString("llm_model")) |
…atrix The matrix builder assumed at least one canonical benchmark JSON was loaded and indexed via `next(iter(matrix.values()))`. On a fresh checkout `evaluation/latency_results/` is gitignored and may not exist, so the script would crash with a bare StopIteration that gives a new contributor no useful direction. Detect the empty case and exit with a message pointing at benchmark_latency.py and the runbooks directory. Verified two paths: existing 32-JSON workflow still produces the report; a /tmp checkout with no JSONs now prints the directional error and exits cleanly. Addresses Copilot review comment on PR #59.
PR #59 reverted llm_model in config/app_config.json back to E4B (commit 84e1bfd), but the runbook's Phase 1 verification still expects fresh benchmark JSONs to record `config.model == "gemma-4-E2B-it.litertlm"`. Anyone reading this runbook on a fresh checkout would hit that mismatch without an explanation. Add a preamble note at the top of the runbook covering the four-step re-run procedure (flip config → rebuild/install → sweep → revert config) and a sentence explaining the git-log expectation drift for replays after PR merge. Keep the rest of the runbook intact — the Phase 2/3 sweep instructions still apply verbatim once the config is flipped. Addresses Copilot review comment on PR #59.
…ures The model-name lookup in writeResults() opens and parses app_config.json from the APK assets at the moment the final benchmark JSON is being built. That happens at the *end* of a 20-minute sweep, after all 54 result records have been accumulated in memory but before they hit disk. An IOException on the asset open, or a JSONException on a malformed config, would propagate up and discard the entire results array. The failure probability is low (the asset is bundled inside the APK and the file is generated at build time), but the consequence is severe — we'd lose every measurement from a multi-hour run because we couldn't read a metadata string. Wrap the read in try/catch and fall back to `"unknown"` for the model field if the asset can't be read or parsed. The model field in the JSON is metadata for downstream analysis (aggregate_k_sweep.py groups by it); an "unknown" tag will just route the run into its own matrix cell rather than colliding with a real model, which is the right failure mode for an unexpected build state. Addresses Copilot review comment on PR #59.
| 1. **E4B minimum RAM: 6 GB** total. 4 GB phones cannot run E4B reliably (model alone needs ~3.3 GB at runtime; Android + bundled apps eat 1.5–2 GB). | ||
| 2. **E2B minimum RAM: 4 GB** total. The smaller model halves the runtime memory footprint (~1.7 GB), opening up the $100–$150 device tier that's the largest slice of the African market. | ||
| 3. **E4B on CPU: k=3 is the borderline.** Beyond k=3, CPU totals exceed the 60 s budget on most mid-tier silicon. **E4B on GPU: no latency worry** — totals stay 13–25 s across k=0–15 on Snapdragon 8 Elite + Adreno. | ||
| 4. **E2B on CPU: k=10 is comfortable on flagship CPU; k=3–5 on mid-tier MediaTek.** Measured E2B CPU at k=10 on Snapdragon 8 Elite is 26 s median; extrapolating ~2× slower for mid-tier MediaTek gives ~50 s at k=5–7 (borderline). The original notes projected a uniform ~2× speedup across backends; measurements show **CPU matches that projection (~2× total speedup)** but **GPU total speedup is closer to ~1.5×** because decode is bandwidth-bound and gains less from the parameter-count shrink. Either way, CPU-only deployment is finally viable up to mid-range k on the no-GPU device tier — that's the deployment-relevant change from the May 2026 sweep. |
Comment on lines
+13
to
+16
| Notes on model identification: post-fix JSONs (commit 976a8ac onward) record | ||
| `config.model` from the app asset; earlier runs do not. For any JSON missing | ||
| `config.model` we default to `gemma-4-E4B-it.litertlm` since the only sweeps | ||
| that predate the fix were E4B. Future runs of any model are unaffected. |
nmrenyi
added a commit
that referenced
this pull request
May 16, 2026
Ran the same 8-cell sweep we did for FP16 GPU in PR #57/#59, but against the FP32-tagged artifact at maxNumTokens=4096. ~4.5 hours total wall-clock on the OnePlus OPD2413 (Snapdragon 8 Elite). Result is a clean cell-by-cell comparison. Headline: FP32 GPU is **~25% slower than FP16 GPU at k=15** (~6 s extra wait per query), much less than the ~3× I'd estimated from the single-data-point Step 3 measurement. That earlier number was wrong — I had accidentally been comparing FP32-E4B against FP16-**E2B** (the smaller model), not the matched FP16-E4B baseline. The slowdown is almost entirely in TTFT (prefill): - FP16 TTFT 0.96–4.0s, FP32 TTFT 2.0–9.8s (~2–2.5× across all k) - FP16 decode 11–18s, FP32 decode 12–19s (essentially identical) Mechanism: prefill is compute-bound (one parallel forward pass over the input), so FP16's 2× arithmetic throughput on Adreno helps a lot. Decode is bandwidth-bound (sequential token-at-a-time loading of weights), so the FP16/FP32 precision choice barely matters. Same 24 errors at k=20 on FP32 GPU as on FP16 GPU — the prompt-cap rejection at maxNumTokens=4096 is precision-agnostic, just a config check. What this means: **FP32 GPU is a real shipping option, not just an experiment.** At maxNumTokens=4096 the latency cost is ~25% (no quality benefit — we're below the FP16 cliff anyway). At higher maxNumTokens (e.g., 5500), FP32 GPU enables clean output past the FP16 cliff at the same ~25% latency hit. Memory ceiling caps maxNumTokens at ~6500–7500 on this 16 GB device since KV cache doubles vs FP16. The choice between FP16 GPU and FP32 GPU is now a UX-vs-margin tradeoff at the deployment level, not a feasibility question. Updates: new Step 5 section with full sweep table + corrected slowdown narrative; updated TL;DR bullet on FP32 latency cost; "What's still open" table marks FP32 latency curve resolved; full 8-JSON inventory added to References.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
evaluation/latency_results/(16 E4B + 16 E2B).aggregate_k_sweep.pyfrom(backend, k)to(model, backend, k)grouping so both models sit side-by-side. Adds per-model sections plus a cross-model comparison section with total / TTFT / decode ratio tables.latency_report_v2.mdwith E4B-vs-E2B comparison and updated Key findings. Headline: ~1.5× total speedup on GPU, ~2.2× on CPU, with the architecturally-clean split that GPU decode is bandwidth-bound while CPU is compute-bound throughout. The 4096-token context wall is now confirmed across all four (model × backend) combinations — same 8 queries × 3 reps = 24 errors fail identically on every cell.device_compatibility_notes.md§2 with measurements; add a bolded one-line deployment summary as the document lead.Deployment-relevant takeaway
On the Snapdragon 8 Elite test device under a 60 s latency budget, E4B (6 GB RAM floor) deploys only on GPU across all k ≤ 15 or on CPU at k ≤ 3, while E2B (4 GB RAM floor) deploys on both GPU and CPU across all k ≤ 15 — with k = 20 ruled out for both models by the 4096-token context wall regardless of backend.
E2B CPU at k=1 (~16 s median) is now comparable to E4B GPU at k=1 (~14 s), opening up the no-GPU mid-tier-device deployment path. Caveat: the mid-tier MediaTek 2×-slowdown row in §2 is still a Geekbench-anchored extrapolation, not a measurement — flagged as a new high-priority open question in §6.
Test plan
Loaded 32 canonical runs across 2 models: gemma-4-E2B-it.litertlm, gemma-4-E4B-it.litertlm(no SKIPs of canonical files).evaluation/reports/latency_report_v2.mdper-model tables match the source JSON medians for at least one spot-checked (model, backend, k) cell.evaluation/reports/device_compatibility_notes.md§2 measured E2B row matches the report's per-category medians.evaluation/latency_results/are tracked (directory is gitignored).Safety note
This PR establishes the latency leg of the E4B → E2B swap decision. It does not authorize a deployment swap — per CLAUDE.md, the answer-quality regression check on
kenya_vignettesand the AfriMed-QA SAQ judge run remains "Critical before any model swap decision" (still listed as open in §6 of the compatibility notes).🤖 Generated with Claude Code