Skip to content

analysis: Gemma 4 E2B latency sweep + cross-model report#59

Merged
nmrenyi merged 14 commits into
mainfrom
feat/e2b-latency-sweep
May 16, 2026
Merged

analysis: Gemma 4 E2B latency sweep + cross-model report#59
nmrenyi merged 14 commits into
mainfrom
feat/e2b-latency-sweep

Conversation

@nmrenyi
Copy link
Copy Markdown
Owner

@nmrenyi nmrenyi commented May 16, 2026

Summary

  • Sweep Gemma 4 E2B latency on Snapdragon 8 Elite (OPPO OPD2413, Adreno 830), mirroring the E4B sweep from feat(benchmark): per-k latency sweep infrastructure + GPU/CPU report #57: 8 GPU + 8 CPU runs across k ∈ {0, 1, 3, 5, 7, 10, 15, 20}. 32 canonical runs now on disk in evaluation/latency_results/ (16 E4B + 16 E2B).
  • Refactor aggregate_k_sweep.py from (backend, k) to (model, backend, k) grouping so both models sit side-by-side. Adds per-model sections plus a cross-model comparison section with total / TTFT / decode ratio tables.
  • Regenerate latency_report_v2.md with E4B-vs-E2B comparison and updated Key findings. Headline: ~1.5× total speedup on GPU, ~2.2× on CPU, with the architecturally-clean split that GPU decode is bandwidth-bound while CPU is compute-bound throughout. The 4096-token context wall is now confirmed across all four (model × backend) combinations — same 8 queries × 3 reps = 24 errors fail identically on every cell.
  • Replace projected E2B numbers in device_compatibility_notes.md §2 with measurements; add a bolded one-line deployment summary as the document lead.

Deployment-relevant takeaway

On the Snapdragon 8 Elite test device under a 60 s latency budget, E4B (6 GB RAM floor) deploys only on GPU across all k ≤ 15 or on CPU at k ≤ 3, while E2B (4 GB RAM floor) deploys on both GPU and CPU across all k ≤ 15 — with k = 20 ruled out for both models by the 4096-token context wall regardless of backend.

E2B CPU at k=1 (~16 s median) is now comparable to E4B GPU at k=1 (~14 s), opening up the no-GPU mid-tier-device deployment path. Caveat: the mid-tier MediaTek 2×-slowdown row in §2 is still a Geekbench-anchored extrapolation, not a measurement — flagged as a new high-priority open question in §6.

Test plan

  • Aggregator reports Loaded 32 canonical runs across 2 models: gemma-4-E2B-it.litertlm, gemma-4-E4B-it.litertlm (no SKIPs of canonical files).
  • evaluation/reports/latency_report_v2.md per-model tables match the source JSON medians for at least one spot-checked (model, backend, k) cell.
  • evaluation/reports/device_compatibility_notes.md §2 measured E2B row matches the report's per-category medians.
  • No JSONs in evaluation/latency_results/ are tracked (directory is gitignored).
  • CI / build green.

Safety note

This PR establishes the latency leg of the E4B → E2B swap decision. It does not authorize a deployment swap — per CLAUDE.md, the answer-quality regression check on kenya_vignettes and the AfriMed-QA SAQ judge run remains "Critical before any model swap decision" (still listed as open in §6 of the compatibility notes).

🤖 Generated with Claude Code

nmrenyi and others added 7 commits May 15, 2026 15:00
The benchmark service hardcoded "gemma-4-E4B-it.litertlm" in the
results JSON metadata, so switching models in app_config.json silently
left stale data in benchmark output. Read llm_model from the same
asset RagPipeline uses so the JSON always reflects what was actually
loaded.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Pointing the app at gemma-4-E2B-it.litertlm so we can run the same
k-sweep benchmarks as the E4B baseline for a direct comparison on the
OPPO Snapdragon 8 Elite device.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The E2B sweep started by another session (Phase 1 committed at 976a8ac
and 3042d38) hit issues with the agent-orchestration wait pattern in
two attempts. Writing a self-contained runbook so a new CLI session
can pick it up from Phase 2 with full context — what's already done,
what to verify, the exact 16 measurements to collect, the analysis
work for Phase 4, and the constraints (no push, no scope change).

Lives at evaluation/runbooks/e2b_sweep.md. If we end up doing more
sweeps in this style, the directory is a natural home for similar
job-aid documents.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Group runs by (model, backend, k) instead of (backend, k) so the matrix
can hold both Gemma 4 E4B and E2B side-by-side. Emit one per-model
section (the existing six tables) for each model, plus a new
cross-model comparison section with total/TTFT/decode ratio tables.

Why: with the E2B sweep landing, the prior aggregator would have
collapsed E4B and E2B into the same cells. The (model, backend, k)
grouping keeps the existing single-model view intact while making the
E4B-vs-E2B story explicit.

Notes:
- Added LEGACY_DEFAULT_MODEL fallback for any JSON missing config.model.
  All current files have it set, so this is purely defensive.
- Refactored the per-model tables into _write_per_model_section() and
  the cross-model tables into _write_cross_model_table() to keep
  write_report() readable.
Re-run aggregate_k_sweep.py over the full set of 32 canonical runs
(16 E4B + 16 E2B). The report now has:

- Per-model sections for both Gemma 4 E4B and Gemma 4 E2B, each with
  the existing six tables (headline / TTFT / decode / p95 / errors /
  wall-clock).
- A new cross-model comparison section with E4B-vs-E2B ratio tables
  for total query latency, TTFT, and decode.
- Rewritten Key findings to reflect the measured speedup ratios:
  ~1.5× total GPU, ~2.2× total CPU. Prefill is compute-bound on both
  backends (~2.3× speedup); decode is bandwidth-bound on GPU
  (~1.5×) and compute-bound on CPU (~2×). The architectural story
  cleanly explains why total speedup is decode-dominated at low k
  and climbs toward the prefill ratio at high k.

The 4096-token context wall is now confirmed across all four
(model × backend) combinations: same 8 queries × 3 reps = 24 errors
on each, anchored as a property of the .litertlm artifact format.
Replace the projected E2B latency table in §2 with measured numbers
from the 2026-05-16 E2B sweep on Snapdragon 8 Elite. The measured
speedup ratio is ~1.5× on GPU and ~2× on CPU (not the originally
projected uniform 2×), and the architectural reason is now spelled
out in latency_report_v2.md: GPU prefill is compute-bound, GPU
decode is bandwidth-bound, CPU is compute-bound throughout.

Specific changes:
- §6 Open questions: mark "Actual E2B CPU latency" as resolved
  with a pointer to the new report. Add a new follow-up question
  about validating the mid-tier MediaTek 2×-slowdown extrapolation
  on real hardware.
- §2 Backend × model × k feasibility: replace projected E2B table
  with measured Snapdragon 8 Elite values; keep the MediaTek row
  flagged as extrapolation and call that out in the section preamble.
- TL;DR: add a fourth rule covering E2B CPU's newly-measured
  deployment envelope (k=10 comfortable on flagship CPU, k=3–5
  borderline on mid-tier).

Why: the original notes shipped with projections marked clearly as
"halve E4B numbers". With real measurements in hand, those rows now
carry actual data, and the deployment-relevant rule of thumb
("E2B CPU opens up the no-GPU device tier") is anchored on numbers
rather than a speculative ratio.
Open the doc with a bolded single sentence that captures the
deployment-relevant takeaway: which (model × backend × k) cells fit
a 60 s latency budget on the Snapdragon 8 Elite test device, anchored
on the RAM floors and the 4096-token context wall.

Why: a reader skimming for "can I ship E2B on CPU?" should get the
answer from the first line, before the TL;DR's four rules and the
detail tables.
Copilot AI review requested due to automatic review settings May 16, 2026 00:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the existing on-device latency benchmarking/analysis workflow to include Gemma 4 E2B alongside E4B, producing a combined “model × backend × k” report and updating deployment guidance accordingly.

Changes:

  • Refactors evaluation/aggregate_k_sweep.py to group and report by (model, backend, k), and adds cross-model comparison tables.
  • Regenerates evaluation/reports/latency_report_v2.md and updates evaluation/reports/device_compatibility_notes.md with measured E2B results and revised findings.
  • Updates benchmark JSON metadata to record the loaded model from app_config.json, and adds a runbook for reproducing/continuing the E2B sweep.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
evaluation/runbooks/e2b_sweep.md New operational runbook for running the E2B GPU/CPU k-sweeps and updating analysis artifacts.
evaluation/reports/latency_report_v2.md Regenerated combined report with per-model sections and E4B↔E2B comparison tables.
evaluation/reports/device_compatibility_notes.md Updates deployment feasibility guidance using measured E2B latencies and revised open questions.
evaluation/aggregate_k_sweep.py Adds model dimension to aggregation + emits per-model and cross-model sections in the report.
config/app_config.json Switches default configured LLM model from E4B to E2B.
app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt Writes config.model into benchmark JSON by reading llm_model from app_config.json.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread config/app_config.json
Comment on lines 1 to 3
{
"llm_model": "gemma-4-E4B-it.litertlm",
"llm_model": "gemma-4-E2B-it.litertlm",
"embedding_model": "Gecko_1024_quant.tflite",
Comment thread evaluation/runbooks/e2b_sweep.md Outdated

- This work mirrors the E4B latency sweep that landed in PR #57 (commit `1be0a55` on `main`). The E4B results are in `evaluation/reports/latency_report_v2.md` and the device-compatibility analysis is in `evaluation/reports/device_compatibility_notes.md`.
- We're now measuring the **smaller** Gemma 4 E2B variant (~2 GB instead of E4B's 3.66 GB) to find out how much faster it is in real terms on the same hardware. Same 16 measurements as E4B: 8 GPU (k ∈ {1, 3, 5, 7, 10, 15, 20} + No-RAG) + 8 CPU.
- Test device: **OPPO OPD2413 (Snapdragon 8 Elite, SM8750P)** connected via ADB. OPPO Hans battery-optimization whitelist is **already configured** by the user — don't re-do it.
Comment on lines 10 to 13
2. **E2B minimum RAM: 4 GB** total. The smaller model halves the runtime memory footprint (~1.7 GB), opening up the $100–$150 device tier that's the largest slice of the African market.
3. **E4B on CPU: k=3 is the borderline.** Beyond k=3, CPU totals exceed the 60 s budget on most mid-tier silicon. **E4B on GPU: no latency worry** — totals stay 13–25 s across k=0–15 on Snapdragon 8 Elite + Adreno.
4. **E2B on CPU: k=10 is comfortable on flagship CPU; k=3–5 on mid-tier MediaTek.** Measured E2B CPU at k=10 on Snapdragon 8 Elite is 26 s median; extrapolating ~2× slower for mid-tier MediaTek gives ~50 s at k=5–7 (borderline). This is the deployment-relevant change from the May 2026 sweep: E2B CPU is **~2× faster than E4B CPU**, not the originally-projected speedup — and that means CPU-only deployment is finally viable up to mid-range k on the no-GPU device tier.

Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment on lines +420 to +426
# Use E4B as baseline when present; ratio is E4B/E2B so >1 means E2B is faster.
if len(models) > 1:
baseline = "gemma-4-E4B-it.litertlm" if "gemma-4-E4B-it.litertlm" in models else models[0]
others = [m for m in models if m != baseline]
md.append("## Cross-model comparison\n")
md.append(
f"Ratios below are **{_short_model_label(baseline)} ÷ {_short_model_label(others[0])}**, "
nmrenyi added 4 commits May 16, 2026 09:02
The comparison-section intro previously named the first comparator
model explicitly (`others[0]`), which would silently mislead readers
if a third model were ever added to the matrix — the intro would
still mention only one comparator while the loop below rendered a
table per `other`.

Reword the intro to describe the comparison generically (baseline ÷
each comparator) and list all comparators inline so it self-describes
no matter how many models the matrix contains. Also align the
architectural-context phrasing with the Key-findings section: GPU
prefill is compute-bound (tracks parameter count), GPU decode is
bandwidth-bound (gains less from shrinkage), CPU is compute-bound
throughout. Regenerate the report.

Addresses Copilot review comment on PR #59.
The benchmark JSONs record `device.manufacturer="OnePlus"`, but the
runbook and the §5 deployment-market table in device_compatibility_notes
both called the device "OPPO" (OPPO Find X8). Same physical hardware,
different brand label — OPPO and OnePlus share platforms and the OPD2413
ships under both brands depending on market — but the inconsistency
makes it hard for a reader to cross-reference the runbook against the
regenerated latency report.

Settle on "OnePlus OPD2413" (firmware-reported) as the canonical name
and add the OPPO branding parenthetically wherever the term first
appears in a doc.

Addresses Copilot review comment on PR #59.
The previous wording said E2B CPU is "~2× faster than E4B CPU, not the
originally-projected speedup," which reads as a contradiction since the
original notes projected a uniform ~2× speedup across both backends.

The thing that actually diverged from the projection is **GPU total
speedup** (~1.5×, not the projected 2×), and the reason is architectural:
GPU decode is bandwidth-bound and benefits less from parameter-count
shrinkage. CPU matches the projection. Rewrite the rule to make the
backend split explicit instead of contrasting CPU against an
unspecified projection.

Addresses Copilot review comment on PR #59.
Commit 3042d38 (Phase 1 of the E2B latency sweep) switched the
production `llm_model` in config/app_config.json from E4B to E2B so
the benchmark would load the smaller model. That change rode into
PR #59 even though the PR's own body — and the safety note in
device_compatibility_notes.md §6 — explicitly state that this work
does **not** authorize a deployment swap: the answer-quality regression
check on kenya_vignettes and the AfriMed-QA SAQ judge run is still
listed as "Critical before any model swap decision".

Per CLAUDE.md, evaluation quality and response safety are the top
priorities for this medical app. Shipping a default-model swap
without the safety eval would mean the production app loads a model
whose accuracy on safety-critical medical-advice metrics has not
been validated.

Revert the default to `gemma-4-E4B-it.litertlm`. The benchmark
measurements that motivated the PR are unaffected — the .litertlm
file used for the sweep is recorded in each benchmark JSON's
config.model field, so the data is preserved regardless of the
deployed default.

To re-run the E2B benchmark on a future branch, the runbook author
will need to temporarily flip llm_model to E2B before kicking off
benchmark_latency.py (and revert before opening the PR).

Addresses Copilot review comment on PR #59.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment on lines +374 to +379
models = sorted(set(m for (m, _b, _k) in matrix.keys()))
all_ks = sorted(set(k for (_m, _b, k) in matrix.keys()))

sample = next(iter(matrix.values()))
dev = sample["data"]["device"]

Comment on lines +3 to +33
Self-contained instructions for finishing the E2B latency sweep started by another session. **Phase 1 (setup) is already complete on branch `feat/e2b-latency-sweep`.** Your job is Phase 2 (GPU sweep), Phase 3 (CPU sweep), and Phase 4 (analysis + local commits, **no push**). Expected wall-clock: **~5 hours**.

## 0. Context — read this first

- This work mirrors the E4B latency sweep that landed in PR #57 (commit `1be0a55` on `main`). The E4B results are in `evaluation/reports/latency_report_v2.md` and the device-compatibility analysis is in `evaluation/reports/device_compatibility_notes.md`.
- We're now measuring the **smaller** Gemma 4 E2B variant (~2 GB instead of E4B's 3.66 GB) to find out how much faster it is in real terms on the same hardware. Same 16 measurements as E4B: 8 GPU (k ∈ {1, 3, 5, 7, 10, 15, 20} + No-RAG) + 8 CPU.
- Test device: **OnePlus OPD2413 (Snapdragon 8 Elite, SM8750P)** connected via ADB — that's the firmware-reported manufacturer (`device.manufacturer="OnePlus"` in the benchmark JSONs); the same OPD2413 hardware ships under the OPPO brand in some markets. The OPPO/OnePlus Hans battery-optimization whitelist is **already configured** by the user — don't re-do it.
- The benchmark infrastructure is in `evaluation/benchmark_latency.py`; the aggregator is `evaluation/aggregate_k_sweep.py`. Both are already correct for this work, with one expected exception in Phase 4 (the aggregator needs a `model` dimension added).

### Why this is a runbook and not a single Bash command

Each benchmark run takes **12–20 minutes wall-clock** (E2B is ~1.5× faster than E4B based on the smoke test, not 2×). You can't realistically loop them in one foreground shell command; bash timeouts cap at 10 minutes in our tooling. Use `Bash run_in_background: true` and **wait for the harness completion notification** between runs. Don't use `tail -F`, sleep loops, or watchdog patterns — those caused the previous subagent to bail at 87 seconds.

---

## 1. Verify Phase 1 state — fail loud if anything's missing

Run these checks before touching anything:

```bash
cd ~/Downloads/mamai
git status # should show clean working tree on branch feat/e2b-latency-sweep
git log --oneline -3 # should show 3042d38, 976a8ac at the top
```

Expected log:
```
3042d38 config: switch llm_model to Gemma 4 E2B
976a8ac fix(benchmark): read model name from app_config asset
a2205ff docs: device compatibility notes — which phones can run E4B / E2B
```
Comment on lines +384 to +389
// Read model name from the same app_config.json asset the RagPipeline uses,
// so the JSON metadata reflects whatever model is actually loaded rather than
// a hardcoded string that goes stale when we switch model artifacts.
put("model", JSONObject(
application.assets.open("app_config.json").bufferedReader().use { it.readText() }
).getString("llm_model"))
nmrenyi added 3 commits May 16, 2026 09:38
…atrix

The matrix builder assumed at least one canonical benchmark JSON was
loaded and indexed via `next(iter(matrix.values()))`. On a fresh
checkout `evaluation/latency_results/` is gitignored and may not
exist, so the script would crash with a bare StopIteration that
gives a new contributor no useful direction.

Detect the empty case and exit with a message pointing at
benchmark_latency.py and the runbooks directory. Verified two paths:
existing 32-JSON workflow still produces the report; a /tmp checkout
with no JSONs now prints the directional error and exits cleanly.

Addresses Copilot review comment on PR #59.
PR #59 reverted llm_model in config/app_config.json back to E4B
(commit 84e1bfd), but the runbook's Phase 1 verification still expects
fresh benchmark JSONs to record `config.model == "gemma-4-E2B-it.litertlm"`.
Anyone reading this runbook on a fresh checkout would hit that mismatch
without an explanation.

Add a preamble note at the top of the runbook covering the four-step
re-run procedure (flip config → rebuild/install → sweep → revert config)
and a sentence explaining the git-log expectation drift for replays
after PR merge. Keep the rest of the runbook intact — the Phase 2/3
sweep instructions still apply verbatim once the config is flipped.

Addresses Copilot review comment on PR #59.
…ures

The model-name lookup in writeResults() opens and parses
app_config.json from the APK assets at the moment the final benchmark
JSON is being built. That happens at the *end* of a 20-minute sweep,
after all 54 result records have been accumulated in memory but
before they hit disk. An IOException on the asset open, or a
JSONException on a malformed config, would propagate up and discard
the entire results array.

The failure probability is low (the asset is bundled inside the APK
and the file is generated at build time), but the consequence is
severe — we'd lose every measurement from a multi-hour run because
we couldn't read a metadata string.

Wrap the read in try/catch and fall back to `"unknown"` for the
model field if the asset can't be read or parsed. The model field
in the JSON is metadata for downstream analysis (aggregate_k_sweep.py
groups by it); an "unknown" tag will just route the run into its own
matrix cell rather than colliding with a real model, which is the
right failure mode for an unexpected build state.

Addresses Copilot review comment on PR #59.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

1. **E4B minimum RAM: 6 GB** total. 4 GB phones cannot run E4B reliably (model alone needs ~3.3 GB at runtime; Android + bundled apps eat 1.5–2 GB).
2. **E2B minimum RAM: 4 GB** total. The smaller model halves the runtime memory footprint (~1.7 GB), opening up the $100–$150 device tier that's the largest slice of the African market.
3. **E4B on CPU: k=3 is the borderline.** Beyond k=3, CPU totals exceed the 60 s budget on most mid-tier silicon. **E4B on GPU: no latency worry** — totals stay 13–25 s across k=0–15 on Snapdragon 8 Elite + Adreno.
4. **E2B on CPU: k=10 is comfortable on flagship CPU; k=3–5 on mid-tier MediaTek.** Measured E2B CPU at k=10 on Snapdragon 8 Elite is 26 s median; extrapolating ~2× slower for mid-tier MediaTek gives ~50 s at k=5–7 (borderline). The original notes projected a uniform ~2× speedup across backends; measurements show **CPU matches that projection (~2× total speedup)** but **GPU total speedup is closer to ~1.5×** because decode is bandwidth-bound and gains less from the parameter-count shrink. Either way, CPU-only deployment is finally viable up to mid-range k on the no-GPU device tier — that's the deployment-relevant change from the May 2026 sweep.
Comment on lines +13 to +16
Notes on model identification: post-fix JSONs (commit 976a8ac onward) record
`config.model` from the app asset; earlier runs do not. For any JSON missing
`config.model` we default to `gemma-4-E4B-it.litertlm` since the only sweeps
that predate the fix were E4B. Future runs of any model are unaffected.
@nmrenyi nmrenyi merged commit 28737d8 into main May 16, 2026
6 checks passed
nmrenyi added a commit that referenced this pull request May 16, 2026
Ran the same 8-cell sweep we did for FP16 GPU in PR #57/#59, but
against the FP32-tagged artifact at maxNumTokens=4096. ~4.5 hours
total wall-clock on the OnePlus OPD2413 (Snapdragon 8 Elite). Result
is a clean cell-by-cell comparison.

Headline: FP32 GPU is **~25% slower than FP16 GPU at k=15** (~6 s
extra wait per query), much less than the ~3× I'd estimated from
the single-data-point Step 3 measurement. That earlier number was
wrong — I had accidentally been comparing FP32-E4B against
FP16-**E2B** (the smaller model), not the matched FP16-E4B baseline.

The slowdown is almost entirely in TTFT (prefill):
- FP16 TTFT 0.96–4.0s, FP32 TTFT 2.0–9.8s (~2–2.5× across all k)
- FP16 decode 11–18s, FP32 decode 12–19s (essentially identical)

Mechanism: prefill is compute-bound (one parallel forward pass over
the input), so FP16's 2× arithmetic throughput on Adreno helps a
lot. Decode is bandwidth-bound (sequential token-at-a-time loading
of weights), so the FP16/FP32 precision choice barely matters.

Same 24 errors at k=20 on FP32 GPU as on FP16 GPU — the prompt-cap
rejection at maxNumTokens=4096 is precision-agnostic, just a config
check.

What this means: **FP32 GPU is a real shipping option, not just an
experiment.** At maxNumTokens=4096 the latency cost is ~25% (no
quality benefit — we're below the FP16 cliff anyway). At higher
maxNumTokens (e.g., 5500), FP32 GPU enables clean output past the
FP16 cliff at the same ~25% latency hit. Memory ceiling caps
maxNumTokens at ~6500–7500 on this 16 GB device since KV cache
doubles vs FP16.

The choice between FP16 GPU and FP32 GPU is now a UX-vs-margin
tradeoff at the deployment level, not a feasibility question.

Updates: new Step 5 section with full sweep table + corrected
slowdown narrative; updated TL;DR bullet on FP32 latency cost;
"What's still open" table marks FP32 latency curve resolved; full
8-JSON inventory added to References.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants