feat(benchmark): per-k latency sweep infrastructure + GPU/CPU report by nmrenyi · Pull Request #57 · nmrenyi/mamai

nmrenyi · 2026-05-14T23:49:26Z

Summary

Adds a runtime --retrieve-k override so the latency benchmark can sweep retrieval depth without rebuilding the APK. One build + install, then loop the Python wrapper across k values.
Refactors BenchmarkActivity → new BenchmarkForegroundService with a PARTIAL_WAKE_LOCK and sticky notification, so multi-hour sweeps survive screen-off + device-lock. (OPPO ColorOS also requires the per-app battery-optimization whitelist set in Settings — documented in the report.)
Captures per-run retrieved chunks + full response text in the benchmark JSON, not just lengths — enables downstream answer-quality analysis at any k.
--rag-only flag to skip the redundant No-RAG mode during k-sweeps (No-RAG is k-invariant; only needs to be measured once).
Fixes a pre-existing ANR (Thread.sleep → delay()), a stale gemma-3n model-file check in the pre-flight script, and a hard-coded "backend":"CPU" JSON-metadata bug that mislabeled all GPU runs.
New evaluation/aggregate_k_sweep.py aggregation script; renders evaluation/reports/latency_report_v2.md — full per-k GPU↔CPU comparison.

Headline findings

On OPPO Snapdragon 8 Elite + Gemma 4 E4B + LiteRT-LM 0.11.0 (16 canonical 54-run benchmarks):

GPU is 13–19× faster on TTFT than CPU and 2–4× faster on total latency. Decode is memory-bandwidth-bound (within ~1.4× across backends); the GPU win is entirely in compute-heavy prefill.
The 4096-token context window is the binding upper limit, not latency. At k=20, the same 24 of 54 runs (same 8 queries × 3 reps) fail identically on both GPU and CPU with `Input token ids are too long … >= 4096`. The cap is a property of the `gemma-4-E4B-it.litertlm` artifact, not a runtime config and not backend-dependent. k_max ≈ 17–18.
All GPU totals at k ≤ 15 stay between 13 s (no-RAG) and 25 s — well under any reasonable UX budget. CPU is unusable past k≈3 for a 60 s budget; at k=15 CPU p95 hits 113 s.

Test plan

Build default (flutter build apk --release) → CPU. With -PuseGpuForLlm=true → GPU. Verify the [BACKEND] *** LLM running on … *** log line on startup matches.
python evaluation/benchmark_latency.py --filter medium_01 --repeats 1 --rag-only --retrieve-k 5. Verify the JSON output has num_retrieved_docs == 5, populated retrieved_chunks array with chunk text/source/page/chars, and populated response_text. Filename contains `_k5`.
Screen-off survival: launch a multi-query benchmark, lock the device after init, confirm runs continue. Requires Settings → Battery → App Battery Management → MAM-AI → "Allow background activity" on OPPO/Xiaomi devices (otherwise OplusHansManager freezes the process at the OS level).
python evaluation/aggregate_k_sweep.py regenerates evaluation/reports/latency_report_v2.md from the JSONs in evaluation/latency_results/.

Commits (10)

```
4daf626 analysis: add CPU k=20 — confirms 4096-token wall is backend-invariant
ede273f analysis: k-sweep latency report (GPU + CPU on Snapdragon 8 Elite)
ef96538 fix(benchmark): record actual backend (GPU/CPU) in config metadata
7ac3b36 refactor(benchmark): move benchmark to a foreground service
12fd358 fix(benchmark): keep CPU alive through screen-off (PARTIAL_WAKE_LOCK + Default dispatcher)
795ac84 fix(benchmark): use suspending delay() instead of Thread.sleep()
197a7bc feat(benchmark): add --rag-only flag to skip No-RAG mode
33604df fix(benchmark): update model-file check to match production stack
8848bee feat(benchmark): capture retrieved chunks + response text per run
fd85cd7 feat(benchmark): add --retrieve-k override for per-k latency sweep
```

🤖 Generated with Claude Code

Lets the latency benchmark vary retrieval top_k without rebuilding the APK or editing runtime_config.json. One build + install up front, then sweep k via the CLI flag — needed to bound k_max on the Snapdragon 8 Elite + GPU stack now that TTFT (~1–2 s at k=3) is no longer the binding constraint. Wiring: - RagPipeline.generateResponse() gains an optional retrieveKOverride: Int? parameter (default null). When non-null it replaces retrievalConfig.top_k for that call only; production callers leave it null. Param added at the end of the list so RagStream's positional call is unaffected. - BenchmarkActivity reads an "retrieve_k" Intent extra (-1 sentinel = no override), threads it through runBenchmark → runQuery → generateResponse, and records "retrieval_top_k_override" in the config block of the results JSON. - benchmark_latency.py adds --retrieve-k N, forwards via am start --ei retrieve_k N, and appends "_kN" to the output filename so a sweep across k values is legible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the benchmark recorded only counts/lengths (num_retrieved_docs, response_length_chars). For the per-k latency sweep we want to see what the retriever actually surfaced at each k and what the model generated — both for content review and because the total chunk-text length is what drives prefill cost as k grows. New per-run fields in the results JSON: - retrieved_chunks: array of {text, source, page, chars} for every chunk the retriever returned. Lets us inspect what changed as k grew. - retrieved_total_chars: sum of chunk text lengths. The real prompt-length proxy (vs. query_word_count which is static). - response_text: full model response (the generation listener was already accumulating it; we now record the final string). Size: at k=20 with ~1500-3000 chars/chunk + ~3KB response, per-run overhead is ~30-60KB. A full 108-run benchmark file grows from ~50KB to ~4-6MB. Acceptable; can add a --no-content opt-out later if needed. Note: Gemma 4 E4B doesn't emit a separate reasoning channel — any inline reasoning the model writes shows up in response_text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

check_models_downloaded() was looking for gemma-3n-E4B-it-int4.task — left over from the pre-Gemma-4 era. config/app_config.json now declares "llm_model": "gemma-4-E4B-it.litertlm", so the script's pre-flight check falsely failed even when the right model was on device. Caught during the smoke test for the --retrieve-k override feature: the script aborted before launching BenchmarkActivity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For the k-sweep, the No-RAG baseline doesn't change with k (retrieval is disabled, so the override is ignored). Without this flag, running the sweep at 7 k-values would re-run the identical 54 No-RAG measurements seven times — ~1.5 hours of redundant work. The Intent extra "rag_only" (bool, default false) tells BenchmarkActivity to run only the RAG mode. Mutually exclusive with skip_retrieval, which wins if both are set. Python --rag-only forwards via am start --ez. Recorded in the session config JSON as "rag_only" so reruns are unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Thread.sleep(cooldownMs) was running on the UI thread (BenchmarkActivity is an Activity; the calling coroutine was scope.launch(Dispatchers.Main) by default). With cooldown ≥ 5000 ms, Android's input-dispatching timeout fires and the activity gets killed mid-sweep — exactly what happened on the OPPO Snapdragon 8 Elite at cooldown=10000: AnrInfo{reason='Input Dispatching Timeout', stackTrace='at java.lang.Thread.sleep(...) at BenchmarkActivity.runBenchmark(BenchmarkActivity.kt:279)'} delay() is a suspending function that doesn't block the underlying thread, so the UI stays responsive while the benchmark waits. Both cooldown call sites (post-init and between-runs) are inside the same suspend coroutine, so this is a drop-in replacement. The existing latency_report.md used cooldown=10000 on Pixel 7 without issue, which suggests the older Tensor G2 chipset had laxer ANR enforcement or Athena memory-kill behavior — either way, the correct-by-construction fix is to never block the UI thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ Default dispatcher) The benchmark coroutine was running on Dispatchers.Main, so as soon as the device screen went off (~10 min on OPPO) the activity backgrounded and the coroutine's delay() never resumed. The process stayed alive but stopped making progress — the Python wrapper hung waiting for [BENCHMARK] COMPLETE that would never come. Observed on the OPPO Snapdragon 8 Elite mid-sweep: ~24 min between the last [BENCHMARK] log line and the most recent Athena heartbeat, with mWakefulness=Asleep. Two changes: 1. Acquire a PARTIAL_WAKE_LOCK in onCreate (released in onDestroy). Keeps the CPU running even when the screen is off; the screen itself can still sleep. 6-hour failsafe timeout. Required permission added to AndroidManifest.xml — used only by BenchmarkActivity. 2. Switch the coroutine scope from Dispatchers.Main to Dispatchers.Default. The benchmark logic doesn't touch the UI directly (logStatus already marshals back to Main via runOnUiThread), so there's no reason to run on Main — doing so just makes the work pause whenever the activity loses focus. Default keeps running in any lifecycle state. These together let the sweep run while the device is screen-off or locked. Screen lifespan and battery thank you. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OPPO ColorOS aggressively freezes background activities and even force-releases PARTIAL_WAKE_LOCKs held by plain Activities (visible as 'add wakelock … to ForceReleaseWakeLock list' in OplusProxyWakeLock). Foreground services with a sticky notification are respected — once the app is also whitelisted in Settings → Battery → App Battery Management ("Allow background activity"), the benchmark runs cleanly with the screen off and the device locked. Architecture: - BenchmarkForegroundService (NEW) — holds the wake lock, posts a sticky progress notification, and runs the entire benchmark loop. Uses Dispatchers.Default so it isn't tied to a UI thread. Stops itself when done; the OS reclaims the process. - BenchmarkActivity — reduced from ~470 lines to ~60. Now a thin launcher: receives `am start` Intent extras, forwards them to the service via startForegroundService(), and finishes immediately. Existing Python wrapper (benchmark_latency.py) is unchanged — it still launches the Activity and reads progress from logcat. - AndroidManifest — registers the new service with foregroundServiceType="dataSync" (reuses the existing FOREGROUND_SERVICE_DATA_SYNC permission) and android:process= ":benchmark" (same isolated process as BenchmarkActivity). Verified end-to-end on OPPO Snapdragon 8 Elite (OPD2413): launched the benchmark, locked the screen 30 s in, watched a full medium_01 RAG run complete with `hans_freeze=0` Athena freeze events and the screen asleep through the entire decode phase. TTFT 1025 ms, total 13.8 s — within the same envelope as foreground runs. Pre-flight on a new device: Settings → Battery → App Battery Management → MAM-AI → "Allow background activity" Without that, OPPO's OplusHansManager freezes the process at the OS level regardless of foreground-service status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The config block dump in benchmark_results.json was hard-coded to "backend":"CPU" — wrong for any build with useGpuForLlm=true. The GPU-sweep JSONs we just ran on the OPPO Snapdragon 8 Elite all carry the incorrect "CPU" label even though they were measured on GPU. Now reads from BuildConfig.USE_GPU_FOR_LLM at compile time and writes "GPU" or "CPU" accordingly. Also adds "mtp_enabled" from BuildConfig.USE_MTP_FOR_LLM for full provenance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aggregates the 15 canonical 54-run benchmark JSONs into a single GPU↔CPU comparison report covering k ∈ {0, 1, 3, 5, 7, 10, 15} (and GPU-only k=20). Produced by a new evaluation/aggregate_k_sweep.py that's re-runnable as more JSONs land. Headline numbers (median total query latency on OPPO OPD2413, Snapdragon 8 Elite, Gemma 4 E4B, LiteRT-LM 0.11.0): k=0 (no-RAG): GPU 13–16 s | CPU 27–30 s (1.9× slower on CPU) k=3 : GPU 19–21 s | CPU 37–45 s (2.2× slower) k=10 : GPU 21–22 s | CPU 62–78 s (3.1× slower) k=15 : GPU 22–25 s | CPU 81–90 s (3.5× slower) k=20 : GPU 44% of runs fail at the 4096-token model ceiling Key findings: - GPU is the practical choice for this device tier — TTFT is 13–19× faster than CPU; total latency is 2–3.5× faster. - The model's 4096-token context window is the binding upper limit (k_max ≈ 17–18), not latency. GPU has comfortable headroom below that ceiling. - CPU is unusable past k≈3 for any reasonable UX budget. At k=15, CPU p95 latency hits 113 s. - Decode is memory-bandwidth-bound (GPU/CPU within ~1.4×); the GPU win is entirely in compute-heavy prefill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CPU k=20 reproduces the GPU k=20 failure pattern exactly: - 24/54 errors on both backends (44% failure rate) - Identical 8 queries fail on both (long_01, long_03, medium_02, medium_04, short_01, short_03, short_04, short_05) - Same 24 (query × rep) pairs across both runs This is direct evidence that the 4096-token context cap is a property of the .litertlm model artifact itself — not a runtime config, not a backend choice. Strengthens finding #2 from "model is the ceiling, GPU specifically hits it" to "model is the ceiling, both backends hit it identically." Successful CPU k=20 runs: TTFT 65–73 s, total 89–96 s — well past any deployment budget even when the request fits the window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds infrastructure for per-k retrieval-depth latency sweeps on the Android RAG app: a runtime --retrieve-k override, a foreground-service-based benchmark harness that survives screen-off, richer per-run JSON output, and a Python aggregator that emits a GPU↔CPU markdown report. Also fixes a few latent bugs (Thread.sleep ANR, hard-coded "backend":"CPU", stale model filename in pre-flight).

Changes:

New BenchmarkForegroundService (with PARTIAL_WAKE_LOCK + sticky notification, in :benchmark process); BenchmarkActivity is now a thin shim that forwards Intent extras and finishes.
RagPipeline.generateResponse gains an optional retrieveKOverride; benchmark JSON now records full retrieved_chunks/response_text and the actual GPU/CPU backend.
New evaluation/aggregate_k_sweep.py aggregator + generated evaluation/reports/latency_report_v2.md; benchmark_latency.py adds --retrieve-k, --rag-only, k-suffixed output filenames, and updated model filename.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
app/android/app/src/main/AndroidManifest.xml	Adds `WAKE_LOCK` permission and `BenchmarkForegroundService` declaration in `:benchmark` process.
app/android/app/src/main/kotlin/com/example/app/BenchmarkActivity.kt	Reduced to ~60-line shim that forwards extras to the new service.
app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt	New service hosting the entire benchmark loop, wake lock, notification, and JSON output.
app/android/app/src/main/kotlin/com/example/app/RagPipeline.kt	Adds `retrieveKOverride` param to `generateResponse` for per-call top_k override.
evaluation/benchmark_latency.py	Adds `--retrieve-k`/`--rag-only` flags, k-suffixed filename, updated model file check.
evaluation/aggregate_k_sweep.py	New aggregator that groups runs by (backend, k) and renders the v2 report.
evaluation/reports/latency_report_v2.md	Generated GPU↔CPU latency-sweep report.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Real fixes: - BenchmarkForegroundService: shut down the single-thread executor in onDestroy. Without this its worker thread keeps the :benchmark process alive after the service stops. - BenchmarkForegroundService: remove dead `pct` variable left over from the Activity-era ASCII progress bar. - AndroidManifest: stale comment said BenchmarkActivity holds the wake lock; updated to reflect the foreground-service refactor. - benchmark_latency.py: error out if both --no-retrieval and --rag-only are passed (previously they silently coexisted; on-device skipRetrieval won, but the result was confusing). - aggregate_k_sweep.py: * backend_of() now only overrides recorded="CPU" when the timestamp predates the metadata fix. Future GPU runs (which write backend="GPU" correctly) and future CPU runs are trusted as-is — fixes the silent mislabeling Copilot flagged. * Drop the May-2026-only glob (`benchmark_2026051*`) — use `benchmark_*.json` and rely on the schema/length filters. * Use `with open(...)` context manager — avoid file-handle leak. * Rename `avg_doc_chars` → `median_doc_chars` (function used median despite the name). * Remove dead loop `for col, run, key in [...]: pass`. * Update module docstring to describe the new backfill-only logic. Verified: Kotlin still compiles (flutter build apk --release succeeds); aggregate script still loads all 16 canonical runs and regenerates the same report; the new mutual-exclusion error fires when both Python flags are passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (3)

app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:99

retrieveKOverride is gated only with takeIf { it >= 0 }, so passing --ei retrieve_k 0 is forwarded as a 0 to RagPipeline.generateResponse, which sets effectiveTopK = 0 and calls RetrievalConfig.create(0, …). The Python wrapper also has no validation on --retrieve-k (it's a plain int arg). The semantically correct way to "retrieve nothing" in this codebase is --no-retrieval / useRetrieval=false; a top_k=0 retrieval call has undefined behavior in MediaPipe's RetrievalConfig. Recommend rejecting non-positive values either in the Python CLI (type=positive_int) or in the service (takeIf { it > 0 }).

        val retrieveKOverride: Int? = intent?.getIntExtra("retrieve_k", -1)?.takeIf { it >= 0 }

app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:240

loopStart is computed but never read after the activity-side ETA reporting was removed. Dead code — safe to delete.

        val loopStart = System.currentTimeMillis()

app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:234

The KDoc says "skipRetrieval and ragOnly are mutually exclusive (skipRetrieval wins)," and the Python wrapper enforces this with parser.error(...). However, the service itself silently chooses skipRetrieval over ragOnly if both extras are passed via am start directly. A user invoking am start by hand could end up surprised. Consider logging a warning (or refusing to start) when both flags are true rather than silently picking one.

        // skipRetrieval and ragOnly are mutually exclusive (skipRetrieval wins).
        val retrievalModes = when {
            skipRetrieval -> listOf(false)
            ragOnly -> listOf(true)
            else -> listOf(true, false)
        }

Replace truthy `if gw else "—"` style with `if gw is not None else "—"`, and add `> 0` guards on division ratios. Defensive against a hypothetical 0-valued median from a corrupted/aborted run JSON, which the truthy form would have silently rendered as "—" instead of "0.0". Affects four spots: TTFT ratio, decode ratio, wall-clock ratio + formatting, and the headline-table ratio. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The timestamp threshold (`if timestamp < "20260514T2130": return "GPU"`) silently rewrites any pre-threshold CPU JSON as GPU. Anyone running this aggregator with their own historical genuine-CPU runs in latency_results/ would have those mislabeled as GPU and potentially double-counted via the "most successful entries" tiebreaker in write_report. Replace with `PRE_FIX_GPU_FILES`: a frozenset of the exact 8 filenames known to predate the metadata fix in commit ef96538. Any file not in the allowlist uses its recorded backend value. Anyone else's historical files are unaffected. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The service runs in android:process=":benchmark", separate from the main app process. RagPipeline(application) here therefore constructs an entirely fresh pipeline (Gecko + SQLite + LLM load) in that benchmark process, not the main app's. Worth documenting because: 1. Application.onCreate() will run a second time when the benchmark process spawns. 2. If the main app has the LLM loaded simultaneously, two LLM instances may briefly contend for GPU/memory during init. Add a "Process model" section to the class KDoc explaining both. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "-1 = use config default" semantic for retrieve_k is non-obvious and wasn't documented anywhere. Update both KDoc blocks (BenchmarkActivity and BenchmarkForegroundService) to spell out: - retrieve_k: pass any value >= 0 to override; pass -1 (or omit) to use runtime_config.json's value. The activity normalises -1 → null before forwarding to the service. - repeats default 3, cooldown_ms default 5000 (were missing). - skip_retrieval and rag_only are mutually exclusive; skip_retrieval wins if both are set. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

foregroundServiceType="dataSync" is documented for "transferring data between a device and the cloud or other peers" — clearly a misuse for an on-device latency benchmark. Google Play's FGS-type enforcement would reject this on submission. Acceptable here because BenchmarkForegroundService is only launched via `adb shell am start` for in-house benchmarking and never appears in any user-facing flow. Add an explicit DEV-ONLY comment documenting: - the type is technically wrong, - the correct type to switch to if we ever ship benchmark capabilities (specialUse + the FOREGROUND_SERVICE_SPECIAL_USE permission + PROPERTY_SPECIAL_USE_FGS_SUBTYPE), - this declaration should be stripped from any Play Store build. No runtime change; only a comment to prevent surprises later. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aggregate_overall returns {} when every run in a (backend, k) cell errored out (all results have an error field). Callers then subscript with ["median"] / ["p95"] and crash with KeyError. Today this only happens partially — e.g. GPU k=20 has 24/54 errors but 30 successful runs, so the dict is populated. But a future sweep where all runs error (entirely possible at higher k once the 4096-token wall is hit broadly, or if the LLM dies during init) would crash the report generation. Switch all four subscript sites in write_report() to .get("median") and .get("p95") so an empty dict propagates as None, which fmt_ms / fmt_s already render as "—". Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (4)

evaluation/aggregate_k_sweep.py:110

For canonical 54-run files this p95 index int(len(s) * 0.95) evaluates correctly (51 → 52nd of 54), but for runs with errors (e.g. the k=20 set with only 30 successes) it yields int(30*0.95)=28 → the 29th of 30 sorted, which is the 96.7th percentile, not the 95th. math.ceil(0.95*n)-1 (or statistics.quantiles) would give the standard nearest-rank p95 across all sample sizes.

        s = sorted(vs)
        out[c] = {
            "n": len(vs),
            "median": int(statistics.median(vs)),
            "p95": int(s[min(len(s) - 1, int(len(s) * 0.95))]),
        }
    return out


def aggregate_overall(d: dict, key: str) -> dict:
    vs = [r[key] for r in d["results"] if not r.get("error")]
    if not vs:
        return {}
    s = sorted(vs)
    return {
        "n": len(vs),
        "median": int(statistics.median(vs)),
        "p95": int(s[min(len(s) - 1, int(len(s) * 0.95))]),

evaluation/reports/latency_report_v2.md:115

The "Wall-clock comparison" table mixes runs that completed all 54 queries with runs that errored out partway: at k=20, 24 of 54 runs failed-fast with the context-overflow error before any decode, so the wall time (22.8 / 58.6 min) is artificially low compared to k=15 and is not a meaningful CPU÷GPU comparison. This row should either be flagged in the table (e.g. an asterisk + footnote) or excluded from the CPU÷GPU column.

| 15 | 32.4 | 90.8 | 2.80× |
| 20 | 22.8 | 58.6 | 2.57× |

evaluation/aggregate_k_sweep.py:259

fmt_ms(ge) / fmt_ms(ce) are reused here just to render an integer count of failed runs. The helper is named fmt_ms and its docstring/sibling fmt_s strongly suggest it is for milliseconds. Either rename fmt_ms to a generic int-or-dash helper, or call str(ge) if ge is not None else "—" inline so the table's intent is clear.

        ge = sum(1 for r in gpu_run["data"]["results"] if r.get("error")) if gpu_run else None
        ce = sum(1 for r in cpu_run["data"]["results"] if r.get("error")) if cpu_run else None
        label = "**0 (no-RAG)**" if k == 0 else str(k)
        md.append(f"| {label} | {fmt_ms(ge)} | {fmt_ms(ce)} |")

evaluation/reports/latency_report_v2.md:37

The headline methodology says "Reported as median across the 54 runs unless noted (p95 in tables marked p95)", but the very next "Headline — Median total query latency" table breaks results down by category (short / medium / long) — i.e. medians over ~18 runs each, not over the full 54. The methodology line should mention the per-category breakdown so readers don't misinterpret the 13.1 / 12.6 / 17.3 values as 54-run statistics.

- `TTFT` excludes retrieval — measured from end-of-retrieval to first generated token.
- `decode` is first-token to last-token.
- `total_query` is everything: `retrieval + TTFT + decode`.
- Reported as median across the 54 runs unless noted (p95 in tables marked `p95`).

## Headline — Median total query latency (seconds)

| k | doc_chars med | GPU short / med / long | CPU short / med / long | CPU÷GPU |
|---:|---:|---:|---:|---:|
| **0 (no-RAG)** | 0 | 12.9 / 15.6 / 16.1 | 27.2 / 26.9 / 29.8 | 1.94× |
| 1 | 561 | 13.1 / 12.6 / 17.3 | 29.3 / 31.9 / 30.3 | 2.14× |
| 3 | 2098 | 18.6 / 18.6 / 21.0 | 37.3 / 44.5 / 42.5 | 2.24× |
| 5 | 3547 | 18.2 / 20.0 / 21.4 | 54.8 / 60.7 / 63.0 | 3.07× |
| 7 | 5139 | 21.3 / 23.2 / 22.8 | 61.4 / 62.3 / 60.4 | 2.72× |
| 10 | 7482 | 22.5 / 20.5 / 20.4 | 61.8 / 70.6 / 77.9 | 3.10× |
| 15 | 11297 | 25.3 / 24.0 / 22.4 | 84.8 / 80.8 / 89.7 | 3.48× |
| 20 | 14520 | 23.9 / 20.5 / 18.5 | 88.7 / 95.6 / 95.6 | 4.46× |

Report claimed "The 8 surviving queries on either side" — but the math doesn't check out: 18 queries × 3 reps = 54 runs, 24 errored = 8 unique queries failed × 3 reps. So 18 − 8 = 10 unique queries survived, not 8. Corrected to "The other 10 queries (10 × 3 reps = 30 successful runs)". Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The service normalises -1 to null but treats any value >= 0 as an explicit override, so passing --retrieve-k 0 would silently call RetrievalConfig.create(0, ...) — a confusing footgun. If you want to disable retrieval entirely, --no-retrieval is the proper flag. argparse now rejects --retrieve-k < 1 with a clear error pointing to --no-retrieval. Negative values were already filtered by the service's takeIf { it >= 0 } but a 0 was slipping through. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the PARTIAL_WAKE_LOCK was acquired in onCreate(), before onStartCommand calls startForegroundCompat. Two problems with that: 1. If the system creates the service but onStartCommand is delayed or never invoked (bind-only path, framework deferral), the wake lock is held without a foreground notification — and Android 12+ can trip the foreground-service-start-while-in-background restriction in that state. 2. Even on the normal path, there is a brief window where the CPU is pinned awake without the user-visible notification that justifies it. Move the wake-lock acquisition into onStartCommand, immediately AFTER startForegroundCompat. The lock is now strictly paired with the foreground notification's lifetime. Guarded with `wakeLock == null` so duplicate onStartCommand invocations (which can happen via START_NOT_STICKY restarts) don't try to re-acquire. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The single-thread executor is what ferries pipeline calls off the coroutine dispatchers (LiteRT-LM generation, Gecko retrieval), and scope.cancel() does NOT propagate cancellation into those blocking native calls. A plain executor.shutdown() then returns immediately and leaves the worker thread alive, keeping the :benchmark process running until generation finishes naturally — stopForeground might run with the worker still busy. Use shutdownNow() to interrupt the worker, plus a brief 2 s awaitTermination() so a worker that's tearing down cleanly gets a chance to do so. If it doesn't, the OS will reclaim the process eventually. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

backend_of's fallback was `d["config"].get("backend", "CPU")`, which silently labels any backend-less JSON as "CPU". Pre-fix files are handled by the explicit allowlist; the silent default only fires on "unexpected" JSONs — which is exactly when a future regression in BenchmarkForegroundService (e.g. metadata accidentally dropped) would slip past us. Now: if config.backend is missing AND the file isn't on the pre-fix allowlist, print a warning to stderr explaining the assumption and defaulting to "CPU". Post-fix JSONs always carry the field, so this warning only fires when something is genuinely off. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Python wrapper rejects --no-retrieval + --rag-only via parser.error() before the launch ever fires. But a direct `am start --ez skip_retrieval true --ez rag_only true ...` bypasses Python entirely and silently runs in No-RAG mode (skipRetrieval wins) with no indication anything is off. Add a Log.w at the same priority as other [BENCHMARK] markers so the mismatch is visible in `adb logcat -s mam-ai-bench:W` output during debugging. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

The 6 h ceiling could expire mid-sweep on long CPU runs — the full GPU + CPU k-sweep documented in latency_report_v2.md took ~7 h end-to-end, and a CPU-only sweep across k ∈ {1, 3, 5, 7, 10, 15} hit similar totals. When the lock auto-released, the OS could idle the CPU and the benchmark would silently stall (no failure log, just no progress). Bump to 24 h, with a comment that anything longer should switch to periodic re-acquire instead of bumping the constant further. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scope.launch { runBenchmark(...) } was unconditional, so a re-delivered intent (another `am start` before stopSelf() completes) would spawn a second coroutine running on the same single-threaded executor and writing to the same benchmark_results.json — both timings and the output JSON would be corrupted. Add a `benchmarkStarted` volatile flag that's set on first entry. Any later onStartCommand call returns immediately with a logcat warning, keeping the in-flight run intact. START_NOT_STICKY makes this unlikely in practice, but the right belt- and-braces fix is cheap and removes a race even on edge cases (e.g. process recreated after low-memory kill before stopSelf returned). Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The catch handler in runQuery() was the only [BENCHMARK] log line in the file using TAG="mam-ai" instead of BENCH_TAG="mam-ai-bench". This made `adb logcat -s mam-ai-bench:E` filter out exactly the messages most worth surfacing — query-level errors. One-line fix. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

load_runs() was silently dropping JSONs that didn't match the 54-run canonical-sweep shape (missing config/results keys, or < 30 results which is the smoke-test guard). For users running a narrow sweep (e.g. --filter long_01 --repeats 3 yields 3 results), the file would silently never appear in the report with no indication why. Log SKIP lines to stderr with the file name and reason. The output is still clean for normal runs (only emits when something is actually dropped). Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Methodology paragraph hardcoded "18 queries × … 10-second cooldown" even though the JSONs carry the actual config.repeats, config.cooldown_ms, and a results count that proves the math. If a future run uses different defaults (or this script is pointed at a different sweep), the methodology text would silently lie. Now reads from the sample run's config: pulls repeats, cooldown_ms, results count; infers (queries × modes) from results / repeats. Output for the current data set is unchanged ("18 (query × mode) cells × 3 repeats = 54 timed runs … 10-second cooldown") because that's what the JSONs actually say. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous formula `int(len(s) * 0.95)` collapses to max for any sample size n < 20 — e.g. for the per-category short bucket with 24 runs, int(24*0.95) = 22 (the 23rd of 24 sorted values), which is close to but not actually the 95th percentile. For a hypothetical narrower sample with n=3 (e.g. single-query small sweep), int(2.85) = 2 = the max, so p95 == max by construction. Centralise the calculation in a `_p95(values)` helper that uses `statistics.quantiles(values, n=20, method="exclusive")[18]` — the linear-interpolation 95th percentile from a 20-quantile partition. Falls back to max only when n < 2 (genuinely no quantiles to compute). Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Suppress

stopForeground(boolean) has been deprecated since Android 13. Replace with the SDK-aware form: STOP_FOREGROUND_REMOVE on API 24+ (where the int overload was introduced), fall back to the boolean variant only on older devices (where it isn't deprecated). Drops the @Suppress("DEPRECATION") on the modern path; we still suppress on the legacy path because the boolean variant *is* the non-deprecated API there. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Ran the same 8-cell sweep we did for FP16 GPU in PR #57/#59, but against the FP32-tagged artifact at maxNumTokens=4096. ~4.5 hours total wall-clock on the OnePlus OPD2413 (Snapdragon 8 Elite). Result is a clean cell-by-cell comparison. Headline: FP32 GPU is **~25% slower than FP16 GPU at k=15** (~6 s extra wait per query), much less than the ~3× I'd estimated from the single-data-point Step 3 measurement. That earlier number was wrong — I had accidentally been comparing FP32-E4B against FP16-**E2B** (the smaller model), not the matched FP16-E4B baseline. The slowdown is almost entirely in TTFT (prefill): - FP16 TTFT 0.96–4.0s, FP32 TTFT 2.0–9.8s (~2–2.5× across all k) - FP16 decode 11–18s, FP32 decode 12–19s (essentially identical) Mechanism: prefill is compute-bound (one parallel forward pass over the input), so FP16's 2× arithmetic throughput on Adreno helps a lot. Decode is bandwidth-bound (sequential token-at-a-time loading of weights), so the FP16/FP32 precision choice barely matters. Same 24 errors at k=20 on FP32 GPU as on FP16 GPU — the prompt-cap rejection at maxNumTokens=4096 is precision-agnostic, just a config check. What this means: **FP32 GPU is a real shipping option, not just an experiment.** At maxNumTokens=4096 the latency cost is ~25% (no quality benefit — we're below the FP16 cliff anyway). At higher maxNumTokens (e.g., 5500), FP32 GPU enables clean output past the FP16 cliff at the same ~25% latency hit. Memory ceiling caps maxNumTokens at ~6500–7500 on this 16 GB device since KV cache doubles vs FP16. The choice between FP16 GPU and FP32 GPU is now a UX-vs-margin tradeoff at the deployment level, not a feasibility question. Updates: new Step 5 section with full sweep table + corrected slowdown narrative; updated TL;DR bullet on FP32 latency cost; "What's still open" table marks FP32 latency curve resolved; full 8-JSON inventory added to References.

nmrenyi and others added 10 commits May 14, 2026 14:24

Copilot AI review requested due to automatic review settings May 14, 2026 23:49

Copilot started reviewing on behalf of nmrenyi May 14, 2026 23:50 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

nmrenyi requested a review from Copilot May 14, 2026 23:58

Copilot started reviewing on behalf of nmrenyi May 14, 2026 23:58 View session