feat(benchmark): per-k latency sweep infrastructure + GPU/CPU report#57
Conversation
Lets the latency benchmark vary retrieval top_k without rebuilding the APK or editing runtime_config.json. One build + install up front, then sweep k via the CLI flag — needed to bound k_max on the Snapdragon 8 Elite + GPU stack now that TTFT (~1–2 s at k=3) is no longer the binding constraint. Wiring: - RagPipeline.generateResponse() gains an optional retrieveKOverride: Int? parameter (default null). When non-null it replaces retrievalConfig.top_k for that call only; production callers leave it null. Param added at the end of the list so RagStream's positional call is unaffected. - BenchmarkActivity reads an "retrieve_k" Intent extra (-1 sentinel = no override), threads it through runBenchmark → runQuery → generateResponse, and records "retrieval_top_k_override" in the config block of the results JSON. - benchmark_latency.py adds --retrieve-k N, forwards via am start --ei retrieve_k N, and appends "_kN" to the output filename so a sweep across k values is legible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the benchmark recorded only counts/lengths (num_retrieved_docs,
response_length_chars). For the per-k latency sweep we want to see what
the retriever actually surfaced at each k and what the model generated —
both for content review and because the total chunk-text length is what
drives prefill cost as k grows.
New per-run fields in the results JSON:
- retrieved_chunks: array of {text, source, page, chars} for every chunk
the retriever returned. Lets us inspect what changed as k grew.
- retrieved_total_chars: sum of chunk text lengths. The real
prompt-length proxy (vs. query_word_count which is static).
- response_text: full model response (the generation listener was
already accumulating it; we now record the final string).
Size: at k=20 with ~1500-3000 chars/chunk + ~3KB response, per-run
overhead is ~30-60KB. A full 108-run benchmark file grows from ~50KB to
~4-6MB. Acceptable; can add a --no-content opt-out later if needed.
Note: Gemma 4 E4B doesn't emit a separate reasoning channel — any inline
reasoning the model writes shows up in response_text.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
check_models_downloaded() was looking for gemma-3n-E4B-it-int4.task — left over from the pre-Gemma-4 era. config/app_config.json now declares "llm_model": "gemma-4-E4B-it.litertlm", so the script's pre-flight check falsely failed even when the right model was on device. Caught during the smoke test for the --retrieve-k override feature: the script aborted before launching BenchmarkActivity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For the k-sweep, the No-RAG baseline doesn't change with k (retrieval is disabled, so the override is ignored). Without this flag, running the sweep at 7 k-values would re-run the identical 54 No-RAG measurements seven times — ~1.5 hours of redundant work. The Intent extra "rag_only" (bool, default false) tells BenchmarkActivity to run only the RAG mode. Mutually exclusive with skip_retrieval, which wins if both are set. Python --rag-only forwards via am start --ez. Recorded in the session config JSON as "rag_only" so reruns are unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thread.sleep(cooldownMs) was running on the UI thread (BenchmarkActivity
is an Activity; the calling coroutine was scope.launch(Dispatchers.Main)
by default). With cooldown ≥ 5000 ms, Android's input-dispatching timeout
fires and the activity gets killed mid-sweep — exactly what happened on
the OPPO Snapdragon 8 Elite at cooldown=10000:
AnrInfo{reason='Input Dispatching Timeout',
stackTrace='at java.lang.Thread.sleep(...)
at BenchmarkActivity.runBenchmark(BenchmarkActivity.kt:279)'}
delay() is a suspending function that doesn't block the underlying
thread, so the UI stays responsive while the benchmark waits. Both
cooldown call sites (post-init and between-runs) are inside the same
suspend coroutine, so this is a drop-in replacement.
The existing latency_report.md used cooldown=10000 on Pixel 7 without
issue, which suggests the older Tensor G2 chipset had laxer ANR
enforcement or Athena memory-kill behavior — either way, the
correct-by-construction fix is to never block the UI thread.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ Default dispatcher) The benchmark coroutine was running on Dispatchers.Main, so as soon as the device screen went off (~10 min on OPPO) the activity backgrounded and the coroutine's delay() never resumed. The process stayed alive but stopped making progress — the Python wrapper hung waiting for [BENCHMARK] COMPLETE that would never come. Observed on the OPPO Snapdragon 8 Elite mid-sweep: ~24 min between the last [BENCHMARK] log line and the most recent Athena heartbeat, with mWakefulness=Asleep. Two changes: 1. Acquire a PARTIAL_WAKE_LOCK in onCreate (released in onDestroy). Keeps the CPU running even when the screen is off; the screen itself can still sleep. 6-hour failsafe timeout. Required permission added to AndroidManifest.xml — used only by BenchmarkActivity. 2. Switch the coroutine scope from Dispatchers.Main to Dispatchers.Default. The benchmark logic doesn't touch the UI directly (logStatus already marshals back to Main via runOnUiThread), so there's no reason to run on Main — doing so just makes the work pause whenever the activity loses focus. Default keeps running in any lifecycle state. These together let the sweep run while the device is screen-off or locked. Screen lifespan and battery thank you. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OPPO ColorOS aggressively freezes background activities and even
force-releases PARTIAL_WAKE_LOCKs held by plain Activities (visible as
'add wakelock … to ForceReleaseWakeLock list' in OplusProxyWakeLock).
Foreground services with a sticky notification are respected — once the
app is also whitelisted in Settings → Battery → App Battery Management
("Allow background activity"), the benchmark runs cleanly with the
screen off and the device locked.
Architecture:
- BenchmarkForegroundService (NEW) — holds the wake lock, posts a
sticky progress notification, and runs the entire benchmark loop.
Uses Dispatchers.Default so it isn't tied to a UI thread. Stops
itself when done; the OS reclaims the process.
- BenchmarkActivity — reduced from ~470 lines to ~60. Now a thin
launcher: receives `am start` Intent extras, forwards them to the
service via startForegroundService(), and finishes immediately.
Existing Python wrapper (benchmark_latency.py) is unchanged — it
still launches the Activity and reads progress from logcat.
- AndroidManifest — registers the new service with
foregroundServiceType="dataSync" (reuses the existing
FOREGROUND_SERVICE_DATA_SYNC permission) and android:process=
":benchmark" (same isolated process as BenchmarkActivity).
Verified end-to-end on OPPO Snapdragon 8 Elite (OPD2413): launched the
benchmark, locked the screen 30 s in, watched a full medium_01 RAG run
complete with `hans_freeze=0` Athena freeze events and the screen
asleep through the entire decode phase. TTFT 1025 ms, total 13.8 s —
within the same envelope as foreground runs.
Pre-flight on a new device:
Settings → Battery → App Battery Management → MAM-AI
→ "Allow background activity"
Without that, OPPO's OplusHansManager freezes the process at the OS
level regardless of foreground-service status.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The config block dump in benchmark_results.json was hard-coded to "backend":"CPU" — wrong for any build with useGpuForLlm=true. The GPU-sweep JSONs we just ran on the OPPO Snapdragon 8 Elite all carry the incorrect "CPU" label even though they were measured on GPU. Now reads from BuildConfig.USE_GPU_FOR_LLM at compile time and writes "GPU" or "CPU" accordingly. Also adds "mtp_enabled" from BuildConfig.USE_MTP_FOR_LLM for full provenance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aggregates the 15 canonical 54-run benchmark JSONs into a single
GPU↔CPU comparison report covering k ∈ {0, 1, 3, 5, 7, 10, 15} (and
GPU-only k=20). Produced by a new evaluation/aggregate_k_sweep.py
that's re-runnable as more JSONs land.
Headline numbers (median total query latency on OPPO OPD2413,
Snapdragon 8 Elite, Gemma 4 E4B, LiteRT-LM 0.11.0):
k=0 (no-RAG): GPU 13–16 s | CPU 27–30 s (1.9× slower on CPU)
k=3 : GPU 19–21 s | CPU 37–45 s (2.2× slower)
k=10 : GPU 21–22 s | CPU 62–78 s (3.1× slower)
k=15 : GPU 22–25 s | CPU 81–90 s (3.5× slower)
k=20 : GPU 44% of runs fail at the 4096-token model ceiling
Key findings:
- GPU is the practical choice for this device tier — TTFT is 13–19×
faster than CPU; total latency is 2–3.5× faster.
- The model's 4096-token context window is the binding upper limit
(k_max ≈ 17–18), not latency. GPU has comfortable headroom below
that ceiling.
- CPU is unusable past k≈3 for any reasonable UX budget. At k=15,
CPU p95 latency hits 113 s.
- Decode is memory-bandwidth-bound (GPU/CPU within ~1.4×); the GPU
win is entirely in compute-heavy prefill.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CPU k=20 reproduces the GPU k=20 failure pattern exactly: - 24/54 errors on both backends (44% failure rate) - Identical 8 queries fail on both (long_01, long_03, medium_02, medium_04, short_01, short_03, short_04, short_05) - Same 24 (query × rep) pairs across both runs This is direct evidence that the 4096-token context cap is a property of the .litertlm model artifact itself — not a runtime config, not a backend choice. Strengthens finding #2 from "model is the ceiling, GPU specifically hits it" to "model is the ceiling, both backends hit it identically." Successful CPU k=20 runs: TTFT 65–73 s, total 89–96 s — well past any deployment budget even when the request fits the window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds infrastructure for per-k retrieval-depth latency sweeps on the Android RAG app: a runtime --retrieve-k override, a foreground-service-based benchmark harness that survives screen-off, richer per-run JSON output, and a Python aggregator that emits a GPU↔CPU markdown report. Also fixes a few latent bugs (Thread.sleep ANR, hard-coded "backend":"CPU", stale model filename in pre-flight).
Changes:
- New
BenchmarkForegroundService(withPARTIAL_WAKE_LOCK+ sticky notification, in:benchmarkprocess);BenchmarkActivityis now a thin shim that forwards Intent extras and finishes. RagPipeline.generateResponsegains an optionalretrieveKOverride; benchmark JSON now records fullretrieved_chunks/response_textand the actual GPU/CPU backend.- New
evaluation/aggregate_k_sweep.pyaggregator + generatedevaluation/reports/latency_report_v2.md;benchmark_latency.pyadds--retrieve-k,--rag-only, k-suffixed output filenames, and updated model filename.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| app/android/app/src/main/AndroidManifest.xml | Adds WAKE_LOCK permission and BenchmarkForegroundService declaration in :benchmark process. |
| app/android/app/src/main/kotlin/com/example/app/BenchmarkActivity.kt | Reduced to ~60-line shim that forwards extras to the new service. |
| app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt | New service hosting the entire benchmark loop, wake lock, notification, and JSON output. |
| app/android/app/src/main/kotlin/com/example/app/RagPipeline.kt | Adds retrieveKOverride param to generateResponse for per-call top_k override. |
| evaluation/benchmark_latency.py | Adds --retrieve-k/--rag-only flags, k-suffixed filename, updated model file check. |
| evaluation/aggregate_k_sweep.py | New aggregator that groups runs by (backend, k) and renders the v2 report. |
| evaluation/reports/latency_report_v2.md | Generated GPU↔CPU latency-sweep report. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Real fixes:
- BenchmarkForegroundService: shut down the single-thread executor in
onDestroy. Without this its worker thread keeps the :benchmark process
alive after the service stops.
- BenchmarkForegroundService: remove dead `pct` variable left over from
the Activity-era ASCII progress bar.
- AndroidManifest: stale comment said BenchmarkActivity holds the wake
lock; updated to reflect the foreground-service refactor.
- benchmark_latency.py: error out if both --no-retrieval and --rag-only
are passed (previously they silently coexisted; on-device skipRetrieval
won, but the result was confusing).
- aggregate_k_sweep.py:
* backend_of() now only overrides recorded="CPU" when the timestamp
predates the metadata fix. Future GPU runs (which write backend="GPU"
correctly) and future CPU runs are trusted as-is — fixes the silent
mislabeling Copilot flagged.
* Drop the May-2026-only glob (`benchmark_2026051*`) — use
`benchmark_*.json` and rely on the schema/length filters.
* Use `with open(...)` context manager — avoid file-handle leak.
* Rename `avg_doc_chars` → `median_doc_chars` (function used median
despite the name).
* Remove dead loop `for col, run, key in [...]: pass`.
* Update module docstring to describe the new backfill-only logic.
Verified: Kotlin still compiles (flutter build apk --release succeeds);
aggregate script still loads all 16 canonical runs and regenerates the
same report; the new mutual-exclusion error fires when both Python flags
are passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (3)
app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:99
retrieveKOverrideis gated only withtakeIf { it >= 0 }, so passing--ei retrieve_k 0is forwarded as a 0 toRagPipeline.generateResponse, which setseffectiveTopK = 0and callsRetrievalConfig.create(0, …). The Python wrapper also has no validation on--retrieve-k(it's a plainintarg). The semantically correct way to "retrieve nothing" in this codebase is--no-retrieval/useRetrieval=false; atop_k=0retrieval call has undefined behavior in MediaPipe's RetrievalConfig. Recommend rejecting non-positive values either in the Python CLI (type=positive_int) or in the service (takeIf { it > 0 }).
val retrieveKOverride: Int? = intent?.getIntExtra("retrieve_k", -1)?.takeIf { it >= 0 }
app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:240
loopStartis computed but never read after the activity-side ETA reporting was removed. Dead code — safe to delete.
val loopStart = System.currentTimeMillis()
app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:234
- The KDoc says "skipRetrieval and ragOnly are mutually exclusive (skipRetrieval wins)," and the Python wrapper enforces this with
parser.error(...). However, the service itself silently choosesskipRetrievaloverragOnlyif both extras are passed viaam startdirectly. A user invokingam startby hand could end up surprised. Consider logging a warning (or refusing to start) when both flags are true rather than silently picking one.
// skipRetrieval and ragOnly are mutually exclusive (skipRetrieval wins).
val retrievalModes = when {
skipRetrieval -> listOf(false)
ragOnly -> listOf(true)
else -> listOf(true, false)
}
Replace truthy `if gw else "—"` style with `if gw is not None else "—"`, and add `> 0` guards on division ratios. Defensive against a hypothetical 0-valued median from a corrupted/aborted run JSON, which the truthy form would have silently rendered as "—" instead of "0.0". Affects four spots: TTFT ratio, decode ratio, wall-clock ratio + formatting, and the headline-table ratio. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The timestamp threshold (`if timestamp < "20260514T2130": return "GPU"`) silently rewrites any pre-threshold CPU JSON as GPU. Anyone running this aggregator with their own historical genuine-CPU runs in latency_results/ would have those mislabeled as GPU and potentially double-counted via the "most successful entries" tiebreaker in write_report. Replace with `PRE_FIX_GPU_FILES`: a frozenset of the exact 8 filenames known to predate the metadata fix in commit ef96538. Any file not in the allowlist uses its recorded backend value. Anyone else's historical files are unaffected. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The service runs in android:process=":benchmark", separate from the main app process. RagPipeline(application) here therefore constructs an entirely fresh pipeline (Gecko + SQLite + LLM load) in that benchmark process, not the main app's. Worth documenting because: 1. Application.onCreate() will run a second time when the benchmark process spawns. 2. If the main app has the LLM loaded simultaneously, two LLM instances may briefly contend for GPU/memory during init. Add a "Process model" section to the class KDoc explaining both. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "-1 = use config default" semantic for retrieve_k is non-obvious and wasn't documented anywhere. Update both KDoc blocks (BenchmarkActivity and BenchmarkForegroundService) to spell out: - retrieve_k: pass any value >= 0 to override; pass -1 (or omit) to use runtime_config.json's value. The activity normalises -1 → null before forwarding to the service. - repeats default 3, cooldown_ms default 5000 (were missing). - skip_retrieval and rag_only are mutually exclusive; skip_retrieval wins if both are set. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
foregroundServiceType="dataSync" is documented for "transferring data between a device and the cloud or other peers" — clearly a misuse for an on-device latency benchmark. Google Play's FGS-type enforcement would reject this on submission. Acceptable here because BenchmarkForegroundService is only launched via `adb shell am start` for in-house benchmarking and never appears in any user-facing flow. Add an explicit DEV-ONLY comment documenting: - the type is technically wrong, - the correct type to switch to if we ever ship benchmark capabilities (specialUse + the FOREGROUND_SERVICE_SPECIAL_USE permission + PROPERTY_SPECIAL_USE_FGS_SUBTYPE), - this declaration should be stripped from any Play Store build. No runtime change; only a comment to prevent surprises later. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aggregate_overall returns {} when every run in a (backend, k) cell errored
out (all results have an error field). Callers then subscript with
["median"] / ["p95"] and crash with KeyError.
Today this only happens partially — e.g. GPU k=20 has 24/54 errors but
30 successful runs, so the dict is populated. But a future sweep where
all runs error (entirely possible at higher k once the 4096-token wall
is hit broadly, or if the LLM dies during init) would crash the report
generation.
Switch all four subscript sites in write_report() to .get("median") and
.get("p95") so an empty dict propagates as None, which fmt_ms / fmt_s
already render as "—".
Per Copilot review on PR #57.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (4)
evaluation/aggregate_k_sweep.py:110
- For canonical 54-run files this p95 index
int(len(s) * 0.95)evaluates correctly (51 → 52nd of 54), but for runs with errors (e.g. the k=20 set with only 30 successes) it yieldsint(30*0.95)=28→ the 29th of 30 sorted, which is the 96.7th percentile, not the 95th.math.ceil(0.95*n)-1(orstatistics.quantiles) would give the standard nearest-rank p95 across all sample sizes.
s = sorted(vs)
out[c] = {
"n": len(vs),
"median": int(statistics.median(vs)),
"p95": int(s[min(len(s) - 1, int(len(s) * 0.95))]),
}
return out
def aggregate_overall(d: dict, key: str) -> dict:
vs = [r[key] for r in d["results"] if not r.get("error")]
if not vs:
return {}
s = sorted(vs)
return {
"n": len(vs),
"median": int(statistics.median(vs)),
"p95": int(s[min(len(s) - 1, int(len(s) * 0.95))]),
evaluation/reports/latency_report_v2.md:115
- The "Wall-clock comparison" table mixes runs that completed all 54 queries with runs that errored out partway: at k=20, 24 of 54 runs failed-fast with the context-overflow error before any decode, so the wall time (22.8 / 58.6 min) is artificially low compared to k=15 and is not a meaningful CPU÷GPU comparison. This row should either be flagged in the table (e.g. an asterisk + footnote) or excluded from the CPU÷GPU column.
| 15 | 32.4 | 90.8 | 2.80× |
| 20 | 22.8 | 58.6 | 2.57× |
evaluation/aggregate_k_sweep.py:259
fmt_ms(ge)/fmt_ms(ce)are reused here just to render an integer count of failed runs. The helper is namedfmt_msand its docstring/siblingfmt_sstrongly suggest it is for milliseconds. Either renamefmt_msto a generic int-or-dash helper, or callstr(ge) if ge is not None else "—"inline so the table's intent is clear.
ge = sum(1 for r in gpu_run["data"]["results"] if r.get("error")) if gpu_run else None
ce = sum(1 for r in cpu_run["data"]["results"] if r.get("error")) if cpu_run else None
label = "**0 (no-RAG)**" if k == 0 else str(k)
md.append(f"| {label} | {fmt_ms(ge)} | {fmt_ms(ce)} |")
evaluation/reports/latency_report_v2.md:37
- The headline methodology says "Reported as median across the 54 runs unless noted (p95 in tables marked p95)", but the very next "Headline — Median total query latency" table breaks results down by category (short / medium / long) — i.e. medians over ~18 runs each, not over the full 54. The methodology line should mention the per-category breakdown so readers don't misinterpret the 13.1 / 12.6 / 17.3 values as 54-run statistics.
- `TTFT` excludes retrieval — measured from end-of-retrieval to first generated token.
- `decode` is first-token to last-token.
- `total_query` is everything: `retrieval + TTFT + decode`.
- Reported as median across the 54 runs unless noted (p95 in tables marked `p95`).
## Headline — Median total query latency (seconds)
| k | doc_chars med | GPU short / med / long | CPU short / med / long | CPU÷GPU |
|---:|---:|---:|---:|---:|
| **0 (no-RAG)** | 0 | 12.9 / 15.6 / 16.1 | 27.2 / 26.9 / 29.8 | 1.94× |
| 1 | 561 | 13.1 / 12.6 / 17.3 | 29.3 / 31.9 / 30.3 | 2.14× |
| 3 | 2098 | 18.6 / 18.6 / 21.0 | 37.3 / 44.5 / 42.5 | 2.24× |
| 5 | 3547 | 18.2 / 20.0 / 21.4 | 54.8 / 60.7 / 63.0 | 3.07× |
| 7 | 5139 | 21.3 / 23.2 / 22.8 | 61.4 / 62.3 / 60.4 | 2.72× |
| 10 | 7482 | 22.5 / 20.5 / 20.4 | 61.8 / 70.6 / 77.9 | 3.10× |
| 15 | 11297 | 25.3 / 24.0 / 22.4 | 84.8 / 80.8 / 89.7 | 3.48× |
| 20 | 14520 | 23.9 / 20.5 / 18.5 | 88.7 / 95.6 / 95.6 | 4.46× |
Report claimed "The 8 surviving queries on either side" — but the math doesn't check out: 18 queries × 3 reps = 54 runs, 24 errored = 8 unique queries failed × 3 reps. So 18 − 8 = 10 unique queries survived, not 8. Corrected to "The other 10 queries (10 × 3 reps = 30 successful runs)". Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The service normalises -1 to null but treats any value >= 0 as an
explicit override, so passing --retrieve-k 0 would silently call
RetrievalConfig.create(0, ...) — a confusing footgun. If you want to
disable retrieval entirely, --no-retrieval is the proper flag.
argparse now rejects --retrieve-k < 1 with a clear error pointing to
--no-retrieval. Negative values were already filtered by the service's
takeIf { it >= 0 } but a 0 was slipping through.
Per Copilot review on PR #57.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the PARTIAL_WAKE_LOCK was acquired in onCreate(), before onStartCommand calls startForegroundCompat. Two problems with that: 1. If the system creates the service but onStartCommand is delayed or never invoked (bind-only path, framework deferral), the wake lock is held without a foreground notification — and Android 12+ can trip the foreground-service-start-while-in-background restriction in that state. 2. Even on the normal path, there is a brief window where the CPU is pinned awake without the user-visible notification that justifies it. Move the wake-lock acquisition into onStartCommand, immediately AFTER startForegroundCompat. The lock is now strictly paired with the foreground notification's lifetime. Guarded with `wakeLock == null` so duplicate onStartCommand invocations (which can happen via START_NOT_STICKY restarts) don't try to re-acquire. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-thread executor is what ferries pipeline calls off the coroutine dispatchers (LiteRT-LM generation, Gecko retrieval), and scope.cancel() does NOT propagate cancellation into those blocking native calls. A plain executor.shutdown() then returns immediately and leaves the worker thread alive, keeping the :benchmark process running until generation finishes naturally — stopForeground might run with the worker still busy. Use shutdownNow() to interrupt the worker, plus a brief 2 s awaitTermination() so a worker that's tearing down cleanly gets a chance to do so. If it doesn't, the OS will reclaim the process eventually. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backend_of's fallback was `d["config"].get("backend", "CPU")`, which
silently labels any backend-less JSON as "CPU". Pre-fix files are
handled by the explicit allowlist; the silent default only fires on
"unexpected" JSONs — which is exactly when a future regression in
BenchmarkForegroundService (e.g. metadata accidentally dropped) would
slip past us.
Now: if config.backend is missing AND the file isn't on the pre-fix
allowlist, print a warning to stderr explaining the assumption and
defaulting to "CPU". Post-fix JSONs always carry the field, so this
warning only fires when something is genuinely off.
Per Copilot review on PR #57.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Python wrapper rejects --no-retrieval + --rag-only via parser.error() before the launch ever fires. But a direct `am start --ez skip_retrieval true --ez rag_only true ...` bypasses Python entirely and silently runs in No-RAG mode (skipRetrieval wins) with no indication anything is off. Add a Log.w at the same priority as other [BENCHMARK] markers so the mismatch is visible in `adb logcat -s mam-ai-bench:W` output during debugging. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 6 h ceiling could expire mid-sweep on long CPU runs — the full GPU
+ CPU k-sweep documented in latency_report_v2.md took ~7 h end-to-end,
and a CPU-only sweep across k ∈ {1, 3, 5, 7, 10, 15} hit similar
totals. When the lock auto-released, the OS could idle the CPU and
the benchmark would silently stall (no failure log, just no progress).
Bump to 24 h, with a comment that anything longer should switch to
periodic re-acquire instead of bumping the constant further.
Per Copilot review on PR #57.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scope.launch { runBenchmark(...) } was unconditional, so a re-delivered
intent (another `am start` before stopSelf() completes) would spawn a
second coroutine running on the same single-threaded executor and
writing to the same benchmark_results.json — both timings and the
output JSON would be corrupted.
Add a `benchmarkStarted` volatile flag that's set on first entry. Any
later onStartCommand call returns immediately with a logcat warning,
keeping the in-flight run intact.
START_NOT_STICKY makes this unlikely in practice, but the right belt-
and-braces fix is cheap and removes a race even on edge cases (e.g.
process recreated after low-memory kill before stopSelf returned).
Per Copilot review on PR #57.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The catch handler in runQuery() was the only [BENCHMARK] log line in the file using TAG="mam-ai" instead of BENCH_TAG="mam-ai-bench". This made `adb logcat -s mam-ai-bench:E` filter out exactly the messages most worth surfacing — query-level errors. One-line fix. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
load_runs() was silently dropping JSONs that didn't match the 54-run canonical-sweep shape (missing config/results keys, or < 30 results which is the smoke-test guard). For users running a narrow sweep (e.g. --filter long_01 --repeats 3 yields 3 results), the file would silently never appear in the report with no indication why. Log SKIP lines to stderr with the file name and reason. The output is still clean for normal runs (only emits when something is actually dropped). Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Methodology paragraph hardcoded "18 queries × … 10-second cooldown"
even though the JSONs carry the actual config.repeats, config.cooldown_ms,
and a results count that proves the math. If a future run uses different
defaults (or this script is pointed at a different sweep), the
methodology text would silently lie.
Now reads from the sample run's config: pulls repeats, cooldown_ms,
results count; infers (queries × modes) from results / repeats. Output
for the current data set is unchanged ("18 (query × mode) cells × 3
repeats = 54 timed runs … 10-second cooldown") because that's what the
JSONs actually say.
Per Copilot review on PR #57.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous formula `int(len(s) * 0.95)` collapses to max for any sample size n < 20 — e.g. for the per-category short bucket with 24 runs, int(24*0.95) = 22 (the 23rd of 24 sorted values), which is close to but not actually the 95th percentile. For a hypothetical narrower sample with n=3 (e.g. single-query small sweep), int(2.85) = 2 = the max, so p95 == max by construction. Centralise the calculation in a `_p95(values)` helper that uses `statistics.quantiles(values, n=20, method="exclusive")[18]` — the linear-interpolation 95th percentile from a 20-quantile partition. Falls back to max only when n < 2 (genuinely no quantiles to compute). Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stopForeground(boolean) has been deprecated since Android 13. Replace with the SDK-aware form: STOP_FOREGROUND_REMOVE on API 24+ (where the int overload was introduced), fall back to the boolean variant only on older devices (where it isn't deprecated). Drops the @Suppress("DEPRECATION") on the modern path; we still suppress on the legacy path because the boolean variant *is* the non-deprecated API there. Per Copilot review on PR #57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran the same 8-cell sweep we did for FP16 GPU in PR #57/#59, but against the FP32-tagged artifact at maxNumTokens=4096. ~4.5 hours total wall-clock on the OnePlus OPD2413 (Snapdragon 8 Elite). Result is a clean cell-by-cell comparison. Headline: FP32 GPU is **~25% slower than FP16 GPU at k=15** (~6 s extra wait per query), much less than the ~3× I'd estimated from the single-data-point Step 3 measurement. That earlier number was wrong — I had accidentally been comparing FP32-E4B against FP16-**E2B** (the smaller model), not the matched FP16-E4B baseline. The slowdown is almost entirely in TTFT (prefill): - FP16 TTFT 0.96–4.0s, FP32 TTFT 2.0–9.8s (~2–2.5× across all k) - FP16 decode 11–18s, FP32 decode 12–19s (essentially identical) Mechanism: prefill is compute-bound (one parallel forward pass over the input), so FP16's 2× arithmetic throughput on Adreno helps a lot. Decode is bandwidth-bound (sequential token-at-a-time loading of weights), so the FP16/FP32 precision choice barely matters. Same 24 errors at k=20 on FP32 GPU as on FP16 GPU — the prompt-cap rejection at maxNumTokens=4096 is precision-agnostic, just a config check. What this means: **FP32 GPU is a real shipping option, not just an experiment.** At maxNumTokens=4096 the latency cost is ~25% (no quality benefit — we're below the FP16 cliff anyway). At higher maxNumTokens (e.g., 5500), FP32 GPU enables clean output past the FP16 cliff at the same ~25% latency hit. Memory ceiling caps maxNumTokens at ~6500–7500 on this 16 GB device since KV cache doubles vs FP16. The choice between FP16 GPU and FP32 GPU is now a UX-vs-margin tradeoff at the deployment level, not a feasibility question. Updates: new Step 5 section with full sweep table + corrected slowdown narrative; updated TL;DR bullet on FP32 latency cost; "What's still open" table marks FP32 latency curve resolved; full 8-JSON inventory added to References.
Summary
--retrieve-koverride so the latency benchmark can sweep retrieval depth without rebuilding the APK. One build + install, then loop the Python wrapper across k values.BenchmarkActivity→ newBenchmarkForegroundServicewith aPARTIAL_WAKE_LOCKand sticky notification, so multi-hour sweeps survive screen-off + device-lock. (OPPO ColorOS also requires the per-app battery-optimization whitelist set in Settings — documented in the report.)--rag-onlyflag to skip the redundant No-RAG mode during k-sweeps (No-RAG is k-invariant; only needs to be measured once).Thread.sleep→delay()), a stalegemma-3nmodel-file check in the pre-flight script, and a hard-coded"backend":"CPU"JSON-metadata bug that mislabeled all GPU runs.evaluation/aggregate_k_sweep.pyaggregation script; rendersevaluation/reports/latency_report_v2.md— full per-k GPU↔CPU comparison.Headline findings
On OPPO Snapdragon 8 Elite + Gemma 4 E4B + LiteRT-LM 0.11.0 (16 canonical 54-run benchmarks):
Test plan
flutter build apk --release) → CPU. With-PuseGpuForLlm=true→ GPU. Verify the[BACKEND] *** LLM running on … ***log line on startup matches.python evaluation/benchmark_latency.py --filter medium_01 --repeats 1 --rag-only --retrieve-k 5. Verify the JSON output hasnum_retrieved_docs == 5, populatedretrieved_chunksarray with chunk text/source/page/chars, and populatedresponse_text. Filename contains `_k5`.python evaluation/aggregate_k_sweep.pyregeneratesevaluation/reports/latency_report_v2.mdfrom the JSONs inevaluation/latency_results/.Commits (10)
```
4daf626 analysis: add CPU k=20 — confirms 4096-token wall is backend-invariant
ede273f analysis: k-sweep latency report (GPU + CPU on Snapdragon 8 Elite)
ef96538 fix(benchmark): record actual backend (GPU/CPU) in config metadata
7ac3b36 refactor(benchmark): move benchmark to a foreground service
12fd358 fix(benchmark): keep CPU alive through screen-off (PARTIAL_WAKE_LOCK + Default dispatcher)
795ac84 fix(benchmark): use suspending delay() instead of Thread.sleep()
197a7bc feat(benchmark): add --rag-only flag to skip No-RAG mode
33604df fix(benchmark): update model-file check to match production stack
8848bee feat(benchmark): capture retrieved chunks + response text per run
fd85cd7 feat(benchmark): add --retrieve-k override for per-k latency sweep
```
🤖 Generated with Claude Code