Skip to content

feat(benchmark): per-k latency sweep infrastructure + GPU/CPU report#57

Merged
nmrenyi merged 30 commits into
mainfrom
feat/benchmark-retrieve-k-override
May 15, 2026
Merged

feat(benchmark): per-k latency sweep infrastructure + GPU/CPU report#57
nmrenyi merged 30 commits into
mainfrom
feat/benchmark-retrieve-k-override

Conversation

@nmrenyi
Copy link
Copy Markdown
Owner

@nmrenyi nmrenyi commented May 14, 2026

Summary

  • Adds a runtime --retrieve-k override so the latency benchmark can sweep retrieval depth without rebuilding the APK. One build + install, then loop the Python wrapper across k values.
  • Refactors BenchmarkActivity → new BenchmarkForegroundService with a PARTIAL_WAKE_LOCK and sticky notification, so multi-hour sweeps survive screen-off + device-lock. (OPPO ColorOS also requires the per-app battery-optimization whitelist set in Settings — documented in the report.)
  • Captures per-run retrieved chunks + full response text in the benchmark JSON, not just lengths — enables downstream answer-quality analysis at any k.
  • --rag-only flag to skip the redundant No-RAG mode during k-sweeps (No-RAG is k-invariant; only needs to be measured once).
  • Fixes a pre-existing ANR (Thread.sleepdelay()), a stale gemma-3n model-file check in the pre-flight script, and a hard-coded "backend":"CPU" JSON-metadata bug that mislabeled all GPU runs.
  • New evaluation/aggregate_k_sweep.py aggregation script; renders evaluation/reports/latency_report_v2.md — full per-k GPU↔CPU comparison.

Headline findings

On OPPO Snapdragon 8 Elite + Gemma 4 E4B + LiteRT-LM 0.11.0 (16 canonical 54-run benchmarks):

  • GPU is 13–19× faster on TTFT than CPU and 2–4× faster on total latency. Decode is memory-bandwidth-bound (within ~1.4× across backends); the GPU win is entirely in compute-heavy prefill.
  • The 4096-token context window is the binding upper limit, not latency. At k=20, the same 24 of 54 runs (same 8 queries × 3 reps) fail identically on both GPU and CPU with `Input token ids are too long … >= 4096`. The cap is a property of the `gemma-4-E4B-it.litertlm` artifact, not a runtime config and not backend-dependent. k_max ≈ 17–18.
  • All GPU totals at k ≤ 15 stay between 13 s (no-RAG) and 25 s — well under any reasonable UX budget. CPU is unusable past k≈3 for a 60 s budget; at k=15 CPU p95 hits 113 s.

Test plan

  • Build default (flutter build apk --release) → CPU. With -PuseGpuForLlm=true → GPU. Verify the [BACKEND] *** LLM running on … *** log line on startup matches.
  • python evaluation/benchmark_latency.py --filter medium_01 --repeats 1 --rag-only --retrieve-k 5. Verify the JSON output has num_retrieved_docs == 5, populated retrieved_chunks array with chunk text/source/page/chars, and populated response_text. Filename contains `_k5`.
  • Screen-off survival: launch a multi-query benchmark, lock the device after init, confirm runs continue. Requires Settings → Battery → App Battery Management → MAM-AI → "Allow background activity" on OPPO/Xiaomi devices (otherwise OplusHansManager freezes the process at the OS level).
  • python evaluation/aggregate_k_sweep.py regenerates evaluation/reports/latency_report_v2.md from the JSONs in evaluation/latency_results/.

Commits (10)

```
4daf626 analysis: add CPU k=20 — confirms 4096-token wall is backend-invariant
ede273f analysis: k-sweep latency report (GPU + CPU on Snapdragon 8 Elite)
ef96538 fix(benchmark): record actual backend (GPU/CPU) in config metadata
7ac3b36 refactor(benchmark): move benchmark to a foreground service
12fd358 fix(benchmark): keep CPU alive through screen-off (PARTIAL_WAKE_LOCK + Default dispatcher)
795ac84 fix(benchmark): use suspending delay() instead of Thread.sleep()
197a7bc feat(benchmark): add --rag-only flag to skip No-RAG mode
33604df fix(benchmark): update model-file check to match production stack
8848bee feat(benchmark): capture retrieved chunks + response text per run
fd85cd7 feat(benchmark): add --retrieve-k override for per-k latency sweep
```

🤖 Generated with Claude Code

nmrenyi and others added 10 commits May 14, 2026 14:24
Lets the latency benchmark vary retrieval top_k without rebuilding the APK
or editing runtime_config.json. One build + install up front, then sweep
k via the CLI flag — needed to bound k_max on the Snapdragon 8 Elite + GPU
stack now that TTFT (~1–2 s at k=3) is no longer the binding constraint.

Wiring:
- RagPipeline.generateResponse() gains an optional retrieveKOverride:
  Int? parameter (default null). When non-null it replaces
  retrievalConfig.top_k for that call only; production callers leave it
  null. Param added at the end of the list so RagStream's positional call
  is unaffected.
- BenchmarkActivity reads an "retrieve_k" Intent extra (-1 sentinel = no
  override), threads it through runBenchmark → runQuery →
  generateResponse, and records "retrieval_top_k_override" in the
  config block of the results JSON.
- benchmark_latency.py adds --retrieve-k N, forwards via
  am start --ei retrieve_k N, and appends "_kN" to the output filename
  so a sweep across k values is legible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the benchmark recorded only counts/lengths (num_retrieved_docs,
response_length_chars). For the per-k latency sweep we want to see what
the retriever actually surfaced at each k and what the model generated —
both for content review and because the total chunk-text length is what
drives prefill cost as k grows.

New per-run fields in the results JSON:
- retrieved_chunks: array of {text, source, page, chars} for every chunk
  the retriever returned. Lets us inspect what changed as k grew.
- retrieved_total_chars: sum of chunk text lengths. The real
  prompt-length proxy (vs. query_word_count which is static).
- response_text: full model response (the generation listener was
  already accumulating it; we now record the final string).

Size: at k=20 with ~1500-3000 chars/chunk + ~3KB response, per-run
overhead is ~30-60KB. A full 108-run benchmark file grows from ~50KB to
~4-6MB. Acceptable; can add a --no-content opt-out later if needed.

Note: Gemma 4 E4B doesn't emit a separate reasoning channel — any inline
reasoning the model writes shows up in response_text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
check_models_downloaded() was looking for gemma-3n-E4B-it-int4.task —
left over from the pre-Gemma-4 era. config/app_config.json now declares
"llm_model": "gemma-4-E4B-it.litertlm", so the script's pre-flight check
falsely failed even when the right model was on device.

Caught during the smoke test for the --retrieve-k override feature: the
script aborted before launching BenchmarkActivity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For the k-sweep, the No-RAG baseline doesn't change with k (retrieval is
disabled, so the override is ignored). Without this flag, running the
sweep at 7 k-values would re-run the identical 54 No-RAG measurements
seven times — ~1.5 hours of redundant work.

The Intent extra "rag_only" (bool, default false) tells BenchmarkActivity
to run only the RAG mode. Mutually exclusive with skip_retrieval, which
wins if both are set. Python --rag-only forwards via am start --ez.

Recorded in the session config JSON as "rag_only" so reruns are
unambiguous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thread.sleep(cooldownMs) was running on the UI thread (BenchmarkActivity
is an Activity; the calling coroutine was scope.launch(Dispatchers.Main)
by default). With cooldown ≥ 5000 ms, Android's input-dispatching timeout
fires and the activity gets killed mid-sweep — exactly what happened on
the OPPO Snapdragon 8 Elite at cooldown=10000:

  AnrInfo{reason='Input Dispatching Timeout',
    stackTrace='at java.lang.Thread.sleep(...)
                at BenchmarkActivity.runBenchmark(BenchmarkActivity.kt:279)'}

delay() is a suspending function that doesn't block the underlying
thread, so the UI stays responsive while the benchmark waits. Both
cooldown call sites (post-init and between-runs) are inside the same
suspend coroutine, so this is a drop-in replacement.

The existing latency_report.md used cooldown=10000 on Pixel 7 without
issue, which suggests the older Tensor G2 chipset had laxer ANR
enforcement or Athena memory-kill behavior — either way, the
correct-by-construction fix is to never block the UI thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ Default dispatcher)

The benchmark coroutine was running on Dispatchers.Main, so as soon as the
device screen went off (~10 min on OPPO) the activity backgrounded and the
coroutine's delay() never resumed. The process stayed alive but stopped
making progress — the Python wrapper hung waiting for [BENCHMARK] COMPLETE
that would never come. Observed on the OPPO Snapdragon 8 Elite mid-sweep:
~24 min between the last [BENCHMARK] log line and the most recent Athena
heartbeat, with mWakefulness=Asleep.

Two changes:

1. Acquire a PARTIAL_WAKE_LOCK in onCreate (released in onDestroy). Keeps
   the CPU running even when the screen is off; the screen itself can
   still sleep. 6-hour failsafe timeout. Required permission added to
   AndroidManifest.xml — used only by BenchmarkActivity.

2. Switch the coroutine scope from Dispatchers.Main to Dispatchers.Default.
   The benchmark logic doesn't touch the UI directly (logStatus already
   marshals back to Main via runOnUiThread), so there's no reason to run
   on Main — doing so just makes the work pause whenever the activity
   loses focus. Default keeps running in any lifecycle state.

These together let the sweep run while the device is screen-off or
locked. Screen lifespan and battery thank you.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OPPO ColorOS aggressively freezes background activities and even
force-releases PARTIAL_WAKE_LOCKs held by plain Activities (visible as
'add wakelock … to ForceReleaseWakeLock list' in OplusProxyWakeLock).
Foreground services with a sticky notification are respected — once the
app is also whitelisted in Settings → Battery → App Battery Management
("Allow background activity"), the benchmark runs cleanly with the
screen off and the device locked.

Architecture:
- BenchmarkForegroundService (NEW) — holds the wake lock, posts a
  sticky progress notification, and runs the entire benchmark loop.
  Uses Dispatchers.Default so it isn't tied to a UI thread. Stops
  itself when done; the OS reclaims the process.
- BenchmarkActivity — reduced from ~470 lines to ~60. Now a thin
  launcher: receives `am start` Intent extras, forwards them to the
  service via startForegroundService(), and finishes immediately.
  Existing Python wrapper (benchmark_latency.py) is unchanged — it
  still launches the Activity and reads progress from logcat.
- AndroidManifest — registers the new service with
  foregroundServiceType="dataSync" (reuses the existing
  FOREGROUND_SERVICE_DATA_SYNC permission) and android:process=
  ":benchmark" (same isolated process as BenchmarkActivity).

Verified end-to-end on OPPO Snapdragon 8 Elite (OPD2413): launched the
benchmark, locked the screen 30 s in, watched a full medium_01 RAG run
complete with `hans_freeze=0` Athena freeze events and the screen
asleep through the entire decode phase. TTFT 1025 ms, total 13.8 s —
within the same envelope as foreground runs.

Pre-flight on a new device:
  Settings → Battery → App Battery Management → MAM-AI
                                              → "Allow background activity"

Without that, OPPO's OplusHansManager freezes the process at the OS
level regardless of foreground-service status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The config block dump in benchmark_results.json was hard-coded to
"backend":"CPU" — wrong for any build with useGpuForLlm=true. The
GPU-sweep JSONs we just ran on the OPPO Snapdragon 8 Elite all carry
the incorrect "CPU" label even though they were measured on GPU.

Now reads from BuildConfig.USE_GPU_FOR_LLM at compile time and writes
"GPU" or "CPU" accordingly. Also adds "mtp_enabled" from
BuildConfig.USE_MTP_FOR_LLM for full provenance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aggregates the 15 canonical 54-run benchmark JSONs into a single
GPU↔CPU comparison report covering k ∈ {0, 1, 3, 5, 7, 10, 15} (and
GPU-only k=20). Produced by a new evaluation/aggregate_k_sweep.py
that's re-runnable as more JSONs land.

Headline numbers (median total query latency on OPPO OPD2413,
Snapdragon 8 Elite, Gemma 4 E4B, LiteRT-LM 0.11.0):

  k=0 (no-RAG):  GPU 13–16 s | CPU 27–30 s    (1.9× slower on CPU)
  k=3         :  GPU 19–21 s | CPU 37–45 s    (2.2× slower)
  k=10        :  GPU 21–22 s | CPU 62–78 s    (3.1× slower)
  k=15        :  GPU 22–25 s | CPU 81–90 s    (3.5× slower)
  k=20        :  GPU 44% of runs fail at the 4096-token model ceiling

Key findings:
- GPU is the practical choice for this device tier — TTFT is 13–19×
  faster than CPU; total latency is 2–3.5× faster.
- The model's 4096-token context window is the binding upper limit
  (k_max ≈ 17–18), not latency. GPU has comfortable headroom below
  that ceiling.
- CPU is unusable past k≈3 for any reasonable UX budget. At k=15,
  CPU p95 latency hits 113 s.
- Decode is memory-bandwidth-bound (GPU/CPU within ~1.4×); the GPU
  win is entirely in compute-heavy prefill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CPU k=20 reproduces the GPU k=20 failure pattern exactly:
- 24/54 errors on both backends (44% failure rate)
- Identical 8 queries fail on both (long_01, long_03, medium_02,
  medium_04, short_01, short_03, short_04, short_05)
- Same 24 (query × rep) pairs across both runs

This is direct evidence that the 4096-token context cap is a property
of the .litertlm model artifact itself — not a runtime config, not a
backend choice. Strengthens finding #2 from "model is the ceiling, GPU
specifically hits it" to "model is the ceiling, both backends hit it
identically."

Successful CPU k=20 runs: TTFT 65–73 s, total 89–96 s — well past any
deployment budget even when the request fits the window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 14, 2026 23:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds infrastructure for per-k retrieval-depth latency sweeps on the Android RAG app: a runtime --retrieve-k override, a foreground-service-based benchmark harness that survives screen-off, richer per-run JSON output, and a Python aggregator that emits a GPU↔CPU markdown report. Also fixes a few latent bugs (Thread.sleep ANR, hard-coded "backend":"CPU", stale model filename in pre-flight).

Changes:

  • New BenchmarkForegroundService (with PARTIAL_WAKE_LOCK + sticky notification, in :benchmark process); BenchmarkActivity is now a thin shim that forwards Intent extras and finishes.
  • RagPipeline.generateResponse gains an optional retrieveKOverride; benchmark JSON now records full retrieved_chunks/response_text and the actual GPU/CPU backend.
  • New evaluation/aggregate_k_sweep.py aggregator + generated evaluation/reports/latency_report_v2.md; benchmark_latency.py adds --retrieve-k, --rag-only, k-suffixed output filenames, and updated model filename.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
app/android/app/src/main/AndroidManifest.xml Adds WAKE_LOCK permission and BenchmarkForegroundService declaration in :benchmark process.
app/android/app/src/main/kotlin/com/example/app/BenchmarkActivity.kt Reduced to ~60-line shim that forwards extras to the new service.
app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt New service hosting the entire benchmark loop, wake lock, notification, and JSON output.
app/android/app/src/main/kotlin/com/example/app/RagPipeline.kt Adds retrieveKOverride param to generateResponse for per-call top_k override.
evaluation/benchmark_latency.py Adds --retrieve-k/--rag-only flags, k-suffixed filename, updated model file check.
evaluation/aggregate_k_sweep.py New aggregator that groups runs by (backend, k) and renders the v2 report.
evaluation/reports/latency_report_v2.md Generated GPU↔CPU latency-sweep report.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread app/android/app/src/main/AndroidManifest.xml Outdated
Comment thread app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt Outdated
Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread evaluation/benchmark_latency.py
Comment thread app/android/app/src/main/AndroidManifest.xml
Comment thread evaluation/aggregate_k_sweep.py Outdated
Real fixes:

- BenchmarkForegroundService: shut down the single-thread executor in
  onDestroy. Without this its worker thread keeps the :benchmark process
  alive after the service stops.
- BenchmarkForegroundService: remove dead `pct` variable left over from
  the Activity-era ASCII progress bar.
- AndroidManifest: stale comment said BenchmarkActivity holds the wake
  lock; updated to reflect the foreground-service refactor.
- benchmark_latency.py: error out if both --no-retrieval and --rag-only
  are passed (previously they silently coexisted; on-device skipRetrieval
  won, but the result was confusing).
- aggregate_k_sweep.py:
  * backend_of() now only overrides recorded="CPU" when the timestamp
    predates the metadata fix. Future GPU runs (which write backend="GPU"
    correctly) and future CPU runs are trusted as-is — fixes the silent
    mislabeling Copilot flagged.
  * Drop the May-2026-only glob (`benchmark_2026051*`) — use
    `benchmark_*.json` and rely on the schema/length filters.
  * Use `with open(...)` context manager — avoid file-handle leak.
  * Rename `avg_doc_chars` → `median_doc_chars` (function used median
    despite the name).
  * Remove dead loop `for col, run, key in [...]: pass`.
  * Update module docstring to describe the new backfill-only logic.

Verified: Kotlin still compiles (flutter build apk --release succeeds);
aggregate script still loads all 16 canonical runs and regenerates the
same report; the new mutual-exclusion error fires when both Python flags
are passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (3)

app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:99

  • retrieveKOverride is gated only with takeIf { it >= 0 }, so passing --ei retrieve_k 0 is forwarded as a 0 to RagPipeline.generateResponse, which sets effectiveTopK = 0 and calls RetrievalConfig.create(0, …). The Python wrapper also has no validation on --retrieve-k (it's a plain int arg). The semantically correct way to "retrieve nothing" in this codebase is --no-retrieval / useRetrieval=false; a top_k=0 retrieval call has undefined behavior in MediaPipe's RetrievalConfig. Recommend rejecting non-positive values either in the Python CLI (type=positive_int) or in the service (takeIf { it > 0 }).
        val retrieveKOverride: Int? = intent?.getIntExtra("retrieve_k", -1)?.takeIf { it >= 0 }

app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:240

  • loopStart is computed but never read after the activity-side ETA reporting was removed. Dead code — safe to delete.
        val loopStart = System.currentTimeMillis()

app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt:234

  • The KDoc says "skipRetrieval and ragOnly are mutually exclusive (skipRetrieval wins)," and the Python wrapper enforces this with parser.error(...). However, the service itself silently chooses skipRetrieval over ragOnly if both extras are passed via am start directly. A user invoking am start by hand could end up surprised. Consider logging a warning (or refusing to start) when both flags are true rather than silently picking one.
        // skipRetrieval and ragOnly are mutually exclusive (skipRetrieval wins).
        val retrievalModes = when {
            skipRetrieval -> listOf(false)
            ragOnly -> listOf(true)
            else -> listOf(true, false)
        }

Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread app/android/app/src/main/AndroidManifest.xml
Comment thread app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt Outdated
nmrenyi and others added 6 commits May 15, 2026 08:13
Replace truthy `if gw else "—"` style with `if gw is not None else "—"`,
and add `> 0` guards on division ratios. Defensive against a hypothetical
0-valued median from a corrupted/aborted run JSON, which the truthy form
would have silently rendered as "—" instead of "0.0".

Affects four spots: TTFT ratio, decode ratio, wall-clock ratio +
formatting, and the headline-table ratio.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The timestamp threshold (`if timestamp < "20260514T2130": return "GPU"`)
silently rewrites any pre-threshold CPU JSON as GPU. Anyone running this
aggregator with their own historical genuine-CPU runs in
latency_results/ would have those mislabeled as GPU and potentially
double-counted via the "most successful entries" tiebreaker in
write_report.

Replace with `PRE_FIX_GPU_FILES`: a frozenset of the exact 8 filenames
known to predate the metadata fix in commit ef96538. Any file not in
the allowlist uses its recorded backend value. Anyone else's historical
files are unaffected.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The service runs in android:process=":benchmark", separate from the main
app process. RagPipeline(application) here therefore constructs an
entirely fresh pipeline (Gecko + SQLite + LLM load) in that benchmark
process, not the main app's. Worth documenting because:

1. Application.onCreate() will run a second time when the benchmark
   process spawns.
2. If the main app has the LLM loaded simultaneously, two LLM instances
   may briefly contend for GPU/memory during init.

Add a "Process model" section to the class KDoc explaining both.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "-1 = use config default" semantic for retrieve_k is non-obvious and
wasn't documented anywhere. Update both KDoc blocks (BenchmarkActivity
and BenchmarkForegroundService) to spell out:

- retrieve_k: pass any value >= 0 to override; pass -1 (or omit) to use
  runtime_config.json's value. The activity normalises -1 → null before
  forwarding to the service.
- repeats default 3, cooldown_ms default 5000 (were missing).
- skip_retrieval and rag_only are mutually exclusive; skip_retrieval
  wins if both are set.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
foregroundServiceType="dataSync" is documented for "transferring data
between a device and the cloud or other peers" — clearly a misuse for an
on-device latency benchmark. Google Play's FGS-type enforcement would
reject this on submission.

Acceptable here because BenchmarkForegroundService is only launched via
`adb shell am start` for in-house benchmarking and never appears in any
user-facing flow. Add an explicit DEV-ONLY comment documenting:

- the type is technically wrong,
- the correct type to switch to if we ever ship benchmark capabilities
  (specialUse + the FOREGROUND_SERVICE_SPECIAL_USE permission +
  PROPERTY_SPECIAL_USE_FGS_SUBTYPE),
- this declaration should be stripped from any Play Store build.

No runtime change; only a comment to prevent surprises later.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aggregate_overall returns {} when every run in a (backend, k) cell errored
out (all results have an error field). Callers then subscript with
["median"] / ["p95"] and crash with KeyError.

Today this only happens partially — e.g. GPU k=20 has 24/54 errors but
30 successful runs, so the dict is populated. But a future sweep where
all runs error (entirely possible at higher k once the 4096-token wall
is hit broadly, or if the LLM dies during init) would crash the report
generation.

Switch all four subscript sites in write_report() to .get("median") and
.get("p95") so an empty dict propagates as None, which fmt_ms / fmt_s
already render as "—".

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 00:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (4)

evaluation/aggregate_k_sweep.py:110

  • For canonical 54-run files this p95 index int(len(s) * 0.95) evaluates correctly (51 → 52nd of 54), but for runs with errors (e.g. the k=20 set with only 30 successes) it yields int(30*0.95)=28 → the 29th of 30 sorted, which is the 96.7th percentile, not the 95th. math.ceil(0.95*n)-1 (or statistics.quantiles) would give the standard nearest-rank p95 across all sample sizes.
        s = sorted(vs)
        out[c] = {
            "n": len(vs),
            "median": int(statistics.median(vs)),
            "p95": int(s[min(len(s) - 1, int(len(s) * 0.95))]),
        }
    return out


def aggregate_overall(d: dict, key: str) -> dict:
    vs = [r[key] for r in d["results"] if not r.get("error")]
    if not vs:
        return {}
    s = sorted(vs)
    return {
        "n": len(vs),
        "median": int(statistics.median(vs)),
        "p95": int(s[min(len(s) - 1, int(len(s) * 0.95))]),

evaluation/reports/latency_report_v2.md:115

  • The "Wall-clock comparison" table mixes runs that completed all 54 queries with runs that errored out partway: at k=20, 24 of 54 runs failed-fast with the context-overflow error before any decode, so the wall time (22.8 / 58.6 min) is artificially low compared to k=15 and is not a meaningful CPU÷GPU comparison. This row should either be flagged in the table (e.g. an asterisk + footnote) or excluded from the CPU÷GPU column.
| 15 | 32.4 | 90.8 | 2.80× |
| 20 | 22.8 | 58.6 | 2.57× |

evaluation/aggregate_k_sweep.py:259

  • fmt_ms(ge) / fmt_ms(ce) are reused here just to render an integer count of failed runs. The helper is named fmt_ms and its docstring/sibling fmt_s strongly suggest it is for milliseconds. Either rename fmt_ms to a generic int-or-dash helper, or call str(ge) if ge is not None else "—" inline so the table's intent is clear.
        ge = sum(1 for r in gpu_run["data"]["results"] if r.get("error")) if gpu_run else None
        ce = sum(1 for r in cpu_run["data"]["results"] if r.get("error")) if cpu_run else None
        label = "**0 (no-RAG)**" if k == 0 else str(k)
        md.append(f"| {label} | {fmt_ms(ge)} | {fmt_ms(ce)} |")

evaluation/reports/latency_report_v2.md:37

  • The headline methodology says "Reported as median across the 54 runs unless noted (p95 in tables marked p95)", but the very next "Headline — Median total query latency" table breaks results down by category (short / medium / long) — i.e. medians over ~18 runs each, not over the full 54. The methodology line should mention the per-category breakdown so readers don't misinterpret the 13.1 / 12.6 / 17.3 values as 54-run statistics.
- `TTFT` excludes retrieval — measured from end-of-retrieval to first generated token.
- `decode` is first-token to last-token.
- `total_query` is everything: `retrieval + TTFT + decode`.
- Reported as median across the 54 runs unless noted (p95 in tables marked `p95`).

## Headline — Median total query latency (seconds)

| k | doc_chars med | GPU short / med / long | CPU short / med / long | CPU÷GPU |
|---:|---:|---:|---:|---:|
| **0 (no-RAG)** | 0 | 12.9 / 15.6 / 16.1 | 27.2 / 26.9 / 29.8 | 1.94× |
| 1 | 561 | 13.1 / 12.6 / 17.3 | 29.3 / 31.9 / 30.3 | 2.14× |
| 3 | 2098 | 18.6 / 18.6 / 21.0 | 37.3 / 44.5 / 42.5 | 2.24× |
| 5 | 3547 | 18.2 / 20.0 / 21.4 | 54.8 / 60.7 / 63.0 | 3.07× |
| 7 | 5139 | 21.3 / 23.2 / 22.8 | 61.4 / 62.3 / 60.4 | 2.72× |
| 10 | 7482 | 22.5 / 20.5 / 20.4 | 61.8 / 70.6 / 77.9 | 3.10× |
| 15 | 11297 | 25.3 / 24.0 / 22.4 | 84.8 / 80.8 / 89.7 | 3.48× |
| 20 | 14520 | 23.9 / 20.5 / 18.5 | 88.7 / 95.6 / 95.6 | 4.46× |

Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread evaluation/reports/latency_report_v2.md Outdated
Comment thread app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt Outdated
Comment thread app/android/app/src/main/AndroidManifest.xml
nmrenyi and others added 4 commits May 15, 2026 10:22
Report claimed "The 8 surviving queries on either side" — but the math
doesn't check out: 18 queries × 3 reps = 54 runs, 24 errored = 8 unique
queries failed × 3 reps. So 18 − 8 = 10 unique queries survived, not 8.

Corrected to "The other 10 queries (10 × 3 reps = 30 successful runs)".

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The service normalises -1 to null but treats any value >= 0 as an
explicit override, so passing --retrieve-k 0 would silently call
RetrievalConfig.create(0, ...) — a confusing footgun. If you want to
disable retrieval entirely, --no-retrieval is the proper flag.

argparse now rejects --retrieve-k < 1 with a clear error pointing to
--no-retrieval. Negative values were already filtered by the service's
takeIf { it >= 0 } but a 0 was slipping through.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the PARTIAL_WAKE_LOCK was acquired in onCreate(), before
onStartCommand calls startForegroundCompat. Two problems with that:

1. If the system creates the service but onStartCommand is delayed or
   never invoked (bind-only path, framework deferral), the wake lock is
   held without a foreground notification — and Android 12+ can trip
   the foreground-service-start-while-in-background restriction in that
   state.
2. Even on the normal path, there is a brief window where the CPU is
   pinned awake without the user-visible notification that justifies it.

Move the wake-lock acquisition into onStartCommand, immediately AFTER
startForegroundCompat. The lock is now strictly paired with the
foreground notification's lifetime. Guarded with `wakeLock == null` so
duplicate onStartCommand invocations (which can happen via START_NOT_STICKY
restarts) don't try to re-acquire.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-thread executor is what ferries pipeline calls off the
coroutine dispatchers (LiteRT-LM generation, Gecko retrieval), and
scope.cancel() does NOT propagate cancellation into those blocking
native calls. A plain executor.shutdown() then returns immediately and
leaves the worker thread alive, keeping the :benchmark process running
until generation finishes naturally — stopForeground might run with
the worker still busy.

Use shutdownNow() to interrupt the worker, plus a brief 2 s
awaitTermination() so a worker that's tearing down cleanly gets a chance
to do so. If it doesn't, the OS will reclaim the process eventually.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nmrenyi and others added 2 commits May 15, 2026 10:25
backend_of's fallback was `d["config"].get("backend", "CPU")`, which
silently labels any backend-less JSON as "CPU". Pre-fix files are
handled by the explicit allowlist; the silent default only fires on
"unexpected" JSONs — which is exactly when a future regression in
BenchmarkForegroundService (e.g. metadata accidentally dropped) would
slip past us.

Now: if config.backend is missing AND the file isn't on the pre-fix
allowlist, print a warning to stderr explaining the assumption and
defaulting to "CPU". Post-fix JSONs always carry the field, so this
warning only fires when something is genuinely off.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Python wrapper rejects --no-retrieval + --rag-only via parser.error()
before the launch ever fires. But a direct `am start --ez skip_retrieval
true --ez rag_only true ...` bypasses Python entirely and silently runs
in No-RAG mode (skipRetrieval wins) with no indication anything is off.

Add a Log.w at the same priority as other [BENCHMARK] markers so the
mismatch is visible in `adb logcat -s mam-ai-bench:W` output during
debugging.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nmrenyi nmrenyi requested a review from Copilot May 15, 2026 02:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Comment thread app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt Outdated
Comment thread evaluation/aggregate_k_sweep.py
Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread evaluation/aggregate_k_sweep.py Outdated
Comment thread app/android/app/src/main/kotlin/com/example/app/BenchmarkForegroundService.kt Outdated
Comment thread app/android/app/src/main/AndroidManifest.xml
nmrenyi and others added 7 commits May 15, 2026 10:48
The 6 h ceiling could expire mid-sweep on long CPU runs — the full GPU
+ CPU k-sweep documented in latency_report_v2.md took ~7 h end-to-end,
and a CPU-only sweep across k ∈ {1, 3, 5, 7, 10, 15} hit similar
totals. When the lock auto-released, the OS could idle the CPU and
the benchmark would silently stall (no failure log, just no progress).

Bump to 24 h, with a comment that anything longer should switch to
periodic re-acquire instead of bumping the constant further.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scope.launch { runBenchmark(...) } was unconditional, so a re-delivered
intent (another `am start` before stopSelf() completes) would spawn a
second coroutine running on the same single-threaded executor and
writing to the same benchmark_results.json — both timings and the
output JSON would be corrupted.

Add a `benchmarkStarted` volatile flag that's set on first entry. Any
later onStartCommand call returns immediately with a logcat warning,
keeping the in-flight run intact.

START_NOT_STICKY makes this unlikely in practice, but the right belt-
and-braces fix is cheap and removes a race even on edge cases (e.g.
process recreated after low-memory kill before stopSelf returned).

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The catch handler in runQuery() was the only [BENCHMARK] log line in
the file using TAG="mam-ai" instead of BENCH_TAG="mam-ai-bench". This
made `adb logcat -s mam-ai-bench:E` filter out exactly the messages
most worth surfacing — query-level errors.

One-line fix.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
load_runs() was silently dropping JSONs that didn't match the 54-run
canonical-sweep shape (missing config/results keys, or < 30 results
which is the smoke-test guard). For users running a narrow sweep
(e.g. --filter long_01 --repeats 3 yields 3 results), the file would
silently never appear in the report with no indication why.

Log SKIP lines to stderr with the file name and reason. The output is
still clean for normal runs (only emits when something is actually
dropped).

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Methodology paragraph hardcoded "18 queries × … 10-second cooldown"
even though the JSONs carry the actual config.repeats, config.cooldown_ms,
and a results count that proves the math. If a future run uses different
defaults (or this script is pointed at a different sweep), the
methodology text would silently lie.

Now reads from the sample run's config: pulls repeats, cooldown_ms,
results count; infers (queries × modes) from results / repeats. Output
for the current data set is unchanged ("18 (query × mode) cells × 3
repeats = 54 timed runs … 10-second cooldown") because that's what the
JSONs actually say.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous formula `int(len(s) * 0.95)` collapses to max for any
sample size n < 20 — e.g. for the per-category short bucket with 24
runs, int(24*0.95) = 22 (the 23rd of 24 sorted values), which is close
to but not actually the 95th percentile. For a hypothetical narrower
sample with n=3 (e.g. single-query small sweep), int(2.85) = 2 = the
max, so p95 == max by construction.

Centralise the calculation in a `_p95(values)` helper that uses
`statistics.quantiles(values, n=20, method="exclusive")[18]` — the
linear-interpolation 95th percentile from a 20-quantile partition.
Falls back to max only when n < 2 (genuinely no quantiles to compute).

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stopForeground(boolean) has been deprecated since Android 13. Replace
with the SDK-aware form: STOP_FOREGROUND_REMOVE on API 24+ (where the
int overload was introduced), fall back to the boolean variant only on
older devices (where it isn't deprecated).

Drops the @Suppress("DEPRECATION") on the modern path; we still
suppress on the legacy path because the boolean variant *is* the
non-deprecated API there.

Per Copilot review on PR #57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@nmrenyi nmrenyi merged commit 1be0a55 into main May 15, 2026
14 of 18 checks passed
@nmrenyi nmrenyi deleted the feat/benchmark-retrieve-k-override branch May 15, 2026 03:57
nmrenyi added a commit that referenced this pull request May 16, 2026
Ran the same 8-cell sweep we did for FP16 GPU in PR #57/#59, but
against the FP32-tagged artifact at maxNumTokens=4096. ~4.5 hours
total wall-clock on the OnePlus OPD2413 (Snapdragon 8 Elite). Result
is a clean cell-by-cell comparison.

Headline: FP32 GPU is **~25% slower than FP16 GPU at k=15** (~6 s
extra wait per query), much less than the ~3× I'd estimated from
the single-data-point Step 3 measurement. That earlier number was
wrong — I had accidentally been comparing FP32-E4B against
FP16-**E2B** (the smaller model), not the matched FP16-E4B baseline.

The slowdown is almost entirely in TTFT (prefill):
- FP16 TTFT 0.96–4.0s, FP32 TTFT 2.0–9.8s (~2–2.5× across all k)
- FP16 decode 11–18s, FP32 decode 12–19s (essentially identical)

Mechanism: prefill is compute-bound (one parallel forward pass over
the input), so FP16's 2× arithmetic throughput on Adreno helps a
lot. Decode is bandwidth-bound (sequential token-at-a-time loading
of weights), so the FP16/FP32 precision choice barely matters.

Same 24 errors at k=20 on FP32 GPU as on FP16 GPU — the prompt-cap
rejection at maxNumTokens=4096 is precision-agnostic, just a config
check.

What this means: **FP32 GPU is a real shipping option, not just an
experiment.** At maxNumTokens=4096 the latency cost is ~25% (no
quality benefit — we're below the FP16 cliff anyway). At higher
maxNumTokens (e.g., 5500), FP32 GPU enables clean output past the
FP16 cliff at the same ~25% latency hit. Memory ceiling caps
maxNumTokens at ~6500–7500 on this 16 GB device since KV cache
doubles vs FP16.

The choice between FP16 GPU and FP32 GPU is now a UX-vs-margin
tradeoff at the deployment level, not a feasibility question.

Updates: new Step 5 section with full sweep table + corrected
slowdown narrative; updated TL;DR bullet on FP32 latency cost;
"What's still open" table marks FP32 latency curve resolved; full
8-JSON inventory added to References.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants