Track Gemma 4 E4B NPU artifact for Snapdragon (QNN backend)

## Why

Adding NPU (Qualcomm Hexagon, via QNN / AI Engine Direct) as a third backend option on `RagPipeline` could give us a meaningful perf upgrade on Snapdragon 8 Elite test devices: expected ~3–10× decode speedup over the current GPU path and sub-second TTFT. The runtime is ready — what's missing is the model artifact.

## What we know

| Component | Status |
|---|---|
| `Backend.NPU(nativeLibraryDir=…)` in `com.google.ai.edge.litertlm:litertlm-android:0.11.0` | ✅ Shipped |
| QAIRT native libs (`libQnnHtp/V79*/System/HtpPrepare/LiteRtDispatch_Qualcomm`) needed in `jniLibs/arm64-v8a/` | ✅ Available from Qualcomm AI Hub |
| **Pre-compiled `gemma-4-E4B-it_qualcomm_sm8750.litertlm` artifact** on HuggingFace | ❌ **Not yet published** |

Only the smaller sibling, `gemma-4-E2B-it_qualcomm_sm8750.litertlm` (~3.02 GB), exists on `litert-community/`. Loading our current generic `gemma-4-E4B-it.litertlm` on `Backend.NPU()` fails with `TF_LITE_AUX not found in the model` — see [LiteRT-LM Issue #774](https://github.com/google-ai-edge/LiteRT-LM/issues/774).

## Why precompilation is needed (and a fixed-context-length gotcha)

Unlike CPU/GPU, the Hexagon NPU executes an ahead-of-time compiled binary graph. Someone has to run Qualcomm's QAIRT toolchain against the Gemma weights, target SM8750 specifically, and bundle the result into a custom `.litertlm`. The compiled graph also bakes in a **fixed max sequence length** — e.g. the Gemma 3 1B Qualcomm build is locked to 1280 tokens. Whatever E4B SM8750 build eventually appears will have its own cap, and we'd need to verify it's high enough for our RAG k=15 path (~3500 input tokens — see latency report `evaluation/reports/latency_report_v2.md`).

## Watch criteria — when to revisit

Check monthly:

- HuggingFace `litert-community/gemma-4-E4B-it-litert-lm` for a `*_qualcomm_sm8750.litertlm` file
- [LiteRT-LM repo](https://github.com/google-ai-edge/LiteRT-LM) releases for NPU-related feature notes
- LiteRT-LM issues filtered on `qnn` / `npu` labels

When the artifact appears, integration is a ~1 day patch:
- Add `Backend.NPU()` branch in `RagPipeline.kt:112-124`, mirroring the existing GPU try/catch
- Bundle QAIRT `.so` files in `app/android/app/src/main/jniLibs/arm64-v8a/`
- Bump `config/rag_assets.lock.json` to pull the new artifact
- Add `useNpuForLlm` Gradle property + `BuildConfig` field, matching the existing `useGpuForLlm` pattern
- Pin QAIRT lib version exactly to what the model artifact expects — version mismatches crash on the same SM8750 chip; see [LiteRT-LM Issue #2226](https://github.com/google-ai-edge/LiteRT-LM/issues/2226)

## Caveats and scope

- **Snapdragon-only.** The 80% of African target devices run MediaTek Dimensity / Helio chips. Their NPU (MediaTek APU) needs a separately compiled artifact — not yet published for Gemma 4 at any size. This issue tracks only the Snapdragon path.
- **Fallback path E2B downgrade:** if we want NPU acceleration before the E4B artifact ships, the alternative is switching to Gemma 4 E2B (2B params, ~half the footprint) which already has a Snapdragon 8 Elite QNN build. This means accepting the quality regression vs E4B and would need its own answer-quality evaluation. Out of scope here; file separately if anyone wants to pursue it.
- **Foreground service + Hans whitelist still apply.** NPU inference runs on Hexagon but the orchestrator stays in-process, so the existing screen-off workarounds carry over unchanged.

## Recommendation

**Long-tail watch, not active work.** Keep optimizing the GPU path (which is fine for Snapdragon 8 Elite at any k ≤ 15). Recheck monthly. When the E4B-SM8750 artifact lands, it's a small patch on top of the existing backend-selection logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track Gemma 4 E4B NPU artifact for Snapdragon (QNN backend) #58

Why

What we know

Why precompilation is needed (and a fixed-context-length gotcha)

Watch criteria — when to revisit

Caveats and scope

Recommendation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Status
`Backend.NPU(nativeLibraryDir=…)` in `com.google.ai.edge.litertlm:litertlm-android:0.11.0`	✅ Shipped
QAIRT native libs (`libQnnHtp/V79*/System/HtpPrepare/LiteRtDispatch_Qualcomm`) needed in `jniLibs/arm64-v8a/`	✅ Available from Qualcomm AI Hub
Pre-compiled `gemma-4-E4B-it_qualcomm_sm8750.litertlm` artifact on HuggingFace	❌ Not yet published

Track Gemma 4 E4B NPU artifact for Snapdragon (QNN backend) #58

Description

Why

What we know

Why precompilation is needed (and a fixed-context-length gotcha)

Watch criteria — when to revisit

Caveats and scope

Recommendation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions