Why
Adding NPU (Qualcomm Hexagon, via QNN / AI Engine Direct) as a third backend option on RagPipeline could give us a meaningful perf upgrade on Snapdragon 8 Elite test devices: expected ~3–10× decode speedup over the current GPU path and sub-second TTFT. The runtime is ready — what's missing is the model artifact.
What we know
| Component |
Status |
Backend.NPU(nativeLibraryDir=…) in com.google.ai.edge.litertlm:litertlm-android:0.11.0 |
✅ Shipped |
QAIRT native libs (libQnnHtp/V79*/System/HtpPrepare/LiteRtDispatch_Qualcomm) needed in jniLibs/arm64-v8a/ |
✅ Available from Qualcomm AI Hub |
Pre-compiled gemma-4-E4B-it_qualcomm_sm8750.litertlm artifact on HuggingFace |
❌ Not yet published |
Only the smaller sibling, gemma-4-E2B-it_qualcomm_sm8750.litertlm (~3.02 GB), exists on litert-community/. Loading our current generic gemma-4-E4B-it.litertlm on Backend.NPU() fails with TF_LITE_AUX not found in the model — see LiteRT-LM Issue #774.
Why precompilation is needed (and a fixed-context-length gotcha)
Unlike CPU/GPU, the Hexagon NPU executes an ahead-of-time compiled binary graph. Someone has to run Qualcomm's QAIRT toolchain against the Gemma weights, target SM8750 specifically, and bundle the result into a custom .litertlm. The compiled graph also bakes in a fixed max sequence length — e.g. the Gemma 3 1B Qualcomm build is locked to 1280 tokens. Whatever E4B SM8750 build eventually appears will have its own cap, and we'd need to verify it's high enough for our RAG k=15 path (~3500 input tokens — see latency report evaluation/reports/latency_report_v2.md).
Watch criteria — when to revisit
Check monthly:
- HuggingFace
litert-community/gemma-4-E4B-it-litert-lm for a *_qualcomm_sm8750.litertlm file
- LiteRT-LM repo releases for NPU-related feature notes
- LiteRT-LM issues filtered on
qnn / npu labels
When the artifact appears, integration is a ~1 day patch:
- Add
Backend.NPU() branch in RagPipeline.kt:112-124, mirroring the existing GPU try/catch
- Bundle QAIRT
.so files in app/android/app/src/main/jniLibs/arm64-v8a/
- Bump
config/rag_assets.lock.json to pull the new artifact
- Add
useNpuForLlm Gradle property + BuildConfig field, matching the existing useGpuForLlm pattern
- Pin QAIRT lib version exactly to what the model artifact expects — version mismatches crash on the same SM8750 chip; see LiteRT-LM Issue #2226
Caveats and scope
- Snapdragon-only. The 80% of African target devices run MediaTek Dimensity / Helio chips. Their NPU (MediaTek APU) needs a separately compiled artifact — not yet published for Gemma 4 at any size. This issue tracks only the Snapdragon path.
- Fallback path E2B downgrade: if we want NPU acceleration before the E4B artifact ships, the alternative is switching to Gemma 4 E2B (2B params, ~half the footprint) which already has a Snapdragon 8 Elite QNN build. This means accepting the quality regression vs E4B and would need its own answer-quality evaluation. Out of scope here; file separately if anyone wants to pursue it.
- Foreground service + Hans whitelist still apply. NPU inference runs on Hexagon but the orchestrator stays in-process, so the existing screen-off workarounds carry over unchanged.
Recommendation
Long-tail watch, not active work. Keep optimizing the GPU path (which is fine for Snapdragon 8 Elite at any k ≤ 15). Recheck monthly. When the E4B-SM8750 artifact lands, it's a small patch on top of the existing backend-selection logic.
Why
Adding NPU (Qualcomm Hexagon, via QNN / AI Engine Direct) as a third backend option on
RagPipelinecould give us a meaningful perf upgrade on Snapdragon 8 Elite test devices: expected ~3–10× decode speedup over the current GPU path and sub-second TTFT. The runtime is ready — what's missing is the model artifact.What we know
Backend.NPU(nativeLibraryDir=…)incom.google.ai.edge.litertlm:litertlm-android:0.11.0libQnnHtp/V79*/System/HtpPrepare/LiteRtDispatch_Qualcomm) needed injniLibs/arm64-v8a/gemma-4-E4B-it_qualcomm_sm8750.litertlmartifact on HuggingFaceOnly the smaller sibling,
gemma-4-E2B-it_qualcomm_sm8750.litertlm(~3.02 GB), exists onlitert-community/. Loading our current genericgemma-4-E4B-it.litertlmonBackend.NPU()fails withTF_LITE_AUX not found in the model— see LiteRT-LM Issue #774.Why precompilation is needed (and a fixed-context-length gotcha)
Unlike CPU/GPU, the Hexagon NPU executes an ahead-of-time compiled binary graph. Someone has to run Qualcomm's QAIRT toolchain against the Gemma weights, target SM8750 specifically, and bundle the result into a custom
.litertlm. The compiled graph also bakes in a fixed max sequence length — e.g. the Gemma 3 1B Qualcomm build is locked to 1280 tokens. Whatever E4B SM8750 build eventually appears will have its own cap, and we'd need to verify it's high enough for our RAG k=15 path (~3500 input tokens — see latency reportevaluation/reports/latency_report_v2.md).Watch criteria — when to revisit
Check monthly:
litert-community/gemma-4-E4B-it-litert-lmfor a*_qualcomm_sm8750.litertlmfileqnn/npulabelsWhen the artifact appears, integration is a ~1 day patch:
Backend.NPU()branch inRagPipeline.kt:112-124, mirroring the existing GPU try/catch.sofiles inapp/android/app/src/main/jniLibs/arm64-v8a/config/rag_assets.lock.jsonto pull the new artifactuseNpuForLlmGradle property +BuildConfigfield, matching the existinguseGpuForLlmpatternCaveats and scope
Recommendation
Long-tail watch, not active work. Keep optimizing the GPU path (which is fine for Snapdragon 8 Elite at any k ≤ 15). Recheck monthly. When the E4B-SM8750 artifact lands, it's a small patch on top of the existing backend-selection logic.