eval: reconsider deployment model — gemma3n-e4b outperforms gemma4-e4b on every open-ended metric

## Finding

Under the `app_parity_v1` evaluation protocol, **gemma3n-e4b consistently outperforms the currently deployed gemma4-e4b** on all three open-ended datasets in the no-RAG condition:

| Dataset | gemma4-e4b | gemma3n-e4b | Δ |
|---------|-----------|------------|---|
| kenya_vignettes (n=284) | 2.76 | 3.02 | **+0.26** |
| afrimedqa_saq (n=37) | 2.57 | 3.28 | **+0.71** |
| whb_stumps (n=20) | 2.51 | 2.64 | **+0.13** |

The gap is largest on afrimedqa_saq (+0.71), driven mainly by completeness (2.49 vs 1.54) and helpfulness. On MCQ, gemma3n-e4b also edges out gemma4-e4b by 2–4 pp across all three datasets.

Open-ended performance is the deployment-relevant metric — it reflects actual clinical query quality in the app.

## Question to resolve

Is there a concrete reason to keep gemma4-e4b as the deployment target over gemma3n-e4b? Candidates:
- Inference speed / TTFT on target devices
- Memory footprint
- On-device compatibility (LiteRT-LM version requirements)
- Quantization quality at E4B level

If no strong reason exists, the default should switch to gemma3n-e4b for the next pilot cohort.

## References
- Eval report: `evaluation/reports/eval_report_app_parity_v1.md` — §1, §3
- Run dirs: `evaluation/results/gemma4-e4b/norag-full-20260411T095630/`, `evaluation/results/gemma3n-e4b/norag-full-20260411T114335/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: reconsider deployment model — gemma3n-e4b outperforms gemma4-e4b on every open-ended metric #48

Finding

Question to resolve

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dataset	gemma4-e4b	gemma3n-e4b	Δ
kenya_vignettes (n=284)	2.76	3.02	+0.26
afrimedqa_saq (n=37)	2.57	3.28	+0.71
whb_stumps (n=20)	2.51	2.64	+0.13

eval: reconsider deployment model — gemma3n-e4b outperforms gemma4-e4b on every open-ended metric #48

Description

Finding

Question to resolve

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions