Skip to content

eval: reconsider deployment model — gemma3n-e4b outperforms gemma4-e4b on every open-ended metric #48

@nmrenyi

Description

@nmrenyi

Finding

Under the app_parity_v1 evaluation protocol, gemma3n-e4b consistently outperforms the currently deployed gemma4-e4b on all three open-ended datasets in the no-RAG condition:

Dataset gemma4-e4b gemma3n-e4b Δ
kenya_vignettes (n=284) 2.76 3.02 +0.26
afrimedqa_saq (n=37) 2.57 3.28 +0.71
whb_stumps (n=20) 2.51 2.64 +0.13

The gap is largest on afrimedqa_saq (+0.71), driven mainly by completeness (2.49 vs 1.54) and helpfulness. On MCQ, gemma3n-e4b also edges out gemma4-e4b by 2–4 pp across all three datasets.

Open-ended performance is the deployment-relevant metric — it reflects actual clinical query quality in the app.

Question to resolve

Is there a concrete reason to keep gemma4-e4b as the deployment target over gemma3n-e4b? Candidates:

  • Inference speed / TTFT on target devices
  • Memory footprint
  • On-device compatibility (LiteRT-LM version requirements)
  • Quantization quality at E4B level

If no strong reason exists, the default should switch to gemma3n-e4b for the next pilot cohort.

References

  • Eval report: evaluation/reports/eval_report_app_parity_v1.md — §1, §3
  • Run dirs: evaluation/results/gemma4-e4b/norag-full-20260411T095630/, evaluation/results/gemma3n-e4b/norag-full-20260411T114335/

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions