Finding
The app_parity_v1 evaluation found a meaningful tail of low-safety responses (score 1–2 on a 1–5 scale) for both on-device models, particularly under RAG. Safety scores range from 3.0–3.8 across conditions — mediocre compared to gpt-5 (3.7–4.6).
Responses requiring review
| Model |
Condition |
Dataset |
safety=1 |
safety=2 |
| gemma4-e4b |
+RAG |
kenya_vignettes |
11 |
49 |
| gemma3n-e4b |
No-RAG |
kenya_vignettes |
9 |
58 |
| gemma4-e4b |
No-RAG |
kenya_vignettes |
2 |
31 |
The 11 gemma4-e4b +RAG safety=1 responses on kenya_vignettes are the highest-priority cases — these are responses the judge flagged as potentially harmful if followed.
What needs doing
- Extract the specific questions and responses with safety=1 from
evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.json
- Have a medical professional (or the team lead) read them and classify: genuinely unsafe, or judge over-penalising?
- Identify any common patterns (e.g. specific clinical topics, RAG context confusing the model)
- Decide whether system prompt changes are needed before expanding beyond the current pilot cohort
Why this matters
This is a medical app for nurses and midwives. Even a small number of harmful responses in production is a patient safety risk. This review should happen before any significant expansion of the user base.
References
- Eval report:
evaluation/reports/eval_report_app_parity_v1.md — §6
- Result file:
evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.json
Finding
The
app_parity_v1evaluation found a meaningful tail of low-safety responses (score 1–2 on a 1–5 scale) for both on-device models, particularly under RAG. Safety scores range from 3.0–3.8 across conditions — mediocre compared to gpt-5 (3.7–4.6).Responses requiring review
The 11 gemma4-e4b +RAG safety=1 responses on kenya_vignettes are the highest-priority cases — these are responses the judge flagged as potentially harmful if followed.
What needs doing
evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.jsonWhy this matters
This is a medical app for nurses and midwives. Even a small number of harmful responses in production is a patient safety risk. This review should happen before any significant expansion of the user base.
References
evaluation/reports/eval_report_app_parity_v1.md— §6evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.json