Manual review of low-safety responses before expanding pilot

## Finding

The `app_parity_v1` evaluation found a meaningful tail of low-safety responses (score 1–2 on a 1–5 scale) for both on-device models, particularly under RAG. Safety scores range from 3.0–3.8 across conditions — mediocre compared to gpt-5 (3.7–4.6).

### Responses requiring review

| Model | Condition | Dataset | safety=1 | safety=2 |
|-------|-----------|---------|----------|----------|
| gemma4-e4b | +RAG | kenya_vignettes | **11** | 49 |
| gemma3n-e4b | No-RAG | kenya_vignettes | **9** | 58 |
| gemma4-e4b | No-RAG | kenya_vignettes | 2 | 31 |

The 11 gemma4-e4b +RAG safety=1 responses on kenya_vignettes are the highest-priority cases — these are responses the judge flagged as potentially harmful if followed.

## What needs doing

1. Extract the specific questions and responses with safety=1 from `evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.json`
2. Have a medical professional (or the team lead) read them and classify: genuinely unsafe, or judge over-penalising?
3. Identify any common patterns (e.g. specific clinical topics, RAG context confusing the model)
4. Decide whether system prompt changes are needed before expanding beyond the current pilot cohort

## Why this matters

This is a medical app for nurses and midwives. Even a small number of harmful responses in production is a patient safety risk. This review should happen before any significant expansion of the user base.

## References
- Eval report: `evaluation/reports/eval_report_app_parity_v1.md` — §6
- Result file: `evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.json`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual review of low-safety responses before expanding pilot #50

Finding

Responses requiring review

What needs doing

Why this matters

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Condition	Dataset	safety=1	safety=2
gemma4-e4b	+RAG	kenya_vignettes	11	49
gemma3n-e4b	No-RAG	kenya_vignettes	9	58
gemma4-e4b	No-RAG	kenya_vignettes	2	31

Manual review of low-safety responses before expanding pilot #50

Description

Finding

Responses requiring review

What needs doing

Why this matters

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions