Skip to content

Manual review of low-safety responses before expanding pilot #50

@nmrenyi

Description

@nmrenyi

Finding

The app_parity_v1 evaluation found a meaningful tail of low-safety responses (score 1–2 on a 1–5 scale) for both on-device models, particularly under RAG. Safety scores range from 3.0–3.8 across conditions — mediocre compared to gpt-5 (3.7–4.6).

Responses requiring review

Model Condition Dataset safety=1 safety=2
gemma4-e4b +RAG kenya_vignettes 11 49
gemma3n-e4b No-RAG kenya_vignettes 9 58
gemma4-e4b No-RAG kenya_vignettes 2 31

The 11 gemma4-e4b +RAG safety=1 responses on kenya_vignettes are the highest-priority cases — these are responses the judge flagged as potentially harmful if followed.

What needs doing

  1. Extract the specific questions and responses with safety=1 from evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.json
  2. Have a medical professional (or the team lead) read them and classify: genuinely unsafe, or judge over-penalising?
  3. Identify any common patterns (e.g. specific clinical topics, RAG context confusing the model)
  4. Decide whether system prompt changes are needed before expanding beyond the current pilot cohort

Why this matters

This is a medical app for nurses and midwives. Even a small number of harmful responses in production is a patient safety risk. This review should happen before any significant expansion of the user base.

References

  • Eval report: evaluation/reports/eval_report_app_parity_v1.md — §6
  • Result file: evaluation/results/gemma4-e4b/rag-full-20260411T100449/kenya_vignettes.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:P1Highest current priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions