Overview
This issue captures the full evaluation strategy for MAM-AI, covering what existing RAG evaluation frameworks offer, how to make best use of them, and where MAM-AI could contribute back to the community.
Major RAG Evaluation Frameworks
RAGAS
- Most widely used; measures Faithfulness, Answer Relevancy, Context Precision, Context Recall
- Faithfulness and Answer Relevancy are reference-free (no ground truth needed) — useful since we don't always have a verified "correct answer"
- Context Recall requires ground truth answers
- Supports local judge models via Ollama
DeepEval
- Same core metrics as RAGAS
- Key advantage: self-explaining scores — tells you why a response scored low, not just the number
- Tight pytest/CI integration; one command to switch judge to a local Ollama model
- Best choice for fast development iteration
TruLens
- More focused on tracing and monitoring during development than batch evaluation
- Less suited for producing evaluation reports
ARES (Stanford)
- Most rigorous: trains a custom small classifier as judge using your own corpus
- Produces statistically confident scores with confidence intervals
- Key advantage for MAM-AI: domain-specific judge that understands Zanzibar clinical guidelines rather than relying on a general LLM
- Requires A100-class GPU — research-level tool, not a quick integration
What Both RAGAS and DeepEval Evaluate
Both cover retrieval and generation independently, which is the core value:
| Component |
Metric |
What it measures |
| Retrieval |
Context Precision |
Are the retrieved chunks actually relevant? |
| Retrieval |
Context Recall |
Does the retrieved context cover the ground truth? |
| Generation |
Faithfulness |
Does the response only claim things the retrieved context supports? |
| Generation |
Answer Relevancy |
Does the response actually address the question? |
This component-level breakdown is important: a low faithfulness score points to a generation problem; low context precision points to a retrieval problem. Without this, we only know an end response is wrong, not why.
Recommended Evaluation Strategy for MAM-AI
When: Pre-deployment on a dev machine with good internet — not on-device.
Tooling:
- DeepEval for development iteration (fast, self-explaining feedback)
- RAGAS for evaluation reporting (more recognized in research community)
- MCQ accuracy (AfriMedQA, Kenya vignettes) for medical correctness — no LLM judge needed
Judge model: Claude or GPT-4 for reliable scores during development; local Ollama model for cost-efficient large batch runs.
Per query, log three fields:
- The question
- Retrieved chunks (from Gecko retrieval)
- Generated response (from Gemma on-device)
Pass these into RAGAS/DeepEval for automatic scoring. See issue #33 for implementation details.
Critical Limitation
A high faithfulness score means the response is consistent with retrieved chunks — it does not mean the response is medically correct. A general LLM judge has no knowledge of Zanzibar's MOHSW guidelines or East African clinical norms, and will default to Western biomedical standards.
This means automated RAG metrics are a necessary but insufficient signal. Domain expert review (clinicians, LIGHT/Swiss TPH partners) remains essential, especially for:
- Medication dosing recommendations
- Emergency escalation thresholds
- Swahili-language response quality
Potential Contribution to the Community
East African / Swahili Clinical QA Dataset
No existing RAG benchmark evaluates small quantized on-device models (1–4B parameters) against East African clinical guidelines, and none include Swahili.
MAM-AI's document corpus (WHO guidelines, MOHSW Zanzibar guidelines) is the basis for a novel benchmark dataset:
- QA pairs derived from the existing guideline documents
- Swahili translations reviewed by clinical partners
- Ground-truth answers verified by domain experts
- Designed specifically for evaluating on-device models in resource-constrained settings
This would be a meaningful research contribution — directly useful to anyone building health AI for East Africa and complementary to AfriMed-QA (which is English-dominant) and the Kenyan Clinical RAG benchmark (which is in progress).
Scope for dataset contribution
Overview
This issue captures the full evaluation strategy for MAM-AI, covering what existing RAG evaluation frameworks offer, how to make best use of them, and where MAM-AI could contribute back to the community.
Major RAG Evaluation Frameworks
RAGAS
DeepEval
TruLens
ARES (Stanford)
What Both RAGAS and DeepEval Evaluate
Both cover retrieval and generation independently, which is the core value:
This component-level breakdown is important: a low faithfulness score points to a generation problem; low context precision points to a retrieval problem. Without this, we only know an end response is wrong, not why.
Recommended Evaluation Strategy for MAM-AI
When: Pre-deployment on a dev machine with good internet — not on-device.
Tooling:
Judge model: Claude or GPT-4 for reliable scores during development; local Ollama model for cost-efficient large batch runs.
Per query, log three fields:
Pass these into RAGAS/DeepEval for automatic scoring. See issue #33 for implementation details.
Critical Limitation
A high faithfulness score means the response is consistent with retrieved chunks — it does not mean the response is medically correct. A general LLM judge has no knowledge of Zanzibar's MOHSW guidelines or East African clinical norms, and will default to Western biomedical standards.
This means automated RAG metrics are a necessary but insufficient signal. Domain expert review (clinicians, LIGHT/Swiss TPH partners) remains essential, especially for:
Potential Contribution to the Community
East African / Swahili Clinical QA Dataset
No existing RAG benchmark evaluates small quantized on-device models (1–4B parameters) against East African clinical guidelines, and none include Swahili.
MAM-AI's document corpus (WHO guidelines, MOHSW Zanzibar guidelines) is the basis for a novel benchmark dataset:
This would be a meaningful research contribution — directly useful to anyone building health AI for East Africa and complementary to AfriMed-QA (which is English-dominant) and the Kenyan Clinical RAG benchmark (which is in progress).
Scope for dataset contribution