RAG evaluation strategy: frameworks, metrics, gaps, and potential contributions

## Overview

This issue captures the full evaluation strategy for MAM-AI, covering what existing RAG evaluation frameworks offer, how to make best use of them, and where MAM-AI could contribute back to the community.

---

## Major RAG Evaluation Frameworks

### RAGAS
- Most widely used; measures Faithfulness, Answer Relevancy, Context Precision, Context Recall
- Faithfulness and Answer Relevancy are **reference-free** (no ground truth needed) — useful since we don't always have a verified "correct answer"
- Context Recall requires ground truth answers
- Supports local judge models via Ollama

### DeepEval
- Same core metrics as RAGAS
- Key advantage: **self-explaining scores** — tells you *why* a response scored low, not just the number
- Tight pytest/CI integration; one command to switch judge to a local Ollama model
- Best choice for fast development iteration

### TruLens
- More focused on tracing and monitoring during development than batch evaluation
- Less suited for producing evaluation reports

### ARES (Stanford)
- Most rigorous: trains a **custom small classifier** as judge using your own corpus
- Produces statistically confident scores with confidence intervals
- Key advantage for MAM-AI: domain-specific judge that understands Zanzibar clinical guidelines rather than relying on a general LLM
- Requires A100-class GPU — research-level tool, not a quick integration

---

## What Both RAGAS and DeepEval Evaluate

Both cover **retrieval and generation independently**, which is the core value:

| Component | Metric | What it measures |
|---|---|---|
| Retrieval | Context Precision | Are the retrieved chunks actually relevant? |
| Retrieval | Context Recall | Does the retrieved context cover the ground truth? |
| Generation | Faithfulness | Does the response only claim things the retrieved context supports? |
| Generation | Answer Relevancy | Does the response actually address the question? |

This component-level breakdown is important: a low faithfulness score points to a generation problem; low context precision points to a retrieval problem. Without this, we only know an end response is wrong, not why.

---

## Recommended Evaluation Strategy for MAM-AI

**When**: Pre-deployment on a dev machine with good internet — not on-device.

**Tooling**:
- **DeepEval** for development iteration (fast, self-explaining feedback)
- **RAGAS** for evaluation reporting (more recognized in research community)
- **MCQ accuracy** (AfriMedQA, Kenya vignettes) for medical correctness — no LLM judge needed

**Judge model**: Claude or GPT-4 for reliable scores during development; local Ollama model for cost-efficient large batch runs.

**Per query, log three fields**:
1. The question
2. Retrieved chunks (from Gecko retrieval)
3. Generated response (from Gemma on-device)

Pass these into RAGAS/DeepEval for automatic scoring. See issue #33 for implementation details.

---

## Critical Limitation

A high faithfulness score means the response is **consistent with retrieved chunks** — it does not mean the response is **medically correct**. A general LLM judge has no knowledge of Zanzibar's MOHSW guidelines or East African clinical norms, and will default to Western biomedical standards.

This means automated RAG metrics are a necessary but insufficient signal. Domain expert review (clinicians, LIGHT/Swiss TPH partners) remains essential, especially for:
- Medication dosing recommendations
- Emergency escalation thresholds
- Swahili-language response quality

---

## Potential Contribution to the Community

### East African / Swahili Clinical QA Dataset

No existing RAG benchmark evaluates small quantized on-device models (1–4B parameters) against East African clinical guidelines, and none include Swahili.

MAM-AI's document corpus (WHO guidelines, MOHSW Zanzibar guidelines) is the basis for a novel benchmark dataset:
- QA pairs derived from the existing guideline documents
- Swahili translations reviewed by clinical partners
- Ground-truth answers verified by domain experts
- Designed specifically for evaluating on-device models in resource-constrained settings

This would be a meaningful research contribution — directly useful to anyone building health AI for East Africa and complementary to AfriMed-QA (which is English-dominant) and the Kenyan Clinical RAG benchmark (which is in progress).

### Scope for dataset contribution
- [ ] Derive QA pairs from WHO/MOHSW documents already in MAM-AI corpus
- [ ] Add Swahili translations for key questions
- [ ] Clinical expert review (LIGHT, Swiss TPH partners)
- [ ] Release as a HuggingFace dataset
- [ ] Publish baseline evaluation results with MAM-AI (Gemma 4 E4B on-device) and server-side models for comparison

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG evaluation strategy: frameworks, metrics, gaps, and potential contributions #34

Overview

Major RAG Evaluation Frameworks

RAGAS

DeepEval

TruLens

ARES (Stanford)

What Both RAGAS and DeepEval Evaluate

Recommended Evaluation Strategy for MAM-AI

Critical Limitation

Potential Contribution to the Community

East African / Swahili Clinical QA Dataset

Scope for dataset contribution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Metric	What it measures
Retrieval	Context Precision	Are the retrieved chunks actually relevant?
Retrieval	Context Recall	Does the retrieved context cover the ground truth?
Generation	Faithfulness	Does the response only claim things the retrieved context supports?
Generation	Answer Relevancy	Does the response actually address the question?

RAG evaluation strategy: frameworks, metrics, gaps, and potential contributions #34

Description

Overview

Major RAG Evaluation Frameworks

RAGAS

DeepEval

TruLens

ARES (Stanford)

What Both RAGAS and DeepEval Evaluate

Recommended Evaluation Strategy for MAM-AI

Critical Limitation

Potential Contribution to the Community

East African / Swahili Clinical QA Dataset

Scope for dataset contribution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions