Skip to content

RAG evaluation strategy: frameworks, metrics, gaps, and potential contributions #34

@nmrenyi

Description

@nmrenyi

Overview

This issue captures the full evaluation strategy for MAM-AI, covering what existing RAG evaluation frameworks offer, how to make best use of them, and where MAM-AI could contribute back to the community.


Major RAG Evaluation Frameworks

RAGAS

  • Most widely used; measures Faithfulness, Answer Relevancy, Context Precision, Context Recall
  • Faithfulness and Answer Relevancy are reference-free (no ground truth needed) — useful since we don't always have a verified "correct answer"
  • Context Recall requires ground truth answers
  • Supports local judge models via Ollama

DeepEval

  • Same core metrics as RAGAS
  • Key advantage: self-explaining scores — tells you why a response scored low, not just the number
  • Tight pytest/CI integration; one command to switch judge to a local Ollama model
  • Best choice for fast development iteration

TruLens

  • More focused on tracing and monitoring during development than batch evaluation
  • Less suited for producing evaluation reports

ARES (Stanford)

  • Most rigorous: trains a custom small classifier as judge using your own corpus
  • Produces statistically confident scores with confidence intervals
  • Key advantage for MAM-AI: domain-specific judge that understands Zanzibar clinical guidelines rather than relying on a general LLM
  • Requires A100-class GPU — research-level tool, not a quick integration

What Both RAGAS and DeepEval Evaluate

Both cover retrieval and generation independently, which is the core value:

Component Metric What it measures
Retrieval Context Precision Are the retrieved chunks actually relevant?
Retrieval Context Recall Does the retrieved context cover the ground truth?
Generation Faithfulness Does the response only claim things the retrieved context supports?
Generation Answer Relevancy Does the response actually address the question?

This component-level breakdown is important: a low faithfulness score points to a generation problem; low context precision points to a retrieval problem. Without this, we only know an end response is wrong, not why.


Recommended Evaluation Strategy for MAM-AI

When: Pre-deployment on a dev machine with good internet — not on-device.

Tooling:

  • DeepEval for development iteration (fast, self-explaining feedback)
  • RAGAS for evaluation reporting (more recognized in research community)
  • MCQ accuracy (AfriMedQA, Kenya vignettes) for medical correctness — no LLM judge needed

Judge model: Claude or GPT-4 for reliable scores during development; local Ollama model for cost-efficient large batch runs.

Per query, log three fields:

  1. The question
  2. Retrieved chunks (from Gecko retrieval)
  3. Generated response (from Gemma on-device)

Pass these into RAGAS/DeepEval for automatic scoring. See issue #33 for implementation details.


Critical Limitation

A high faithfulness score means the response is consistent with retrieved chunks — it does not mean the response is medically correct. A general LLM judge has no knowledge of Zanzibar's MOHSW guidelines or East African clinical norms, and will default to Western biomedical standards.

This means automated RAG metrics are a necessary but insufficient signal. Domain expert review (clinicians, LIGHT/Swiss TPH partners) remains essential, especially for:

  • Medication dosing recommendations
  • Emergency escalation thresholds
  • Swahili-language response quality

Potential Contribution to the Community

East African / Swahili Clinical QA Dataset

No existing RAG benchmark evaluates small quantized on-device models (1–4B parameters) against East African clinical guidelines, and none include Swahili.

MAM-AI's document corpus (WHO guidelines, MOHSW Zanzibar guidelines) is the basis for a novel benchmark dataset:

  • QA pairs derived from the existing guideline documents
  • Swahili translations reviewed by clinical partners
  • Ground-truth answers verified by domain experts
  • Designed specifically for evaluating on-device models in resource-constrained settings

This would be a meaningful research contribution — directly useful to anyone building health AI for East Africa and complementary to AfriMed-QA (which is English-dominant) and the Kenyan Clinical RAG benchmark (which is in progress).

Scope for dataset contribution

  • Derive QA pairs from WHO/MOHSW documents already in MAM-AI corpus
  • Add Swahili translations for key questions
  • Clinical expert review (LIGHT, Swiss TPH partners)
  • Release as a HuggingFace dataset
  • Publish baseline evaluation results with MAM-AI (Gemma 4 E4B on-device) and server-side models for comparison

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions