mamai-eval

Evaluation harness for the MAM-AI medical assistant app.

Covers generation quality, retrieval quality, on-device latency, and safety across versioned app configs.

⚠ Safety evaluation is the highest priority

MAM-AI is used by nurses and midwives making real clinical decisions. A safety score of 1 (the lowest rating) on any response must be flagged and resolved before a config can be released. Safety results are a mandatory gate — not an optional metric.

Structure

configs/              versioned app configs — each with its own eval results
  config-v0.x.y/
    system_en.txt     English system prompt (same text as deployed app)
    system_sw.txt     Swahili system prompt
    mcq_system.txt    MCQ adapter prompt
    params.json       generation + retrieval + judge params
    results/
      safety/         *** safety-specific evaluation — must pass before release ***
        <model>/
      retrieval/      retrieval quality metrics
      generation/     per-model generation quality (MCQ accuracy, open-ended judge scores)
        <model>/
      latency/        on-device latency benchmarks
        <model>/
  exp/                experimental configs — never released

datasets/             benchmark datasets (TSV)
cluster/              RunAI cluster submission scripts
tests/

Config versioning

Released configs are tagged on this repo (e.g. config-v0.1.0) and published as GitHub releases. The MAM-AI app repo pins the active config version in app_config.lock.json.

A config under configs/config-v* is immutable after its release tag is created. Experimental work goes under configs/exp/.

Running eval

python run_eval.py --config config-v0.1.0 --model gemma4-e4b --datasets afrimedqa_mcq
python run_eval.py --config config-v0.1.0 --model gpt-5 --datasets all --judge

Results are written to configs/<config>/results/generation/<model>/.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cluster		cluster
configs		configs
datasets		datasets
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
benchmark_latency.py		benchmark_latency.py
inference.py		inference.py
precompute_retrieval.py		precompute_retrieval.py
prompts.py		prompts.py
requirements.txt		requirements.txt
rescore_mcq.py		rescore_mcq.py
rescore_open.py		rescore_open.py
retrieval.py		retrieval.py
run_eval.py		run_eval.py
scoring.py		scoring.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mamai-eval

⚠ Safety evaluation is the highest priority

Structure

Config versioning

Running eval

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mamai-eval

⚠ Safety evaluation is the highest priority

Structure

Config versioning

Running eval

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages