Why
The current benchmark suite is useful for overall model quality, but it is not tightly aligned with the actual product task: answering questions grounded in the documents bundled with MAM-AI.
We need a benchmark built from the current collected guideline corpus so we can evaluate the product against the knowledge base it actually ships.
Scope
Create a new benchmark derived from the bundled docs / RAG corpus that includes:
- Direct lookup questions
- Multi-chunk synthesis questions
- Source-sensitive questions where the expected supporting guideline matters
- Negative / unsupported questions where the correct behavior is to avoid unsupported claims
Each example should record enough structure to support both retrieval and generation evaluation.
Dataset fields
At minimum:
- question
- answer / reference answer
- expected supporting source(s)
- expected page(s) when feasible
- task type (lookup / synthesis / unsupported)
- language tag if Swahili is later added
Notes
This is more concrete than the broader RAG evaluation strategy in #34. The goal here is a repo-local benchmark artifact we can actually run in CI/dev evaluation.
Deliverables
- New dataset spec under
evaluation/data/
- Documentation for how examples are generated and reviewed
- Initial benchmark split with enough examples to compare retrieval and answer quality meaningfully
Acceptance criteria
- The benchmark is grounded in the current bundled corpus, not generic medical QA only.
- It is usable for both retrieval-only and end-to-end RAG evaluation.
- Unsupported / abstention behavior is explicitly represented.
Why
The current benchmark suite is useful for overall model quality, but it is not tightly aligned with the actual product task: answering questions grounded in the documents bundled with MAM-AI.
We need a benchmark built from the current collected guideline corpus so we can evaluate the product against the knowledge base it actually ships.
Scope
Create a new benchmark derived from the bundled docs / RAG corpus that includes:
Each example should record enough structure to support both retrieval and generation evaluation.
Dataset fields
At minimum:
Notes
This is more concrete than the broader RAG evaluation strategy in #34. The goal here is a repo-local benchmark artifact we can actually run in CI/dev evaluation.
Deliverables
evaluation/data/Acceptance criteria