Build a doc-grounded benchmark from the bundled clinical guideline corpus

## Why

The current benchmark suite is useful for overall model quality, but it is not tightly aligned with the actual product task: answering questions grounded in the documents bundled with MAM-AI.

We need a benchmark built from the current collected guideline corpus so we can evaluate the product against the knowledge base it actually ships.

## Scope

Create a new benchmark derived from the bundled docs / RAG corpus that includes:
- Direct lookup questions
- Multi-chunk synthesis questions
- Source-sensitive questions where the expected supporting guideline matters
- Negative / unsupported questions where the correct behavior is to avoid unsupported claims

Each example should record enough structure to support both retrieval and generation evaluation.

## Dataset fields

At minimum:
- question
- answer / reference answer
- expected supporting source(s)
- expected page(s) when feasible
- task type (lookup / synthesis / unsupported)
- language tag if Swahili is later added

## Notes

This is more concrete than the broader RAG evaluation strategy in #34. The goal here is a repo-local benchmark artifact we can actually run in CI/dev evaluation.

## Deliverables

- New dataset spec under `evaluation/data/`
- Documentation for how examples are generated and reviewed
- Initial benchmark split with enough examples to compare retrieval and answer quality meaningfully

## Acceptance criteria

- The benchmark is grounded in the current bundled corpus, not generic medical QA only.
- It is usable for both retrieval-only and end-to-end RAG evaluation.
- Unsupported / abstention behavior is explicitly represented.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build a doc-grounded benchmark from the bundled clinical guideline corpus #42

Why

Scope

Dataset fields

Notes

Deliverables

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Build a doc-grounded benchmark from the bundled clinical guideline corpus #42

Description

Why

Scope

Dataset fields

Notes

Deliverables

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions