Skip to content

Build a doc-grounded benchmark from the bundled clinical guideline corpus #42

@nmrenyi

Description

@nmrenyi

Why

The current benchmark suite is useful for overall model quality, but it is not tightly aligned with the actual product task: answering questions grounded in the documents bundled with MAM-AI.

We need a benchmark built from the current collected guideline corpus so we can evaluate the product against the knowledge base it actually ships.

Scope

Create a new benchmark derived from the bundled docs / RAG corpus that includes:

  • Direct lookup questions
  • Multi-chunk synthesis questions
  • Source-sensitive questions where the expected supporting guideline matters
  • Negative / unsupported questions where the correct behavior is to avoid unsupported claims

Each example should record enough structure to support both retrieval and generation evaluation.

Dataset fields

At minimum:

  • question
  • answer / reference answer
  • expected supporting source(s)
  • expected page(s) when feasible
  • task type (lookup / synthesis / unsupported)
  • language tag if Swahili is later added

Notes

This is more concrete than the broader RAG evaluation strategy in #34. The goal here is a repo-local benchmark artifact we can actually run in CI/dev evaluation.

Deliverables

  • New dataset spec under evaluation/data/
  • Documentation for how examples are generated and reviewed
  • Initial benchmark split with enough examples to compare retrieval and answer quality meaningfully

Acceptance criteria

  • The benchmark is grounded in the current bundled corpus, not generic medical QA only.
  • It is usable for both retrieval-only and end-to-end RAG evaluation.
  • Unsupported / abstention behavior is explicitly represented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions