Automating multimodal evidence seeking for agentic clinical reasoning
π Project page: https://ucsc-vlaa.github.io/ClinSeekAgent/
Quick Start β’ Data β’ Docs β’ Training β’ Responsible Use
ClinSeekAgent is a multimodal evidence-seeking pipeline for agentic clinical reasoning. It gives a host model access to patient-level EHR retrieval, browser tools for external medical knowledge, and medical imaging tools, then evaluates the model under the Automated Evidence-Seeking setting instead of the paired Curated Input setting.
This repository is prepared as the public code release for:
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
Important
This public release intentionally does not include raw MIMIC data, generated patient databases, chest X-ray files, private trajectories, model weights, or experiment logs. Data and model artifacts should be released separately on Hugging Face after the relevant access and license checks.
ClinSeekAgent shifts evaluation from passive consumption of pre-selected evidence to active evidence acquisition across raw EHR tables, external medical knowledge search, and medical imaging. Compared with the paired Curated Input setting (same task, same label, but evidence pre-selected by the source benchmark):
Text-only EHR tasks (ClinSeek-Bench, overall F1):
| Host model | Curated Input | ClinSeekAgent | Ξ |
|---|---|---|---|
| Claude Opus 4.6 | 60.0 | 63.2 | +3.2 |
| MiniMax M2.5 | 43.1 | 47.3 | +4.2 |
7 / 9 evaluated host models improve on the risk-prediction split.
Multimodal tasks (ClinSeek-Bench, overall F1):
| Host model | Curated Input | ClinSeekAgent | Ξ |
|---|---|---|---|
| Claude Opus 4.6 | 47.5 | 62.6 | +15.1 |
| Claude Sonnet 4.6 | 48.0 | 54.9 | +6.9 |
| Qwen3-VL-235B | 43.9 | 49.8 | +5.9 |
| Gemma-4-26B-A4B-it | 38.2 | 44.9 | +6.7 |
5 / 6 evaluated host models improve overall; on Phenotype reasoning Opus 4.6 alone gains +34.0 points.
Distilled student (AgentEHR-Bench, average F1):
| Model | Average F1 |
|---|---|
| Qwen3.5-35B-A3B (base) | 22.1 |
| ClinSeek-35B-A3B (ours, SFT on ClinSeekAgent trajectories) | 34.0 (+11.9) |
| Claude Sonnet 4.6 | 32.7 |
| Claude Opus 4.6 (teacher) | 36.0 |
ClinSeek-35B-A3B is the strongest open-source model in our evaluation, surpassing Kimi K2.5 (29.9), MiniMax M2.5 (27.7), GLM-4.7 (27.6), and Qwen3-235B-A22B (20.5), while reaching 94.4% of its teacher's performance.
| Path | Purpose |
|---|---|
clinseekagent/ |
Automated Evidence-Seeking and Curated Input drivers, LLM backends (Bedrock + vLLM), tool pools, and scorers |
src/run_mcp_server.py |
EHR MCP server |
src/agentlite/mcp_tools/ |
EHR table, SQL, candidate, and utility tools |
src/mcp_image/ |
Medical-image MCP server and image tools |
verl/ |
Vendored VERL training code and ClinSeekAgent SFT recipes |
scripts/ |
Public launchers for MCP servers, evaluation, vLLM, and SFT |
docs/ |
Release, data, benchmark, and training documentation |
venvs/requirements/ |
Per-role dependency files (agent driver, EHR MCP, image MCP, SFT training) |
assets/ |
Figures used in this README (teaser, performance plot, case study) |
examples/ |
Synthetic manifest examples only |
ClinSeekAgent is intentionally split into four roles with separate dependency files so you only install what you need. The agent driver and the MCP servers should each live in their own venv: the agent driver is a thin HTTP/SDK client, but the MCP servers and SFT training pull in heavy GPU and ML stacks.
cp .env.example .env # fill values for the roles you plan to runRun inference + scoring. No GPU. Install this whether you use Bedrock or a self-hosted vLLM endpoint.
python -m venv .venvs/agent && source .venvs/agent/bin/activate
pip install -r venvs/requirements/bedrock_agent.txtServes ehr.* tool calls over MCP. GPU optional (used only for BioLORD semantic search). Install only on the host that runs scripts/run_ehr_mcp.sh.
python -m venv .venvs/mcp-ehr && source .venvs/mcp-ehr/bin/activate
# Install a torch build matching your CUDA before pip-installing this file.
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r venvs/requirements/mcp_ehr.txtServes image.* CXR tool calls (classifier, report generator, phrase grounding, segmentation). GPU required (4 GPUs recommended). Install only on the host that runs scripts/run_image_mcp.sh.
python -m venv .venvs/mcp-image && source .venvs/mcp-image/bin/activate
pip install --pre torch==2.9.0+cu128 torchvision==0.24.0+cu128 \
--index-url https://download.pytorch.org/whl/cu128
pip install -r venvs/requirements/mcp_image.txtImportant: the image MCP pins
transformers==4.46.3to keep MAIRA-2 working. Do not co-install this venv with the agent venv.
Distill ClinSeekAgent trajectories into a smaller student (paper recipe: Qwen3.5-35B-A3B on 8Γ H200). Install only on the training node.
python -m venv .venvs/sft && source .venvs/sft/bin/activate
pip install -r venvs/requirements/sft_training.txtSee docs/sft_training.md for the full paper recipe.
Data is not stored in Git. Set CLINSEEK_DATA_ROOT to a prepared benchmark tree after obtaining the required credentialed datasets from their official sources.
export CLINSEEK_DATA_ROOT=./data/clinseek_bench
export CLINSEEK_MODEL_DIR=./models/clinseek-35b-a3bThe ClinSeekAgent Hugging Face collection contains the released model, benchmark metadata, and evaluation-result artifacts:
- π§ͺ ClinSeek-Bench data:
UCSC-VLAA/ClinSeek-Bench - π€ ClinSeek-35B-A3B model checkpoint:
UCSC-VLAA/ClinSeek-35B-A3B - π Evaluation results:
UCSC-VLAA/ClinSeek-Evaluation-Results
See RESOURCES.md and the documentation below for access notes and expected release structure.
For the text-only split of ClinSeek-Bench, use the following release guides:
docs/ClinSeek-Bench_text_data_prepare.md: prepare the patient-level EHR assets used by ClinSeekAgent, including MIMIC-IV / MIMIC-IV-Note / MIMIC-IV-ED layout and per-patient SQLite database generation.docs/ClinSeek-Bench_text_evaluation.md: run ClinSeekAgent under the Automated Evidence-Seeking setting, where the model retrieves evidence from raw EHR tables through ClinSeekAgent tools.docs/ClinSeek-Bench_text_curated_input_evaluation.md: run the paired Curated Input baseline, where the model answers from the benchmark-provided evidence package without tool access.
Start an EHR MCP server:
bash scripts/run_ehr_mcp.shRun a text-only evaluation file:
DATA_PATH=examples/synthetic_text_sample.jsonl \
OUTPUT_DIR=outputs/text_smoke \
bash scripts/run_text_eval.shFor multimodal runs, start the image MCP server and provide a manifest whose image paths resolve under BENCH_ROOT:
bash scripts/run_image_mcp.sh
DATA_PATH=examples/synthetic_multimodal_sample.jsonl \
OUTPUT_DIR=outputs/mm_smoke \
bash scripts/run_mm_eval.shThe example files are schema examples, not a replacement for the benchmark data.
- β¨ Automated Evidence-Seeking text-only evaluation over patient-level EHR tables and candidate sets.
- π©» Multimodal evaluation that combines EHR evidence with linked chest X-ray inputs.
- π§ Curated Input baselines for comparing pre-selected evidence against tool-mediated evidence seeking.
- π οΈ MCP tool serving for EHR tables, SQL-style access, candidates, image inputs, and utility tools.
- π SFT data preparation and training recipes using the vendored
verl/training code.
Prepare trajectory parquet files:
python verl/examples/sft/clinseek/prepare_clinseek_data.py \
--repo_id <hf-org-or-user>/<trajectory-dataset> \
--filename clinseek_trajectories.jsonl \
--model_name Qwen/Qwen3.5-35B-A3B \
--max_token_length 52000 \
--output_dir data/clinseek_trajectory_qwen35_52kRun SFT with the vendored VERL training code:
TRAIN_FILES=data/clinseek_trajectory_qwen35_52k/train.parquet \
VAL_FILES=data/clinseek_trajectory_qwen35_52k/val.parquet \
MODEL_PATH=./models/Qwen3.5-35B-A3B \
bash scripts/train_sft.shMore details are in docs/sft_training.md.
ClinSeekAgent is for research on clinical evidence seeking. It is not a medical device and must not be used for clinical diagnosis, treatment, triage, or patient management without separate validation, governance, and regulatory review.
If you use ClinSeekAgent, ClinSeek-Bench, or ClinSeek-35B-A3B, please cite:
@misc{wu2026clinseekagent,
title = {ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning},
author = {Wu, Juncheng and Zhang, Letian and Wang, Yuhan and Tu, Haoqin and Chen, Hardy and Wang, Zijun and Xie, Cihang and Zhou, Yuyin},
year = {2026},
eprint = {2605.20176},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2605.20176}
}
