PhenoGPT2

PhenoGPT2 is an advanced phenotype recognition model, leveraging the robust capabilities of large language models. It is an improved version of PhenoGPT (Jingye et. al. 2023). It employs a fine-tuned implementation on the synthetic medical data generated by Llama 3.1 70B, MIMIC-IV deidentified clinical notes, and Human Phenotype Ontology Database, to enhance prediction accuracy and alignments. Like GPT's broad utilization, PhenoGPT2 can process diverse clinical abstracts for improved flexibility. For enhanced model precision and specialization, you have the option to further fine-tune the proposed PhenoGPT2 model on your own clinical datasets.

PhenoGPT2 is distributed under the MIT License by Wang Genomics Lab.

Installation

Clone this repository and navigate to PhenoGPT2 folder

git clone https://github.com/WGLab/PhenoGPT2.git
cd PhenoGPT2

Install system/conda dependencies.

conda create -n phenogpt2 python=3.11
conda activate phenogpt2
conda install pandas numpy scikit-learn matplotlib seaborn requests
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
conda install -c "nvidia/label/cuda-12.8" cuda-toolkit
conda install -c nvidia cuda-compiler
conda install -c conda-forge jupyter
conda install intel-openmp blas mpi4py
conda install -c anaconda ipykernel
conda install pytorch::faiss-cpu
conda install -c conda-forge libstdcxx-ng libgcc-ng
python -m ipykernel install --user --name=phenogpt2

Install PhenoGPT2 packages

pip install --upgrade pip
pip install -e .
python -m spacy download en_core_web_sm

Install extra pip-only package

## Make sure to load CUDA module properly before install flash-attn
#module load CUDA/12.1.1 #try to pip install the following line first, if not module load cuda before it
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

Model Download

PhenoGPT2 is built upon LLaMA 3.1 8B model, so please apply for access first. Apply here
OPTIONAL You can download the HPO Aware Pretrain first if you want to fine-tune on your extraction/normalization data or for LoRA variants.
Then, you can just download either PhenoGPT2-Short or PhenoGPT2-EHR (full parameters) for the inference.
If you plan to extract phenotypes from images, also download PhenoGPT2-Vision.
ATTENTION: PhenoGPT2 is in testing. To access the model weights, please contact us.
LLaVA-Med delivers the best performance, but its installation requires manual modifications to the original code, which can be complex. Please contact us if you wish to use the LLaVA-Med version. Otherwise, the fine-tuned LLaMA 3.2 11B Vision-Instruct offers seamless integration.

Model Descriptions	Module	Base Model	🤗 Huggingface Hub
HPO Aware Pretrain	Text	LLaMA 3.1 8B	Not release yet
PhenoGPT2-Short	Text	LLaMA 3.1 8B	Not release yet
PhenoGPT2-EHR (main)	Text	LLaMA 3.1 8B	Not release yet
PhenoGPT2-Vision	Vision	LLaVA-Med/LLaMA	Not release yet
PhenoGPT2-Vision (default)	Vision	LLaMA 3.2 11B Vision-Instruct	Not release yet

If you plan to fine-tune or pretrain the models from scratch, make sure to download the original base model weights from Meta and LLava-Med repos.
Save all models in the ./models

Data Input Guide

Input files (for inference) should be a dictionary (key: patient id, value: patient meta data) or a list of dictionary. It should either in JSON or PICKLE extension. Each patient dictionary should have the following format:

{
  "pid1": {
    "clinical_note": "A 1-year-old Korean child presents with persistent fever and shortness of breath. He was found with brachycephaly at 5 months old",
    "image": NaN,
    "pid": "pid1"
  },
  "pid2": {
    "clinical_note": "Subject reports chest pain radiating to the left arm. Elevated troponin levels...",
    "image": "image_pid2.png",
    "pid": "pid2"
  }
}

Please see the ./data/example for reference

JSON-formatted answer

Ideally, the output files include the raw results in phenogpt2_repX.json:

{
  "pid1": {'text': {
    "demographics": {
        'age': '1-year-old',
        'sex': 'male',
        'ethnicity': 'Korean',
        'race': 'Asian'
    },
    "phenotypes": {
        "persistent fever": {
            'HPO_ID':'HP:0033399', 'onset':'unknown'
        },
        "shortness of breath": {
            'HPO_ID':'HP:0002094', 'onset':'unknown'
        }
        "brachycephaly": {
            'HPO_ID':'HP:0000248', 'onset':'5 months old'
        }
    },
    "pid": "pid1"
  },
  'image':{}
  },
  ...
}

WARNING

However, due to the nature of LLMs, sometimes the generated format does not fit with JSON format. You will receive the "error_response" in the answer instead of (demographics and phenotypes). This means it is a high chance that the JSON format is not properly set due to some repetitve outputs or unexpected string. Hence, it is suggestive that you check them manually or rerun with some modified notes (you can try to denoise the note first).

Inference

If you want to simply implement PhenoGPT2 on your local machine for inference, the fine-tuned models are saved in the models directory. Make sure to compile your input data as above before running the inference.

Please note that the first run may take some time as it needs to load all the models. Subsequent runs will be significantly faster.

Please use the following command (along with your scheduler system (i.e SLURM)):

bash run_inference.sh -i ./data/example/text_examples.json \
         -o example_testing \
         -model_dir ./models/phenogpt2/ \
         -index 0 
         -negation 
         -wc 0"

Required Arguments

Argument	Description
`-i`, `--input`	Required. Path to your input data. Can be a `.json`, `.pkl`, or a folder containing `.txt` or image files.
`-o`, `--output`	Required. Output directory name. This is where results will be saved. The directory will be created if it does not exist.

Optional Arguments

Argument	Description
`-model_dir`, `--model_dir`	Path to the base model directory (e.g. a pretrained LLaVA or LLaMA3 model). If not provided, defaults will be used.
`-lora`, `--lora`	Enable this flag if your model is LoRA-adapted.
`-index`, `--index`	Identifier string for saving outputs. Useful for tracking multiple runs.
`-negation`, `--negation`	By default, negation filtering is disabled. Use this flag to enable it.
`--text_only`	Use only the text module of the model, ignoring visual inputs.
`--vision_only`	Use only the vision module, ignoring text inputs.
`-vision`, `--vision`	Choose the vision model. Options: `llava-med` or `llama-vision` (default). It is used along with the text module; otherwise simply use --vision_only instead.
`-wc`, `--wc`	Word count per chunk. Use this to split long text into smaller chunks (default is `0`, meaning no splitting). We recommend using either full length (no split) or 300/384 words per chunk (improving recall) depending on your tasks.

Pretraining & Fine-tuning

You can reproduce PhenoGPT2 model with your own datasets or other foundation models.

Text Module

You need to pretrain your model on synthetic data compiled from HPO Database to obtain HPO Aware Pretrained Model.
Then, fine-tune HPO Aware Pretrained Model on synthetic train data and validation data compiled from MIMIC-IV and PhenoPackets.

Vision Module

If you want to fine-tune LLaVA-Med model, we recommend following the instructions in LLaVA GitHub, but change the weights to LLaVA-Med
Otherwise, you can use our phenogpt2_vision_training.py to fine-tune LLaMA Vision (or other similar architecture models).

Developers

Quan Minh Nguyen - Bioengineering PhD student at the University of Pennsylvania

Dr. Kai Wang - Professor of Pathology and Laboratory Medicine at the University of Pennsylvania and Children's Hospital of Philadelphia

Citations

The publication is preparing! We appreciate your reading! In the meantime, you can cite our Github if used.

@misc{nguyen2025phenogpt2, author = {Quan Minh Nguyen and Kai Wang}, title = {PhenoGPT2}, year = {2025}, howpublished = {\url{https://github.com/WGLab/PhenoGPT2}}, note = {Accessed: 2025-08-07} }

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
PhenoGPT2_codebook.ipynb		PhenoGPT2_codebook.ipynb
README.md		README.md
finetuning_phenogpt2_text.sh		finetuning_phenogpt2_text.sh
finetuning_phenogpt2_vision.sh		finetuning_phenogpt2_vision.sh
inference.py		inference.py
phenogpt2_pretraining.py		phenogpt2_pretraining.py
phenogpt2_training.py		phenogpt2_training.py
phenogpt2_vision_training.py		phenogpt2_vision_training.py
pretrain_hpo_aware.sh		pretrain_hpo_aware.sh
pyproject.toml		pyproject.toml
run_evaluations.sh		run_evaluations.sh
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PhenoGPT2

Contents

Installation

Model Download

Data Input Guide

JSON-formatted answer

Inference

Required Arguments

Optional Arguments

Pretraining & Fine-tuning

Developers

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

WGLab/PhenoGPT2

Folders and files

Latest commit

History

Repository files navigation

PhenoGPT2

Contents

Installation

Model Download

Data Input Guide

JSON-formatted answer

Inference

Required Arguments

Optional Arguments

Pretraining & Fine-tuning

Developers

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages