Skip to content

PhenoGPT2 is an advanced phenotype recognition model, leveraging the robust capabilities of large language models.

License

Notifications You must be signed in to change notification settings

WGLab/PhenoGPT2

Repository files navigation

PhenoGPT2

PhenoGPT2 is an advanced phenotype recognition model, leveraging the robust capabilities of large language models. It is an improved version of PhenoGPT (Jingye et. al. 2023). It employs a fine-tuned implementation on the synthetic medical data generated by Llama 3.1 70B, MIMIC-IV deidentified clinical notes, and Human Phenotype Ontology Database, to enhance prediction accuracy and alignments. Like GPT's broad utilization, PhenoGPT2 can process diverse clinical abstracts for improved flexibility. For enhanced model precision and specialization, you have the option to further fine-tune the proposed PhenoGPT2 model on your own clinical datasets.

PhenoGPT2 is distributed under the MIT License by Wang Genomics Lab.

Contents

Installation

  1. Clone this repository and navigate to PhenoGPT2 folder
git clone https://github.com/WGLab/PhenoGPT2.git
cd PhenoGPT2
  1. Install system/conda dependencies.
conda create -n phenogpt2 python=3.11
conda activate phenogpt2
conda install pandas numpy scikit-learn matplotlib seaborn requests
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
conda install -c "nvidia/label/cuda-12.8" cuda-toolkit
conda install -c nvidia cuda-compiler
conda install -c conda-forge jupyter
conda install intel-openmp blas mpi4py
conda install -c anaconda ipykernel
conda install pytorch::faiss-cpu
conda install -c conda-forge libstdcxx-ng libgcc-ng
python -m ipykernel install --user --name=phenogpt2
  1. Install PhenoGPT2 packages
pip install --upgrade pip
pip install -e .
python -m spacy download en_core_web_sm
  1. Install extra pip-only package
## Make sure to load CUDA module properly before install flash-attn
#module load CUDA/12.1.1 #try to pip install the following line first, if not module load cuda before it
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

Model Download

  • PhenoGPT2 is built upon LLaMA 3.1 8B model, so please apply for access first. Apply here
  • OPTIONAL You can download the HPO Aware Pretrain first if you want to fine-tune on your extraction/normalization data or for LoRA variants.
  • Then, you can just download either PhenoGPT2-Short or PhenoGPT2-EHR (full parameters) for the inference.
  • If you plan to extract phenotypes from images, also download PhenoGPT2-Vision.
  • ATTENTION: PhenoGPT2 is in testing. To access the model weights, please contact us.
  • LLaVA-Med delivers the best performance, but its installation requires manual modifications to the original code, which can be complex. Please contact us if you wish to use the LLaVA-Med version. Otherwise, the fine-tuned LLaMA 3.2 11B Vision-Instruct offers seamless integration.
Model Descriptions Module Base Model 🤗 Huggingface Hub
HPO Aware Pretrain Text LLaMA 3.1 8B Not release yet
PhenoGPT2-Short Text LLaMA 3.1 8B Not release yet
PhenoGPT2-EHR (main) Text LLaMA 3.1 8B Not release yet
PhenoGPT2-Vision Vision LLaVA-Med/LLaMA Not release yet
PhenoGPT2-Vision (default) Vision LLaMA 3.2 11B Vision-Instruct Not release yet
  • If you plan to fine-tune or pretrain the models from scratch, make sure to download the original base model weights from Meta and LLava-Med repos.
  • Save all models in the ./models

Data Input Guide

  • Input files (for inference) should be a dictionary (key: patient id, value: patient meta data) or a list of dictionary. It should either in JSON or PICKLE extension. Each patient dictionary should have the following format:
{
  "pid1": {
    "clinical_note": "A 1-year-old Korean child presents with persistent fever and shortness of breath. He was found with brachycephaly at 5 months old",
    "image": NaN,
    "pid": "pid1"
  },
  "pid2": {
    "clinical_note": "Subject reports chest pain radiating to the left arm. Elevated troponin levels...",
    "image": "image_pid2.png",
    "pid": "pid2"
  }
}
  • Please see the ./data/example for reference

JSON-formatted answer

  • Ideally, the output files include the raw results in phenogpt2_repX.json:
{
  "pid1": {'text': {
    "demographics": {
        'age': '1-year-old',
        'sex': 'male',
        'ethnicity': 'Korean',
        'race': 'Asian'
    },
    "phenotypes": {
        "persistent fever": {
            'HPO_ID':'HP:0033399', 'onset':'unknown'
        },
        "shortness of breath": {
            'HPO_ID':'HP:0002094', 'onset':'unknown'
        }
        "brachycephaly": {
            'HPO_ID':'HP:0000248', 'onset':'5 months old'
        }
    },
    "pid": "pid1"
  },
  'image':{}
  },
  ...
}

WARNING

However, due to the nature of LLMs, sometimes the generated format does not fit with JSON format. You will receive the "error_response" in the answer instead of (demographics and phenotypes). This means it is a high chance that the JSON format is not properly set due to some repetitve outputs or unexpected string. Hence, it is suggestive that you check them manually or rerun with some modified notes (you can try to denoise the note first).

Inference

If you want to simply implement PhenoGPT2 on your local machine for inference, the fine-tuned models are saved in the models directory. Make sure to compile your input data as above before running the inference.

Please note that the first run may take some time as it needs to load all the models. Subsequent runs will be significantly faster.

Please use the following command (along with your scheduler system (i.e SLURM)):

bash run_inference.sh -i ./data/example/text_examples.json \
         -o example_testing \
         -model_dir ./models/phenogpt2/ \
         -index 0 
         -negation 
         -wc 0"

Required Arguments

Argument Description
-i, --input Required. Path to your input data. Can be a .json, .pkl, or a folder containing .txt or image files.
-o, --output Required. Output directory name. This is where results will be saved. The directory will be created if it does not exist.

Optional Arguments

Argument Description
-model_dir, --model_dir Path to the base model directory (e.g. a pretrained LLaVA or LLaMA3 model). If not provided, defaults will be used.
-lora, --lora Enable this flag if your model is LoRA-adapted.
-index, --index Identifier string for saving outputs. Useful for tracking multiple runs.
-negation, --negation By default, negation filtering is disabled. Use this flag to enable it.
--text_only Use only the text module of the model, ignoring visual inputs.
--vision_only Use only the vision module, ignoring text inputs.
-vision, --vision Choose the vision model. Options: llava-med or llama-vision (default). It is used along with the text module; otherwise simply use --vision_only instead.
-wc, --wc Word count per chunk. Use this to split long text into smaller chunks (default is 0, meaning no splitting). We recommend using either full length (no split) or 300/384 words per chunk (improving recall) depending on your tasks.

Pretraining & Fine-tuning

You can reproduce PhenoGPT2 model with your own datasets or other foundation models.

Text Module

  1. You need to pretrain your model on synthetic data compiled from HPO Database to obtain HPO Aware Pretrained Model.
  2. Then, fine-tune HPO Aware Pretrained Model on synthetic train data and validation data compiled from MIMIC-IV and PhenoPackets.

Vision Module

  1. If you want to fine-tune LLaVA-Med model, we recommend following the instructions in LLaVA GitHub, but change the weights to LLaVA-Med
  2. Otherwise, you can use our phenogpt2_vision_training.py to fine-tune LLaMA Vision (or other similar architecture models).

Developers

Quan Minh Nguyen - Bioengineering PhD student at the University of Pennsylvania

Dr. Kai Wang - Professor of Pathology and Laboratory Medicine at the University of Pennsylvania and Children's Hospital of Philadelphia

Citations

The publication is preparing! We appreciate your reading! In the meantime, you can cite our Github if used.

@misc{nguyen2025phenogpt2, author = {Quan Minh Nguyen and Kai Wang}, title = {PhenoGPT2}, year = {2025}, howpublished = {\url{https://github.com/WGLab/PhenoGPT2}}, note = {Accessed: 2025-08-07} }

About

PhenoGPT2 is an advanced phenotype recognition model, leveraging the robust capabilities of large language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published