Important
- Join our PyHealth Discord Community! We are actively looking for contributors and want to get to know our users better! Click here to join Discord
- Signup for our mailing list! We will email any significant PyHealth changes that are soon to come! Click here to subscribe
PyHealth is designed for both ML researchers and medical practitioners. We can make your healthcare AI applications easier to develop, test and validate. Your development process becomes more flexible and more customizable. [GitHub]
[News!] We are continueously implemeting good papers and benchmarks into PyHealth, checkout the [Planned List]. Welcome to pick one from the list and send us a PR or add more influential and new papers into the plan list.
Introduction [Video]
PyHealth can support diverse electronic health records (EHRs) such as MIMIC and eICU and all OMOP-CDM based databases and provide various advanced deep learning algorithms for handling important healthcare tasks such as diagnosis-based drug recommendation, patient hospitalization and mortality prediction, and ICU length stay forecasting, etc.
Build a healthcare AI pipeline can be as short as 10 lines of code in PyHealth.
All healthcare tasks in our package follow a five-stage pipeline:
load dataset -> define task function -> build ML/DL model -> model training -> inference
! We try hard to make sure each stage is as separate as possibe, so that people can customize their own pipeline by only using our data processing steps or the ML models. Each step will call one module and we introduce them using an example.
- STEP 1: <pyhealth.datasets> provides a clean structure for the dataset, independent from the tasks. We support
MIMIC-III
,MIMIC-IV
andeICU
, as well as the standardOMOP-formatted data
. The dataset is stored in a unifiedPatient-Visit-Event
structure.
from pyhealth.datasets import MIMIC3Dataset
mimic3base = MIMIC3Dataset(
root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
# map all NDC codes to ATC 3-rd level codes in these tables
code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
)
User could also store their own dataset into our <pyhealth.datasets.SampleBaseDataset>
structure and then follow the same pipeline below, see Tutorial
- STEP 2: <pyhealth.tasks> inputs the
<pyhealth.datasets>
object and defines how to process each patient's data into a set of samples for the tasks. In the package, we provide several task examples, such asdrug recommendation
andlength of stay prediction
.
from pyhealth.tasks import drug_recommendation_mimic3_fn
from pyhealth.datasets import split_by_patient, get_dataloader
mimic3sample = mimic3base.set_task(task_fn=drug_recommendation_mimic3_fn) # use default task
train_ds, val_ds, test_ds = split_by_patient(mimic3sample, [0.8, 0.1, 0.1])
# create dataloaders (torch.data.DataLoader)
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
- STEP 3: <pyhealth.models> provides the healthcare ML models using
<pyhealth.models>
. This module also provides model layers, such aspyhealth.models.RETAINLayer
for building customized ML architectures. Our model layers can used as easily astorch.nn.Linear
.
from pyhealth.models import Transformer
model = Transformer(
dataset=mimic3sample,
feature_keys=["conditions", "procedures"],
label_key="drugs",
mode="multilabel",
)
- STEP 4: <pyhealth.trainer> is the training manager with
train_loader
, theval_loader
,val_metric
, and specify other arguemnts, such as epochs, optimizer, learning rate, etc. The trainer will automatically save the best model and output the path in the end.
from pyhealth.trainer import Trainer
trainer = Trainer(model=model)
trainer.train(
train_dataloader=train_loader,
val_dataloader=val_loader,
epochs=50,
monitor="pr_auc_samples",
)
- STEP 5: <pyhealth.metrics> provides several common evaluation metrics (refer to Doc and see what are available) and special metrics in healthcare, such as drug-drug interaction (DDI) rate.
trainer.evaluate(test_loader)
- <pyhealth.codemap> provides two core functionalities: (i) looking up information for a given medical code (e.g., name, category, sub-concept); (ii) mapping codes across coding systems (e.g., ICD9CM to CCSCM). This module can be independently applied to your research.
- For code mapping between two coding systems
from pyhealth.medcode import CrossMap
codemap = CrossMap.load("ICD9CM", "CCSCM")
codemap.map("82101") # use it like a dict
codemap = CrossMap.load("NDC", "ATC")
codemap.map("00527051210")
- For code ontology lookup within one system
from pyhealth.medcode import InnerMap
icd9cm = InnerMap.load("ICD9CM")
icd9cm.lookup("428.0") # get detailed info
icd9cm.get_ancestors("428.0") # get parents
- <pyhealth.tokenizer> is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. This module can be independently applied to your research.
from pyhealth.tokenizer import Tokenizer
# Example: we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', \
'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
# 2d encode
tokens = [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', 'B035', 'C129']]
indices = tokenizer.batch_encode_2d(tokens) # [[8, 9, 10, 11], [12, 1, 1, 0]]
# 2d decode
indices = [[8, 9, 10, 11], [12, 1, 1, 0]]
tokens = tokenizer.batch_decode_2d(indices) # [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
Users can customize their healthcare AI pipeline as simply as calling one module
- process your OMOP data via
pyhealth.datasets
- process the open eICU (e.g., MIMIC) data via
pyhealth.datasets
- define your own task on existing databases via
pyhealth.tasks
- use existing healthcare models or build upon it (e.g., RETAIN) via
pyhealth.models
. - code map between for conditions and medicaitons via
pyhealth.codemap
.
We provide the following datasets for general purpose healthcare AI research:
Dataset | Module | Year | Information |
---|---|---|---|
MIMIC-III | pyhealth.datasets.MIMIC3Dataset |
2016 | MIMIC-III Clinical Database |
MIMIC-IV | pyhealth.datasets.MIMIC4Dataset |
2020 | MIMIC-IV Clinical Database |
eICU | pyhealth.datasets.eICUDataset |
2018 | eICU Collaborative Research Database |
OMOP | pyhealth.datasets.OMOPDataset |
OMOP-CDM schema based dataset | |
SleepEDF | pyhealth.datasets.SleepEDFDataset |
2018 | Sleep-EDF dataset |
SHHS | pyhealth.datasets.SHHSDataset |
2016 | Sleep Heart Health Study dataset |
ISRUC | pyhealth.datasets.ISRUCDataset |
2016 | ISRUC-SLEEP dataset |
Model Name | Type | Module | Year | Summary | Reference |
---|---|---|---|---|---|
Multi-layer Perceptron | deep learning | pyhealth.models.MLP |
1986 | MLP treats each feature as static | Backpropagation: theory, architectures, and applications |
Convolutional Neural Network (CNN) | deep learning | pyhealth.models.CNN |
1989 | CNN runs on the conceptual patient-by-visit grids | Handwritten Digit Recognition with a Back-Propagation Network |
Recurrent Neural Nets (RNN) | deep Learning | pyhealth.models.RNN |
2011 | RNN (includes LSTM and GRU) can run on any sequential level (e.g., visit by visit sequences) | Recurrent neural network based language model |
Transformer | deep Learning | pyhealth.models.Transformer |
2017 | Transformer can run on any sequential level (e.g., visit by visit sequences) | Atention is All you Need |
RETAIN | deep Learning | pyhealth.models.RETAIN |
2016 | RETAIN uses two RNN to learn patient embeddings while providing feature-level and visit-level importance. | RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism |
GAMENet | deep Learning | pyhealth.models.GAMENet |
2019 | GAMENet uses memory networks, used only for drug recommendation task | GAMENet: Graph Attention Mechanism for Explainable Electronic Health Record Prediction |
MICRON | deep Learning | pyhealth.models.MICRON |
2021 | MICRON predicts the future drug combination by instead predicting the changes w.r.t. the current combination, used only for drug recommendation task | Change Matters: Medication Change Prediction with Recurrent Residual Networks |
SafeDrug | deep Learning | pyhealth.models.SafeDrug |
2021 | SafeDrug encodes drug molecule structures by graph neural networks, used only for drug recommendation task | SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations |
MoleRec | deep Learning | pyhealth.models.MoleRec |
2023 | MoleRec encodes drug molecule in a substructure level as well as the patient's information into a drug combination representation, used only for drug recommendation task | MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning |
Deepr | deep Learning | pyhealth.models.Deepr |
2017 | Deepr is based on 1D CNN. General purpose. | Deepr : A Convolutional Net for Medical Records |
ContraWR Encoder (STFT+CNN) | deep Learning | pyhealth.models.ContraWR |
2021 | ContraWR encoder uses short time Fourier transform (STFT) + 2D CNN, used for biosignal learning | Self-supervised EEG Representation Learning for Automatic Sleep Staging |
SparcNet (1D CNN) | deep Learning | pyhealth.models.SparcNet |
2023 | SparcNet is based on 1D CNN, used for biosignal learning | Development of Expert-level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation |
TCN | deep learning | pyhealth.models.TCN |
2018 | TCN is based on dilated 1D CNN. General purpose | Temporal Convolutional Networks |
AdaCare | deep learning | pyhealth.models.AdaCare |
2020 | AdaCare uses CNNs with dilated filters to learn enriched patient embedding. It uses feature calibration module to provide the feature-level and visit-level interpretability | AdaCare: Explainable Clinical Health Status Representation Learning via Scale-Adaptive Feature Extraction and Recalibration |
ConCare | deep learning | pyhealth.models.ConCare |
2020 | ConCare uses transformers to learn patient embedding and calculate inter-feature correlations. | ConCare: Personalized Clinical Feature Embedding via Capturing the Healthcare Context |
StageNet | deep learning | pyhealth.models.StageNet |
2020 | StageNet uses stage-aware LSTM to conduct clinical predictive tasks while learning patient disease progression stage change unsupervisedly | StageNet: Stage-Aware Neural Networks for Health Risk Prediction |
Dr. Agent | deep learning | pyhealth.models.Agent |
2020 | Dr. Agent uses two reinforcement learning agents to learn patient embeddings by mimicking clinical second opinions | Dr. Agent: Clinical predictive model via mimicked second opinions |
GRASP | deep learning | pyhealth.models.GRASP |
2021 | GRASP uses graph neural network to identify latent patient clusters and uses the clustering information to learn patient | GRASP: Generic Framework for Health Status Representation Learning Based on Incorporating Knowledge from Similar Patients |
.. toctree:: :maxdepth: 4 :hidden: :caption: Getting Started install tutorials advance_tutorials
.. toctree:: :maxdepth: 4 :hidden: :caption: Documentation api/data api/datasets api/tasks api/models api/trainer api/tokenizer api/metrics api/medcode api/calib
.. toctree:: :maxdepth: 2 :hidden: :caption: Additional Information live log about