- Obtain access to the MIMIC-CXR-JPG Database Database on PhysioNet and download the dataset. We recommend downloading from the GCP bucket:
gcloud auth login
mkdir MIMIC-CXR-JPG
gsutil -m rsync -d -r gs://mimic-cxr-jpg-2.0.0.physionet.org MIMIC-CXR-JPG
-
Sign up with your email address here.
-
Download either the original or the downsampled dataset (we recommend the downsampled version -
CheXpert-v1.0-small.zip) and extract it.
-
Download the
imagesfolder andData_Entry_2017_v2020.csvfrom the NIH website. -
Unzip all of the files in the
imagesfolder.
-
In
Constants.py, updateimage_pathsto point to each of the three directories that you downloaded. -
Run
python -m data.preprocess.preprocess_cxr. -
(Optional) If you are training a lot of models, it might be faster to first cache all images to binary 224x224 files on disk. In this case, you should update the
cache_dirpath inConstants.pyand then runpython -m data.preprocess.cache_data, optionally parallelizing over--env_id {0, 1, 2}for speed. To use the cached files, pass --cache_cxr to train.py.
-
Download the datasets for Camelyon17-v1.0 and PovertyMap-1.0 and extract it.
-
Update
wilds_root_dirinConstants.py. -
Run
python -m data.preprocess.preprocess_wilds.